* WT-3499 Add a visibility rwlock between transactions and checkpoints.
* Typo
* Just acquire/release the lock immediately for synchronization.
(cherry picked from commit 80c6cee91f)
* WT-3499 Add a visibility rwlock between transactions and checkpoints.
* Typo
* Just acquire/release the lock immediately for synchronization.
(cherry picked from commit 80c6cee91f)
During WT_SESSION::reset, if there has been a schema change (such as a WT_SESSION::drop operation) since the last sweep, do a pass through the table cache and remove any obsolete table handles.
(cherry picked from commit 74510affec)
During WT_SESSION::reset, if there has been a schema change (such as a WT_SESSION::drop operation) since the last sweep, do a pass through the table cache and remove any obsolete table handles.
When acquiring a lock on our parent internal page, we use the WT_REF.home
field to reference our parent page. As a child of the parent page, we
prevent its eviction, but that's a weak guarantee. If the parent page
splits, and our WT_REF were to move with the split, the WT_REF.home field
might change underneath us and we could race, and end up attempting to
access an evicted page. Set the session page-index generation so if the
parent splits, it still can't be evicted.
Testing has uncovered another case where drops can spin trying to lock a
checkpoint handle until a checkpoint completes. This change fixes that
in two ways: attempting to lock (but not open) a handle won't spin, and
drop will always attempt to lock the live tree before locking any
checkpoint handles.
We use a pragma on Windows to force a struct to be packed, but were
missing the "end" pragma that restores normal layout. The result was
that most structs were being packed, leading to poor performance for
workloads (particularly when accessing session structures).
* WT-3356 Use atomic reads of rwlocks.
Previously we had some conditions that checked several fields within a rwlock by indirecting to the live structure. Switch to always doing a read of the full 64-bit value, then using local reads from the copy.
Otherwise, we're relying on the compiler and the memory model to order the structure accesses in "code execution order". That could explain assertion failures and/or incorrect behavior with the new rwlock implementation.
* Change all waits to 10ms.
Previously when stalling waiting to get into the lock we would wait for 1ms, but once queued we waited forever. The former is probably too aggressive (burns too much CPU when we should be able to wait for a notification), and the latter is dangerous if a notification is ever lost (a thread with a ticket may never wake up).
* WT-3354 Fix bugs found by Coverity.
* two cases where error checking for rwlocks should goto the error label for cleanup.
* LSM code not restoring isolation if a checkpoint fails part way through
* Take care with ordering an assertion after a read barrier.
We just had an assertion failure on PPC, and from inspection it looks
like read in the assertion could be scheduled before read that sees the
ticket allocated. We have a read barrier in this path to protect
against exactly that kind of thing happening to application data, move
the assertion after it so our diagnostics are also safe.
* Add a workload that stresses rwlock performance under various conditions (including `threads >> cores`), tune read and write lock operations to only spin when it is likely to help, and to back off to a condition variable when there is heavy contention.
* New rwlock implementation: queue readers and writers separately, don't enforce fairness among readers or if the lock is overwhelmed.
* Switch to a spinlock whenever we need to lock a page.
Previously we had a read/write lock in the __wt_page structure that was only ever acquired in write mode, plus a spinlock in the page->modify structure. Switch to using the spinlock for everything.
One slight downside of this change is that we can no longer precisely determine whether a page is locked based on the status of the spinlock (since another page sharing the same lock could be holding it in the places where we used to check). Since that was only ever used
for diagnostic / debugging purposes, I think the benefit of the change outweighs this issue.
* Fix a bug where a failure during `__wt_curfile_create` caused a data handle to be released twice. This is caught by the sanity checking assertions in the new read/write lock code.
* Split may be holding a page lock when restoring update. Tell the restore code we have the page exclusive and no further locking is required.
* Allocate a spinlock for each modified page.
Using shared page locks for mulitple operations that need to lock a page (including inserts and reconciliation) resulted in self-deadlock when the lookaside table was used. That's because reconciliation held a page lock, then caused inserts to the lookaside table, which acquired the page lock for a page in the lookaside table. With a shared set of page locks, they could both be the same lock.
Switch (back?) to allocating a spinlock per modified page. Earlier in this ticket we saved some space in __wt_page, so growing __wt_page_modify is unlikely to be noticeable.
* Tweak padding and position of the spinlock in WT_PAGE_MODIFY to claw back some bytes.
Move evict_pass_gen to the end of WT_PAGE: on inspection, it should be a cold field relative to the others, which now fit in one x86 cache line.
(cherry picked from commit 42daa132f2)
(cherry picked from commit 1bcb9a0cc4)
* Improve two recent assertions, one from WT-2798 relating to writing metadata updates to disk that are part of a running transaction, and another from WT-2802 that checks that we don't try to copy values from a cursor without a transaction pinned. The latter doesn't apply to cursors on checkpoints (including chunk cursors in an LSM tree).
* Copy cursor values before rollback in autocommit.
If an autocommit operation such as WT_CURSOR::update touches multiple trees (e.g., multiple column groups in a table, or index updates, or multiple chunks in an LSM tree), then some cursors may have consumed the application's key/value pair when the operation has to roll back. Take a copy of any such values before attempting to retry the operation.
(cherry picked from commit 41eb2dcaac)
When logging is disabled, a create operation (and potentially other
metadata updates) could write partially completed checkpoint metadata,
leaving on-disk files inconsistent until the checkpoint completes.
(cherry picked from commit 7e1a47dd45)
Change the default remove/rename calls to flush the enclosing directory.
Simplify the pluggable file system API by replacing the directory-sync method
with "durable" boolean argument to the remove, rename and open-file methods.
* Add "durable" arguments to relevant functions so that each remove or rename
call specifies its durability requirements.
* Switch the WT_FILE_SYSTEM::fs_open_file type enum from WT_OPEN_FILE_TYPE,
with WT_OPEN_XXX names, to the WT_FS_OPEN_FILE_TYPE, with WT_FS_OPEN_XXX
names.
Switch the WT_FILE_SYSTEM::fs_open_file flags from WT_OPEN_XXX names to
WT_FS_OPEN_XXX names.
* Replace the "bool durable" argument to WT_FILE_SYSTEM.fs_remove and
WT_FILE_SYSTEM.fs_rename with a "uint32_t flags" argument, and the
WT_FS_DURABLE flag.
* Remove a stray bracket.
(cherry picked from commit 11f018322c)
This problem can occur for both row and column store.
The WT_CURSOR_BTREE.rip_saved field potentially has the same problem
as the cip_saved field, initializing it on point-searches is wrong,
it should be initialized as a cursor moves to a new page.
* Clear cip_saved and rip_saved when starting to iterate from a search
position. This wasn't necessary before because we cleared them in
__cursor_pos_clear(), but I removed that code.
In summary, we now clear them in the iteration code, both when starting
an iteration and when switching to a new page. That's correct because
they have nothing to do with searches so the clear doesn't belong in
__cursor_pos_clear(), and we have to do the clear when switching to a
new page regardless, __cursor_pos_clear() isn't called when switching
to a new page.
(cherry picked from commit 1b6a9220c3)
Reset the column-store saved slot information on each new page, otherwise
it's possible for it to match the last page we were traversing.
(cherry picked from commit 51a4e1593d)
* WT-2711 Remove posix expanded strftime values and use older C89 values
* Fix issues with s_string
* Add a comment so nobody rewrites the strftime format and reintroduces the bug.
* Fix strings sort order.
(cherry picked from commit 1c67c4e0f0)
If there's no server running, discard any configuration information so
we don't leak memory during reconfiguration.
(cherry picked from commit e001657e5c)
No longer support setting the statistics_log path in WT_CONNECTION::reconfigure.
No longer support setting a custom name for statistics files, only allow a destination directory.
Be more explicit about which logging configuration options are allowed in WT_CONNECTION::reconfigure.
The aim of these changes is to avoid situations where applications that embed WiredTiger allow their users to overwrite unexpected files on a file system.
This potentially requires an upgrade step for applications that were specifying a non-standard file name component for statistics log file names, it's not backward compatible.
(cherry picked from commit 9cc5d0f4b1)
Randomize visits to trees that use a tiny fraction of the cache.
Eviction optimizations.
Now that we are queuing more entries (potentially), make sure enough of
them become candidates. Previously, a skewed distribution of read
generations could mean that only 10% of queue entries were considered.
Improve the efficiency of sorting the queue by calculating the score
once when pages are added to the queue.
Take care to bound the maximum eviction slot.
(cherry picked from commit 521270d54c)
When splitting the root page and updating the child's WT_REF.addr, reconciliation/eviction can race with us, updating WT_REF.addr after our read and before our update. The update is necessary because the child's
address points into the page being split: if the address changes, then it can no longer point into the page being split and the update is no longer necessary.
Define system call success as a 0 return, and split error handling into two parts: if the call returns -1, use errno, otherwise expect the failing return to be an error value.
Replace calls to remove with unlink, so we know errno will be set. Do the best we can with rename, there's no easy workaround.
POSIX requires posix_madvise return an errno value, but some OS versions return a -1/errno pair instead (at least FreeBSD and OS X). I don't care about retrying posix_madvise calls on failure, but since WT_SYSCALL_RETRY includes the necessary error handling magic, wrap the posix_madvise calls in WT_SYSCALL_RETRY.
(cherry picked from commit ced588aecd)
Add more options for callers when updating the oldest ID to control how much they care about the ID being updated.
(cherry picked from commit 116e41e5e1)
When the cache hits eviction triggers, all application threads can
hammer the eviction queue lock, starving each other and server threads.
Also, noticed with the same workload, the eviction server doesn't need
to force updates to the oldest ID (which can starve the eviction server
thread if there are hundreds of application threads getting snapshots).
It is sufficient to update it lazily.
* Clear the eviction walk if we don't find any candidates.
Otherwise, we are keeping a page pinned in what might be an idle file,
and tying up a hazard pointer that could prevent eviction from an active
file (since the eviction server tracks how many hazard pointers it is
using to avoid going over the limit).
(cherry picked from commit 7f9d7aecea)
* Default checkpoint_wait is true. This change is useful because it means concurrent create/drop calls don't generate EBUSY returns.
* Mark lock_wait and checkpoint_wait as undoc
(cherry picked from commit 4b48ad6fb7)
uninitialized in __ref_is_leaf() (based on a call to __wt_ref_info()).
It's not really possible because the path where type isn't set is a path
where we panic because the WT_ADDR structure has an impossible type.
We already ignore the __wt_ref_info() error return in one path, and
there are only two paths that care about the returned type; remove the
error check from __wt_ref_info() and set type to 0 in the failing case
(the same value we use when there's no WT_REF addr to check), the code
that calls this function already checks addr on return.
This simplifies __ref_is_leaf() slightly, it now returns a boolean
instead of an error code with a boolean pointer argument.
* We need to make sure that the log records in the checkpoint
* LSN are on disk. In particular to make sure that the
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.