Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompleteCheckpoint(true) never completes for CheckpointStrategy.FoldOver #153

Closed
marius-klimantavicius opened this issue Jul 8, 2019 · 6 comments
Assignees

Comments

@marius-klimantavicius
Copy link
Contributor

I've noticed that sometimes CompleteCheckpoint(true) is stuck in WAIT_FLUSH when FoldOver strategy is used. The most consistent way for me to reproduce was:

  1. Start session
  2. Upsert a key/value
  3. Take full checkpoint, CompleteCheckpoint(true)
  4. Stop session
  5. Exit app
  6. Start app
  7. Recover
  8. Start session
  9. Read a key/value
  10. Stop session
  11. Start session
  12. TakeFullCheckpoint and CompleteCheckpoint(true)

It seems as if there were no changes in the session then there is nothing to write to checkpoint but it still tries to wait for flush which never happens. This had happened to me (a few times though inconsistently) even for cases where other sessions were writing key/values.

@badrishc
Copy link
Collaborator

Is this an issue with 'variable length structs' as well?

@marius-klimantavicius
Copy link
Contributor Author

I am sure that I encountered this with both variable length structs and when using object log, I don't remember if I tested fixed length key/values.

@marius-klimantavicius
Copy link
Contributor Author

Created gists that reproduces the issue:

@badrishc
Copy link
Collaborator

I ran the repro for fixed size (x64, debug/release, net46) around 10 times, no error. Can you try the same config and see?

@marius-klimantavicius
Copy link
Contributor Author

I can reliably reproduce this with x64/release/(net46,net472). The first time the app is run it finishes within a second, in case of the second and subsequent times - I give up after 10-15 secs and kill the process (when I first encountered it I waited for 15 minutes [went for lunch] it did not complete).
This is what I see in trace window:

********* Primary Recovery Information ********
Index Checkpoint: e86357a9-0461-46e7-a2e2-d22e4bf7523c
HybridLog Checkpoint: 36c44720-c0a4-4569-8051-456a0987b7e5
******** Index Checkpoint Info for e86357a9-0461-46e7-a2e2-d22e4bf7523c ********
Table Size: 2097152
Main Table Size (in GB): 0.134217728
Overflow Table Size (in GB): 1.024E-06
Num Buckets: 16
Start Logical Address: 88
Final Logical Address: 88
******** HybridLog Checkpoint Info for 36c44720-c0a4-4569-8051-456a0987b7e5 ********
Version: 1
Is Snapshot?: False
Flushed LogicalAddress: 0
Start Logical Address: 88
Final Logical Address: 88
Num sessions recovered: 1
Recovered sessions: 
693a141d-289b-437d-a340-73c3d3d977be: 0
******* Recovered HybridLog Stats *******
Head Address: 64
Safe Head Address: 64
ReadOnly Address: 88
Safe ReadOnly Address: 88
Tail Address: 88
Moved to INTERMEDIATE, 2
Moved to PREP_INDEX_CHECKPOINT, 2
Moved to INTERMEDIATE, 2
Moved to PREPARE, 2
Moved to INTERMEDIATE, 2
Moved to IN_PROGRESS, 3
Moved to INTERMEDIATE, 3
Moved to WAIT_PENDING, 3
Moved to INTERMEDIATE, 3
Moved to WAIT_FLUSH, 3

My little investigation shows the following:

  1. We are moving from WAIT_PENDING to WAIT_FlUSH
  2. We check that strategy is FoldOverSnapshot
if (FoldOverSnapshot)
{
    hlog.ShiftReadOnlyToTail(out long tailAddress);
    _hybridLogCheckpoint.info.finalLogicalAddress = tailAddress;
}
  1. ShiftReadOnlyToTail tries to monotonically update ReadOnlyAddress, but fails because ReadOnlyAddress and tailAddress are the same (in my case 88)
  2. Because monotonic update failed the method OnPagesMarkedReadOnly is not 'called' thus in turn AsyncFlushPages is not called.
  3. As flush was not called we are forever stuch in WAIT_FLUSH state.

@marius-klimantavicius
Copy link
Contributor Author

marius-klimantavicius commented Jul 14, 2019

I've just checked that this seems to have been fixed in master branch (by #144) - all of my tests were done with latest release (2019.4.24.4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants