Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick question #1

Closed
Orgil opened this issue Mar 4, 2021 · 19 comments
Closed

Quick question #1

Orgil opened this issue Mar 4, 2021 · 19 comments

Comments

@Orgil
Copy link

Orgil commented Mar 4, 2021

I'm going to use it for in-memory game state along with persistence. How to set it with persistence even memory is not full?
Also, releasing it will be great. thanks.

@JeremyTCD
Copy link
Member

Hey! MixedStorageKVStore exposes the underlying FasterKV via MixedStorageKVStore.FasterKV. You can call methods described here on the FasterKV instance to flush records to disk and create recoverable checkpoints.

We'll be releasing the first version of this package tomorrow.

@JeremyTCD
Copy link
Member

Unfortunately, we're freezing development of this project due to issues on Linux. Sorry about that.

@badrishc
Copy link

badrishc commented Mar 10, 2021

Hey @JeremyTCD -- do you see the issues on Linux as fundamental or things we can fix in FASTER? It seems premature to freeze development due to this. Also, are you seeing any of the reported issues on Windows, or is it a Linux specific thing?

If you provide us a debuggable repro of all the issues on Linux, I am feeling confident we can come up with a way to get performance back to what it should be, on Linux.

@JeremyTCD
Copy link
Member

JeremyTCD commented Mar 11, 2021

Hey @badrishc!

Apologies about the abrupt freeze. I put a hold on the project because of looming datelines for deliverables (our application runs on Linux VMs).

To expand on the issues:

  • Only observed on Linux/MacOS (non-windows)
  • Issues can be reproduced by running this repo's tests:
    • Clone

    • Open powershell in the solution's root directory

    • dotnet test should yield:

      Test run for C:\Projects\Jering.KeyValueStore\test\KeyValueStore\bin\Debug\net5.0\Jering.KeyValueStore.Tests.dll (.NETCoreApp,Version=v5.0)
      Microsoft (R) Test Execution Command Line Tool Version 16.9.1
      Copyright (c) Microsoft Corporation.  All rights reserved.
      
      Starting test execution, please wait...
      A total of 1 test files matched the specified pattern.
      
      Passed!  - Failed:     0, Passed:     9, Skipped:     0, Total:     9, Duration: 8 s - Jering.KeyValueStore.Tests.dll (net5.0)  
      
    • Open a WSL terminal in the solution's root directory

    • dotnet test. The tests take a long time to complete. For reference, our CI attempt for this library: Linux tests, macOS tests, Windows tests. In the CI attempt, the test LogCompaction_OccursIfSafeReadOnlyRegionIsLargerThanThreshold took 395 ms on Windows vs 18 m on Linux, it failed after 40 s on Mac. The test inserts 500 records and updates 500 records. (A test failed on windows due to a string formatting mistake in our test, library works well on windows).

  • Also reproducible by running benchmarks:
    • Open a WSL terminal in the solution's root directory
    • cd perf/KeyValueStore
    • dotnet run --configuration "Release"
    • Benchmarks take a long time to complete.

From issues on Faster, it seems like the performance issue is expected. In particular this issue highlights that operations have to run sequentially on non-windows platforms. I'm not sure if .Net Core has resolved the underlying limitation. If it's unresolved I'm not sure how Faster could proceed. I suppose Faster could manually invoke native code to handle I/O if .Net Core still isn't up to scratch, but that seems like a difficult undertaking.

Do you think these issues are resolvable? I'd be happy to create a minimal repro and open an issue in the Faster repo.

@badrishc
Copy link

My initial gut inclination is to say that these are resolvable. In fact, for the linked issue, we already work around it by keeping multiple file handles for the same file in a pool (FixedPool) in ManagedStorageDevice. This should scale out reads and writes pretty well (up to a parallelism of 8: see this). So, I am not sure what is causing the slowdown -- you could try a larger handle pool, for example, if 8 is insufficient.

Question: does the slowdown occur only when log compaction is active? How does performance look without log compaction running. That would be the first step before we analyze the impact of GC on performance.

PS. We do not have access to MacOS but have Linux.

@JeremyTCD
Copy link
Member

Ah okay, wasn't aware of the file handle pooling. Assumed operations were all sequential on non-windows platforms. Will try running with a larger pool.

Question: does the slowdown occur only when log compaction is active? How does performance look without log compaction running. That would be the first step before we analyze the impact of GC on performance.

The benchmarks mentioned in my previous comment all run with compaction disabled. So just plain inserts/reads. Their performance is poor on Linux.

I'll create a minimal repro and open an issue on Faster later today.

PS. We do not have access to MacOS but have Linux.

Noted on MacOS.

@badrishc
Copy link

Sounds good. I just ran StoreLogCompaction from our public samples on a Linux box, performance seemed better than Windows. Both systems have an NVMe drive.

Linux:

Total time to delete/upsert 1000000 elements: 0.699 secs (1430615.16 inserts/sec)
Log tail address: 1033923272
Compacting log
Log begin address: 1032876032
Log tail address: 1056970144
0: 412M
524288: 390M
Total time to delete/upsert 1000000 elements: 0.661 secs (1512859.30 inserts/sec)
Log tail address: 1080345680

Windows:

Total time to delete/upsert 1000000 elements: 1.315 secs (760456.27 inserts/sec)
Log tail address: 1503383600
Compacting log
Log begin address: 1502339072
Log tail address: 1526433184
0: 443M
524288: 436M
Total time to delete/upsert 1000000 elements: 1.241 secs (805801.77 inserts/sec)
Log tail address: 1549808960

@JeremyTCD
Copy link
Member

JeremyTCD commented Mar 11, 2021

Hey from your numbers it looks like the issue may be on my end, something to do with how the store uses sessions perhaps. My assumption about sequential operations was wrong. Upon closer inspection it might be a concurrency issue involving SpinLocks.

Will update the ReadMe of this project once the issue is resolved/identified. I've created a minimal repro using a session pool, and an issue to continue to discussion: [C#] Non-windows performance.

@badrishc
Copy link

Sounds good, thanks.

@badrishc
Copy link

Latest in master should have fixed all issues you encountered on Linux. See how to use the latest async Upsert API in our wrapper based sample here:

https://github.com/microsoft/FASTER/tree/master/cs/playground/AsyncStress

Nuget will be released next week.

@JeremyTCD
Copy link
Member

JeremyTCD commented Apr 17, 2021

Thanks for the fixes! I've just given the latest commit a go, example (without tweaks) works well on Linux.

I tried converting the wrapper's underlying FasterKV to use SpanByte keys and values. UpsertAsync worked well, but Upsert hangs and ReadAsync throws the following:

Fatal error.    at FASTER.core.LightEpoch.Acquire()
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.   at FASTER.core.LightEpoch.Acquire()   at FASTER.core.LightEpoch.Acquire()Unhandled exception.System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.   at FASTER.core.LightEpoch.Acquire()

Might've been mistakes in my tweaks, haven't has the chance to look into the issue.

Any chance you guys could add an example async wrapper for SpanByte?

@badrishc
Copy link

Upsert hangs

This is expected when used with tasks on Linux, Upsert blocks the caller thread (and uses up the thread) waiting for flush to complete, and so the scheduler is unable to schedule the I/O completion. So, for the loaded scenario, using Upsert is not recommended. Use UpsertAsync instead.

@badrishc
Copy link

ReadAsync throws the following

The error looks like you (or some unfinished task) is trying to access the store after it is disposed. If you share the code/repro, we should be able to diagnose this.

@JeremyTCD
Copy link
Member

JeremyTCD commented Apr 18, 2021

Upsert

Ah okay I think I get it, all threads are blocked waiting for a free thread. A bit like a deadlock situation.
Suggest updating docs to highlight that Upsert can't be used in MLSD multi-threaded situations.

ReadAsync

Okay will create a repro when I have some time.

@badrishc
Copy link

Sounds good. Just to clarify, it's happening only because the Upserts are being done by task threads. If normal threads issued the Upserts, you wouldn't see the thread starvation issue. Starvation is generally a problem for any C# blocking operation issued on a Task thread.

@badrishc
Copy link

The latest in this PR should work well with SpanByte as well: microsoft/FASTER#452

@JeremyTCD
Copy link
Member

Just ran some basic tests with the PR. Works well!

@JeremyTCD
Copy link
Member

Will clean up and publish this package some time this week

@JeremyTCD
Copy link
Member

JeremyTCD commented May 14, 2021

I've released 1.0.0. Thanks @badrishc for all the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants