Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbuffered event stream. #144

Open
eddyb opened this issue Nov 6, 2020 · 6 comments
Open

Unbuffered event stream. #144

eddyb opened this issue Nov 6, 2020 · 6 comments

Comments

@eddyb
Copy link
Member

eddyb commented Nov 6, 2020

To avoid hitting a Mutex for every event being recorded, and assuming that all the other streams have (much) lower frequency, we could keep the current format, while recording it differently:

  • after the file header, and after any non-events page, start an events page (i.e. leave enough space for a page header)
  • writing events happens ("atomically"), just like writing whole pages
    • requires the backing storage to be a mmap'd file, we don't want to actually hit a syscall here
      • we also need this for the ability to track what position we are at (using an AtomicUsize)
    • should work as we're (i.e. the atomic position is) effectively always inside an "open" events page
  • writing a non-events page (as well as finishing the file) has to write the correct length in the "currently open" events page, for which we'd have to:
    • first grab a global (per backing store) lock, to avoid page-writing races with other non-events streams
    • claim the range for the new page to write using fetch_add on the same AtomicUsize position (that writing events uses), effectively "closing" the "currently open" events page (new events would effectively end up in a newer page)
    • use the "start of the events page" (from the global lock), and the "start of the new non-events page" (claimed above) to compute the correct length to use in the header of the just-closed events page, and to write that header
    • update the "start of the events page" (in the global lock) to just after the end of the newly claimed non-events page
    • release the lock and start writing the new page (in the claimed range)
      • nothing else will access that range so there's no need to keep the lock any longer

Sadly bypassing File entirely will probably be most of the work here, I don't fully understand why measureme never properly mmap'd a file-on-disk (the pre-paging mmaps were "anonymous", i.e. no different than ungrowable Vec<u8>s).

I'm also not sure how we'd handle not being able to grow the mmap'd region (not without re-adding locking to writing events), we need to be on a 64-bit host to even grab gratuitous amounts of virtual memory, without hampering the rest of the process, don't we?
Maybe for 32-bit hosts we can keep paging events and writing to a File?

cc @michaelwoerister @wesleywiser

@eddyb
Copy link
Member Author

eddyb commented Nov 6, 2020

There were also some ideas on Zulip around using "non-temporal" (i.e. cache-bypassing) stores to the mmap'd file, in order to avoid polluting the cache with data only the kernel should ever read again (if at all, maybe AHCI/NVMe would read it themselves from RAM through DMA), but that's farther off, and would require a scheme like this to even be plausible.

@michaelwoerister
Copy link
Member

Sadly bypassing File entirely will probably be most of the work here, I don't fully understand why measureme never properly mmap'd a file-on-disk

For reference, there was a (half-hearted) attempt to do that in https://github.com/rust-lang/measureme/pull/16/files. It was abandoned due to lack of time and because existing SerializationSink implementations were deemed good enough for the time being.

@michaelwoerister
Copy link
Member

Thanks for writing that up, @eddyb! This protocol looks quite promising to me. I like how it basically keeps the file format intact and is actually not that hard to understand (surprisingly so for something involving locks and atomic operations 😄)

The main obstacle I see is the handling mmap'd files, especially how to grow them and doing so on the various different platforms.

@eddyb
Copy link
Member Author

eddyb commented Nov 12, 2020

For growing, I came up with this hack:
Keep the "atomic writes" for the events, but have a "slow path" in the same place where "out-of-bounds panic" would normally be handled (i.e. the result of pos.fetch_add(size_of::<RawEvent>()) went out of bounds of the mmap'd region).

The "slow path" for events can grab the same lock that non-event pages use, and then we can have a common codepath for all writes, inside the lock. Even though events would've already fetch_add'd, whereas non-event pages still have to do that to detect growing is necessary, events should still also fetch_add, to detect that they "lost" the race, and that growing has already occurred.

Even if this is practically unlikely, if there are enough threads that fetch_add overflows (and therefore other threads might start writing at the start of the mmap'd region), we can make overflowing the atomic position mark the file as invalid (using pwrite), or even outright delete it, and panic. This should keep everything safe/correct even in a chaotic extreme.

Once everything is either waiting for a lock (or bumping pos and about to hit the lock as well), we can safely:

  • increase the size of the file on disk (using fallocate?)
  • remap the mmap'd region to point into the new area of the file
  • reset pos to 0 (effectively "unlocking" any incoming event writes)
  • go back to the start of the locked "slow path" region
    • while unlikely, it is possible that a lot of events on another thread would fill the newly-added area, and require growing again, just after pos being reset to 0
  • (eventially) unlock the non-event pages lock

@michaelwoerister
Copy link
Member

and is actually not that hard to understand (surprisingly so for something involving locks and atomic operations 😄)

I spoke too soon 😉

@eddyb
Copy link
Member Author

eddyb commented Apr 3, 2022

It occurs to me that I don't know why we would want multiple threads to write to the same page - do we need the exact original interleaving? IIRC we split the event streams into threads anyway.

It would also be more efficient for every event page to contain the thread ID once and not waste space in the events in the page.


Event pages

If each thread has its own "active event page", only running out of space in that needs any synchronization, and it can be a single AtomicUsize which allocates where in the file pages go.

Once we've reserved the offset of a page we could posix_fallocate(fd, new_page_offset, PAGE_SIZE) - IIUC, it doesn't matter what interleaving it takes between threads, and it will monotonically grow the file as needed (e.g. later page can win the race and then an earlier page's posix_fallocate becomes noop).

If the posix_fallocate call succeeds, we can use mmap to switch the page mapping to start at new_page_offset instead, write the page header, and go back to writing the event.

(EDIT: randomly stumbled about NixOS/nixpkgs#101490 (comment) which suggests not all filesystems support fallocate, and ftruncate may need to be used instead - looks like that can also grow the file, but has no monotonic behavior built-in, so we'd need a lock around the call, which is fine by me)

Non-event pages

For non-event pages we could have them managed globally under a lock, while sharing the common AtomicUsize for page offset allocation inside the file. If they're rare enough we can even avoid mmap and just pwrite the data into the right position.


Overall this is a much simpler design, assuming we never want to avoid doing thread deinterleaving.
The main downside AFAICT is that it requires an API split between (names subject to bikeshed):

  • Profiler newtype of some Arc<GlobalState>
    • no event recording operations, only emitting into non-events pages
  • PerThreadProfiler (!Send & !Sync) wrapper of Profiler + thread-specific state
    • stores thread ID on creation, which never becomes invalid thanks to !Send/!Sync
      • could even allocate thread IDs from Arc<GlobalState>, or instead register something from std::thread to disallow misuse and error when creating more than one PerThreadProfiler per thread per Profiler
    • exposes Profile API as well (Deref I guess?)
    • manages its own events page and emits events into it

I think I've been confused before by the lack of a separate PerThreadProfiler, there's probably more than one reason to have this kind of explicit design anyway.

cc @wesleywiser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants