-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test NativeLink with Buck2 #222
Comments
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222
In testing Buck2, things seem to be working. Buck2 does hammer the service for all it has though. From this testing I did discover a bug where too many file descriptors could be opened and end up blocking each other resulting in a deadlock. A fix is en route #239. The other minor bug I found is that buck2 does not send a |
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222 closes #238
In testing Buck2, it turns out they hammer the scheduler with jobs and assume it can keep up (which is good). On local testing, where all services are on the same machine it was deadlocking because it was performing the following operations: 1. Open file1 for reading 2. Open file2 for writing 3. Streaming file1 -> file2 Since we allow users to limit the number of open files at any given time, this was deadlocking because file1 was held open waiting for file2 to open, which was waiting for a file to be closed. Since buck2 goes crazy, it was causing a deadlock. In most production systems this is not an issue because the CAS is separated from the workers, but rarely might happen on the workers if the `max_open_files` was set too low. To get around this issue `ResumeableFileSlot` is introduced. It allows callers to use a timeout and call `.close_file()` on it and the next time the struct is used it will re-open the file. related #222 closes #238
I will work on this issue. cc: @allada , @MarcusSorealheis |
I've made PR draft for this issue to Buck2 repo. cc: @allada , @MarcusSorealheis |
Hope you to close this PR since it's merged to buck2 repo. cc: @allada , @MarcusSorealheis |
@aleksdmladenovic thank you for handling this! |
Buck2 has potential for gaining a lot of adoption quickly. They seem to make BRE a first-class-citizen, so we should test to ensure TruboCache can fully build Buck2 on Buck2.
If all passes, we should make an upstream patch to add TurboCache to their examples:
https://github.com/facebook/buck2/tree/main/examples/remote_execution
The text was updated successfully, but these errors were encountered: