Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test NativeLink with Buck2 #222

Closed
allada opened this issue Jul 23, 2023 · 5 comments
Closed

Test NativeLink with Buck2 #222

allada opened this issue Jul 23, 2023 · 5 comments

Comments

@allada
Copy link
Member

allada commented Jul 23, 2023

Buck2 has potential for gaining a lot of adoption quickly. They seem to make BRE a first-class-citizen, so we should test to ensure TruboCache can fully build Buck2 on Buck2.

If all passes, we should make an upstream patch to add TurboCache to their examples:
https://github.com/facebook/buck2/tree/main/examples/remote_execution

allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
@allada
Copy link
Member Author

allada commented Aug 26, 2023

In testing Buck2, things seem to be working. Buck2 does hammer the service for all it has though. From this testing I did discover a bug where too many file descriptors could be opened and end up blocking each other resulting in a deadlock.

A fix is en route #239.

The other minor bug I found is that buck2 does not send a PATH env. I'm going to make a simple env injector for the worker in a followup PR to solve this issue.

allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
allada added a commit that referenced this issue Aug 26, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
closes #238
allada added a commit that referenced this issue Sep 3, 2023
In testing Buck2, it turns out they hammer the scheduler with jobs
and assume it can keep up (which is good). On local testing, where
all services are on the same machine it was deadlocking because
it was performing the following operations:
1. Open file1 for reading
2. Open file2 for writing
3. Streaming file1 -> file2

Since we allow users to limit the number of open files at any given
time, this was deadlocking because file1 was held open waiting
for file2 to open, which was waiting for a file to be closed. Since
buck2 goes crazy, it was causing a deadlock.

In most production systems this is not an issue because the CAS
is separated from the workers, but rarely might happen on the workers
if the `max_open_files` was set too low.

To get around this issue `ResumeableFileSlot` is introduced. It
allows callers to use a timeout and call `.close_file()` on it
and the next time the struct is used it will re-open the file.

related #222
closes #238
@aleksdmladenovic
Copy link
Contributor

I will work on this issue. cc: @allada , @MarcusSorealheis

@allada allada changed the title Test TruboCache with Buck2 Test NativeLink with Buck2 May 16, 2024
@aleksdmladenovic
Copy link
Contributor

I've made PR draft for this issue to Buck2 repo.
facebook/buck2#658

cc: @allada , @MarcusSorealheis

@aleksdmladenovic
Copy link
Contributor

aleksdmladenovic commented Jul 17, 2024

Hope you to close this PR since it's merged to buck2 repo.

facebook/buck2#658

cc: @allada , @MarcusSorealheis

@adam-singer
Copy link
Contributor

@aleksdmladenovic thank you for handling this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants