-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve startup performance of airbyte-ci #34430
improve startup performance of airbyte-ci #34430
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @stephane-airbyte and the rest of your teammates on Graphite |
0d1afa6
to
7048a7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting improvement!
I know glob
is known to be slow.
I originally used it as I think the first dagger version we used did not support wildcards so I tried to build a full list of excluded files.
I think it's worth adding a unit test to the get_repo_dir
method to make sure this does not lead to a regression.
A workaround to just speed up format would be to make DEFAULT_EXCLUDED_FILES
a property/function, this would prevent from evaluating the glob
wherever PipelineContext
is imported. If I'm not mistaken forlmat
is not using PipelineContext
instance, it means PipelineContext
is imported in a module used during format (and its class variable with glob are evaluated).
4b5afe7
to
5e5cf33
Compare
2824d05
to
d3eb495
Compare
test_context.dagger_client = dagger_client | ||
filtered_directory = await test_context.get_repo_dir("airbyte-ci/connectors/") | ||
unfiltered_directory = await dagger_client.host().directory("airbyte-ci/connectors/") | ||
raise Exception("" + str((await filtered_directory.entries())) + "--" + str((await unfiltered_directory.entries()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephane-airbyte you told me in DM that something is going wrong with this test? Could you please share more details? Raising here will fail the test.
Is the problem that filtered and unfiltered entries are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's exactly my problem. And that's without my changes to the filter logic. So either I'm missing something (likely) or the filter is useless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have some better understanding of what was going on.
The filter starts on the current working dir, and it's always being set to the directory from which I was executing poetry.
In pycharm I'm able to change that by modifying the working directory in the running configuration. But basically, I can't get this test to pass from the command line.
That being said, when running it from pycharm, the result is that dagger_client.host().directory()
doesn't support the **
construct. So I ended up implemented what you suggested, with a cache, so we only call glob 12 (!) times per instance of PipelineContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I'm not sure that the current dependency on the working directory is a desired behavior. I don't have enough understanding of the code to be definitive, but I wonder if we shouldn't use the root_dir
parameter to glob
.
Finally, I think we could improve the performance further by ourselves filtering the values out of glob, so we only walk the filesystem once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't resist but try to avoid doing so many calls to glob. See the faster
function. I checked the returned set versus the original one, and they're the same. I haven't measured the speed difference, but I'd expect it to be another 12x. Now, the bottleneck for format seems to be the time to get a dagger_client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I'm not sure that the current dependency on the working directory is a desired behavior.
I can't say it desired but it's a fact - a lot of the logic in this codebase depends on the repo structure. This is why we call this function very early in the execution path . It'd be interesting to call it in your tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did call this function first thing in my test, and still the working dir was somehow set to the directory from which the tests were started...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to just @lru_cache the previous logic. My attempt depends on python 3.11+, which we don't seem to have everywhere :(
08ffeb7
to
ae23f84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm validating this approach, can you please document faster_get_default_excluded_files
and make it be the one called?
Please also bump the package version and update the changelog 🙏
for file in all_files | ||
if file.find("/build/") > 0 | ||
or file.endswith("/build") | ||
or file.find("/.venv/") > 0 | ||
or file.endswith("/.venv") | ||
or file.find("/secrets/") > 0 | ||
or file.endswith("/secrets") | ||
or file.find("/__pycache__/") > 0 | ||
or file.endswith("/__pycache__") | ||
or file.find(".egg-info/") > 0 | ||
or file.endswith(".egg-info") | ||
or file.find("/.vscode/") > 0 | ||
or file.endswith("/.vscode") | ||
or file.find("/.pytest_cache/") > 0 | ||
or file.endswith("/.pytest_cache") | ||
or file.find("/.eggs/") > 0 | ||
or file.endswith("/.eggs") | ||
or file.find("/.mypy_cache/") > 0 | ||
or file.endswith("/.mypy_cache") | ||
or file.find("/.DS_Store/") > 0 | ||
or file.endswith("/.DS_Store") | ||
or file.find("/airbyte_ci_logs/") > 0 | ||
or file.endswith("/airbyte_ci_logs") | ||
or file.find("/.gradle/") > 0 | ||
or file.endswith("/.gradle") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we dynamically generate these or statement from a list with wildwards which looks like DEFAULT_EXCLUDED_FILES
?
+ glob("**/airbyte_ci_logs", recursive=True) | ||
+ glob("**/.gradle", recursive=True) | ||
) | ||
def faster_get_default_excluded_files(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You could use a @lru_cache
decorator to memoize this function and not assign to self._faster_excluded_files
+ glob("**/airbyte_ci_logs", recursive=True) | ||
+ glob("**/.gradle", recursive=True) | ||
) | ||
def faster_get_default_excluded_files(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain in a docstring why its faster?
self._faster_excluded_files = [".git", "airbyte-ci/connectors/pipelines/*"] + self._faster_excluded_files | ||
return self._faster_excluded_files | ||
|
||
def get_default_excluded_files(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the other one is faster let's keep the other one wdyt?
ae23f84
to
d3de13f
Compare
c9c79fc
to
0d48365
Compare
6b099bb
to
f9cd487
Compare
f9cd487
to
1c035bc
Compare
1d7cee9
to
ccf7293
Compare
ccf7293
to
2d109b6
Compare
Merge activity
|
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary. The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75 ``` ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null real 1m44.166s user 0m12.037s sys 0m8.598s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null real 0m26.372s user 0m8.554s sys 0m7.045s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 1m6.916s user 0m6.614s sys 0m4.071s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 0m7.116s user 0m4.616s sys 0m3.042s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] ``` For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s... There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary. The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75 ``` ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null real 1m44.166s user 0m12.037s sys 0m8.598s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null real 0m26.372s user 0m8.554s sys 0m7.045s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 1m6.916s user 0m6.614s sys 0m4.071s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 0m7.116s user 0m4.616s sys 0m3.042s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] ``` For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s... There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary. The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75 ``` ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null real 1m44.166s user 0m12.037s sys 0m8.598s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null real 0m26.372s user 0m8.554s sys 0m7.045s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 1m6.916s user 0m6.614s sys 0m4.071s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 0m7.116s user 0m4.616s sys 0m3.042s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] ``` For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s... There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary. The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75 ``` ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null real 1m44.166s user 0m12.037s sys 0m8.598s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null real 0m26.372s user 0m8.554s sys 0m7.045s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 1m6.916s user 0m6.614s sys 0m4.071s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 0m7.116s user 0m4.616s sys 0m3.042s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] ``` For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s... There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary. The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75 ``` ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null real 1m44.166s user 0m12.037s sys 0m8.598s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null real 0m26.372s user 0m8.554s sys 0m7.045s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 1m6.916s user 0m6.614s sys 0m4.071s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] 14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null real 0m7.116s user 0m4.616s sys 0m3.042s ✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2] ``` For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s... There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to
clob(pattern, recursive=true)
, which seem unnecessary.The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75
For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...
There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.