Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve startup performance of airbyte-ci #34430

Conversation

stephane-airbyte
Copy link
Contributor

@stephane-airbyte stephane-airbyte commented Jan 22, 2024

when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to clob(pattern, recursive=true), which seem unnecessary.
The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75

✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    1m44.166s
user    0m12.037s
sys     0m8.598s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    0m26.372s
user    0m8.554s
sys     0m7.045s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    1m6.916s
user    0m6.614s
sys     0m4.071s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    0m7.116s
user    0m4.616s
sys     0m3.042s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]

For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...

There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.

@stephane-airbyte stephane-airbyte requested a review from a team January 22, 2024 22:21
Copy link

vercel bot commented Jan 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Feb 8, 2024 6:44am

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @stephane-airbyte and the rest of your teammates on Graphite Graphite

@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 3 times, most recently from 0d1afa6 to 7048a7e Compare January 23, 2024 00:44
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting improvement!
I know glob is known to be slow.
I originally used it as I think the first dagger version we used did not support wildcards so I tried to build a full list of excluded files.

I think it's worth adding a unit test to the get_repo_dir method to make sure this does not lead to a regression.

A workaround to just speed up format would be to make DEFAULT_EXCLUDED_FILES a property/function, this would prevent from evaluating the glob wherever PipelineContext is imported. If I'm not mistaken forlmat is not using PipelineContext instance, it means PipelineContext is imported in a module used during format (and its class variable with glob are evaluated).

@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 2 times, most recently from 4b5afe7 to 5e5cf33 Compare January 24, 2024 16:14
@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 3 times, most recently from 2824d05 to d3eb495 Compare January 31, 2024 23:16
test_context.dagger_client = dagger_client
filtered_directory = await test_context.get_repo_dir("airbyte-ci/connectors/")
unfiltered_directory = await dagger_client.host().directory("airbyte-ci/connectors/")
raise Exception("" + str((await filtered_directory.entries())) + "--" + str((await unfiltered_directory.entries())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephane-airbyte you told me in DM that something is going wrong with this test? Could you please share more details? Raising here will fail the test.
Is the problem that filtered and unfiltered entries are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's exactly my problem. And that's without my changes to the filter logic. So either I'm missing something (likely) or the filter is useless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have some better understanding of what was going on.
The filter starts on the current working dir, and it's always being set to the directory from which I was executing poetry.
In pycharm I'm able to change that by modifying the working directory in the running configuration. But basically, I can't get this test to pass from the command line.
That being said, when running it from pycharm, the result is that dagger_client.host().directory() doesn't support the ** construct. So I ended up implemented what you suggested, with a cache, so we only call glob 12 (!) times per instance of PipelineContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I'm not sure that the current dependency on the working directory is a desired behavior. I don't have enough understanding of the code to be definitive, but I wonder if we shouldn't use the root_dir parameter to glob.
Finally, I think we could improve the performance further by ourselves filtering the values out of glob, so we only walk the filesystem once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't resist but try to avoid doing so many calls to glob. See the faster function. I checked the returned set versus the original one, and they're the same. I haven't measured the speed difference, but I'd expect it to be another 12x. Now, the bottleneck for format seems to be the time to get a dagger_client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I'm not sure that the current dependency on the working directory is a desired behavior.

I can't say it desired but it's a fact - a lot of the logic in this codebase depends on the repo structure. This is why we call this function very early in the execution path . It'd be interesting to call it in your tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did call this function first thing in my test, and still the working dir was somehow set to the directory from which the tests were started...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to just @lru_cache the previous logic. My attempt depends on python 3.11+, which we don't seem to have everywhere :(

@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 2 times, most recently from 08ffeb7 to ae23f84 Compare February 2, 2024 07:54
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm validating this approach, can you please document faster_get_default_excluded_files and make it be the one called?

Please also bump the package version and update the changelog 🙏

Comment on lines 54 to 78
for file in all_files
if file.find("/build/") > 0
or file.endswith("/build")
or file.find("/.venv/") > 0
or file.endswith("/.venv")
or file.find("/secrets/") > 0
or file.endswith("/secrets")
or file.find("/__pycache__/") > 0
or file.endswith("/__pycache__")
or file.find(".egg-info/") > 0
or file.endswith(".egg-info")
or file.find("/.vscode/") > 0
or file.endswith("/.vscode")
or file.find("/.pytest_cache/") > 0
or file.endswith("/.pytest_cache")
or file.find("/.eggs/") > 0
or file.endswith("/.eggs")
or file.find("/.mypy_cache/") > 0
or file.endswith("/.mypy_cache")
or file.find("/.DS_Store/") > 0
or file.endswith("/.DS_Store")
or file.find("/airbyte_ci_logs/") > 0
or file.endswith("/airbyte_ci_logs")
or file.find("/.gradle/") > 0
or file.endswith("/.gradle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we dynamically generate these or statement from a list with wildwards which looks like DEFAULT_EXCLUDED_FILES?

+ glob("**/airbyte_ci_logs", recursive=True)
+ glob("**/.gradle", recursive=True)
)
def faster_get_default_excluded_files(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could use a @lru_cache decorator to memoize this function and not assign to self._faster_excluded_files

+ glob("**/airbyte_ci_logs", recursive=True)
+ glob("**/.gradle", recursive=True)
)
def faster_get_default_excluded_files(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in a docstring why its faster?

self._faster_excluded_files = [".git", "airbyte-ci/connectors/pipelines/*"] + self._faster_excluded_files
return self._faster_excluded_files

def get_default_excluded_files(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the other one is faster let's keep the other one wdyt?

@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch from ae23f84 to d3de13f Compare February 2, 2024 19:31
@stephane-airbyte stephane-airbyte requested a review from a team as a code owner February 2, 2024 19:31
@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 2 times, most recently from c9c79fc to 0d48365 Compare February 2, 2024 19:35
@octavia-squidington-iii octavia-squidington-iii removed the area/connectors Connector related issues label Feb 2, 2024
@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 6 times, most recently from 6b099bb to f9cd487 Compare February 2, 2024 19:47
@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch from f9cd487 to 1c035bc Compare February 8, 2024 05:38
@stephane-airbyte stephane-airbyte requested a review from a team as a code owner February 8, 2024 05:38
@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch 2 times, most recently from 1d7cee9 to ccf7293 Compare February 8, 2024 06:35
@stephane-airbyte stephane-airbyte force-pushed the stephane/01-22-improve_startup_performance_of_airbyte-ci branch from ccf7293 to 2d109b6 Compare February 8, 2024 06:44
@stephane-airbyte stephane-airbyte merged commit 955b9eb into master Feb 8, 2024
24 checks passed
@stephane-airbyte stephane-airbyte deleted the stephane/01-22-improve_startup_performance_of_airbyte-ci branch February 8, 2024 07:07
Copy link
Contributor Author

Merge activity

xiaohansong pushed a commit that referenced this pull request Feb 13, 2024
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary.
The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75

```
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    1m44.166s
user    0m12.037s
sys     0m8.598s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    0m26.372s
user    0m8.554s
sys     0m7.045s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    1m6.916s
user    0m6.614s
sys     0m4.071s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    0m7.116s
user    0m4.616s
sys     0m3.042s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
```

For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...

There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary.
The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75

```
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    1m44.166s
user    0m12.037s
sys     0m8.598s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    0m26.372s
user    0m8.554s
sys     0m7.045s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    1m6.916s
user    0m6.614s
sys     0m4.071s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    0m7.116s
user    0m4.616s
sys     0m3.042s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
```

For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...

There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary.
The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75

```
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    1m44.166s
user    0m12.037s
sys     0m8.598s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    0m26.372s
user    0m8.554s
sys     0m7.045s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    1m6.916s
user    0m6.614s
sys     0m4.071s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    0m7.116s
user    0m4.616s
sys     0m3.042s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
```

For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...

There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary.
The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75

```
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    1m44.166s
user    0m12.037s
sys     0m8.598s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    0m26.372s
user    0m8.554s
sys     0m7.045s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    1m6.916s
user    0m6.614s
sys     0m4.071s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    0m7.116s
user    0m4.616s
sys     0m3.042s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
```

For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...

There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
xiaohansong pushed a commit that referenced this pull request Feb 27, 2024
when using airbyte ci, I noticed that it's very slow to start. That's especially visible if pre-push hooks are set up to call airbyte-ci format. A little bit of inspection showed me that the main contributors were the calls to `clob(pattern, recursive=true)`, which seem unnecessary.
The hypothesys mostly come form the fac that https://github.com/airbytehq/airbyte/pull/31525/files introduced the new behavior, and that the previous behavior can be seen at https://github.com/airbytehq/airbyte/pull/31525/files#diff-b86e6d2f27d70812919004f8b786a5bb0ea5a7a28ec99c5ffd8773d9eb424eecL75

```
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:32 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    1m44.166s
user    0m12.037s
sys     0m8.598s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:33 $ time airbyte-ci --disable-update-check format check all > /dev/null

real    0m26.372s
user    0m8.554s
sys     0m7.045s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:37 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    1m6.916s
user    0m6.614s
sys     0m4.071s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
14:39 $ time airbyte-ci-dev --disable-update-check format check all > /dev/null

real    0m7.116s
user    0m4.616s
sys     0m3.042s
✔ ~/dev/airbyte [stephane/01-22-improve_startup_performance_of_airbyte-ci|⚑ 2]
```

For some reason the 1st call to airbyte-ci or airbyte-ci-dev is much slower than subsequent calls. But you can see the 1st call going from 1m44s to 1m6s, and the 2nd call going from 26s to 7s...

There should also be some memory consumption improvements because we are not storing the list of files we want to ignore anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants