-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[data] Fix high memory usage with FileBasedDatasource & ParquetDatasource when using a large number of files #55978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…source to reduce memory usage Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
|
Thank you for the contribution. And good catch on high mem for large number of blocks. On closer look, |
Yeah it is not used anywhere and I don't see any reason to keep it. The only argument for keeping it would be to ensure that we don't break any assumptions for downstream consumers (e.g. users who extend |
|
Yes, I'm happy to update this to just remove it. The unresolved paths aren't used anywhere, and aren't accessible by users using the |
Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
|
@gvspraveen thanks for pinging me. I'm using this for lineage tracking purpose which was not yet open-sourced. The relation is not that clear - I'm looking at the object through Maybe those unresolved_paths could be put into GCS, similarly to resolved paths? |
|
I think keeping it in the object store is okay, the size of the many file paths isn't so bad if it's just stored once. I only run into issues when I increase the number of blocks over 2,000 or so |
Thanks! I'd really love to avoid losing that metadata. |
Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
omatthew98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JDarDagran for the input, thanks @JackGammack for the change and iteration. LGTM.
|
A couple things to note after updating:
|
alexeykudinkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @JackGammack!
| ray.get_runtime_context().get_node_id(), soft=False | ||
| ) | ||
|
|
||
| self._unresolved_paths = paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually just delete this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying to remove self._unresolved_paths_ref for ParquetDatasource, but keep it for FileBasedDatasource? Or is this an older comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no good reason for us not to nuke it in both places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree if it's not used anywhere currently especially with the big performance issue, but I'm not aware of the other ongoing work. It should be very easy to add back.
I have updated to just remove self._unresolved_paths now
Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: sampan <sampan@anyscale.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>
…urce when using a large number of files (#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: #49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>
Why are these changes needed?
Using
FileBasedDatasourceorParquetDatasourcewith a very large number of files causes OOM when creating read tasks. The full list of file paths is stored inself, causing it to persist to every read task, leading to this warning:When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with
override_num_blocks, OOM occurs.This is because the full list of paths is added to
self._unresolved_paths. This attribute isn't currently used anywhere in ray. This PR removesself._unresolved_pathsto alleviate unexpected high memory usage with very large numbers of files.Related issue number
Similar to this issue with Iceberg: #49054
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.