Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory unbounded Arrow data format export/import #1169

Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
(<https://github.com/openvinotoolkit/datumaro/pull/1162>)
- Fix hyperlink errors in the document
(<https://github.com/openvinotoolkit/datumaro/pull/1159>, <https://github.com/openvinotoolkit/datumaro/pull/1161>)
- Fix memory unbounded Arrow data format export/import
(<https://github.com/openvinotoolkit/datumaro/pull/1169>)

## 15/09/2023 - Release 1.5.0
### New features
Expand Down
7 changes: 2 additions & 5 deletions docs/source/docs/data-formats/formats/arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,13 +178,10 @@ Extra options for exporting to Arrow format:
- `JPEG/95`: [JPEG](https://en.wikipedia.org/wiki/JPEG) with 95 quality
- `JPEG/75`: [JPEG](https://en.wikipedia.org/wiki/JPEG) with 75 quality
- `NONE`: skip saving image.
- `--max-chunk-size MAX_CHUNK_SIZE` allow to specify maximum chunk size (batch size) when saving into arrow format.
- `--max-shard-size MAX_SHARD_SIZE` allow to specify maximum number of dataset items when saving into arrow format.
(default: `1000`)
- `--num-shards NUM_SHARDS` allow to specify the number of shards to generate.
`--num-shards` and `--max-shard-size` are mutually exclusive.
(default: `1`)
- `--max-shard-size MAX_SHARD_SIZE` allow to specify maximum size of each shard. (e.g. 7KB = 7 \* 2^10, 3MB = 3 \* 2^20, and 2GB = 2 \* 2^30)
`--num-shards` and `--max-shard-size` are mutually exclusive.
`--num-shards` and `--max-shard-size` are mutually exclusive.
(default: `None`)
- `--num-workers NUM_WORKERS` allow to multi-processing for the export. If num_workers = 0, do not use multiprocessing (default: `0`).

Expand Down
6 changes: 3 additions & 3 deletions src/datumaro/components/dataset_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,13 +178,13 @@ def media_type(_):

return _DatasetFilter()

def infos(self):
def infos(self) -> DatasetInfo:
return {}

def categories(self):
def categories(self) -> CategoriesInfo:
return {}

def get(self, id, subset=None):
def get(self, id, subset=None) -> Optional[DatasetItem]:
subset = subset or DEFAULT_SUBSET_NAME
for item in self:
if item.id == id and item.subset == subset:
Expand Down
2 changes: 1 addition & 1 deletion src/datumaro/components/format_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ def _require_files_iter(
@contextlib.contextmanager
def probe_text_file(
self, path: str, requirement_desc: str, is_binary_file: bool = False
) -> Union[BufferedReader, TextIO]:
) -> Iterator[Union[BufferedReader, TextIO]]:
"""
Returns a context manager that can be used to place a requirement on
the contents of the file referred to by `path`. To do so, you must
Expand Down
204 changes: 0 additions & 204 deletions src/datumaro/plugins/data_formats/arrow/arrow_dataset.py

This file was deleted.

Loading
Loading