Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop JsonSectionPageMapper in Rust API #1224

Merged

Conversation

vinnamkim
Copy link
Contributor

@vinnamkim vinnamkim commented Dec 13, 2023

Summary

  • Ticket no. 127135 and 127136.
  • Develop JsonSectionPageMapper to construct page maps for top-level sections in a given JSON file.
  • Enhance DatumaroImporter.detect()'s performance by replacing JSON file parsing logic with the JsonSectionPageMapper.

How to test

Our existing test will validate its functionality. For the performance comparison, please see the following.

  • Before
from datumaro.rust_api import JsonSectionPageMapper
from time import time
import datumaro as dm

start = time()
format = dm.Dataset.detect("ws_test/coco/datumaro")
dt = 1000.0 * (time() - start)
print(f"Duration for detecting Datumaro data format: {dt:.1f}ms, format={format}")
Duration for detecting Datumaro data format: 25784.5ms, format=datumaro
  • After
from datumaro.rust_api import JsonSectionPageMapper
from time import time
import datumaro as dm

start = time()
format = dm.Dataset.detect("ws_test/coco/datumaro")
dt = 1000.0 * (time() - start)
print(f"Duration for detecting Datumaro data format: {dt:.1f}ms, format={format}")
Duration for detecting Datumaro data format: 17234.7ms, format=datumaro

It saves ~7 secs.

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@vinnamkim vinnamkim changed the title Develop JsonSectionPageMapper and enhance DatumaroImporter.detect() performance Develop JsonSectionPageMapper in Rust API and enhance DatumaroImporter.detect() performance Dec 13, 2023
@vinnamkim vinnamkim changed the title Develop JsonSectionPageMapper in Rust API and enhance DatumaroImporter.detect() performance Develop JsonSectionPageMapper in Rust API Dec 13, 2023
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@vinnamkim
Copy link
Contributor Author

Unless we merge this PR, we cannot reach our first objective, memory bounded detect(). Please see the following memory profiling result with this PR:
after

This is because there are many data formats which read the whole JSON file at once during their detect(). This is one of them:

@classmethod
def detect(
cls,
context: FormatDetectionContext,
) -> Optional[FormatDetectionConfidence]:
# test maximum 10 annotation files only
ctr = 0
for file in context.require_files_iter("*.json"):
ctr += 1
with context.probe_text_file(
file, "Annotation format is not Segmentat-Anything format", is_binary_file=True
) as f:
anno = parse_json(f.read())

I'll fix the other data formats as well in the next PR.

@vinnamkim vinnamkim marked this pull request as ready for review December 13, 2023 08:23
@vinnamkim vinnamkim requested review from a team as code owners December 13, 2023 08:23
@vinnamkim vinnamkim requested review from sooahleex and removed request for a team December 13, 2023 08:23
Copy link

codecov bot commented Dec 13, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9b254ce) 80.10% compared to head (b2263ea) 80.44%.
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1224      +/-   ##
===========================================
+ Coverage    80.10%   80.44%   +0.34%     
===========================================
  Files          269      269              
  Lines        29915    29915              
  Branches      5850     5850              
===========================================
+ Hits         23962    24066     +104     
+ Misses        4617     4487     -130     
- Partials      1336     1362      +26     
Flag Coverage Δ
macos-11_Python-3.8 79.10% <100.00%> (+0.34%) ⬆️
ubuntu-20.04_Python-3.8 80.44% <100.00%> (+0.34%) ⬆️
windows-2022_Python-3.8 80.41% <100.00%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

contents = parse_json(f.read())
if not {"categories", "items"} <= contents.keys():
):
fpath = osp.join(context.root_path, annot_file)
Copy link
Contributor

@wonjuleee wonjuleee Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I personally love to get access to context.root_path directly, but this is recommended to avoid during detection.

Detectors should avoid using this property in favor of specific
requirement methods.

Do you have any objection to use this during detection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That guide is because the methods provided by context are using self._root_path internally. For example,

try:
if is_binary_file:
with open(osp.join(self._root_path, path), "rb") as f:
yield f

However, JsonSectionPageMapper Rust API has no information for root_path. There is no choice but to inject the full path osp.join(context.root_path, annot_file) to access the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, how about other formats with pure Python codes? I just want to hear your opinion.
Frankly speaking, I don't want to avoid the use of context.root_path.

Copy link
Contributor

@wonjuleee wonjuleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great to me in overall. Especially, it is happy to reduce the total time consumption during detection. Thank you!

@vinnamkim vinnamkim merged commit 6b567b7 into openvinotoolkit:develop Dec 14, 2023
@vinnamkim vinnamkim deleted the vinnamki/add-rust-json-detect branch December 14, 2023 02:19
@vinnamkim vinnamkim added this to the 1.6.0 milestone Dec 14, 2023
@vinnamkim vinnamkim added the FEATURE New feature & functionality label Dec 14, 2023
vinnamkim added a commit that referenced this pull request Dec 15, 2023
### Summary

- Ticket no. 127586
- It comes from this feedback,
#1224 (comment).

### How to test
```console
$ cd rust
$ cargo test
```

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [ ] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [x] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [x] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
vinnamkim added a commit that referenced this pull request Dec 18, 2023
…mant (#1229)

### Summary

- Ticket no. 127136

### How to test
Refer to #1224 for
details on how we obtained the following results.

1. Performance

- Before

```console
Duration for detecting Datumaro data format: 25784.5ms, format=datumaro
```
- After

```console
Duration for detecting Datumaro data format: 5966.8ms, format=datumaro
```

2. Memory usage
- Before

![before](https://github.com/openvinotoolkit/datumaro/assets/26541465/9f6432f7-108d-4d9f-a535-f954bfd55f02)
- After

![after](https://github.com/openvinotoolkit/datumaro/assets/26541465/8ff7a1a4-6106-46cc-9f16-74a4979b8a3b)

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [x] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [x] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [x] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```

---------

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
vinnamkim added a commit to vinnamkim/datumaro that referenced this pull request Jan 9, 2024
- Ticket no. 127135 and 127136.
- Develop `JsonSectionPageMapper` to construct page maps for top-level
sections in a given JSON file.
- Enhance `DatumaroImporter.detect()`'s performance by replacing JSON
file parsing logic with the `JsonSectionPageMapper`.

Our existing test will validate its functionality. For the performance
comparison, please see the following.

- Before
```python
from datumaro.rust_api import JsonSectionPageMapper
from time import time
import datumaro as dm

start = time()
format = dm.Dataset.detect("ws_test/coco/datumaro")
dt = 1000.0 * (time() - start)
print(f"Duration for detecting Datumaro data format: {dt:.1f}ms, format={format}")
```

```console
Duration for detecting Datumaro data format: 25784.5ms, format=datumaro
```

- After
```python
from datumaro.rust_api import JsonSectionPageMapper
from time import time
import datumaro as dm

start = time()
format = dm.Dataset.detect("ws_test/coco/datumaro")
dt = 1000.0 * (time() - start)
print(f"Duration for detecting Datumaro data format: {dt:.1f}ms, format={format}")
```

```console
Duration for detecting Datumaro data format: 17234.7ms, format=datumaro
```

It saves ~7 secs.

<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [x] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

- [x] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [x] I have updated the license header for each file (see an example
below).

```python
```

---------

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
vinnamkim added a commit to vinnamkim/datumaro that referenced this pull request Jan 9, 2024
…mant (openvinotoolkit#1229)

- Ticket no. 127136

Refer to openvinotoolkit#1224 for
details on how we obtained the following results.

1. Performance

- Before

```console
Duration for detecting Datumaro data format: 25784.5ms, format=datumaro
```
- After

```console
Duration for detecting Datumaro data format: 5966.8ms, format=datumaro
```

2. Memory usage
- Before

![before](https://github.com/openvinotoolkit/datumaro/assets/26541465/9f6432f7-108d-4d9f-a535-f954bfd55f02)
- After

![after](https://github.com/openvinotoolkit/datumaro/assets/26541465/8ff7a1a4-6106-46cc-9f16-74a4979b8a3b)

<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [x] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

- [x] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [x] I have updated the license header for each file (see an example
below).

```python
```

---------

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
vinnamkim added a commit that referenced this pull request Jan 11, 2024
### Summary

- Ticket no 128951
- Apply #1224 and #1229 changes to the releases/1.5.0 branch

### How to test
Already tested in the previous PRs.

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [ ] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [ ] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [ ] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```

---------

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@yunchu yunchu modified the milestones: 1.6.0, 2.0.0 Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FEATURE New feature & functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants