Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash_info: don't use dataclass #56

Merged
merged 1 commit into from
Jun 18, 2022
Merged

hash_info: don't use dataclass #56

merged 1 commit into from
Jun 18, 2022

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Jun 18, 2022

We are creating lots of these instances and dataclass is significantly slower.

For example, running:

from dvc_data.hashfile.hash_info import HashInfo


hashes = []
for value in range(10000000):
    hashes.append(HashInfo("md5", value))

before this PR takes ~6.1sec and after ~4.9sec.

Most of the improvement is actually coming from using __slots__, which were finally supported by dataclasses in 3.10.

Another (temporary) downside to using dataclass is that it is kinda buggy with other tools. E.g. mypyc is failing to compile it throwing an error like mypyc/mypyc#921 for the HashInfo.name.

@skshetry
Copy link
Member

We are creating lots of these instances and dataclass is significantly slower.

How much perf improvements do you see with this?

@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2022

Codecov Report

Merging #56 (a90646e) into main (070fd08) will increase coverage by 0.03%.
The diff coverage is 76.47%.

@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
+ Coverage   34.24%   34.28%   +0.03%     
==========================================
  Files          33       33              
  Lines        1673     1677       +4     
  Branches      261      262       +1     
==========================================
+ Hits          573      575       +2     
- Misses       1089     1090       +1     
- Partials       11       12       +1     
Impacted Files Coverage Δ
src/dvc_data/hashfile/hash_info.py 56.41% <76.47%> (-0.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 070fd08...a90646e. Read the comment docs.

We are creating lots of these instances and dataclass is significantly slower.
@efiop
Copy link
Contributor Author

efiop commented Jun 18, 2022

@skshetry Great question! Added to the PR desc.

@efiop
Copy link
Contributor Author

efiop commented Jun 18, 2022

Overall we are not heavy users of dataclasses in this application (and a few more places like TreeEntry), so the overhead for a simpler implementation is very minor.

@efiop efiop merged commit a3419f2 into iterative:main Jun 18, 2022
efiop added a commit to efiop/dvc-data that referenced this pull request Jun 18, 2022
@efiop efiop mentioned this pull request Jun 18, 2022
efiop added a commit that referenced this pull request Jun 18, 2022
efiop added a commit to efiop/dvc-data that referenced this pull request Jun 18, 2022
efiop added a commit that referenced this pull request Jun 18, 2022
efiop added a commit to efiop/dvc-objects that referenced this pull request Jun 20, 2022
We are creating lots of these, so it will help to save up on memory and
improve performance.

Similar to iterative/dvc-data#56
efiop added a commit to iterative/dvc-objects that referenced this pull request Jun 20, 2022
We are creating lots of these, so it will help to save up on memory and
improve performance.

Similar to iterative/dvc-data#56
@skshetry
Copy link
Member

skshetry commented Aug 9, 2022

I just wish we had not done this kind of change without looking into other alternatives (eg: attrs with slots, dataclasses, etc). This makes debugging quite hard. Also, the above benchmarks seem unfair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants