Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculating datasets stats size #418

Merged
merged 2 commits into from
Sep 11, 2024
Merged

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Sep 10, 2024

It looks like calculating datasets stats is broken in some cases.

The way it works now is we are calculating the sum of either file__size or size fields in all dataset rows. I think size is not used anymore and file__size will works only for listings.

In this PR I am calculating size dataset stat as sum of all fields with name ending with file__size. This way we will be able to calculate listing size (file__size field) and other signals, like laion.file.size, source.file.size, emd.file.size, etc. Also we are now calculating sum of all file.size fields instead of calculating only one of them.

@dreadatour dreadatour self-assigned this Sep 10, 2024
"""
Returns tuple with dataset stats: total number of rows and total dataset size.
"""
dataset = self.get_dataset(name)
dataset_version = dataset.get_version(version)
dataset_version = dataset.get_version(version or dataset.latest_version)
Copy link
Contributor Author

@dreadatour dreadatour Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datachain dataset-stats ds-name command now takes latest dataset version if version is not provided instead of raising an error.

query = select(*expressions)
((nrows, *rest),) = self.db.execute(query)
return nrows, rest[0] if rest else None
return nrows, sum(rest) if rest else 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are waiting for int in many places, errors happens if numbers of rows is set, but size is None (for example, here). It is better to return 0 in such cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error looks like this:

$ datachain dataset-stats laion_pq
Number of objects: 2
Error: bad operand type for abs(): 'NoneType'
$

@dreadatour dreadatour marked this pull request as draft September 10, 2024 16:08
@dreadatour dreadatour force-pushed the fix-calculating-stats-size branch from af99a34 to bbb129b Compare September 11, 2024 15:40
Copy link

cloudflare-workers-and-pages bot commented Sep 11, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2c8be93
Status: ✅  Deploy successful!
Preview URL: https://8638fafa.datachain-documentation.pages.dev
Branch Preview URL: https://fix-calculating-stats-size.datachain-documentation.pages.dev

View logs

@dreadatour dreadatour force-pushed the fix-calculating-stats-size branch from bbb129b to 2c8be93 Compare September 11, 2024 15:40
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.09%. Comparing base (22ec048) to head (2c8be93).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #418      +/-   ##
==========================================
- Coverage   87.09%   87.09%   -0.01%     
==========================================
  Files          92       92              
  Lines        9929     9928       -1     
  Branches     2032     2032              
==========================================
- Hits         8648     8647       -1     
  Misses        927      927              
  Partials      354      354              
Flag Coverage Δ
datachain 87.04% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dreadatour dreadatour marked this pull request as ready for review September 11, 2024 15:54
@dreadatour dreadatour merged commit 46ac8e6 into main Sep 11, 2024
38 checks passed
@dreadatour dreadatour deleted the fix-calculating-stats-size branch September 11, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants