-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix calculating datasets stats size #418
Conversation
""" | ||
Returns tuple with dataset stats: total number of rows and total dataset size. | ||
""" | ||
dataset = self.get_dataset(name) | ||
dataset_version = dataset.get_version(version) | ||
dataset_version = dataset.get_version(version or dataset.latest_version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datachain dataset-stats ds-name
command now takes latest dataset version if version is not provided instead of raising an error.
query = select(*expressions) | ||
((nrows, *rest),) = self.db.execute(query) | ||
return nrows, rest[0] if rest else None | ||
return nrows, sum(rest) if rest else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are waiting for int
in many places, errors happens if numbers of rows is set, but size is None
(for example, here). It is better to return 0
in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error looks like this:
$ datachain dataset-stats laion_pq
Number of objects: 2
Error: bad operand type for abs(): 'NoneType'
$
af99a34
to
bbb129b
Compare
Deploying datachain-documentation with Cloudflare Pages
|
bbb129b
to
2c8be93
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #418 +/- ##
==========================================
- Coverage 87.09% 87.09% -0.01%
==========================================
Files 92 92
Lines 9929 9928 -1
Branches 2032 2032
==========================================
- Hits 8648 8647 -1
Misses 927 927
Partials 354 354
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
It looks like calculating datasets stats is broken in some cases.
The way it works now is we are calculating the sum of either
file__size
orsize
fields in all dataset rows. I thinksize
is not used anymore andfile__size
will works only for listings.In this PR I am calculating
size
dataset stat as sum of all fields with name ending withfile__size
. This way we will be able to calculate listing size (file__size
field) and other signals, likelaion.file.size
,source.file.size
,emd.file.size
, etc. Also we are now calculating sum of allfile.size
fields instead of calculating only one of them.