Dataset pull fixes #560

ilongin · 2024-11-04T14:42:21Z

Fixing various issues regarding dataset pull after recent feature based schema changes:

Adding UInt32 type as it's needed for Clickhouse random column as UInt64 is too big for SQLlite
Removing casting dataframe data to string - it seems it's not needed anymore
Fix parsing empty string as json

cloudflare-workers-and-pages · 2024-11-04T14:43:28Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`7a6be18`
Status:	✅ Deploy successful!
Preview URL:	https://e5156e5b.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-539-dataset-pull-fix.datachain-documentation.pages.dev

View logs

mattseddon · 2024-11-04T23:25:32Z

test_row_random looks like a genuine failure.

amritghimire · 2024-11-05T03:45:43Z

I am getting the error as below when using the pull using datachain pull ds://02jpg_files -o jpf -vv

Version not specified, pulling the latest one (v1)
Saving dataset ds://02jpg_files@v1 locally: 100%|█████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:00<00:00, 10.2 rows/s]
Dataset ds://02jpg_files@v1 saved locally
_request non-retriable exception: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 113, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Error: Dataset lst__gs://datachain-demo/ not found.
Traceback (most recent call last):
  File ".../datachain/src/datachain/cli.py", line 1016, in main
    catalog.pull_dataset(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1454, in pull_dataset
    _instantiate_dataset()
  File ".../datachain/src/datachain/catalog/catalog.py", line 1324, in _instantiate_dataset
    self.cp(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1563, in cp
    node_groups = self.enlist_sources_grouped(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File .../iterative/datachain/src/datachain/catalog/catalog.py", line 703, in enlist_sources_grouped
    listing = Listing(st, client, self.get_dataset(dataset_name))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/src/datachain/catalog/catalog.py", line 1090, in get_dataset
    return self.metastore.get_dataset(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../src/datachain/data_storage/metastore.py", line 704, in get_dataset
    raise DatasetNotFoundError(f"Dataset {name} not found.")
datachain.error.DatasetNotFoundError: Dataset lst__gs://datachain-demo/ not found.
Telemetry is disabled by environment variable.

The datasets were created using:

from datachain import DataChain, C

ds = DataChain.from_storage("gs://datachain-demo")
ds1 = ds.filter(C('file.path').glob('*.jpg')).save("jpg_files")
ds2 = ds.filter(C('file.path').glob('*.png')).save("png_files")
ds4 = ds1.union(ds2)
ds5 = ds4.save("jpg_png_files")

dsf1 = ds.filter(C("file.path").glob("*02.jpg")).limit(5).save("02jpg_files")

ds6 = ds5.union(dsf1)
ds6.save("all_files")

codecov · 2024-11-05T15:37:53Z

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.87%. Comparing base (4f08133) to head (7a6be18).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/sql/types.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #560      +/-   ##
==========================================
- Coverage   87.89%   87.87%   -0.02%     
==========================================
  Files          96       96              
  Lines        9907     9908       +1     
  Branches     1350     1350              
==========================================
- Hits         8708     8707       -1     
- Misses        859      860       +1     
- Partials      340      341       +1

Flag	Coverage Δ
datachain	`87.81% <66.66%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ilongin · 2024-11-06T00:02:18Z

datachain pull ds://02jpg_files -o jpf -vv

@amritghimire Thanks for finding the issue.

This is more related to instantiating dataset for which I will create a separate issue as it's not so trivial to fix atm. I would merge current PR to at least have dataset pull without instantiation working fine (which is enough in majority of use cases for clients)

Created follow up issue related to instantiation #566

ilongin · 2024-11-06T00:06:25Z

test_row_random looks like a genuine failure.

I've fixed the test. Eventually I decided to use 64 bit for random number but limit it to 2^63 - 1 which is the largest possible value that can be stored in SQLite as it only supports signed numbers. Clickhouse supports unsigned ones, so the largest value is 2^64 but as I mentioned, 2^63 - 1 is largest possible value supported in both DBs. This gives us much larger number "pool" then changing this to 32 bit int.

src/datachain/sql/types.py

mattseddon · 2024-11-06T00:26:37Z

tests/func/test_datasets.py

-        RAND_MAX = 2**63  # noqa: N806
-    else:
-        RAND_MAX = 2**64  # noqa: N806
+    RAND_MAX = DataTable.MAX_RANDOM  # noqa: N806


[Q] Won't this test fail when the number drawn by clickhouse is > 2^63?

And if my math is correct the chances of that happening is:

((2^64) - (2^63)) / (2^64) = 0.5 or 50%

In Clickhouse we have explicit limit to 2^63-1 https://github.com/iterative/studio/pull/10860/files#diff-1136c5032c82c12345c3103e18093d3256798daa2d4b13e5f5f2d5e8670ac32bR328

In general, now we actually know what is the range of that random number regardless of DB implementation.

fixing dataset pull

e0cf2bf

ilongin linked an issue Nov 4, 2024 that may be closed by this pull request

datachain pull doesn't work #539

Closed

ilongin requested a review from amritghimire November 4, 2024 16:10

ilongin added 3 commits November 5, 2024 14:22

Merge branch 'main' into ilongin/539-dataset-pull-fix

a2c0200

fixing pull test

8270ad8

fixing random default

a12194c

testing uint32

6962903

ilongin requested review from dreadatour, skshetry, mattseddon and shcheklein November 6, 2024 00:07

mattseddon reviewed Nov 6, 2024

View reviewed changes

src/datachain/sql/types.py Outdated Show resolved Hide resolved

mattseddon reviewed Nov 6, 2024

View reviewed changes

ilongin mentioned this pull request Nov 6, 2024

Instantiation after pulling remote dataset is failing #566

Closed

removing uint32 support from this PR

487ed1a

ilongin requested a review from mattseddon November 6, 2024 09:44

amritghimire approved these changes Nov 6, 2024

View reviewed changes

Merge branch 'main' into ilongin/539-dataset-pull-fix

7a6be18

ilongin merged commit 3dc4f3e into main Nov 6, 2024
37 of 38 checks passed

ilongin deleted the ilongin/539-dataset-pull-fix branch November 6, 2024 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset pull fixes #560

Dataset pull fixes #560

ilongin commented Nov 4, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Nov 4, 2024 •

edited

Loading

mattseddon commented Nov 4, 2024

amritghimire commented Nov 5, 2024

codecov bot commented Nov 5, 2024 •

edited

Loading

ilongin commented Nov 6, 2024 •

edited

Loading

ilongin commented Nov 6, 2024

mattseddon Nov 6, 2024

ilongin Nov 6, 2024 •

edited

Loading

Dataset pull fixes #560

Dataset pull fixes #560

Conversation

ilongin commented Nov 4, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Nov 4, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

mattseddon commented Nov 4, 2024

amritghimire commented Nov 5, 2024

codecov bot commented Nov 5, 2024 • edited Loading

Codecov Report

ilongin commented Nov 6, 2024 • edited Loading

ilongin commented Nov 6, 2024

mattseddon Nov 6, 2024

Choose a reason for hiding this comment

ilongin Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

ilongin commented Nov 4, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 5, 2024 •

edited

Loading

ilongin commented Nov 6, 2024 •

edited

Loading

ilongin Nov 6, 2024 •

edited

Loading