Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset pull fixes #560

Merged
merged 7 commits into from
Nov 6, 2024
Merged

Dataset pull fixes #560

merged 7 commits into from
Nov 6, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Nov 4, 2024

Fixing various issues regarding dataset pull after recent feature based schema changes:

  1. Adding UInt32 type as it's needed for Clickhouse random column as UInt64 is too big for SQLlite
  2. Removing casting dataframe data to string - it seems it's not needed anymore
  3. Fix parsing empty string as json

@ilongin ilongin linked an issue Nov 4, 2024 that may be closed by this pull request
Copy link

cloudflare-workers-and-pages bot commented Nov 4, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7a6be18
Status: ✅  Deploy successful!
Preview URL: https://e5156e5b.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-539-dataset-pull-fix.datachain-documentation.pages.dev

View logs

@ilongin ilongin requested a review from amritghimire November 4, 2024 16:10
@mattseddon
Copy link
Member

test_row_random looks like a genuine failure.

@amritghimire
Copy link
Contributor

I am getting the error as below when using the pull using datachain pull ds://02jpg_files -o jpf -vv

Version not specified, pulling the latest one (v1)
Saving dataset ds://02jpg_files@v1 locally: 100%|█████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:00<00:00, 10.2 rows/s]
Dataset ds://02jpg_files@v1 saved locally
_request non-retriable exception: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 113, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Error: Dataset lst__gs://datachain-demo/ not found.
Traceback (most recent call last):
  File ".../datachain/src/datachain/cli.py", line 1016, in main
    catalog.pull_dataset(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1454, in pull_dataset
    _instantiate_dataset()
  File ".../datachain/src/datachain/catalog/catalog.py", line 1324, in _instantiate_dataset
    self.cp(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1563, in cp
    node_groups = self.enlist_sources_grouped(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File .../iterative/datachain/src/datachain/catalog/catalog.py", line 703, in enlist_sources_grouped
    listing = Listing(st, client, self.get_dataset(dataset_name))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/src/datachain/catalog/catalog.py", line 1090, in get_dataset
    return self.metastore.get_dataset(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../src/datachain/data_storage/metastore.py", line 704, in get_dataset
    raise DatasetNotFoundError(f"Dataset {name} not found.")
datachain.error.DatasetNotFoundError: Dataset lst__gs://datachain-demo/ not found.
Telemetry is disabled by environment variable.

The datasets were created using:

from datachain import DataChain, C

ds = DataChain.from_storage("gs://datachain-demo")
ds1 = ds.filter(C('file.path').glob('*.jpg')).save("jpg_files")
ds2 = ds.filter(C('file.path').glob('*.png')).save("png_files")
ds4 = ds1.union(ds2)
ds5 = ds4.save("jpg_png_files")

dsf1 = ds.filter(C("file.path").glob("*02.jpg")).limit(5).save("02jpg_files")

ds6 = ds5.union(dsf1)
ds6.save("all_files")

Copy link

codecov bot commented Nov 5, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.87%. Comparing base (4f08133) to head (7a6be18).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/sql/types.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #560      +/-   ##
==========================================
- Coverage   87.89%   87.87%   -0.02%     
==========================================
  Files          96       96              
  Lines        9907     9908       +1     
  Branches     1350     1350              
==========================================
- Hits         8708     8707       -1     
- Misses        859      860       +1     
- Partials      340      341       +1     
Flag Coverage Δ
datachain 87.81% <66.66%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ilongin
Copy link
Contributor Author

ilongin commented Nov 6, 2024

datachain pull ds://02jpg_files -o jpf -vv

@amritghimire Thanks for finding the issue.

This is more related to instantiating dataset for which I will create a separate issue as it's not so trivial to fix atm. I would merge current PR to at least have dataset pull without instantiation working fine (which is enough in majority of use cases for clients)

Created follow up issue related to instantiation #566

@ilongin
Copy link
Contributor Author

ilongin commented Nov 6, 2024

test_row_random looks like a genuine failure.

I've fixed the test. Eventually I decided to use 64 bit for random number but limit it to 2^63 - 1 which is the largest possible value that can be stored in SQLite as it only supports signed numbers. Clickhouse supports unsigned ones, so the largest value is 2^64 but as I mentioned, 2^63 - 1 is largest possible value supported in both DBs. This gives us much larger number "pool" then changing this to 32 bit int.

RAND_MAX = 2**63 # noqa: N806
else:
RAND_MAX = 2**64 # noqa: N806
RAND_MAX = DataTable.MAX_RANDOM # noqa: N806
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Won't this test fail when the number drawn by clickhouse is > 2^63?

And if my math is correct the chances of that happening is:

((2^64) - (2^63)) / (2^64) = 0.5 or 50%

Copy link
Contributor Author

@ilongin ilongin Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Clickhouse we have explicit limit to 2^63-1 https://github.com/iterative/studio/pull/10860/files#diff-1136c5032c82c12345c3103e18093d3256798daa2d4b13e5f5f2d5e8670ac32bR328

In general, now we actually know what is the range of that random number regardless of DB implementation.

@ilongin ilongin requested a review from mattseddon November 6, 2024 09:44
@ilongin ilongin merged commit 3dc4f3e into main Nov 6, 2024
37 of 38 checks passed
@ilongin ilongin deleted the ilongin/539-dataset-pull-fix branch November 6, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

datachain pull doesn't work
3 participants