-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse_tabular(nrows=3)
is slow
#210
Comments
@dberenbaum is it happening even if you run it second time (indexing is cached)? bc, now we index the whole parent even if we access a single file |
uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3/3M objects")
print("========================================================================")
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", nrows=3)
dynamic_csv_ds.print_schema()
print(dynamic_csv_ds.to_pandas()) I have run the above example multiple times with |
From my investigations last week it seems like Thread 6133035008:
File "/datachain/.env/lib/python3.12/site-packages/fsspec/spec.py", line 1941, in read
out = self.cache._fetch(self.loc, self.loc + length)
File "/datachain/.env/lib/python3.12/site-packages/fsspec/caching.py", line 234, in _fetch
self.cache = self.fetcher(start, end) # new block replaces old
File "/datachain/.env/lib/python3.12/site-packages/gcsfs/core.py", line 1924, in _fetch_range
return self.gcsfs.cat_file(self.path, start=start, end=end)
File "/datachain/.env/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/datachain/.env/lib/python3.12/site-packages/fsspec/asyn.py", line 91, in sync
if event.wait(1):
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 655, in wait
signaled = self._cond.wait(timeout)
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 359, in wait
gotit = waiter.acquire(True, timeout)
Thread Thread-1:
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1030, in _bootstrap
self._bootstrap_inner()
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/datachain/.env/lib/python3.12/site-packages/tqdm/_monitor.py", line 60, in run
self.was_killed.wait(self.sleep_interval)
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 655, in wait
signaled = self._cond.wait(timeout)
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 359, in wait
gotit = waiter.acquire(True, timeout)
Thread asyncio_0:
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1030, in _bootstrap
self._bootstrap_inner()
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/thread.py", line 89, in _worker
work_item = work_queue.get(block=True)
Thread fsspecIO:
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1030, in _bootstrap
self._bootstrap_inner()
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 641, in run_forever
self._run_once()
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 1949, in _run_once
event_list = self._selector.select(timeout)
File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/selectors.py", line 566, in select
kev_list = self._selector.control(None, max_ev, timeout)
Thread MainThread:
File "/datachain/examples/get_started/json-csv-reader.py", line 113, in <module>
main()
File "/datachain/examples/get_started/json-csv-reader.py", line 108, in main
traceback.print_stack(frame) Now that we avoid threads for the |
Description
Thanks to both @mattseddon and @volkfox for raising this. Any of
from_csv/from_parquet/parse_tabular
will be slow if used withnrows
and a cloud path.To reproduce, run https://github.com/iterative/datachain/blob/main/examples/get_started/json-csv-reader.py. The last example takes a long time to complete even though it uses
nrow=3
. After diving into it a bit, it looks to be an issue with either pyarrow or fsspec. Opened apache/arrow#43497 to track the issue upstream.Version Info
The text was updated successfully, but these errors were encountered: