Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for loading a resource with large amount of data in a single cell #112

Closed
wants to merge 3 commits into from

Conversation

OriHoch
Copy link
Contributor

@OriHoch OriHoch commented Jan 2, 2018

running load_resource for a csv file which has a lot of data in a single cell raises an exception -

tests/stdlib/test_stdlib.py::test_stdlib_simple_load_large_resource <- datapackage_pipelines/utilities/lib_test_helpers.py Traceback (most recent call last):
  File "/home/ori/datapackage-pipelines/tests/stdlib/../../datapackage_pipelines/lib/load_resource.py", line 56, in <module>
    ResourceLoader()()
  File "/home/ori/datapackage-pipelines/tests/stdlib/../../datapackage_pipelines/lib/load_resource.py", line 49, in __call__
    spew(self.dp, itertools.chain(self.res_iter, selected_resources))
  File "/home/ori/datapackage-pipelines/datapackage_pipelines/wrapper/wrapper.py", line 69, in spew
    for rec in res:
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tableschema/table.py", line 77, in iter
    self.__stream.open()
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tabulator/stream.py", line 158, in open
    self.__extract_sample()
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tabulator/stream.py", line 287, in __extract_sample
    row_number, headers, row = next(self.__parser.extended_rows)
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tabulator/parsers/csv.py", line 99, in __iter_extended_rows
    for row_number, item in enumerate(items, start=1):
_csv.Error: field larger than field limit (131072)

fixed by adding large-resource boolean parameter to load-resource which allows to opt-in to the fix for this problem. I'm not sure what are the implications of always doing this fix.

@akariv
Copy link
Member

akariv commented Jan 3, 2018

I think we can probably use csv.field_size_limit(...) always with a large value (although not sys.maxsize), and not use a parameter here.
(either way 'large-resource' is not a good name...)

@akariv akariv force-pushed the master branch 14 times, most recently from ccf920c to d24ac21 Compare February 15, 2018 20:45
@rufuspollock
Copy link
Contributor

@akariv what's blocking merging this one in?

@akariv
Copy link
Member

akariv commented Apr 1, 2018

@rufuspollock see my comment from Jan 3rd - basically don't add a parameter, but always set it with a default large number (which isn't sys.maxsize)

@rufuspollock
Copy link
Contributor

@akariv so this is pending on an update from @OriHoch ?

@OriHoch
Copy link
Contributor Author

OriHoch commented Apr 2, 2018

I think that it's better to leave out these system specific details from datapackage-pipelines

there will always be some limitation on field size (if not from field_size_limit then from memory limit), so I think it's better to leave it to calling code to set csv.field_size_limit appropriately if really needed
(in most cases it will be better to save these fields as separate plain data files rather then tabular data)

@OriHoch OriHoch closed this Apr 2, 2018
@OriHoch
Copy link
Contributor Author

OriHoch commented Apr 2, 2018

documented this issue in the README - #128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants