fix for loading a resource with large amount of data in a single cell #112

OriHoch · 2018-01-02T17:14:51Z

running load_resource for a csv file which has a lot of data in a single cell raises an exception -

tests/stdlib/test_stdlib.py::test_stdlib_simple_load_large_resource <- datapackage_pipelines/utilities/lib_test_helpers.py Traceback (most recent call last):
  File "/home/ori/datapackage-pipelines/tests/stdlib/../../datapackage_pipelines/lib/load_resource.py", line 56, in <module>
    ResourceLoader()()
  File "/home/ori/datapackage-pipelines/tests/stdlib/../../datapackage_pipelines/lib/load_resource.py", line 49, in __call__
    spew(self.dp, itertools.chain(self.res_iter, selected_resources))
  File "/home/ori/datapackage-pipelines/datapackage_pipelines/wrapper/wrapper.py", line 69, in spew
    for rec in res:
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tableschema/table.py", line 77, in iter
    self.__stream.open()
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tabulator/stream.py", line 158, in open
    self.__extract_sample()
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tabulator/stream.py", line 287, in __extract_sample
    row_number, headers, row = next(self.__parser.extended_rows)
  File "/home/ori/virtualenvs/datapackage-pipelines/lib/python3.6/site-packages/tabulator/parsers/csv.py", line 99, in __iter_extended_rows
    for row_number, item in enumerate(items, start=1):
_csv.Error: field larger than field limit (131072)

fixed by adding large-resource boolean parameter to load-resource which allows to opt-in to the fix for this problem. I'm not sure what are the implications of always doing this fix.

akariv · 2018-01-03T19:01:30Z

I think we can probably use csv.field_size_limit(...) always with a large value (although not sys.maxsize), and not use a parameter here.
(either way 'large-resource' is not a good name...)

…lines

rufuspollock · 2018-04-01T09:56:05Z

@akariv what's blocking merging this one in?

akariv · 2018-04-01T11:01:52Z

@rufuspollock see my comment from Jan 3rd - basically don't add a parameter, but always set it with a default large number (which isn't sys.maxsize)

rufuspollock · 2018-04-02T19:07:04Z

@akariv so this is pending on an update from @OriHoch ?

OriHoch · 2018-04-02T20:12:57Z

I think that it's better to leave out these system specific details from datapackage-pipelines

there will always be some limitation on field size (if not from field_size_limit then from memory limit), so I think it's better to leave it to calling code to set csv.field_size_limit appropriately if really needed
(in most cases it will be better to save these fields as separate plain data files rather then tabular data)

OriHoch · 2018-04-02T20:23:12Z

documented this issue in the README - #128

fix for loading a resource with large amount of data in a single cell

98fe89a

Merge branch 'master' of github.com:frictionlessdata/datapackage-pipe…

86719a0

…lines

akariv force-pushed the master branch 14 times, most recently from ccf920c to d24ac21 Compare February 15, 2018 20:45

Merge branch 'master' of github.com:frictionlessdata/datapackage-pipe…

62faa49

…lines

OriHoch closed this Apr 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for loading a resource with large amount of data in a single cell #112

fix for loading a resource with large amount of data in a single cell #112

OriHoch commented Jan 2, 2018

akariv commented Jan 3, 2018

rufuspollock commented Apr 1, 2018

akariv commented Apr 1, 2018

rufuspollock commented Apr 2, 2018

OriHoch commented Apr 2, 2018

OriHoch commented Apr 2, 2018

fix for loading a resource with large amount of data in a single cell #112

fix for loading a resource with large amount of data in a single cell #112

Conversation

OriHoch commented Jan 2, 2018

akariv commented Jan 3, 2018

rufuspollock commented Apr 1, 2018

akariv commented Apr 1, 2018

rufuspollock commented Apr 2, 2018

OriHoch commented Apr 2, 2018

OriHoch commented Apr 2, 2018