mmap files are treated as streams but can't be read #34

dracos · 2017-05-12T15:30:23Z

xlrd can operate on mmap files, so it would be useful to be able to pass in one to pyexcel, e.g.

sheet = pyexcel.get_sheet(file_type='xls', file_content=mmap.mmap(fp.fileno(), 0,
    access=mmap.ACCESS_READ))

isstream is true for an mmap because it has a read function, and so even though I passed in file_content, pyexcel-io's get_data calls load_data with file_stream rather than file_content. But this then means that down in pyexcel-xls, either getvalue is called (current release) which errors, or read is called without argument (after pyexcel/pyexcel-xls#16 fix) which errors as mmap's read must have an argument.

My aim is to not have to read in file contents in one go at all anywhere, and for pyexcel-xls/xlrd, mmap appears to be the only way (and #33 would fix it for CSV I think). What I have done is create an mmap subclass that does not have a read function, which then means, pyexcel-io passes file_content through to pyexcel-xls and thus xlrd.

The text was updated successfully, but these errors were encountered:

chfw · 2017-05-12T19:33:45Z

Yes, I agree with what you have found. And I will look at mmap options for file_stream and file_content. Could you please evaluate get_data(... streaming=True..)? streaming would enable 'yield' command and would allow you to process large csv at least. For large xls, I will have to have a look at mmap.

dracos · 2017-05-12T20:00:05Z

streaming=True nearly works, but my #33 shows the one case I think is left - the entire CSV file is read in by the read() at

pyexcel-io/pyexcel_io/fileformat/_csv.py

Line 269 in 1cffd9d

content = self._file_stream.read()

In case it's of interest, here's what I've done to switch from csv DictReader/DictWriter to pyexcel: mysociety/mapit.mysociety.org@3c3dd94
I think (apart from #33 and whatever full reads the underlying odfpy package might do) this is hopefully fully iterative and not loading anything into memory. Or hopefully near enough anyway! Thanks for providing this package :)

chfw · 2017-05-12T20:28:23Z

I see you used iget_records and that is OK. "streaming=True" is passed and kept by iget_* and isave_*. odfpy and ezodf both read file fully into memory. However, pyexcel-odsr, a strip-down and ods only version of messytable. If you need both pyexcel-ods and pyexcel-odsr installed, you could specify iget_records(...library='pyexcel-odsr').

chfw · 2017-05-18T19:40:49Z

please verify using pyexcel-io 0.3.4

chfw closed this as completed in 06924f9 May 13, 2017

chfw added a commit that referenced this issue May 13, 2017

test issue #33 #34

17ff82f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmap files are treated as streams but can't be read #34

mmap files are treated as streams but can't be read #34

dracos commented May 12, 2017

chfw commented May 12, 2017

dracos commented May 12, 2017

chfw commented May 12, 2017 •

edited

Loading

chfw commented May 18, 2017

mmap files are treated as streams but can't be read #34

mmap files are treated as streams but can't be read #34

Comments

dracos commented May 12, 2017

chfw commented May 12, 2017

dracos commented May 12, 2017

chfw commented May 12, 2017 • edited Loading

chfw commented May 18, 2017

chfw commented May 12, 2017 •

edited

Loading