Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap files are treated as streams but can't be read #34

Closed
dracos opened this issue May 12, 2017 · 4 comments
Closed

mmap files are treated as streams but can't be read #34

dracos opened this issue May 12, 2017 · 4 comments

Comments

@dracos
Copy link

dracos commented May 12, 2017

xlrd can operate on mmap files, so it would be useful to be able to pass in one to pyexcel, e.g.

sheet = pyexcel.get_sheet(file_type='xls', file_content=mmap.mmap(fp.fileno(), 0,
    access=mmap.ACCESS_READ))

isstream is true for an mmap because it has a read function, and so even though I passed in file_content, pyexcel-io's get_data calls load_data with file_stream rather than file_content. But this then means that down in pyexcel-xls, either getvalue is called (current release) which errors, or read is called without argument (after pyexcel/pyexcel-xls#16 fix) which errors as mmap's read must have an argument.

My aim is to not have to read in file contents in one go at all anywhere, and for pyexcel-xls/xlrd, mmap appears to be the only way (and #33 would fix it for CSV I think). What I have done is create an mmap subclass that does not have a read function, which then means, pyexcel-io passes file_content through to pyexcel-xls and thus xlrd.

@chfw
Copy link
Member

chfw commented May 12, 2017

Yes, I agree with what you have found. And I will look at mmap options for file_stream and file_content. Could you please evaluate get_data(... streaming=True..)? streaming would enable 'yield' command and would allow you to process large csv at least. For large xls, I will have to have a look at mmap.

@dracos
Copy link
Author

dracos commented May 12, 2017

streaming=True nearly works, but my #33 shows the one case I think is left - the entire CSV file is read in by the read() at

content = self._file_stream.read()

In case it's of interest, here's what I've done to switch from csv DictReader/DictWriter to pyexcel: mysociety/mapit.mysociety.org@3c3dd94
I think (apart from #33 and whatever full reads the underlying odfpy package might do) this is hopefully fully iterative and not loading anything into memory. Or hopefully near enough anyway! Thanks for providing this package :)

@chfw
Copy link
Member

chfw commented May 12, 2017

I see you used iget_records and that is OK. "streaming=True" is passed and kept by iget_* and isave_*. odfpy and ezodf both read file fully into memory. However, pyexcel-odsr, a strip-down and ods only version of messytable. If you need both pyexcel-ods and pyexcel-odsr installed, you could specify iget_records(...library='pyexcel-odsr').

@chfw chfw closed this as completed in 06924f9 May 13, 2017
chfw added a commit that referenced this issue May 13, 2017
@chfw
Copy link
Member

chfw commented May 18, 2017

please verify using pyexcel-io 0.3.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants