-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffering tools: warn if running against extremely large files #737
Comments
For files larger than 1GB, I don't think we can expect csvkit to ever perform well, so we can at least start there. Some tools will have lower limits. For reference, the buffering tools are listed here.
Noting that issues had been opened about the performance of csvsort (#157, #338, #457, #626), csvsql (#428, #633), csvstat (#581). #141 discusses general performance. |
Fast alternatives: |
The warning can also mention this from tricks.rst:
|
Grepping for I think Anyway, for posterity, noting that the streaming tools probably can't get much faster within csvkit. |
Does I see why it would need to buffer in the case that you are both creating the table and filling the table as that requires two passes on the data. But if you are only doing one of those tasks it seems like csvsql should be able to stream? From Similarly, many of the statistics calculated by |
csvsql has a faster alternative in #735 which should maybe be pursued. csvjson does stream if you set csvstat buffers because it uses agate, but we can implement some statistics directly in csvkit to avoid buffering. |
I use this technique a lot, but it doesn't help much if the pattern of a columns shift at 6,000,001th row. (I understand that streaming won't really make things faster, but should help with memory). Would y'all be interested in a csvsql that uses streaming when appropriate. |
I would be, yes! |
When I run csvstat on a 218MB file it fails silently giving no output. When I run it in verbose mode, it gives a memory error. Would it be possible for it to display this memory error while not in verbose mode? Traceback (most recent call last): |
I've made a commit to do that - thanks! |
import Sequence from collections.abc to suppress warning in python 3.…
Such as #581.
But what's the limit? 100MB? 500MB? 1GB?
The text was updated successfully, but these errors were encountered: