-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use on files too large to fit in memory #18
Comments
Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back? https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString |
I'd be interested to know the speed of it. |
|
Nice, I didn't know about that. In our case, the huge xml files are zipped, so we can't mmap without first wasting some disk space, but I suppose that's a possibility. We've been using https://hackage.haskell.org/package/zip-1.0.0/docs/Codec-Archive-Zip.html#v:sourceEntry with xml-conduit so far, but it's slow. |
A quick-and-dirty speed comparison shows |
SAX seems to only stream on the output side – but maybe it would help with avoiding space leaks? I notice the "count elements" examples from the README using |
Interesting! I expect most of that speed hit comes from
Eesh! Question: how much long does it take to simply unzip the file with
That's a good point, it does indeed only stream on the output side. |
So
Actually, it's faster than zcat, under 2s! Even if I insert |
This might be related: snoyberg/conduit#370 |
Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage. (EDIT: If I upgrade conduit to 1.3.1, I do see snoyberg/conduit#370 – but we've been on 1.2, where there is reasonable memory usage.) |
One thing to check would be the GC productivity and allocations info. The CPU usage might as well be high because the program spends too much time cleaning up after itself. |
@chrisdone should we add memory-mapped file access? |
@ocramz I think that's already available from https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString |
Even with mmap-ing, I can't get this to have constant memory usage.
If I do
I get constant memory usage, If I do
then small.xml gives 52080maxresidentk while big.xml gives 4286264maxresidentk. I've tried building xeno both with and without |
@unhammer How big are your files that do not fit in memory? |
see gmail @mgajda |
For the time being I recommend using If there are any performance issues, that would be due to garbage collection time (GC time) - you may decrease it by processing the nodes inside XML in a streaming fashion. |
From what I can tell from reading the (short!) source, there doesn't seem to be a way to use this on files too large to fit in memory. In particular, the ByteString's are strict, and if one simply tries splitting and sending in the parts, there's a possibility of getting errors like
Left (XenoParseError "Couldn't find the matching quote character.")
I'd rather not have Lazy ByteStrings because reasons, though that would probably be the simplest solution.
An alternative would be to add an
eofF :: (Int -> m ())
toprocess
, and pass it the index of the last fully processed element, and then call that wherever we now callthrow
, which would allow simply sending in arbitrarily sized ByteString's and stitch them using the eoF function. But most of the throw's happen ins_index
– making that return a Maybe seems like it would hurt performance.Or is it possible to make
indexEndOfLastMatch
an argument ofXenoException
's, andcatch
the throw's without losing the monad return value?The text was updated successfully, but these errors were encountered: