Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use on files too large to fit in memory #18

Open
unhammer opened this issue May 10, 2018 · 17 comments
Open

Use on files too large to fit in memory #18

unhammer opened this issue May 10, 2018 · 17 comments

Comments

@unhammer
Copy link
Contributor

From what I can tell from reading the (short!) source, there doesn't seem to be a way to use this on files too large to fit in memory. In particular, the ByteString's are strict, and if one simply tries splitting and sending in the parts, there's a possibility of getting errors like

Left (XenoParseError "Couldn't find the matching quote character.")

I'd rather not have Lazy ByteStrings because reasons, though that would probably be the simplest solution.

An alternative would be to add an eofF :: (Int -> m ()) to process, and pass it the index of the last fully processed element, and then call that wherever we now call throw, which would allow simply sending in arbitrarily sized ByteString's and stitch them using the eoF function. But most of the throw's happen in s_index – making that return a Maybe seems like it would hurt performance.

Or is it possible to make indexEndOfLastMatch an argument of XenoException's, and catch the throw's without losing the monad return value?

@chrisdone
Copy link
Collaborator

Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back?

https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

@chrisdone
Copy link
Collaborator

I'd be interested to know the speed of it.

@chrisdone
Copy link
Collaborator

chrisdone commented May 10, 2018

By the way, the sax package is a streaming parser implemented using xeno. So that might be an option if you want to stream. Wrong, see below.

@unhammer
Copy link
Contributor Author

Neil Mitchell said that he just memory maps the file. Perhaps you
could try that and report back?

https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

Nice, I didn't know about that.

In our case, the huge xml files are zipped, so we can't mmap without first wasting some disk space, but I suppose that's a possibility. We've been using https://hackage.haskell.org/package/zip-1.0.0/docs/Codec-Archive-Zip.html#v:sourceEntry with xml-conduit so far, but it's slow.

@unhammer
Copy link
Contributor Author

A quick-and-dirty speed comparison shows process print (\_ _ -> return ()) c c c c where c = (\_ -> return ()) taking 12s on a 1G file, where mawk '/</{print}' takes 1.5s and a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes. So this looks promising :-)

@unhammer
Copy link
Contributor Author

SAX seems to only stream on the output side – but maybe it would help with avoiding space leaks? I notice the "count elements" examples from the README using fold at least gobbled memory.

@chrisdone
Copy link
Collaborator

chrisdone commented May 11, 2018

A quick-and-dirty speed comparison shows process print (_ _ -> return ()) c c c c where c = (_ -> return ()) taking 12s on a 1G file,

Interesting! I expect most of that speed hit comes from print (has to convert the ByteString to String and then use the slow putStrLn which prints each character at a time). Probably S8.putStrLn from Data.ByteString.Char8 would be faster. I'd be curious to see how much time is taken with no-ops for everything.

a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes.

Eesh! Question: how much long does it take to simply unzip the file with sourceEntry and e.g. sinkNull? I'm curious about how much the unzip takes versus xml-conduit. If we were to implement a streaming solution to xeno, it's possible that the unzipping overhead would be so much that a fast XML parser doesn't matter.

SAX seems to only stream on the output side

That's a good point, it does indeed only stream on the output side.

@unhammer
Copy link
Contributor Author

Interesting! I expect most of that speed hit comes from print (has
to convert the ByteString to String and then use the slow putStrLn
which prints each character at a time). Probably S8.putStrLn from
Data.ByteString.Char8 would be faster. I'd be curious to see how
much time is taken with no-ops for everything.

So S8.putStrLn takes it down to 6s, and (\_ -> return ()) down to 4s. I didn't even try noop at first, since I'm used to ghc completely optimising such things away, giving useless benchmarks =P but 4s seems like it's doing what we mean here.

a stripped down version of our xml-conduit parser (to only prints
elements) takes about 4 minutes.

Eesh! Question: how much long does it take to simply unzip the file
with sourceEntry and e.g. sinkNull? I'm curious about how much the
unzip takes versus xml-conduit. If we were to implement a streaming
solution to xeno, it's possible that the unzipping speed would be so
much that a fast XML parser doesn't matter.

Actually, it's faster than zcat, under 2s! Even if I insert CL.mapM (liftIO . S8.putStr) .| in there. I'm going to start using that instead of zcat on the cli for these single-file zip files …

@chrisdone
Copy link
Collaborator

This might be related: snoyberg/conduit#370

@unhammer
Copy link
Contributor Author

unhammer commented May 14, 2018

Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.

(EDIT: If I upgrade conduit to 1.3.1, I do see snoyberg/conduit#370 – but we've been on 1.2, where there is reasonable memory usage.)

@zudov
Copy link

zudov commented May 24, 2018

Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.

One thing to check would be the GC productivity and allocations info. The CPU usage might as well be high because the program spends too much time cleaning up after itself.

@ocramz
Copy link
Owner

ocramz commented Jul 6, 2018

@chrisdone should we add memory-mapped file access?

@chrisdone
Copy link
Collaborator

@unhammer
Copy link
Contributor Author

unhammer commented Oct 18, 2019

Even with mmap-ing, I can't get this to have constant memory usage.

-rw-rw-r-- 1 unhammer unhammer 15M okt.  18 11:01 big.xml
-rw-rw-r-- 1 unhammer unhammer 337 okt.  18 11:03 small.xml

If I do

  [path] <- getArgs
  bs <- MM.mmapFileByteString path Nothing
  print $ Data.ByteString.length bs

I get constant memory usage, /usr/bin/time gives ~51980maxresidentk) for both big.xml and small.xml.

If I do

noop :: ByteString -> Either XenoException Integer
noop = Xeno.fold const (\m _ _ -> m) const const const const 0

main = do
  [path] <- getArgs
  bs <- MM.mmapFileByteString path Nothing
  print $ noop bs

then small.xml gives 52080maxresidentk while big.xml gives 4286264maxresidentk.

bilde

bilde

bilde

I've tried building xeno both with and without -f-no-full-laziness.
Tested on ghc 8.6.5, xeno 82be21b

@mgajda
Copy link
Collaborator

mgajda commented Feb 22, 2021

@unhammer How big are your files that do not fit in memory?
Can we have a look at them?
We parsed multigigabyte files fine, but maybe that is not what you need.

@unhammer
Copy link
Contributor Author

see gmail @mgajda

@mgajda
Copy link
Collaborator

mgajda commented Feb 23, 2021

For the time being I recommend using bytestring-mmap to map entire ByteString, since the memory is foreign, and can be swapped out without compromising garbage collection time.
If you have a compressed file, I would recommend trying to uncompress it on-disk (in /tmp) and consume it from there with mmap.

If there are any performance issues, that would be due to garbage collection time (GC time) - you may decrease it by processing the nodes inside XML in a streaming fashion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants