Use on files too large to fit in memory #18

unhammer · 2018-05-10T09:10:25Z

From what I can tell from reading the (short!) source, there doesn't seem to be a way to use this on files too large to fit in memory. In particular, the ByteString's are strict, and if one simply tries splitting and sending in the parts, there's a possibility of getting errors like

Left (XenoParseError "Couldn't find the matching quote character.")

I'd rather not have Lazy ByteStrings because reasons, though that would probably be the simplest solution.

An alternative would be to add an eofF :: (Int -> m ()) to process, and pass it the index of the last fully processed element, and then call that wherever we now call throw, which would allow simply sending in arbitrarily sized ByteString's and stitch them using the eoF function. But most of the throw's happen in s_index – making that return a Maybe seems like it would hurt performance.

Or is it possible to make indexEndOfLastMatch an argument of XenoException's, and catch the throw's without losing the monad return value?

The text was updated successfully, but these errors were encountered:

chrisdone · 2018-05-10T12:30:03Z

Neil Mitchell said that he just memory maps the file. Perhaps you could try that and report back?

https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

chrisdone · 2018-05-10T12:31:58Z

I'd be interested to know the speed of it.

chrisdone · 2018-05-10T12:37:07Z

~~By the way, the sax package is a streaming parser implemented using xeno. So that might be an option if you want to stream.~~ Wrong, see below.

unhammer · 2018-05-11T11:16:59Z

Neil Mitchell said that he just memory maps the file. Perhaps you
could try that and report back?

https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

Nice, I didn't know about that.

In our case, the huge xml files are zipped, so we can't mmap without first wasting some disk space, but I suppose that's a possibility. We've been using https://hackage.haskell.org/package/zip-1.0.0/docs/Codec-Archive-Zip.html#v:sourceEntry with xml-conduit so far, but it's slow.

unhammer · 2018-05-11T11:20:55Z

A quick-and-dirty speed comparison shows process print (\_ _ -> return ()) c c c c where c = (\_ -> return ()) taking 12s on a 1G file, where mawk '/</{print}' takes 1.5s and a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes. So this looks promising :-)

unhammer · 2018-05-11T11:21:47Z

SAX seems to only stream on the output side – but maybe it would help with avoiding space leaks? I notice the "count elements" examples from the README using fold at least gobbled memory.

chrisdone · 2018-05-11T12:05:06Z

A quick-and-dirty speed comparison shows process print (_ _ -> return ()) c c c c where c = (_ -> return ()) taking 12s on a 1G file,

Interesting! I expect most of that speed hit comes from print (has to convert the ByteString to String and then use the slow putStrLn which prints each character at a time). Probably S8.putStrLn from Data.ByteString.Char8 would be faster. I'd be curious to see how much time is taken with no-ops for everything.

a stripped down version of our xml-conduit parser (to only prints elements) takes about 4 minutes.

Eesh! Question: how much long does it take to simply unzip the file with sourceEntry and e.g. sinkNull? I'm curious about how much the unzip takes versus xml-conduit. If we were to implement a streaming solution to xeno, it's possible that the unzipping overhead would be so much that a fast XML parser doesn't matter.

SAX seems to only stream on the output side

That's a good point, it does indeed only stream on the output side.

unhammer · 2018-05-11T19:47:26Z

Interesting! I expect most of that speed hit comes from print (has
to convert the ByteString to String and then use the slow putStrLn
which prints each character at a time). Probably S8.putStrLn from
Data.ByteString.Char8 would be faster. I'd be curious to see how
much time is taken with no-ops for everything.

So S8.putStrLn takes it down to 6s, and (\_ -> return ()) down to 4s. I didn't even try noop at first, since I'm used to ghc completely optimising such things away, giving useless benchmarks =P but 4s seems like it's doing what we mean here.

a stripped down version of our xml-conduit parser (to only prints
elements) takes about 4 minutes.

Eesh! Question: how much long does it take to simply unzip the file
with sourceEntry and e.g. sinkNull? I'm curious about how much the
unzip takes versus xml-conduit. If we were to implement a streaming
solution to xeno, it's possible that the unzipping speed would be so
much that a fast XML parser doesn't matter.

Actually, it's faster than zcat, under 2s! Even if I insert CL.mapM (liftIO . S8.putStr) .| in there. I'm going to start using that instead of zcat on the cli for these single-file zip files …

chrisdone · 2018-05-14T08:18:10Z

This might be related: snoyberg/conduit#370

unhammer · 2018-05-14T09:40:20Z

Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.

(EDIT: If I upgrade conduit to 1.3.1, I do see snoyberg/conduit#370 – but we've been on 1.2, where there is reasonable memory usage.)

zudov · 2018-05-24T13:10:49Z

Hm, I haven't noticed high memory usage with xml-conduit here, just very high cpu usage.

One thing to check would be the GC productivity and allocations info. The CPU usage might as well be high because the program spends too much time cleaning up after itself.

ocramz · 2018-07-06T14:09:45Z

@chrisdone should we add memory-mapped file access?

chrisdone · 2018-07-06T14:36:01Z

@ocramz I think that's already available from https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html#v:mmapFileByteString

unhammer · 2019-10-18T09:29:10Z

Even with mmap-ing, I can't get this to have constant memory usage.

-rw-rw-r-- 1 unhammer unhammer 15M okt.  18 11:01 big.xml
-rw-rw-r-- 1 unhammer unhammer 337 okt.  18 11:03 small.xml

If I do

  [path] <- getArgs
  bs <- MM.mmapFileByteString path Nothing
  print $ Data.ByteString.length bs

I get constant memory usage, /usr/bin/time gives ~51980maxresidentk) for both big.xml and small.xml.

If I do

noop :: ByteString -> Either XenoException Integer
noop = Xeno.fold const (\m _ _ -> m) const const const const 0

main = do
  [path] <- getArgs
  bs <- MM.mmapFileByteString path Nothing
  print $ noop bs

then small.xml gives 52080maxresidentk while big.xml gives 4286264maxresidentk.

I've tried building xeno both with and without -f-no-full-laziness.
Tested on ghc 8.6.5, xeno 82be21b

mgajda · 2021-02-22T08:40:08Z

@unhammer How big are your files that do not fit in memory?
Can we have a look at them?
We parsed multigigabyte files fine, but maybe that is not what you need.

unhammer · 2021-02-22T12:42:25Z

see gmail @mgajda

mgajda · 2021-02-23T09:52:24Z

For the time being I recommend using bytestring-mmap to map entire ByteString, since the memory is foreign, and can be swapped out without compromising garbage collection time.
If you have a compressed file, I would recommend trying to uncompress it on-disk (in /tmp) and consume it from there with mmap.

If there are any performance issues, that would be due to garbage collection time (GC time) - you may decrease it by processing the nodes inside XML in a streaming fashion.

ocramz added the enhancement label May 11, 2018

unhammer mentioned this issue May 17, 2019

Note that -fno-full-laziness may be needed snoyberg/xml#142

Open

mgajda added the workarounds-available label Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use on files too large to fit in memory #18

Use on files too large to fit in memory #18

unhammer commented May 10, 2018

chrisdone commented May 10, 2018

chrisdone commented May 10, 2018

chrisdone commented May 10, 2018 •

edited

Loading

unhammer commented May 11, 2018

unhammer commented May 11, 2018

unhammer commented May 11, 2018

chrisdone commented May 11, 2018 •

edited

Loading

unhammer commented May 11, 2018

chrisdone commented May 14, 2018

unhammer commented May 14, 2018 •

edited

Loading

zudov commented May 24, 2018

ocramz commented Jul 6, 2018

chrisdone commented Jul 6, 2018

unhammer commented Oct 18, 2019 •

edited

Loading

mgajda commented Feb 22, 2021

unhammer commented Feb 22, 2021

mgajda commented Feb 23, 2021

Use on files too large to fit in memory #18

Use on files too large to fit in memory #18

Comments

unhammer commented May 10, 2018

chrisdone commented May 10, 2018

chrisdone commented May 10, 2018

chrisdone commented May 10, 2018 • edited Loading

unhammer commented May 11, 2018

unhammer commented May 11, 2018

unhammer commented May 11, 2018

chrisdone commented May 11, 2018 • edited Loading

unhammer commented May 11, 2018

chrisdone commented May 14, 2018

unhammer commented May 14, 2018 • edited Loading

zudov commented May 24, 2018

ocramz commented Jul 6, 2018

chrisdone commented Jul 6, 2018

unhammer commented Oct 18, 2019 • edited Loading

mgajda commented Feb 22, 2021

unhammer commented Feb 22, 2021

mgajda commented Feb 23, 2021

chrisdone commented May 10, 2018 •

edited

Loading

chrisdone commented May 11, 2018 •

edited

Loading

unhammer commented May 14, 2018 •

edited

Loading

unhammer commented Oct 18, 2019 •

edited

Loading