Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] support streaming osm files through libexpat #55

Merged
merged 1 commit into from
Jan 5, 2015
Merged

[WIP] support streaming osm files through libexpat #55

merged 1 commit into from
Jan 5, 2015

Conversation

yeesian
Copy link
Collaborator

@yeesian yeesian commented Dec 20, 2014

Just for comparison, when reading an osm file that's ~320MB (extracts available here):

  • it takes up about 7.6GB RAM using the tree-based parser from LightXML:
julia> @time nodes, highways, buildings, features = getOSMData(filename, nodes=true, highways=true, buildings=true, features=true);
elapsed time: 77.810447564 seconds (7885139620 bytes allocated, 70.46% gc time)
  • it takes up around 0.5GB RAM using the event-based parser from LibExpat:
julia> @time nodes, highways, buildings, features = getOSMData(filename, true);
elapsed time: 627.010413888 seconds (176812171176 bytes allocated, 89.00% gc time)

However, garbage collection is making it very slow, and I'm not sure how to improve the performance. Any suggestions?
@garborg, @mlubin, @IainNZ, @joehuchette

Some numbers:

julia> length(nodes)
2447660

julia> length(highways)
65610

julia> length(buildings)
185823

julia> length(features)
12077

@yeesian yeesian mentioned this pull request Dec 20, 2014
@IainNZ
Copy link

IainNZ commented Dec 20, 2014

@yeesian try running it with http://docs.julialang.org/en/latest/stdlib/profile/#direct-analysis-of-memory-allocation to identify worst spots - also consider disabling gc selectively if its just generating a million of small objects in a tight loop.

@yeesian
Copy link
Collaborator Author

yeesian commented Dec 20, 2014

I've tried disabling gc selectively, but haven't yet gotten it to work. On the few occasions where it seemed promising, it blew up in memory; I'll try profiling as you suggest

(btw, if you're looking for a smaller file to play with, @tedsteiner has a small instance on dropbox, mostly for testing)

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f677b74 on yeesian:streaming-xml-parser into 2293c61 on tedsteiner:master.

@yeesian
Copy link
Collaborator Author

yeesian commented Dec 22, 2014

Sorry for the noise, i did a git revert where I really meant to do a git reset instead.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 3, 2015

Thought you all might be interested to know: here's the latest comparison, when reading an osm file that's ~320MB (extracts available here)

  • it takes up about 7.6GB RAM using the tree-based parser from LightXML:
julia> @time nodes, highways, buildings, features = getOSMData(filename, nodes=true, highways=true, buildings=true, features=true);
elapsed time: 77.810447564 seconds (7885139620 bytes allocated, 70.46% gc time)
  • it takes up around 0.5GB RAM using the event-based parser from LibExpat (on master):
julia> @time nodes, highways, buildings, features = getOSMData(filename, true);
elapsed time: 85.566212155 seconds (10574148240 bytes allocated, 65.05% gc time)

@mlubin
Copy link

mlubin commented Jan 3, 2015

👍

@garborg
Copy link
Collaborator

garborg commented Jan 3, 2015

Excellent!

On Fri, Jan 2, 2015 at 6:15 PM, Miles Lubin notifications@github.com
wrote:

[image: 👍]


Reply to this email directly or view it on GitHub
#55 (comment)
.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 3, 2015

I'm wondering now

  • are the LightXML portions of the codebase still needed? (removed, except for parseMapXML)
  • should I provide keyword arguments for the user to select those aspects of the map they want? Or just parse everything, and let them discard the portions they don't need

@coveralls
Copy link

Coverage Status

Coverage increased (+1.15%) when pulling dc730f9 on yeesian:streaming-xml-parser into 2293c61 on tedsteiner:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.15%) when pulling 5a1d50b on yeesian:streaming-xml-parser into 2293c61 on tedsteiner:master.

@garborg
Copy link
Collaborator

garborg commented Jan 3, 2015

@yeesian I think letting users select what they want parsed should be on the roadmap. I'm not certain keyword arguments is a granular enough solution in the longterm, but I haven't thought about it much. Anyway, I'd be for implementing something for now or just leaving the design open to adding it later.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 3, 2015

Mm, agreed. I think it's worth having on the roadmap, but is not an immediate priority (at least to me). I'd prefer to get the other aspects of the parsing right (e.g. relations/turn-restrictions) working first, before introducing conditional filtering. That said, if you wish, feel free to have a shot!

One way to implement it would be simply to discard the irrelevant parts within getOSMData and just return the ones specified by the keyword arguments?

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 3, 2015

Also, I'm starting to wonder if a Dict might be a better object to return. For users to check what keys/objects are available, and to make it easy for us to introduce new OSM entities

@garborg
Copy link
Collaborator

garborg commented Jan 3, 2015

Yep, not trying to suggest someone work on it now. I care about turn restrictions more, too -- that sounds great.

@garborg
Copy link
Collaborator

garborg commented Jan 3, 2015

Yes, that sounds good -- returning a tuple that varies based on values of the inputs seems like a bit of an anti-pattern.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.15%) when pulling de29440 on yeesian:streaming-xml-parser into 2293c61 on tedsteiner:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.15%) when pulling de29440 on yeesian:streaming-xml-parser into 2293c61 on tedsteiner:master.

@garborg
Copy link
Collaborator

garborg commented Jan 3, 2015

Oh no, just thought of this -- I often only parse nodes and highways -- instead of a 10% slowdown, it will be much slower in cases like that -- I'm not sure what @tedsteiner does in his work, but he's the heaviest user, so I'll defer to him.

In the meantime I'll run a quick test and report numbers -- and if if it happens people need it a little faster before merging into master, perhaps a feature branch we all could commit to to get it to parity as soon as possible.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 3, 2015

If the numbers are significant, I'll have a look to see what I can do in this PR. I'm guessing that it might be a challenge though.. since (i) the number of features is usually small, and (ii) you need to parse the tags to disambiguate between a building and a highway, which comes only after parsing their nd references, and so you can't "skip over" the parsing (in an event-based system) the way you previously could (in a tree-based system).

That said, it's always nice to have benchmarks to aim for (:

@garborg
Copy link
Collaborator

garborg commented Jan 3, 2015

I tried a couple extracts, and I'm seeing performance like you mentioned for parsing everything, and performance 2-3x slower for parsing just the nodes and highways, e.g.:

~  julia-dev -e 'include("osmp.jl"); time_fun(tree4)'
elapsed time: 38.797660288 seconds (3637203160 bytes allocated, 34.39% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(tree4)'
elapsed time: 30.45888161 seconds (3637203160 bytes allocated, 43.91% gc time)
➜  ~~  julia-dev -e 'include("osmp.jl"); time_fun(tree2)'
elapsed time: 11.748997784 seconds (792707816 bytes allocated, 4.55% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(tree2)'
elapsed time: 11.620226646 seconds (792707816 bytes allocated, 4.68% gc time)
➜  ~
Switched to branch 'streaming'
➜  OpenStreetMap git:(streaming)
➜  ~~  julia-dev -e 'include("osmp.jl"); time_fun(stream4)'
elapsed time: 31.529423398 seconds (6290493080 bytes allocated, 40.69% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(stream4)'
elapsed time: 31.566031387 seconds (6290493080 bytes allocated, 40.51% gc time)
➜  ~~  julia-dev -e 'include("osmp.jl"); time_fun(stream2)'
elapsed time: 31.150185125 seconds (6290493112 bytes allocated, 40.87% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(stream2)'
elapsed time: 31.565353016 seconds (6290493112 bytes allocated, 41.32% gc time)

Notes:
While this PR appears to allocate much more than master, it can handle larger files without running out of space.
All the garbage created in the current parser happens in parsing buildings and features -- maybe that's avoidable.
I'm not sure what to suggest between keeping both parsers around, keeping a separate branch for a little bit, and just merging it.
.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 4, 2015

Thanks for the benchmarks! Guess I should not have collapsed the commits; I can revert back to having both parsers tmr, but --

I'd imagine it to be confusing to anyone using openstreetmap to suddenly see the memory requirement spike (when they are in fact asking for fewer OSM objects), so we'll have to be careful with the documentation.

2-3x slower does sound like quite abit though (is it 10 vs 30sec?); I guess I should look into possibilities for speeding the parser up, since I've only worked with node and highways myself so far.

@garborg
Copy link
Collaborator

garborg commented Jan 5, 2015

@yeesian I think a note in the documentation is enough there, but...

Do you think having two parsers to keep in sync will slow down development efforts implement other parsing? If that's so, I'd be happy to take slower parsing in exchange for a better graph (as long as the new parsing functionality comes with tests, anyone who needed faster parsing would have a clear path).

@tedsteiner, what's more important to you right now?:
a) Fast parsing of (nodes and highways) for extracts that the current parser doesn't brush against memory limits on?
b) Parsing of larger extracts, and possibly getting more parsing functionality a bit soone

@tedsteiner
Copy link
Owner

Sorry about being away for a while, this looks awesome, @yeesian!

@garborg TL; DR: I'd lean towards Option B.

My general thoughts are that:

  1. We probably only want one dependency on an XML reading package, rather than two. Less to break. If it looks like one option is going to win out long-term or is more actively developed, that might be a better option. I chose LightXML because it was easier for me to get running and seemed more stable at the time, but I did so reluctantly because I really wanted streaming XML.
  2. As I see it, the memory bloat of parsing the entire XML tree in memory is probably unacceptable long-term, when considering all use cases. I think I selected the boundaries for my example maps of Boston (the ones linked in the Readme) according to the size of the OSM file. I shrank the boundaries until it fit in memory (that OSM file parses to about 3 GB). Anyone wanting to view maps of an entire state or country is going to run out of memory.
  3. In general, I think a longer loading time is fine, since you only need to load each datafile once per session. And it's frustrating to have gigabytes of RAM eaten up by an XML tree you don't even have access to any more. For some reason, at least on my system, the memory stays allocated until I close Julia.
  4. My understanding is that the OSM XML format isn't really intended for fast parsing, but rather for convenience. Applications requiring fast parsing should probably use the binary format. Somewhere on the roadmap we should maybe add the ability to read these binary files, but I don't really have a use for it myself, since the OSM files are working fine and I don't mind waiting a little while to read them.

@garborg
Copy link
Collaborator

garborg commented Jan 5, 2015

That all sounds right to me, so I guess the one worry is that we'd be switching away from a tested package under the JuliaLang organization to an untested package that less people have their eyes on -- on the other hand, easy enough to remedy and switching and adding testing upstream sounds like something that needs to happen eventually anyway.

@garborg
Copy link
Collaborator

garborg commented Jan 5, 2015

I opened an issue out of curiosity (not to wait on) to see if support for the streaming API was on anyone's LightXML.jl roadmap.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 5, 2015

FWIW, LibExpat also has some tests, even if it's not very well-documented.

I'm okay with having support for multiple parsers, but we don't yet have a parser-agnostic interface (apart from getOSMData and parseMapXML). Longer-term, I don't think tree-based parsing is the way to go for OpenStreetMap anyway (for reasons @tedsteiner has mentioned), and so it doesn't really hurt to remove the existing (tree-based) implementation for LightXML. If and when they do have support for a streaming API, we can probably revisit this discussion, although I don't think the performance difference will be all that much --

My hunch is: this package will probably profit more from understanding/modelling the different OSM entities (e.g. #56), and their relationships, before we embark on the work of conditionally filtering. As it is, we're already leaving out alot of information in the OSM file, that other people might be interested in.

Moving forward, I think it might be nice (from time to time), to abstract out the backends abit, e.g.

  • [Plotting/Visualization] to use Gadfly/Compose, apart from Winston
  • [Routing/Algorithms] to use JuMP, or maybe even OSRM, apart from Graphs
  • [Parsing] LibExpat versus LightXML, etc

@garborg
Copy link
Collaborator

garborg commented Jan 5, 2015

Sounds like we're all on the same page.

Good to see the package has tests -- it will be quick work to get Pkg.test("LibExpat") working, Travis set up, and PkgEval to test nightly. I'll submit a PR there for that now, and to get 0.3 support, so this PR will apply to 0.3.

Thanks again for getting this working!

garborg added a commit that referenced this pull request Jan 5, 2015
[WIP] support streaming osm files through libexpat
@garborg garborg merged commit 367af58 into tedsteiner:master Jan 5, 2015
@yeesian yeesian deleted the streaming-xml-parser branch January 5, 2015 19:28
@garborg
Copy link
Collaborator

garborg commented Jan 5, 2015

Compatibility / testing PR submitted upstream: JuliaIO/LibExpat.jl#23

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 5, 2015

Nice, thanks!

@tedsteiner
Copy link
Owner

@yeesian @garborg Is there a reason why string fields in OSMattributes have type UTF8String rather than just String or ASCIIString? I'm getting an error on my machine (MacOS, tried Julia 0.3.1 and Julia 0.3.5) when I try to run the tests:
MethodError(convert,(UTF8String,\"Alewife\")), , 0, 0, 0"
When I changed the fields in OSMattributes to String this error went away and the tests all passed again. It seems like the problem is that the xml file elements are being read as ASCIIString but there is no convert function available from ASCIIString to UTF8String (which seems surprising to me).

Could it be that the XML tags are being read as different types on different architectures? It looks like that test is passing on the Travis Linux system.

@garborg
Copy link
Collaborator

garborg commented Jan 12, 2015

@tedsteiner I believe I've run tests and some downstream code on OSX since this was merged, but perhaps only on 0.4 -- I'll see how 0.3 is looking here on my laptop and let you know.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 12, 2015

Apologies, yeah, it was originally just String, but some discussion with @mlubin led me to think that it'll improve performance to have UTF8String, as a concrete type, rather than using String. ASCIIString won't work here, since some of the attributes have chinese characters. Although the bottleneck was later found to be in libexpat, I haven't yet changed back to using String.

@garborg
Copy link
Collaborator

garborg commented Jan 13, 2015

@tedsteiner On the latest commit(367af58), tests pass for me on the heads of both Julia release-0.3 and master for OSX, as they did on Travis for Linux. I'm not sure why it's giving you trouble.

By all means, we can loosen up the type if it's getting in the way of your work, but I'm generally a fan of letting the type system work its magic and generate as specialized of code as possible, even when it's not the main bottleneck.

@tedsteiner
Copy link
Owner

@garborg Below is the error message I'm getting on OS X. I was able to run the tests on my Linux machine without any problems last night. Does anything stand out to you?

It seems to be reading all of the tags as ASCIIStrings for me, then it is unable to convert them to UTF8Strings. Which seems silly, because why couldn't we do that conversion? But it probably just hasn't been written. When I change over to just using Strings, this error goes away and everything works. But I do agree with your statement that we should let the type system generate as specialized of code as possible.

julia> Pkg.test("OpenStreetMap")
INFO: Testing OpenStreetMap
testing read_data ...
ERROR: "MethodError(convert,(UTF8String,"")), , 0, 0, 0"
in convert at base.jl:13
in reset_attributes! at /Users/mirage/.julia/OpenStreetMap/src/parseMap.jl:44
in collectValues at /Users/mirage/.julia/OpenStreetMap/src/parseMap.jl:192
in streaming_end_element at /Users/mirage/.julia/LibExpat/src/streaming.jl:90
in XML_Parse at /Users/mirage/.julia/LibExpat/src/lX_common_h.jl:4
in parsefile at /Users/mirage/.julia/LibExpat/src/streaming.jl:221
in getOSMData at /Users/mirage/.julia/OpenStreetMap/src/parseMap.jl:215
in include at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib
in include_from_node1 at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib
in anonymous at no file:56
in include at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib
in include_from_node1 at loading.jl:128
in process_options at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib
in _start at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib
while loading /Users/mirage/.julia/OpenStreetMap/test/read_data.jl, in expression starting on line 14
while loading /Users/mirage/.julia/OpenStreetMap/test/runtests.jl, in expression starting on line 11

@garborg
Copy link
Collaborator

garborg commented Jan 13, 2015

Huh, that seems like a bug in Julia 0.3.5... Let me download the app and see if I can reproduce. Perhaps we managed to trigger that "Convertacolypse" error.

Probably unrelated, but I just tagged a new version LibExpat, so probably worth running Pkg.update to get Yeesian's speedup and dual 0.3 and 0.4 support.

@tedsteiner
Copy link
Owner

So here's something weird.

First, I tried loading data with streets including Chinese characters, to see if that triggered the tags to be read as UTF8Strings. I got the same error.

Then, I tried doing the following:

Pkg.test("OpenStreetMap")
using OpenStreetMap
nodes, hwys, builds, feats = getOSMData("filename")

The test fails with the error message above, but then getOSMData works as expected.

Running getOSMData in a fresh Julia session fails, and it continues to do so if you try it repeatedly. So there must be something about how the test is either spinning up libexpat or crashing it, in that it gets it to start reading as UTF8Strings after that (or, possibly, the convert function magically engages).

@garborg
Copy link
Collaborator

garborg commented Jan 13, 2015

Shoot, I can't replicate this running the 0.3.5 app with OpenStreetMap checked out, with LibExpat on either v0.0.6 (newly tagged) or v0.0.4 (last 0.3 compatible version).. frustrating. At this point, I'd say switch to using String for now, but I just checked and it looks like that adds 25% overhead.

@tedsteiner
Copy link
Owner

Do you know what version of libexpat you're using? My Mac has libexpat.1.5.2.dylib in /usr/lib/, and I'm assuming it's using that. It looks like Ubuntu 14.04 uses version 2.1.0. I tried brew install expat, which gave me version 2.1, but I still got the same error. I don't have any issues on my Linux machines.

Now I'll admit I don't really understand LibExpat.jl, but here's something that does concern me a little. In streaming.jl, it looks like the elements are passed around as a String, then we are making the assumption that it is really a UTF8String and treating it as such in parse_tag(). Isn't this asking for trouble?

Since we're the ones introducing the type conflict, I think the bug is in OpenStreetMap.jl not LibExpat.jl. If we want a robust speed up, I think we need to get LibExpat.jl to be able to guarantee that its output is a UTF8String, before we can treat it as such in all cases.

I think that robustness needs to be the top priority, even though I want to parse files 25% faster. This is a pretty serious bug for anyone who shares my situation (they won't be able to read OSM data), and I don't want this to be a new user's first experience with OpenStreetMap.jl.

It seems like the options are:

  1. Use String
  2. Get LibExpat to use UTF8String (which will push the bug into their court)
  3. Use UTF8String and provide (or request?) a conversion from ASCIIString

Thoughts? I'm leaning towards Option 1 until Option 2 becomes available.

@yeesian
Copy link
Collaborator Author

yeesian commented Jan 14, 2015

I'll lean towards using String for now too. We can optimize for speed later

@garborg
Copy link
Collaborator

garborg commented Jan 14, 2015

That sounds good. For the record, though, there shouldn't be anything stopping us from changing the code right back once the issue disappears:

I don't think any of our code is the issue, it seems like a configuration issue or that somehow Ted's machine but not mine or Travis is triggering the tuple/convert bug that's been messing with 0.3 and 0.4 for so long.

@tedsteiner The line you linked to in LibExpat.jl is not a problem -- there's a ::String annotation, but the function returns either an ASCIIString or a UTF8String and the annotation does not throw an error or cause a conversion -- it just passes them along unchanged after checking that they are Strings, which they are.

@tedsteiner
Copy link
Owner

@garborg Yes, it's not a problem for LibExpat.jl, but it is a problem for us, right? Because LibExpat is happy to work with either ASCIIString or UTF8String, but we're only content when it delivers an ASCIIString. It seems risky for us to assume it's a UTF8String coming through, when it could be either.

@garborg
Copy link
Collaborator

garborg commented Jan 14, 2015

We're forcing conversion from ASCIIString to UTF8String, ala,

type A
       x::Int
end

A(1.0) # -> A(1)

which is not ideal for files that will have only ASCIIStrings, but better than keeping it at String because of how that keeps all our downstream methods from inlining specialized methods during compilation (maybe could cause type instability, too?).

The fact that the convert function is broken for you suggests a known scary bug in Base Julia where using a library or user code that creates a pair or a tuple (in certain, but legitimate ways) can magically break convert:
JuliaLang/julia#8818
JuliaLang/julia#8631

Here's my package status running off a clean install of OpenStreetMap -- older versions of some packages, like Color, could trigger the bug, so if we're lucky, you just have an old version of one of these packages:

julia> Pkg.status()
1 required packages:
 - OpenStreetMap                 0.7.0+             master
16 additional packages:
 - BinDeps                       0.3.7
 - Cairo                         0.2.21
 - Color                         0.3.15
 - Compat                        0.2.9
 - DataStructures                0.3.5
 - FixedPointNumbers             0.0.6
 - Graphs                        0.5.0
 - Homebrew                      0.1.13
 - IniFile                       0.2.4
 - JSON                          0.4.0
 - LibExpat                      0.0.4              pinned.3ee8e669.tmp # I also tested against 0.0.6
 - LightXML                      0.1.9
 - SHA                           0.0.3
 - Tk                            0.2.16
 - URIParser                     0.0.3
 - Winston                       0.11.7

Otherwise, I give up, and perhaps we can switch back after it fixes itself someday, or perhaps getting BinDeps up for LibExpat would be a good side project for one of us.

@tedsteiner
Copy link
Owner

Yes! That's it! I had Color pinned at an old version to try to fix that devil bug from a long time ago. I went through and cleaned out some other packages I don't use anymore, either, and now I can parse with UTF8String.

tedsteiner added a commit that referenced this pull request Jan 14, 2015
Signed-off-by: Ted Steiner <tsteiner2@gmail.com>
@garborg
Copy link
Collaborator

garborg commented Jan 14, 2015

Awesome! That is a really nasty bug, luckily people are chipping away at it, and fingers crossed it's fixed completely soon. Glad it wasn't indicative of a LibExpat issue that we'd have to keep worrying about. Sorry it took me so long to think of the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants