Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+Arrow/Feather #37

Open
e-lo opened this issue Sep 14, 2020 · 31 comments
Open

+Arrow/Feather #37

e-lo opened this issue Sep 14, 2020 · 31 comments

Comments

@e-lo
Copy link
Member

e-lo commented Sep 14, 2020

I'd like to propose that we evaluate the feasibility to support the faster Arrow-based data format.

@billyc
Copy link
Member

billyc commented Sep 14, 2020 via email

@e-lo
Copy link
Member Author

e-lo commented Sep 14, 2020

Proposal based on:

  1. Feedback that OMX was too slow to use in production (noted by some in the data standards learning session)
  2. Use of these data formats in other contexts because of their speed benefits / hdf5 not being a standard that most people are adopting these days (that I can tell)

@pedrocamargo
Copy link

I would love for the next iteration of OMX to be based on Arrow, but is the objective of OMX to be used in production now?

@e-lo
Copy link
Member Author

e-lo commented Sep 15, 2020

is the objective of OMX to be used in production now?

That's a good question for the organizing group (which is who, these days?). In practice, it is being used in production.

@pedrocamargo
Copy link

I also use it in production and made AequilibraE capable of using it as well. However, if the OMX mission changes, then I would say it would be worth it to explore other data formats to make sure we get it right.
Also, would we ask software providers to switch to the new format? Or will we support both?

@billyc
Copy link
Member

billyc commented Sep 15, 2020 via email

@gregorbj
Copy link

I think that supporting Arrow-based format and other formats in the future is probably necessary if OMX is to endure as anything more than an exchange format. The spec would have to become more abstract. One issue will be how specific should the spec be about data structure. For example it is my understanding (limited) that Arrow supports storage of tabular data in columnar format where each column can store a different data type. This is the approach that VisionEval takes. OMX stores matrix data in a matrix format. So what should the spec say in that regard. There might need to be a part of the specification to deal with each type of backend that is supported: if HDF5 how structured, if Arrow how structured, etc. Or maybe is entirely functional, identifying functions that must be supported.

@pedrocamargo
Copy link

@billyc , I was referring to using OMX for in production and not as a common format for transfer between platforms (the latter was my understanding of the mission, but I am probably mistaken and remember only part of the mission).

@bstabler bstabler self-assigned this Sep 17, 2020
@bstabler
Copy link
Member

bstabler commented Sep 17, 2020

I like this idea and I like the idea of discussing this. Anyone interested in discussing please comment on this thread and then we can brainstorm next steps - maybe a meeting to discuss, maybe a prototype, etc. Thanks!

@bstabler
Copy link
Member

If we're thinking about a next version, let's include other potential ideas as well - more flexibility, more data types, better API conformity, CI for testing APIs, better viewers, etc.

@e-lo
Copy link
Member Author

e-lo commented Sep 17, 2020

@bstabler - perhaps:

  1. create a feature-request issue template
  2. make a call for feature-requests (more broadly than here on github)
  3. ask people to comment about their support
  4. develop a backlog for next version

@jpn--
Copy link

jpn-- commented Oct 13, 2020

Apparently I was not "watching" and didn't see this conversation initially. Count me in 👍

@jeabraham
Copy link

How interesting! HDF5 is primarily a disk storage format, with an option to force in-memory. Arrow is exclusively an in-memory format, right? So, the two are complimentary.

I've never been a big fan of HDF5, but don't see Arrow as a way to get away from HDF5.

Arrow sure would be nice to enable us to use higher performance libraries and not have to go through disk storage just to work in another platform or language for a bit.

@e-lo
Copy link
Member Author

e-lo commented Oct 14, 2020

Arrow is exclusively an in-memory format, right?

Feather is its on-disk complement

@pedrocamargo
Copy link

And Arrow+Feather is ridiculously fast...

@jpn--
Copy link

jpn-- commented Oct 19, 2020

And Arrow+Feather is ridiculously fast...

Did some noodling on this over the weekend. +1 to ridiculously fast ... not just "I don't want to wait while the data saves to disk" fast, but bordering on "I don't need to load skims into RAM to use them" fast.

@jpn--
Copy link

jpn-- commented Oct 27, 2020

Talk is cheap. Here instead is a straw man proposal for you all to beat around a bit. https://github.com/jpn--/arrowmatrix

@pedrocamargo
Copy link

Quite impressive results and effort, @jpn-- !

@pedrocamargo
Copy link

pedrocamargo commented Nov 12, 2020

The development of the PyTables project (on which OMX relies) seems to be quite slow these days, and there doesn't seem to be any hurry in supporting the newly released Python 3.9

PyTables/PyTables#823

@jpn--
Copy link

jpn-- commented Nov 12, 2020

The development of the PyTables project (on which OMX relies) seems to be quite slow these days, and there doesn't seem to be any hurry in supporting the newly released Python 3.9

I wouldn't worry too hard about not having wheels out on PyPI supporting 3.9 yet. The same applies to plenty of other relevant and very active projects <cough>pyarrow</cough>. Both have 3.9 support on conda-forge.

@pedrocamargo
Copy link

My concern is a little more with the frequency of updates to the library, @jpn-- , but you are right that the 3.9 release in itself is nothing to worry about for now.

@amotl
Copy link

amotl commented Nov 17, 2020

Dear Pedro and Jeffrey,

thanks to @avalentino, PyTables-cp39 wheels for Linux are available on PyPI now. See also PyTables/PyTables#823 (comment).

With kind regards,
Andreas.

@pedrocamargo
Copy link

Has anybody looked further into this change? PyTables still does not have wheels for Python 3.9 for either Windows or macOS, so I would say that the case for migrating to Arrow is getting even better...

@bstabler
Copy link
Member

bstabler commented Aug 1, 2021

@toliwaga did some further comparisons of HDF5 versus Arrow/Feather for ActivitySim and the performance gains were not great. If I recall correctly, the use case of reading several full matrices into RAM, which is what we're typically doing for activity-based models because we need to random access hundreds of millions of cells as fast as possible, was underwhelming. Maybe @toliwaga can add some more details?

Nevertheless, I'm supportive of developing and releasing an updated version, say v0.3, of OMX that supports either HDF5 or Arrow/Feather because it's popular, supported, and faster under some additional use cases.

@billyc
Copy link
Member

billyc commented Aug 5, 2021

It would be great to see the results of those comparisons here, if @toliwaga is willing to share them. Otherwise someone will probably ask for it again :-)

@pedrocamargo
Copy link

My concern, besides the fact that HDF5 has lost a lot of momentum in favor of more modern formats such as arrow and feather, is that the use case of just loading all arrays to disk once is a rather narrow one, @billyc

@e-lo
Copy link
Member Author

e-lo commented Aug 5, 2021

the use case of just loading all arrays to disk once is a rather narrow one.

Fully agree. Even within the scope of a travel model, there are lots of uses for the matrices used/created in travel models beyond "running the actual model". I'm surprised that there wasn't a significant amount of time saved. Based on some of what I've read there should be time saved on the read/write in addition to I/O as well as significant RAM improvements. The RAM improvements alone should be something to consider as it could reduce the need for specialized "modeling machines".

Another thing to consider is if arrow/feather is the right "storage" mechanism beyond intra-runs or if parquet is (which is considered "archival"). Ideally OMX would deal with either.

@billyc
Copy link
Member

billyc commented Aug 5, 2021

Beyond all the above reasons, hdf5 doesn't have any bindings for Javascript (and likely never will) -- so it's literally impossible to access OMX skims from front-end browser code without relying on a node-server to broker any requests.

It sounds like we have more than enough justifications to at least keep exploring this.

@jpn--
Copy link

jpn-- commented Aug 5, 2021

I'm surprised that there wasn't a significant amount of time saved.

The work by @toliwaga on this was in the context of ActivitySim. Overall the time spent loading HDF5 OMX data in an ActivitySim model is tiny compared to the runtime of the whole model -- cutting the plain load time from say 50 seconds to 10 seconds (not @toliwaga's results, just some approximate numbers from what I've played with) doesn't matter much when running everything else takes hours, and that makes it not worth a ton of development effort on the part of the ActivitySim consortium. But as we all agree, that's just one use case.

So I'd like to invite all of you who are interested to look at the straw man proposal I put forth a few months ago, and particularly the implementation details. Post here some thoughts about what's good and what's bad in there. From some more concrete thoughts perhaps we can move past "yes we should talk more about this" to actually outlining a new set of principles we want to pursue in the next version of the standard.

@toliwaga
Copy link
Collaborator

Sorry to be so slow in responding - I took a very long (and wonderful) summer vacation and am only just sorting through all stuff that happened while I was away.

I agree with @jpn that the activitysim use case is not representative and so my observations may have little bearing on this question.

Activitysim is a long running program with many models that do repeated lookups of various skims.

The ordinary use case is that Activitysim loads all of the skims into memory once at the start of the run, stores them in a large 3-dimensional numpy array (which is stored in shared memory when multiprocessing.) The various models access individual skims or skim sets (e.g. drive time for different time periods) via wrappers designed for convenience and legibility in expression files. The initial load time is not very important - what is important is that subsequent skim references are fast and are stored in a way that can be shared across processes.

@jpn presented a straw man proposal that, in addition to other possible advantages, suggested that it might be possible to avoid the runtime and memory overhead of preloading the skims and instead reading them just-in-time for skim lookup. The example showed both good performance and promising near-zero memory footprint.

I played around with that approach to see if it might be possible to use feather files as an alternative to in-memory skims. I decided to investigate, and explored his approach.

The first problem I ran into was that I found that accessing all skims would eventually bring all the skim data into memory. As the documentation says "Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory." This wasn't apparent in the example @jpn provided because it accessed the same skim repeatedly so the gradual increase in memory usage wasn't apparent. I couldn't find any way to free the memory short of opening and closing the file at every access - which slowed the process down.

However, the rapidity of feather file opening suggested a different, analogous approach which I then explored.

I implemented a numpy memmapped skim_dict class as an alternative to the existing activitysim in-memory array version. By opening and closing the memmap file just-in-time to perform skim or dkim_stack lookups, the memmap implementation avoided the 'leakage' associated with Jeff's approach - at the expense of redundant (albeit rapid) loads of skim data.

This resulted in a zero-overhead skim implementation with runtime performance 'only' 60% slower then in-memory skims. (A runtime handicap that could possibly be compensated by the reduced memory requirements in certain implementations. This is worth exploring. I should think it might be of interest to MPOs with truly gigantic skims. Especially if they are more constrained on the memory than the processor side.

# stats below are for a Full run of 3-zone Marin on wrjsofppw01
 
households_sample_size: # all households
initialize_tvpb num_processes: 20
tour_mode_choice_simulate num_processes: 32?
 
# skim_dict_factory: NumpyArraySkimFactory
Time to execute run_sub_simulations step mp_tvpb : 670.767 seconds (11.2 minutes)
Time to execute run_sub_simulations step mp_mode_choice : 327.657 seconds (5.5 minutes)?
high water mark rss: 485.85
 
# skim_dict_factory: MemMapSkimFactory
Time to execute run_sub_simulations step mp_tvpb : 763.762 seconds (12.7 minutes)
Time to execute run_sub_simulations step mp_mode_choice : 525.076 seconds (8.8 minutes)
high water mark rss: 333.78

Disabling tap-tap utility calculation (rebuild_tvpb_cache: False) shows that the memory requirements for 32-processor tour_mode_choice model run are striking low:

Total memory requirements for 32 processor tour_mode_choice model step with MemMapSkimFactory are 145GB - or under 5GB per process

This is all - last I checked - easily turned on and off by simply changing the skim_dict_factory setting in network_los.yaml from NumpyArraySkimFactory (the default) to MemMapSkimFactory.

#skim_dict_factory: NumpyArraySkimFactory
skim_dict_factory: MemMapSkimFactory

This will cause Activitysim to create a numpy memmap cache file (if it does not already exist) and which it opens and closes just-in-time for each skim access. This should work in either single or multi process mode.

This was never really exhaustively tested because it was just a little side project I did on my own time - not something that was part of the funded development effort.

@bstabler bstabler removed their assignment Oct 7, 2021
@bstabler
Copy link
Member

bstabler commented Oct 7, 2021

anyone eager to get something going on this topic? I've been too busy to move this along. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants