Progress report and discussion topics for 12-19-14 meeting #10

fscottfoti · 2014-12-17T22:14:38Z

Progress since the last meeting

Infrastructure

Pull request for (YAML-based) general choice modeling in Urbansim
Defaults for some standard data sources and variable definitions that are usable for all regions
A notebook for moving data from the directory structure into a single HDF5 file
A notebook for browsing the registered tables and running the example models
A pull request for integrating OMX Python support
A pull request to wrap a 2D matrix with a Skim object

Specific Models

We now have auto ownership and workplace location choice basically working on MTC data - this last uses the OMX and Skim objects. There are a few design decisions that can be made at this point. Here is a proposal for how the models could work.
- An example implementation directory (these are client specific)
- The configuration is still a csv file with the same basic form as the current UPCs.
- Note that this is different from the current YAML approach in UrbanSim and is discussed in this issue
- In short models are specified in the csv file, and configured in Python code like this. Transformations that are more complicated than those allowed by Patsy can be specified in the CSV file - we use DataFrame.eval or straight Python eval when there is an @ as the first character. Allows more flexibility in the csv than is provided in current yaml files.
- Additionally, there are simulation settings here
- An example of using a skim object is here where OMX is read here, the specific matrix here, these are injected and configured here
- A dictionary of skims is passed to "simple simulate" - and a specified column (e.g. TAZ) should occur in the choosers and is used as the origin id and the alternatives and is used as destination id
A few design decisions to make
- YAML vs CSV
- and related: how much in configuration vs in code - how comfortable are folks with Python at this level?
- also related: dependencies in UrbanSim right now are basically the sim framework and low-level mnl routines. 3rd library or is everyone ok with that?
- main concern at this point is performance - we're not slow per se, but not fast and a lot of memory is used. Did I hear that folks actually parallelize households on different processors?
Next 2 weeks (actually first 2 weeks of Jan)
- If we're comfortable with this proposal, we can comment, document, and test the activitysim code
- Size variables in workplace location choice
- Next model is Coordinated daily activity pattern unless we want to head to something complicated

Thoughts?

The text was updated successfully, but these errors were encountered:

fscottfoti · 2014-12-18T21:54:32Z

One other thing to add to this list is naming conventions - PEP8 conventions say variables should be named number_persons_age_5_to_15 while most of the variables currently are named like numPersAge5to15. I've changed most of these in the code to match PEP8 conventions but want to make sure people are OK with this. Column names in tables might need to be changed too which I haven't done yet.

danielsclint · 2014-12-19T06:06:13Z

@fscottfoti Thank you for pulling together this thorough update on the progress for the last two weeks. It's been nice to follow the progress you and Matt are making via the frequent documentation in GitHub.

A few thoughts on where we are and where we may want to go on the issues outlined above.

Code

I would like to walk through some of the classes and configurations setup over the past two weeks.

Configuration Files:

I NEVER want to see another binary XLSX configuration file ever again. You can't imagine how often we have been burned by losing track of binary file changes in Github.
It may be beyond the scope of this work product, but I would love for configuration / estimation information to be stored in a database with a list of definitive default values and isolated overrides for each scenario. This past year we have done over 350 unique model runs where input parameters are largely the same across runs. However, since all inputs are stored in a file based system, we are copying the same information from scenario folder to scenario folder wasting space. A more ideal solution would be store the inputs in a database keyed by scenario where unless an override was explicitly asserted, the model would use default values.
How much configuration in code is a balancing act. I would advocate for as agnostic an implementation as possible in code with much of the configuration outside the python code base. However, not being as familiar with the Python system, I would be interested in discussing the pros and cons.

Relationship to UrbanSim

I would rather pull common libraries where possible into a common-util type of project that is shared between UrbanSim and ActivitySim. We are wading into the UrbanSim world at SANDAG, but I don't think I should have to have UrbanSim "installed" to run ActivitySim.

Performance

Let's see how well the model is working from an implementation perspective. Is it estimating well? Is it producing reasonable results? If performance becomes an issue (and I would advocate we determine what slow is first), we can look at that a little later.
I hate buying expensive computers / servers. I have an ungodly amount of un-used computing capacity sitting on the desks of SANDAG staff. I would love to explore a SETI-like system where I could run ActivitySim in the background of some accountants computer while their at lunch.
The parallelization in CT-RAMP is not implemented particularly well. It does distribute the load, but it's not terribly effective at keeping track of requests (we occasionally lose work packets), and I also believe (but cannot prove yet) that the parallel processing is the root cause of some of simulation noise. In effect, I don't think CT-RAMP is doling out the random number seed chain in the same order to each of the nodes from run to run even with a fixed seed. We have tapped this down quite a bit in our implementation, but it is something to keep in mind with parallelization. I think this problem has been resolved (or at least claims to be) in the latest implementations of DaySim.

Naming Conventions

PEP8 Standards +1

Next Models

I defer to the modelers.

e-lo · 2014-12-19T07:15:03Z

I concur with @danielsclint that this update was very fun to review and I appreciate the sending of this report with links to everything. Pretty Sweet.

One thing that may be useful in the future is to discus specific design questions in their own issues so they are easier to follow. But for now, I will follow Clint's lead and just add my thoughts here.

Config file formats

I agree with Clint about "no binary" and prefer YAML slightly to CSV
I disagree with Clint about wanting to have a database with specifications. I feel this is overly complex and less traceable. We use git to track different specs at SFCTA. However, I think an in depth discussion about the relationship between /examples/ and defaults will be important. The way I see it eventually working is that ActivitySim is a code base that I (well the future person playing my role at SFCTA) would reference to implement CHAMP. i.e. CHAMP is it's own git repo full of specifications and set of routines to run champ, but it is just using ActiviySIM building blocks (that will probably look a lot like /example/TravelModelOne
I would reallllly like to separate the "codebase" from the "travel model".

Performance

@danielsclint plan to take over the accountant's computers on their lunch hour is intriguing (and something I've done before using Cube Cluster) ...and I think this points to a more disaggregate file system and extremely careful random number seed alignment.
I like for parallelization to be flexible. Sometimes I have 30 CPUs and 8 models to run, sometimes 100 CPUs and 2 models to run (or something like that but with better math). Our SF-CHAMP implementation can deal with either flexibly and we just name which computers we want it to use when we start the run...the list could be 1 or dozens.

guyrousseau · 2014-12-19T14:01:39Z

It was interesting to keep up with software development over the past couple of weeks, very transparent process, much appreciated, many thanks. That said, a few comments:

I agree that we should not have to have Urbansim installed in order to run Activitysim, though leveraging the Urbansim architecture is most definitely ok and the way to go.

Distributed processing across servers should be as flexible as possible.

CDAP as the next model component to tackle is one more step heading in the right direction with this project moving forward.

Merge in changes to Develop

Bring in the latest Develop

* NMTF person available periods * NMTF person available periods * blacken * remove bad path to annotate.py * remove bad path to annotate.py * time_periods_available unit test * removing outdated comment * estimation mode tour checking

jiffyclub closed this as completed Feb 10, 2015

bstabler pushed a commit that referenced this issue Oct 20, 2021

Merge pull request #10 from SANDAG/develop

c61f397

Merge in changes to Develop

jpn-- pushed a commit that referenced this issue Feb 13, 2024

Merge pull request #10 from wsp-sag/develop

a3cb622

Bring in the latest Develop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress report and discussion topics for 12-19-14 meeting #10

Progress report and discussion topics for 12-19-14 meeting #10

fscottfoti commented Dec 17, 2014

fscottfoti commented Dec 18, 2014

danielsclint commented Dec 19, 2014

e-lo commented Dec 19, 2014

guyrousseau commented Dec 19, 2014

Progress report and discussion topics for 12-19-14 meeting #10

Progress report and discussion topics for 12-19-14 meeting #10

Comments

fscottfoti commented Dec 17, 2014

Progress since the last meeting

Infrastructure

Specific Models

fscottfoti commented Dec 18, 2014

danielsclint commented Dec 19, 2014

Code

Configuration Files:

Relationship to UrbanSim

Performance

Naming Conventions

Next Models

e-lo commented Dec 19, 2014

guyrousseau commented Dec 19, 2014