Skip to content

ABM Cleaning Protocol

Binny Paul edited this page Nov 30, 2020 · 6 revisions

Overview

Currently, an ABM run with 5 iterations takes about 9-10 GBs of disk space, of which, about 95% of the disk space is in the output folder. During the use and analysis of the given run, it is wise to save all of the files (to fully understand any questions about the run or to be able to pull and assemble data from multiple aspects of the ABM run). However, after the run is complete ODOT requires that cleaning (space reduction) be applied prior to storing the ABM results to a backed-up server location.

Cleaning Steps

Step 1 - ABM_OutputClean.R

The first step of the cleaning process is to run the script ABM_OutputClean.R. This script is stored in the "scripts" folder, but is assumed to be run from the root directory of the given scenario that is to be cleaned. This script will reduce the total disk size by approximately 60%. This script will delete:

  • all of the transit specific version files, along with bike and walk (MAZ level) version files and other interim version files;
  • removes the PM peak highway assignment version file, as the final summary verision file is a copy of the PM with all other period data appended (meaning it's the PM version file, plus some);
  • all interim files related to the commercial vehicle and external model;
  • all interim iteration files of the ABM run (and it will take the final iteration ABM result set and rename it without the final iteration reference, removes the "_X" notation from the file name, where X is the last iteration of the ABM); and
  • Removes all omx copies of the skim data (noting that skims are also stored in the version files). The one exception is "outputs\skims\taz_skim_sov_pm.omx". This file should be retained in case a specific hour assignment needs to be run. This skim is needed to re-run CVM to generate period specific demands / assignments. If a specific hour or specific period run is not needed, this skim can be removed by hand (the cleaning script saves it).

If for some reason the application/scenario requires that some of the information above should be stored long term. The analyst should either remove these files by hand (saving the ones that need to be saved), or modify the copy of the script for the given scenario to comment out (not remove) files that should be kept.

Of specific note - this script removes all transit assignment raw files. So if the scenario involved transit flow bundles, as an example, then modifications to this step should be made to save the right files (in this case the right transit Visum version files).

Manual Removal of Additional Files

The largest files that the above process does not delete, are the 4 other time of day highway assignment version files:

  1. outputs\network\Highway_Assignment_Results_EA.ver
  2. outputs\network\Highway_Assignment_Results_AM.ver
  3. outputs\network\Highway_Assignment_Results_MD.ver
  4. outputs\network\Highway_Assignment_Results_EV.ver

These are saved so that time of day specific flow bundles can be produced. If the scenario / request does not require time of day flow bundles beyond the PM (which is in the "Final" version file), then these four files should also be removed, reducing the total scenario space by another >2GBs or 25%.

Final Assessment

After these two steps are complete the ABM scenario total disk space should now be between 1 and 2 GBs - appropriately sized for server storage (might be larger if the specifics of the scenario require extra files be kept). The analyst can do another review to remove any additional unneeded files, but the project team authorizing this instruction set has already done a review and files remaining after these two steps are fairly unique and important to understanding the run, and they likely all should be kept. But there's always special circumstances and user judgment that might suggest that additional files should be removed.

PopulationSim

PopulationSim is a linked tool; it is part of the ABM process. In order to run the ABM, a synthetic population must be generated as an input to the ABM and the tool that ODOT uses to generate that input is PopulationSim. PopulationSim is only run when the population or household component of the ABM scenario is impacted, so there will not be a need to review and reduce the size of a PopulationSim run for every ABM scenario. However, a PopulationSim run can take up a Gigabyte or two of space in addition to the ABM, therefore steps to monitor and manage the space of a PopulationSim run is an important step of the cleaning protocol.

Cleaning Steps PopulationSim

There are two major contributors to large PopulationSim run size;

  1. Raw seed data
  2. The HDF5 data pipeline (output\pipeline.h5)

When storing a PopulationSim run on the server, it's good to consider both of these sources of run size and consider removing these files if possible. Other other files in a PopulationSim run should be relatively small or, if not small, they are likely essential, like the final output used as the input to the ABM. The other potential set of files to clean/remove, would be any large documentation files that aren't actually used in the run of PopulationSim. If those exist they should be reviewed and considered for deletion.

Raw seed data

It is common practice to document the process of taking raw ACS PUMS data and processing it to be the input "seed" table for PopulationSim. In order to allow a user to fully understand these steps, the practice has been to have the raw ACS data (typically for the entire state) as an initial input to a processing script. The script than documents all the steps to take the raw PUMS product and transform it into the seed for PopulationSim. This transformed seed is typically a much smaller file, because; a. Many of the fields of the raw data are typically removed (since they are not needed), and b. typically the seed is only for a sub-region of the state (one to several PUMAs), as opposed to all PUMAs in the state (or multi-state region).

It's important to save the final seed to the process, so that the PopulationSim can be re-run, but it is not critical to save the raw (typically statewide) PUMS data with every PopulationSim run. What is suggested is that raw PUMS census data for the State (or states if needed) be saved in one location on the Agency's shared drives. The seed processing scripts should then point to that single location as opposed to a local directory / copy. In this way the raw data and processing steps can be kept as a record, without duplicating the identical and large raw data set for every application of PopulationSim that uses that raw data file.

HDF5 data pipeline

The HDF5 data pipeline (output\pipeline.h5) is the largest single file in a PopulationSim run. This is because this file contains every step and data need of PopulationSim throughout the run sequence, so the run can be started up at any point, and so that data can be pulled from any point of the run for understanding of the data processing at various steps. So the data pipeline is valuable, but the final resulting synthetic population and household tables are also written out as separate outputs, and so the pipeline may not be necessary. For this reason, the project manager should consider whether the data pipeline should be kept (or stored) whenever moving a copy to the server.

One important reason to save the data pipeline is because of repop mode. The PopulationSim tool has a feature called repop. This feature allows a user to come back to a previously run PopulationSim scenario (say a final RTP adopted land use scenario). If that scenario has a Traffic Impact Assessment (TIA) or similar, where only a few zones are impacted within the entire region, the user can make a copy of the PopulationSim run/scenario and then use the repop feature to only impact a handful of zones (as many as are specified by the user), leaving the rest of the region unchanged. This feature makes it easier to do an apples to apples comparison, where all zones are the same except for the handful of zones that were desired to be altered. Under a normal (full) run, the changes to a handful of zones might have ripple effects that change the values of many other zones throughout the modeling region. So the repop feature can be a very helpful option for some scenarios.

Repop is not a full PopulationSim run. It uses the data pipeline from a previous run copy and just makes changes to the specified zone in a way that is consistent with how the other households and persons for the region were developed and then exports a new set of sythetic output tables where the desired zones have been altered as specified. This feature is not available if the data pipeline has been deleted.

The point being, when considering whether to keep or delete the data pipeline, consider whether the it can be anticipated that the given PopulationSim run will be used for the basis of a repop run later in the future. The answer to that question will help determine if this file that is the majority of space in a PopulationSim run can be removed or if it needs to be kept before storing the run on the server.

Clone this wiki locally