Optimize Data Type Usage #673

i-am-sijia · 2023-05-08T20:43:23Z

Background

In the memory profiling work conducted by WSP during Phase 8 Interim, we have identified that memory usage of the ActivitySim model is high when large “choosers” tables with inefficient data types are created. For example, in the work tour scheduling step in the ARC run with a 25 percent sample, the choosers table has 540 million rows and 80 columns – a memory footprint of about 250 GB. Among the 80 columns, there are five string variables including tour purpose (e.g., “work”, “school”), tour category (e.g., “mandatory”), and time-period (e.g., “AM”); each takes about 32 GB of RAM. If we change string variables into something like enums using int8 data types, the memory footprint of this chooser table could be reduced from 250 GB to 102 GB. Many of the other columns unnecessarily use memory-intensive data types like float64 and int64. A logical next step is to optimize the data types used in ActivitySim, as part of the Phase 8 development.

Methodology

String variables

We looked into two alternatives to optimizing string variables:

IntEnum. Presented at the June 27, 2023 Meeting.
Pandas Categorical data type. Presented at the July 18, 2023 Meeting.

The table below recaps the pros and cons of the two alternatives:

In consultation with the consortium members and the bench contractors, we decided to implement pandas Categorical for converting string variables, mainly because the level of effort is lower and it will keep the backward compatibility. We will modify the ActivitySim source code so that when a string variable is created, we convert it to pandas categorical.

Numeric variables

Numeric variables are created and used in the following sources:

Input data
pre- and post- annotation of each ActivitySim sub-model
.py source code

For the numeric variables in the input data, the user can define their data types in the settings.yaml, see example here. For the numeric variables created in the annotations and source codes, we can create a function that downcasts them based on the value ranges of the variables.

Implementation Details

Overview

The string to pandas categorical conversion shall happen under the hood, in the ActivitySim source code, and it should require minimum work for users to implement their models with this change. The downcasting of numeric variables is implemented as an option that users can turn on and off.

Relevant discussions/presentations can be found at:
Project-Meeting-2023.06.27
Project-Meeting-2023.07.18
Project-Meeting-2023.08.08
Project-Meeting-2023.08.22
Project-Meeting-2023.08.29
Project-Meeting-2023.09.12
Project-Meeting-2023.09.26
Project-Meeting-2023.10.10
Project-Meeting-2023.12.12

String to pandas categorical

Although pandas categorical data type is a convenient solution to the memory issue, we have found the following caveats during implementation:

Assigning a value not already defined in the existing categories results in an error. There are places in the ActivitySim source code as well as in model UECs where new values are being assigned to existing string columns. Example 1, Example 2, Example 3. We need to make sure those new values are pre-defined in the categories.
Pandas categorical is fragile with pandas merge(). Merging two pandas categorical columns with different categories will result in an object type column (string) and cancels the memory saving. Hence, before joining we should make sure the two categorical columns use the same categories.
Calling pandas groupby() on a pandas categorical column will by default create groups for all pre-defined categories. This will crash the model if some pre-defined categories are not observed in the data. Example code.
We could also convert numeric variables to pandas categorical, but they will not work with any numeric operations. Example. We suggest not using pandas categorical for numeric variables in ActivitySim
We needed special treatment for Time Period variables in ActivitySim. Because time period variables are used in Sharrow to look up skims, and Sharrow requires them to be ordered. We converted time period string variables to ordered pandas categorical.

Downcasting numeric variables

In our tests, downcasting numeric variables helped further bringing down the memory requirement of ActivitySim. But changing the precision of numeric variables, especially float variables, caused the model result to change slightly in our tests. Hence, we have implemented the numeric downcasting as a switch in the ActivitySim setting, and defaulted to it being turned off.

Other notable findings

When running ActivitySim model with Sharrow turned on, household debug tracing requires additional memory. Presented at September 26, 2023 Meeting See issue #754
When running ActivitySim model with Sharrow turned on, there was additional memory being held unnecessarily (memory leak) due to Sharrow flow cache not being released properly. Presented at September 26, 2023 Meeting Jeff investigated and fixed this in PR #751
When running ActivitySim model with Sharrow turned on, utility expressions that compare pandas categorical variable to strings can be evaluated incorrectly. This has been documented in issue #766.

Results

prototype_arc 25% sample

The memory of the work tour scheduling choosers table of the 25% sample ARC run dropped from 254 GB to 79 GB after the data type optimization. The data type optimization alone reduced the peak memory from 491 GB to 335 GB. The implementation also includes fixing the memory leak we discovered in Sharrow, which reduced the peak memory by another 27 GB. Overall, the data type optimization work, along with the memory leak fix in Sharrow, reduced the peak memory of the 25 % ARC run from 491 GB to 308 GB. The chart below shows the memory profile of the 25% ARC model before and after data type optimization.

prototype_mtc_extended 100%

In our latest test with the extended MTC model, we found that school escorting (added in Phase 7) is the new memory peak, instead of the mandatory tour scheduling model. The data type optimization has brought down the memory peak from 375 GB to 154 GB (excluding school escorting), and from 490 GB to 380 GB (including school escorting). The chart below shows the memory profile of the 100% extended MTC model before and after data type optimization.

run time implication

In addition to the memory reduction, we also observed a run time reduction (from 488 mins to 359 mins) for the ARC model, with data type being optimized. However, we did not see a run time reduction for the extended MTC model.

Guidance for the future

The way we converted string variables to pandas categorical is a quick solution to reduce the memory burden created by strings, but it does not remove the use of strings in ActivitySim. Although it has brought down the memory requirement greatly, it also has a few caveats as documented above. In the future development, a more systematic way of truly getting rid of strings (such as a data type model with IntEnum) would be worth looking into.

jpn-- · 2024-07-26T15:00:00Z

Closed by #782

i-am-sijia self-assigned this May 8, 2023

i-am-sijia mentioned this issue Dec 28, 2023

Data Type Optimization #782

Merged

i-am-sijia mentioned this issue Feb 22, 2024

Double check Sharrow memory usage fix (#751) is merged and implemented #816

Closed

jpn-- closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Data Type Usage #673

Optimize Data Type Usage #673

i-am-sijia commented May 8, 2023 •

edited

Loading

jpn-- commented Jul 26, 2024

Optimize Data Type Usage #673

Optimize Data Type Usage #673

Comments

i-am-sijia commented May 8, 2023 • edited Loading

Background

Methodology

String variables

Numeric variables

Implementation Details

Overview

String to pandas categorical

Downcasting numeric variables

Other notable findings

Results

prototype_arc 25% sample

prototype_mtc_extended 100%

run time implication

Guidance for the future

jpn-- commented Jul 26, 2024

i-am-sijia commented May 8, 2023 •

edited

Loading