Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Data Type Usage #673

Closed
i-am-sijia opened this issue May 8, 2023 · 1 comment
Closed

Optimize Data Type Usage #673

i-am-sijia opened this issue May 8, 2023 · 1 comment
Assignees

Comments

@i-am-sijia
Copy link
Contributor

i-am-sijia commented May 8, 2023

Background

In the memory profiling work conducted by WSP during Phase 8 Interim, we have identified that memory usage of the ActivitySim model is high when large “choosers” tables with inefficient data types are created. For example, in the work tour scheduling step in the ARC run with a 25 percent sample, the choosers table has 540 million rows and 80 columns – a memory footprint of about 250 GB. Among the 80 columns, there are five string variables including tour purpose (e.g., “work”, “school”), tour category (e.g., “mandatory”), and time-period (e.g., “AM”); each takes about 32 GB of RAM. If we change string variables into something like enums using int8 data types, the memory footprint of this chooser table could be reduced from 250 GB to 102 GB. Many of the other columns unnecessarily use memory-intensive data types like float64 and int64. A logical next step is to optimize the data types used in ActivitySim, as part of the Phase 8 development.

Methodology

String variables

We looked into two alternatives to optimizing string variables:

The table below recaps the pros and cons of the two alternatives:

image

In consultation with the consortium members and the bench contractors, we decided to implement pandas Categorical for converting string variables, mainly because the level of effort is lower and it will keep the backward compatibility. We will modify the ActivitySim source code so that when a string variable is created, we convert it to pandas categorical.

Numeric variables

Numeric variables are created and used in the following sources:

  1. Input data
  2. pre- and post- annotation of each ActivitySim sub-model
  3. .py source code

For the numeric variables in the input data, the user can define their data types in the settings.yaml, see example here. For the numeric variables created in the annotations and source codes, we can create a function that downcasts them based on the value ranges of the variables.

Implementation Details

Overview

The string to pandas categorical conversion shall happen under the hood, in the ActivitySim source code, and it should require minimum work for users to implement their models with this change. The downcasting of numeric variables is implemented as an option that users can turn on and off.

Relevant discussions/presentations can be found at:
Project-Meeting-2023.06.27
Project-Meeting-2023.07.18
Project-Meeting-2023.08.08
Project-Meeting-2023.08.22
Project-Meeting-2023.08.29
Project-Meeting-2023.09.12
Project-Meeting-2023.09.26
Project-Meeting-2023.10.10
Project-Meeting-2023.12.12

String to pandas categorical

Although pandas categorical data type is a convenient solution to the memory issue, we have found the following caveats during implementation:

  1. Assigning a value not already defined in the existing categories results in an error. There are places in the ActivitySim source code as well as in model UECs where new values are being assigned to existing string columns. Example 1, Example 2, Example 3. We need to make sure those new values are pre-defined in the categories.
  2. Pandas categorical is fragile with pandas merge(). Merging two pandas categorical columns with different categories will result in an object type column (string) and cancels the memory saving. Hence, before joining we should make sure the two categorical columns use the same categories.
  3. Calling pandas groupby() on a pandas categorical column will by default create groups for all pre-defined categories. This will crash the model if some pre-defined categories are not observed in the data. Example code.
  4. We could also convert numeric variables to pandas categorical, but they will not work with any numeric operations. Example. We suggest not using pandas categorical for numeric variables in ActivitySim
  5. We needed special treatment for Time Period variables in ActivitySim. Because time period variables are used in Sharrow to look up skims, and Sharrow requires them to be ordered. We converted time period string variables to ordered pandas categorical.

Downcasting numeric variables

In our tests, downcasting numeric variables helped further bringing down the memory requirement of ActivitySim. But changing the precision of numeric variables, especially float variables, caused the model result to change slightly in our tests. Hence, we have implemented the numeric downcasting as a switch in the ActivitySim setting, and defaulted to it being turned off.

Other notable findings

  1. When running ActivitySim model with Sharrow turned on, household debug tracing requires additional memory. Presented at September 26, 2023 Meeting See issue #754
  2. When running ActivitySim model with Sharrow turned on, there was additional memory being held unnecessarily (memory leak) due to Sharrow flow cache not being released properly. Presented at September 26, 2023 Meeting Jeff investigated and fixed this in PR #751
  3. When running ActivitySim model with Sharrow turned on, utility expressions that compare pandas categorical variable to strings can be evaluated incorrectly. This has been documented in issue #766.

Results

prototype_arc 25% sample

The memory of the work tour scheduling choosers table of the 25% sample ARC run dropped from 254 GB to 79 GB after the data type optimization. The data type optimization alone reduced the peak memory from 491 GB to 335 GB. The implementation also includes fixing the memory leak we discovered in Sharrow, which reduced the peak memory by another 27 GB. Overall, the data type optimization work, along with the memory leak fix in Sharrow, reduced the peak memory of the 25 % ARC run from 491 GB to 308 GB. The chart below shows the memory profile of the 25% ARC model before and after data type optimization.

image

prototype_mtc_extended 100%

In our latest test with the extended MTC model, we found that school escorting (added in Phase 7) is the new memory peak, instead of the mandatory tour scheduling model. The data type optimization has brought down the memory peak from 375 GB to 154 GB (excluding school escorting), and from 490 GB to 380 GB (including school escorting). The chart below shows the memory profile of the 100% extended MTC model before and after data type optimization.

image

run time implication

In addition to the memory reduction, we also observed a run time reduction (from 488 mins to 359 mins) for the ARC model, with data type being optimized. However, we did not see a run time reduction for the extended MTC model.

Guidance for the future

The way we converted string variables to pandas categorical is a quick solution to reduce the memory burden created by strings, but it does not remove the use of strings in ActivitySim. Although it has brought down the memory requirement greatly, it also has a few caveats as documented above. In the future development, a more systematic way of truly getting rid of strings (such as a data type model with IntEnum) would be worth looking into.

@jpn--
Copy link
Member

jpn-- commented Jul 26, 2024

Closed by #782

@jpn-- jpn-- closed this as completed Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants