-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Data Type Usage #673
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
In the memory profiling work conducted by WSP during Phase 8 Interim, we have identified that memory usage of the ActivitySim model is high when large “choosers” tables with inefficient data types are created. For example, in the work tour scheduling step in the ARC run with a 25 percent sample, the choosers table has 540 million rows and 80 columns – a memory footprint of about 250 GB. Among the 80 columns, there are five string variables including tour purpose (e.g., “work”, “school”), tour category (e.g., “mandatory”), and time-period (e.g., “AM”); each takes about 32 GB of RAM. If we change string variables into something like enums using int8 data types, the memory footprint of this chooser table could be reduced from 250 GB to 102 GB. Many of the other columns unnecessarily use memory-intensive data types like float64 and int64. A logical next step is to optimize the data types used in ActivitySim, as part of the Phase 8 development.
Methodology
String variables
We looked into two alternatives to optimizing string variables:
The table below recaps the pros and cons of the two alternatives:
In consultation with the consortium members and the bench contractors, we decided to implement pandas Categorical for converting string variables, mainly because the level of effort is lower and it will keep the backward compatibility. We will modify the ActivitySim source code so that when a string variable is created, we convert it to pandas categorical.
Numeric variables
Numeric variables are created and used in the following sources:
For the numeric variables in the input data, the user can define their data types in the settings.yaml, see example here. For the numeric variables created in the annotations and source codes, we can create a function that downcasts them based on the value ranges of the variables.
Implementation Details
Overview
The string to pandas categorical conversion shall happen under the hood, in the ActivitySim source code, and it should require minimum work for users to implement their models with this change. The downcasting of numeric variables is implemented as an option that users can turn on and off.
Relevant discussions/presentations can be found at:
Project-Meeting-2023.06.27
Project-Meeting-2023.07.18
Project-Meeting-2023.08.08
Project-Meeting-2023.08.22
Project-Meeting-2023.08.29
Project-Meeting-2023.09.12
Project-Meeting-2023.09.26
Project-Meeting-2023.10.10
Project-Meeting-2023.12.12
String to pandas categorical
Although pandas categorical data type is a convenient solution to the memory issue, we have found the following caveats during implementation:
Downcasting numeric variables
In our tests, downcasting numeric variables helped further bringing down the memory requirement of ActivitySim. But changing the precision of numeric variables, especially float variables, caused the model result to change slightly in our tests. Hence, we have implemented the numeric downcasting as a switch in the ActivitySim setting, and defaulted to it being turned off.
Other notable findings
Results
prototype_arc 25% sample
The memory of the work tour scheduling choosers table of the 25% sample ARC run dropped from 254 GB to 79 GB after the data type optimization. The data type optimization alone reduced the peak memory from 491 GB to 335 GB. The implementation also includes fixing the memory leak we discovered in Sharrow, which reduced the peak memory by another 27 GB. Overall, the data type optimization work, along with the memory leak fix in Sharrow, reduced the peak memory of the 25 % ARC run from 491 GB to 308 GB. The chart below shows the memory profile of the 25% ARC model before and after data type optimization.
prototype_mtc_extended 100%
In our latest test with the extended MTC model, we found that school escorting (added in Phase 7) is the new memory peak, instead of the mandatory tour scheduling model. The data type optimization has brought down the memory peak from 375 GB to 154 GB (excluding school escorting), and from 490 GB to 380 GB (including school escorting). The chart below shows the memory profile of the 100% extended MTC model before and after data type optimization.
run time implication
In addition to the memory reduction, we also observed a run time reduction (from 488 mins to 359 mins) for the ARC model, with data type being optimized. However, we did not see a run time reduction for the extended MTC model.
Guidance for the future
The way we converted string variables to pandas categorical is a quick solution to reduce the memory burden created by strings, but it does not remove the use of strings in ActivitySim. Although it has brought down the memory requirement greatly, it also has a few caveats as documented above. In the future development, a more systematic way of truly getting rid of strings (such as a data type model with IntEnum) would be worth looking into.
The text was updated successfully, but these errors were encountered: