Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile activity detected when running Sharrow in production mode #756

Closed
aletzdy opened this issue Oct 18, 2023 · 7 comments
Closed

Compile activity detected when running Sharrow in production mode #756

aletzdy opened this issue Oct 18, 2023 · 7 comments
Labels
Bug Something isn't working/bug f

Comments

@aletzdy
Copy link
Contributor

aletzdy commented Oct 18, 2023

Describe the bug

This bug is encountered in the MWCOG model. The compile step of Sharrow (running in test mode) concludes successfully, with sharrowcache folder created. Running the model in Sharrow production mode also concludes successfully, but runtimes of a few model steps (in addition to the overall runtime) are much longer compared to the non-Sharrow version. Workplace location, specifically, stands out, with the sharrow version taking 320mins vs. 50mins in the non-Sharrow mode. The notes column in the (sharrow production mode's) timing_log.csv shows compiled information for these steps. According to the source code, this is a bug that needs to be investigated.

@jpn-- Any suggestions on why this is happening?

@aletzdy aletzdy added the Bug Something isn't working/bug f label Oct 18, 2023
@jpn--
Copy link
Member

jpn-- commented Oct 18, 2023

It is likely that (re)compiling is being triggered because some DataFrame column data type in production mode is different from the type in the compile step. This can happen sometimes due to corner cases (e.g. rare instances where no choice is valid and the choice comes back as "null" instead of an integer) or just having more observations, if a column is promoted a value to a larger bit width to prevent an overflow. Solutions can include (a) run a much larger sample in the compile step so you encounter all these corner cases there instead of on production, (b) just run production again, all your compiling should be cached now, and/or (c) look forward to a future version of ActivitySim where an explicit data model prevents dtypes from changing unpredictably during a model run.

@aletzdy
Copy link
Contributor Author

aletzdy commented Nov 6, 2023

Thanks, @jpn--.

I tested your solutions. Solution (a) was not easy to work with, since even with running 50% sample in test mode, the recompiling seemed to happen in production. Running 100% in test mode weirdly resulted in a memory crash. I tested Solution (b), which successfully resulted in no more recompiling note in the timing log under a subsequent production run, but that run still took as long (300+ mins). So, I am not sure how much (or if) this recompiling bug was a problem.

My tests are showing that if I take out all the calibration constants (of which we have about 40), the production runtime decreases to 38mins. Those constants are defined in the following format:
@np.where((df['home_jurisdiction']==0) & (_COUNTY==0), 1, 0)

with _COUNTY being a temp variable defined at the top. Do you see any problem with this way of defining the constants?

@jpn--
Copy link
Member

jpn-- commented Nov 6, 2023

@aletzdy is the spec file for this component published somewhere on GitHub where I can see it? If not can you send it to me? Thanks

@aletzdy
Copy link
Contributor Author

aletzdy commented Nov 6, 2023

it is similar to the mwcog_example spec, but with some calibration-related updates:
workplace_location_mwcog.csv

@aletzdy
Copy link
Contributor Author

aletzdy commented Nov 6, 2023

Another piece of potentially relevant info: the current model implementation reads in the area type and county omx files as separate omx files. These pseudo-skim files are created separately using a python script to allow fetching the county or area type of an alternative destination in workplace location model. I initially suspected that this might be the issue, so I merged all the skim files into one and made sure the zarr digital encoding is working fine (checked the created zarr cache), and it all looks good to me. but I am not sure if there might be a datatype issue here.

@aletzdy
Copy link
Contributor Author

aletzdy commented Nov 28, 2023

@jpn-- I wanted to check back on this issue and see if you have any suggestions on how we can fix it.

@jpn--
Copy link
Member

jpn-- commented Jul 26, 2024

Closed by #782

@jpn-- jpn-- closed this as completed Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working/bug f
Projects
Status: Done
Development

No branches or pull requests

2 participants