Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large project.yaml can lead to timeouts on "run jobs" page #4617

Open
evansd opened this issue Sep 19, 2024 · 5 comments
Open

Large project.yaml can lead to timeouts on "run jobs" page #4617

evansd opened this issue Sep 19, 2024 · 5 comments

Comments

@evansd
Copy link
Contributor

evansd commented Sep 19, 2024

A user reported getting the big blue Dokku error page:

We're sorry, but something went wrong
If you are the application owner please check the logs for more information.

when trying to access this URL:
http://jobs.opensafely.org/comparing-disparities-in-rsv-influenza-and-covid-19/disparities-comparison-rsv-flu-c19/run-jobs/

The cause seems to be this very large (programatically generated) project.yaml file with 6,874 actions in it:
https://github.com/opensafely/disparities-comparison/blob/6cba2f6f1609f26dc4ee5698f51afcc7e9be3840/project.yaml

In the short term, the user has been advised to reduce the number of actions to the minimum needed to get started with the project.

It's possible that we decide we just don't support this many actions (the user expressed surprise at the number, so possibly they don't actually need all of them in any case). But if so it would be good to fail more gracefully and give a more explanatory error.

Related issue:

Original Slack thread:
https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1726760871032629

Subsequent discussion:
https://bennettoxford.slack.com/archives/C069SADHP1Q/p1726761557795209

Sentry error:
https://ebm-datalab.sentry.io/issues/5874686309/

@evansd
Copy link
Contributor Author

evansd commented Oct 1, 2024

I can see some initial profiling work was done the last time the performance of this page was raised. But that's no longer representative of the current situation.

I've done some basic profiling locally of loading the run-jobs/ page with the above very large project.yaml file. This was using the job-server code as of this commit and was just measuring the time to return HTML. I suspect, given the size of the DOM, there will also be issues rendering this HTML but I haven't attempted to measure those.

Total time to return HTML was 48 seconds on my laptop. I suspect that on the server, which is not particularly powerful and has more to do than satisfy a single request, this takes even longer leading to the timeout errors.

The time breaks down as follows (so roughly 27s parsing the YAML, 6s validating and 15s rendering HTML):
Screenshot from 2024-10-01 13-21-49

The slow YAML parsing time is a consequence of doing the parsing in pure Python, which we had good reason for at the time. Switching the YAML library back from ruyaml to ruamel1 and using its libyaml parsing backend drops the parsing time to 1.9 seconds for an overall response time of 23 seconds:
Screenshot from 2024-10-01 13-25-35

The slow template rendering time is a consequence of having many slippers components in the hot loop and the high overhead of calling in to each component. It may be possible to inline some of the code here, or restructure the page in some way to reduce this.

The previous concerns about the calls out to the Github or OpenCodelists APIs don't seem to be particularly relevant to the overall runtime.

Footnotes

  1. ruyaml is a community "maintained" fork of ruamel, but it does not support the C-based parser. In general, Python YAML support is a bit of a shit-show.

@bloodearnest
Copy link
Member

Yet another place were our current method for distributing opensafely cli puts limitations on us (we need a pure python solution).

But if it also takes 20+s to parse parsing a 6k job project.yaml on every invocation of opensafely run, no wonder some of our users are not testing locally...

@evansd
Copy link
Contributor Author

evansd commented Oct 2, 2024

There may be some quick (if slightly dirty) wins here. One would be for the pipeline library to expose a variant of the load_pipeline function which took a pre-parsed structure rather than a YAML string. Then job-server could use pyaml/ruamel.clib/whatever to parse the YAML and pass it in.

It's obviously not ideal to be using different parsers in different parts of the system. But there's nothing particularly exotic about the YAML used in pipeline files, so I can't see it would cause problems in practice.

Of course this doesn't help with running things locally, nor with the problem of submitting lots of jobs in one go. But it would (hopefully) buy us enough of a speed up that it becomes at least physically possible to run jobs from a very large project file, which seems to me the major blocker at the moment.

@lucyb
Copy link
Contributor

lucyb commented Oct 3, 2024

Thanks for this excellent analysis, Dave. The proposed fix for the pipeline library seems like a pragmatic approach that will have some significant benefits.

I don't think there's anything we should do on Job Server in the very short term, so I'm going to remove this ticket from our board for now, but keep it open.

@mikerkelly
Copy link
Contributor

Removing tech-support label as I don't think we need actively follow this up or get back further to the reporter (the tech-support process uses this label for tracking such tickets). We have captured all information needed to resolve the issue and the reporter isn't expecting further responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants