-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed short-term fix for the slow parsing of Very Large project.yaml
files
#340
Comments
Having investigated the options for fast YAML parsing I think we currently have two, with different trade-offs: pyyaml is the default package most people reach for (for instance it's already installed in job-server). It's tolerably actively maintained in the sense that people seem to keep the build working for new versions of Python. I'm not sure much in the way of bug fixes or feature development happens though. Most importantly for our purposes, it provides binary wheels for a wide variety of platforms and Python versions. However it has a significant downside (the reason we initially switched away from it) in that it silently ignores duplicate keys. This is a reasonably likely kind of error for a user to make while copy/pasting actions and it would have the effect that the ruamel.yaml.clib is the compiled backend for Both packages parse at approximately the same speed (well over an order of magnitude faster than the pure Python parser). Overall, given that this is fundamentally a short-term hack to buy us some time, I think we should go with
And also helping any users who are struggling with large yaml files locally to install these packages. |
Have you checked parsing compatibility on the files of interest for this? One of the reasons yaml is a bit of a nightmare is that basically no two yaml parsers are actually compatible. They pretend to be, and manage compatibility for simple examples, but the yaml semantics are complicated enough that you inevitably run into discrepancies, and if you're talking about really large YAML files I'd assume by default that some of these are lurking in there somewhere. |
We could add it as a dependency to Maybe as an extra, and tell users to do |
So. Yes. This is one of the many terrible things about YAML which make me deeply regret its use in OpenSAFELY, and determined that The Glorious Future Pipeline Format will use something else. In this particular case though, the YAML is all pretty vanilla and stays well clear of the ambiguous corners of the spec. The reason they get so large is that we have users who want to run a small number of actions many times with different parameters and so they write scripts which generate these files. No one (I dearly hope) is writing 3MB of YAML by hand. |
I think that until we switch our whole installation story to using |
No, for sure, I wasn't imagining this was manually written, but I've definitely run into cases where YAML emitted by one implementation is not interpreted in the same way by another, and it would make me worried about implementing something like this without an explicit sanity check on compatibility. |
By "sanity check" do you just mean checking that existing files parse identically? If so, I have done that and it looks OK, but obviously that's no proof it will always be so. Is there something cleverer we could do here? I think it helps that the two parsers here share a common lineage. Ultimately the risk of divergence is always going to be present. But it feels like the least worst of our currently available options in the short term. |
@bloodearnest On reflection, I think your suggestion about using an |
No, I think that's mostly sufficient to requirements. If there are discrepancies later those are just bugs that can be fixed. https://github.com/opensafely-core/pipeline/pull/346/files#r1796580792 is the other big sanity check that I think is probably necessary to make this safe to run, but I've added that detail there. |
We're using pure Python parsing for two reasons:
But for very large projects involving megabytes of YAML it's intolerably slow. This makes local
opensafely run
very difficult to use, and makes it impossible to dispatch jobs in production because the page times out.A short term workaround for this would be for the pipeline library to first attempt to parse the YAML using a fast, compiled parser (if one is importable) and only if that produces errors to re-parse it using the pure Python parser to get the helpful error messages.
In psuedo-code, what I'm proposing is something like:
This does make the unhappy path slower, but not by much. And it would massively speed up the happy path, assuming that a fast parser is importable. There are three different contexts we need to think about.
1. Job Server
Here it looks like
pyyaml
is already be available so there'd be nothing more to do than upgrading the pipeline library.https://github.com/opensafely-core/job-server/blob/4814a7a17d42c55a508fb527c2b3c5a9121027c3/requirements.prod.txt#L794
2. Codespaces
I'm not exactly sure of the mechanics here, but presumably we can use whatever mechanism we do for ensuring that
opensafely-cli
is installed to also installpyyaml
.3. Running locally
This is obviously the hardest part. I think in the first instance we'd just need to talk the affected users through installing
pyyaml
(or whatever we choose). That's obviously not sustainable, but it makes it practical right now for these users to interact with their projects locally which I think is really important.Longer term, if we move to using
uv
for local installation then the need to keep all our dependencies as pure Python goes away.The text was updated successfully, but these errors were encountered: