-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lineage search field with sub-lineage queries #3074
Comments
We might need to reconsider responsibilities regarding lineage validations. For SILO 0.3.0 it was decided, that we validate all input values for lineage columns. This means that they need to be either This validation has advantages as any invalid values could lead to misleading behavior to the user. Incorrect specification of lineage definition file might make the user assume that lineages are case-insensitive as there is no error when he inputs lower-case lineages. Also, absence of errors might make users not detect that a new lineage is in the data, without it being present in the tree definition. However, SILO preprocessing that produces errors can be a little cumbersome to deal with from the loculus side of things. Therefore some proper validation needs to be present in a solution to this issue. It would also be possible (and not too much effort) to make SILO "errors on unknown values" configurable and thus optional. If this were chosen, the warnings emitted by SILO should be propagated somewhere to prevent the above mentioned misleading behavior from appearing. Otherwise, proper validation should most likely appear before the SILO import script is called, as the errors would be a lot easier to deal with in the backend |
If a lineage is missing, it will essentially be treated as a root node without children. In case of the Pango lineages, this only applies to a new recombinant node for which no children has been designated yet and that is quite a rare case. In all other cases, a missing lineage definition can cause wrong results. Of course, any wrong lineage definition file can cause wrong results but I think that we should try to prevent them as much as possible, so the current SILO behavior to abort when a lineage is not found seems very sensible to me. In any case, in Loculus, I think that it is implement a validation (which doesn't mean that we cannot decide to do it in a second step if we want to get the feature out quickly). We also have to think about how such a file can be updated and whether or how this might be related to the pipeline versions. A new pipeline might introduce new lineages but also potentially remove others, so this needs to be in sync. |
This sounds a bit worrying to me. Not all sequences will have a lineage, e.g. for multi-segmented organisms such as CCHF or H5N1 often times the clade or lineage is defined based on only one segment - if we have a sample which does not include that segment it will not be labeled. So SILO aborting when a sequence does not have a lineage is quite troubling. I also recently added clades to H5N1 and even in cases where the HA segment was known I had a couple sequences that could not be classified as a clade. |
SILO errors on invalid lineages, missing lineage assignments are totally fine :) |
Yes, sorry, with "missing lineage", I meant if a lineage is missing from the lineage definition file. |
@fhennig and I talked a bit about this and here's a brief summary and solution proposal: Background
Concept
Proposed solution and sub-tasks
|
Bit of an aside: For usability, SILO should have a tolerant mode for this in case users don't want it to be so strict and just ignores things it doesn't understand (still warn at import but don't error, simply treat lineages not present in config file as null)
There are warnings for a reason, as Postel’s Law suggests: “Be conservative in what you send, be liberal in what you accept - allow users to fail, but don't force failure I don't see why the backend needs to be involved. Anyone can download get released data at any point and validate it vs the config file, this doesn't need to be baked into the backend. It's a very specific type of validation that doesn't fit into the general nature of the backend. A lot of extra code for no clear reason, just run a script and send a notification if you're worried about mismatch. There are so many ways one can shoot oneself in the foot, we shouldn't build in complicated safeguards just in case. At least not before we haven't noticed this being a real problem. Let's solve the real problems we know we have first and not worry too much upfront about things that might happen in the future. A couple more points:
The pipeline can't do this. The lineage definitions file is a config file like the Nextclade dataset. It cannot in general be generated from a Nextclade dataset alone, it requires potentially extra knowledge about the tree topology and aliases that are not in Nextclade datasets. To me this feature should be done like this:
If it turns out that some lineages are present that are missing in the lineage config file: fix the config file, push, fixed, shouldn't take more than 10 minutes when it happens. I'm not sure how long your proposal would take to implement but I guess at least a week? What do you think? |
I don't yet know the architecture and domain well enough to have a strong opinion about how this should be implemented. I'm missing the domain knowledge about the lineage definition specifics. Chaoran said:
Then you said:
And then you said:
Can you maybe explain a bit more how these files would be generated, which extra knowledge is needed, why is it not available in the pipeline, and why the pipeline cannot be adapted? |
The pipeline has no domain knowledge. The Nextclade datasets don't contain the information. So it has to be generated somehow and generation of this is per-lineage-system so definitely not something to do within Loculus which is pathogen agnostic. One has to inform oneself on how the tree looks like per lineage system. This is not something you would do, I can generate it for the pathogens we're looking at. |
I'm not generally against it but think that it is important to think this through carefully as it is always possible to introduce such a feature later when we have stronger arguments/use cases but more difficult to remove it again. Anyways, even if we had this feature in SILO, I don't think that we should use it in Loculus – at least not by default. A Loculus maintainer should need to know about SILO as little as possible and would likely not check the SILO logs/warnings closely, so we would face the challenge of making the SILO warnings visible, understandable and actionable. Instead, I think that Loculus should avoid ever reaching the situation where an unknown lineage is sent to SILO.
Can you please explain in more detail how this could work? What should the maintainer do exactly?
I think that this is exactly the type of validation that the backend should and is already doing. The backend, at the moment, is already performing all the validation (hopefully – unless we forgot something) that is needed to make sure that the data that it returns in
We are programming the pipeline and we can either develop it in a way that it can – for SARS-CoV-2, it could mean that it knows where the alias file lives. Or we can also maintain a GitHub repository with the lineage definitions for major Nextclade datasets as you suggested and the pipeline downloads from it. The difference between our approaches is just the separation of concern: I think that it makes a lot of sense for the pipeline to be responsible for providing the lineage definitions as it is also the entity that either calls the lineages (most cases probably) or at least has to verify the lineage in some ways (if users submit the lineages). I know that my approach will take longer, very rough estimate: one week for the main implementation and another for adding tests and documentation and getting reviews? But I believe that it is also the more reliable/stable and complete solution. Your solution could be enough as a Pathoplexus-specific solution: we could (somehow) overwrite/adjust the SILO config and overwrite the import_silo_job.sh to download the lineage file. Such a solution, however, wouldn't be a re-usable Loculus feature. |
I've spent a bit of time getting my head around this. I think I end up at the moment finding that something closer to Cornelius's idea makes sense to me. As a slight enhancement, one could give the SILO import script some an env var which would map from pipeline version to lineage definition file.
and it could use that to download the appropriate file depending on the pipeline version it received (given we planned to add a I think this idea that new pipeline versions can need to involve changes which aren't simply new versions of each datum is something that ultimately applies beyond lineage definitions, it potentially applies to the whole schema. E.g. we're about to change the raw Happy to discuss this on a call. |
I think that we are talking about two different issues here: one is about whether the backend should validate the output of the pipeline and the other one is about how to pass in the lineage definitions. Discussion 1: validation As said above, the backend has been validating all output provided by the pipeline in so far, that it ensured that the output can be processed by SILO. This means that a wrongly-behaving pipeline (hopefully) cannot cause SILO preprocessing to crash. I think that this is an important feature to be mainainer-friendly. We should try to minimize the likelihood that a Loculus instance breaks due to bugs in the pipeline, and I remember that we discussed this quite a bit in August and wanted to further improve by addressing #2570. Discussion 2: providing the lineage definitions SILO preprocessing definitely needs to know the definitions and in my opinion, as argued above, the backend should know them as well. So, the question is how to give the definitions to them and two ideas are (1) through the configuration via Values.yaml and (2) via a new backend endpoint to register new pipeline versions. First, conceptually: what makes most sense / where do the definitions "fit" most? I can see arguments for either. One can well argue that the lineage definitions are part of the schema: it is like changing the type of the "lineage" field from a string to an enum (and w.r.t. the backend validation (discussion 1) it really is just an enum (i.e., the backend will just check whether the values provided by the pipeline is one of the defined values)) – and a schema seems like something that belongs into a configuration file. On the other hand, the lineage definitions are very dynamic. The configuration of a server software (I'm thinking about software like PostgreSQL, GitLab, Discourse, …) usually doesn't need to be changed on a regular basis but the lineage definitions are expected to update regularly. Regarding the implementation complexity: Both approaches have to achieve the same (i.e., providing the lineage definitions to the components that need them) and the difference is just whether the logic of passing around the definitions is in the software code or in the deployment code. I'm not sure whether the software-based solution really takes more effort. With the configuration approach, we will increase the complexity of the Helm templates which are more error-prone and difficult to write (unit-)tests for. Regarding the maintainer-friendliness: Since the lineage definition updates are expected to be a regular operation that maintainers will do and not just an once-off thing at the beginning when they set up their instance, we should try to make it maintainer-friendly. An API endpoint will enable many approaches: If the lineage definitions don't update too frequently, a maintainer can just call the endpoint manually. We can in the future also implement a graphical admin panel where the admin can drag & drop a lineage file. But the API endpoint will also allow easy automated approaches: for a pathogen like SARS-CoV-2 where the lineages update quite often, one might want to set up a fully automated system what, e.g. when detecting a new Nextclade dataset, automatically downloads the current Pango lineage aliases along with the Nextclade dataset to generate the lineage definition file and submits it. Automation is more difficult with the configuration approach as it requires write access to the cluster. Further, with the endpoint approach, it is easier to provide error messages when there is an error with the provided lineage definitions: the endpoint can simply send a response with a HTTP error status and explanations in the body. With the config system, the error messages will only be visible somewhere in the logs and more difficult for a maintainer to find (and even harder to retrieve in an automated way). The proposed I'm also happy to discuss this more in a call. |
It seems like the big disagreement here is if the backend should validate the lineage definition files, however our main goal is just to enable sublineage queries. As SILO now supports sublineage querying the validation isn't required for the feature, so I would suggest we move forward with @theosanderson's suggestion for the moment and pass the lineage definitions in the config with a link to the pipeline version they use. Then we can add the sublineage search option to the website. I personally think it would be an overkill for the backend to additionally validate the lineage definitions or have to register the lineage definitions. To me lineage definition files are like nextclade datasets - we use them without validation on our end and assume they are correct. However we can come back to this after adding the search option to the website. |
What do you mean exactly? For me, the issues is that SILO will crash if the pipeline produces data with lineages that are not in the definitions file and I believe that Loculus should avoid that. |
We are already in a place where SILO import can [and does] crash during version upgrades due to schema changes, which is why I am for the moment I think fairly willing to tolerate that (but I do acknowedge that this would add another new thing that could cause crashes). |
Thanks @theosanderson I was just trying to best formulate this. I think we should try to move forward with the sublineage search feature and then think about adding sublineage validation. We also don't for example validate that nextclade datasets are accurate - we assume the nextclade datasets CLI does this. If we created another library to generate the lineage files (something @corneliusroemer suggested in #3074 (comment)) we could add a similar verification there. |
I don't think that this is really comparable because we explicitly designed the architecture in a way that considers the pipeline as an external component. It is natural that the backend doesn't validate the Nextclade dataset name as it doesn't know about it and other pipelines mght not use Nextclade at all. For that reason, the backend also doesn't and shouldn't fully trust what the pipelne produces but validate that the data types are correct. (Obviously, the backend doesn't check whether the values are semantically correct as this would be outside of the scope of a database.)
Schema changes can currently indeed cause down times but for me, this is rather something that we should try to avoid/improve going forward rather than creating more cases that can cause a crash. In particular, wouldn't you agree that in practice, we expect lineage definition updates to occur much more often than breaking schema changes? We talked earlier about wanting more stability and implementing this ticket without validation would move us further away from that. |
In the next 6 months, potentially actually less often, but beyond that, hopefully yes |
If you tie pipeline versions with SILO lineage config and registering with backend endpoint, then fixing a lineage config will require a reprocessing of everything which is wasteful when one could just fix the lineage config and be done.
LAPIS is just as much of an external component
That's why you validate things in SILO and error if things are wrong, like the backend rejects bad feedback. |
The difference is: We expect noone to plug in a different version of LAPIS. Loculus is designed in a way that maintainers should be able to use their own pipeline. That's what "external" refers to in this context. |
Lineage config file from @corneliusroemer : https://github.com/corneliusroemer/mpxv-lapis-lineage-tree |
I had a chat with @chaoran-chen and this is what I got:
metadata:
- name: lineage
lineageSystem:
1: link
2: link
preprocessing_config.yaml ExamplendjsonInputFilename: "sample.ndjson.zst"
lineageDefinitionsFilename: "lineage_definitions.yaml"
referenceGenomeFilename: "reference_genomes.json"
duckdbMemoryLimitInG: 150 Not sure whether all of these need to be given or not. |
An unspecified thing that I hadn't previously considered is what to do when |
In LAPIS v0.3.7, the sub-lineage filter was generalized, making it possible to include sub-lineages of a lineage by appending the query with a
*
. In Loculus, we should incorporate this feature.Sub-tasks include (probably not complete):
The text was updated successfully, but these errors were encountered: