Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage search field with sub-lineage queries #3074

Open
chaoran-chen opened this issue Oct 25, 2024 · 23 comments · May be fixed by #3467
Open

Lineage search field with sub-lineage queries #3074

chaoran-chen opened this issue Oct 25, 2024 · 23 comments · May be fixed by #3467
Assignees
Labels
discussion Open questions epic A major task that should be broken down into smaller tasks

Comments

@chaoran-chen
Copy link
Member

In LAPIS v0.3.7, the sub-lineage filter was generalized, making it possible to include sub-lineages of a lineage by appending the query with a *. In Loculus, we should incorporate this feature.

Sub-tasks include (probably not complete):

  • Website: add UI to send corresponding queries and allow users to decide whether to include sublineages
  • Deployment and config: pass the lineage tree definition file to SILO (and LAPIS?) and the backend
  • Backend: If a field has the new lineage type, SILO expects that only values appear that are also present in the lineage tree definition file. The backend, therefore, should validate the lineages that the preprocessing pipeline generate.
@chaoran-chen chaoran-chen added the epic A major task that should be broken down into smaller tasks label Oct 25, 2024
@Taepper
Copy link

Taepper commented Oct 28, 2024

We might need to reconsider responsibilities regarding lineage validations.

For SILO 0.3.0 it was decided, that we validate all input values for lineage columns. This means that they need to be either NULL or a valid lineage. Validity is defined through presence in the lineage definition file (covid example) as was defined here by @corneliusroemer.

This validation has advantages as any invalid values could lead to misleading behavior to the user. Incorrect specification of lineage definition file might make the user assume that lineages are case-insensitive as there is no error when he inputs lower-case lineages. Also, absence of errors might make users not detect that a new lineage is in the data, without it being present in the tree definition.

However, SILO preprocessing that produces errors can be a little cumbersome to deal with from the loculus side of things. Therefore some proper validation needs to be present in a solution to this issue. It would also be possible (and not too much effort) to make SILO "errors on unknown values" configurable and thus optional. If this were chosen, the warnings emitted by SILO should be propagated somewhere to prevent the above mentioned misleading behavior from appearing.

Otherwise, proper validation should most likely appear before the SILO import script is called, as the errors would be a lot easier to deal with in the backend

@fhennig fhennig added the discussion Open questions label Oct 28, 2024
@chaoran-chen
Copy link
Member Author

If a lineage is missing, it will essentially be treated as a root node without children. In case of the Pango lineages, this only applies to a new recombinant node for which no children has been designated yet and that is quite a rare case. In all other cases, a missing lineage definition can cause wrong results. Of course, any wrong lineage definition file can cause wrong results but I think that we should try to prevent them as much as possible, so the current SILO behavior to abort when a lineage is not found seems very sensible to me.

In any case, in Loculus, I think that it is implement a validation (which doesn't mean that we cannot decide to do it in a second step if we want to get the feature out quickly).

We also have to think about how such a file can be updated and whether or how this might be related to the pipeline versions. A new pipeline might introduce new lineages but also potentially remove others, so this needs to be in sync.

@anna-parker
Copy link
Contributor

So the current SILO behavior to abort when a lineage is not found seems very sensible to me.

This sounds a bit worrying to me. Not all sequences will have a lineage, e.g. for multi-segmented organisms such as CCHF or H5N1 often times the clade or lineage is defined based on only one segment - if we have a sample which does not include that segment it will not be labeled. So SILO aborting when a sequence does not have a lineage is quite troubling. I also recently added clades to H5N1 and even in cases where the HA segment was known I had a couple sequences that could not be classified as a clade.

@Taepper
Copy link

Taepper commented Oct 29, 2024

SILO errors on invalid lineages, missing lineage assignments are totally fine :)

@chaoran-chen
Copy link
Member Author

Yes, sorry, with "missing lineage", I meant if a lineage is missing from the lineage definition file.

@chaoran-chen
Copy link
Member Author

chaoran-chen commented Nov 25, 2024

@fhennig and I talked a bit about this and here's a brief summary and solution proposal:

Background

  • To enable the lineage search in SILO, we need to give it a lineage definition file which must be complete (i.e., contain all lineages that are in the dataset).
  • The lineages are called by the processing pipeline and the available lineages may change if a pipeline is updated.
  • The backend has the job to ensure that everything that SILO preprocessing gets from /get-released-data is also really correct (from SILO's perspective) and can be processed by SILO preprocessing. This means that the backend must ensure that SILO only gets lineages that are in the lineage definition file and reject if the pipeline misbehaves and sends a lineage that is not known.
  • It is possible for a pathogen to have zero, one, or more than one lineage column. That means that for an organism, there can be more than one lineage definition file.

Concept

  • The pipeline naturally knows about the available lineages, so it makes sense if it is responsible for creating the lineage definition file(s) and provides it to the backend.
  • The avaiable lineages can change with each new pipeline version, so it also would make sense if a pipeline has/can provide new definitions for each version. In our words, the backend would know the lineage definitions for each pipeline version.
  • silo_import_job will get the lineage definition file(s) from the backend.

Proposed solution and sub-tasks

  1. Fix Make pipeline version scoped at organism level rather than global #1728
  2. Add a backend endpoint /{organism}/register-new-pipeline-version that the pipeline must call before submitting data of a new version. The registered versions are stored in a "pipeline_version" table . If the pipeline with a version that has not been registered calls the backend, it gets an error. We can adapt our Nextclade-based pipeline to register a new version when it gets such an error message.
  3. We extend the "pipeline_version" table to also store lineage definitions. Lineage definitions are stored as a JSON with the lineage field names as keys and the definition file conents as value. When the /{organism}/register-new-pipeline-version is called, the pipeline must provide the JSON with the lineage definitions if the organism has lineage columns. The Nextclade-based pipeline will be adapted to generate the lineage definitions file.
  4. The backend will verify the content of lineage fields and ensure that they are known.
  5. Adapt the get-released-data endpoint to return the pipeline version in the header.
  6. Add a new get-lineage-definitions?pipelineVersion=<version> endpoint that returns the lineage definition file for a particular version.
  7. Adapt the SILO import job to fetch the lineage definition file (when needed)

@corneliusroemer
Copy link
Contributor

To enable the lineage search in SILO, we need to give it a lineage definition file which must be complete (i.e., contain all lineages that are in the dataset).

Bit of an aside: For usability, SILO should have a tolerant mode for this in case users don't want it to be so strict and just ignores things it doesn't understand (still warn at import but don't error, simply treat lineages not present in config file as null)

This validation has advantages as any invalid values could lead to misleading behavior to the user. Incorrect specification of lineage definition file might make the user assume that lineages are case-insensitive as there is no error when he inputs lower-case lineages. Also, absence of errors might make users not detect that a new lineage is in the data, without it being present in the tree definition.

There are warnings for a reason, as Postel’s Law suggests: “Be conservative in what you send, be liberal in what you accept - allow users to fail, but don't force failure


I don't see why the backend needs to be involved. Anyone can download get released data at any point and validate it vs the config file, this doesn't need to be baked into the backend. It's a very specific type of validation that doesn't fit into the general nature of the backend. A lot of extra code for no clear reason, just run a script and send a notification if you're worried about mismatch.

There are so many ways one can shoot oneself in the foot, we shouldn't build in complicated safeguards just in case. At least not before we haven't noticed this being a real problem. Let's solve the real problems we know we have first and not worry too much upfront about things that might happen in the future.


A couple more points:

  • Nothing against making pipeline version organism scoped, not a bad idea, but not urgent either in any way as long as we data volume is manageable. Making this a requisite for lineage search feels overblown
  • Nothing against emitting pipeline version in metadata, we've discussed this before

The Nextclade-based pipeline will be adapted to generate the lineage definitions file.

The pipeline can't do this. The lineage definitions file is a config file like the Nextclade dataset. It cannot in general be generated from a Nextclade dataset alone, it requires potentially extra knowledge about the tree topology and aliases that are not in Nextclade datasets.


To me this feature should be done like this:

  • Set up some simple code to generate lineage definitions file for each pathogen with sublineages, this doesn't have to live in Loculus, this is ideally done over at SILO like Nextclade datasets live with Nextclade - everyone can use the same file. But we can contribute of course. A simple script should take a few hours not more per pathogen, ideally generated from machine readable source if possible so it can just run in CI or with minimal intervention: ~ half a day
  • Pull those config files into SILO config the way we pull reference sequences and Nextclade datasets, via link/config preprocessor: 30min
  • Adapt lineage UI to have radio buttons for monophyletic (wildcard, all children included) or paraphyletic (without children) search: ~ half a day

If it turns out that some lineages are present that are missing in the lineage config file: fix the config file, push, fixed, shouldn't take more than 10 minutes when it happens.

I'm not sure how long your proposal would take to implement but I guess at least a week? What do you think?

@fhennig
Copy link
Contributor

fhennig commented Nov 26, 2024

I don't yet know the architecture and domain well enough to have a strong opinion about how this should be implemented.

I'm missing the domain knowledge about the lineage definition specifics.

Chaoran said:

The Nextclade-based pipeline will be adapted to generate the lineage definitions file.

Then you said:

The pipeline can't do this. The lineage definitions file is a config file like the Nextclade dataset. It cannot in general be generated from a Nextclade dataset alone, it requires potentially extra knowledge about the tree topology and aliases that are not in Nextclade datasets.

And then you said:

Set up some simple code to generate lineage definitions file for each pathogen with sublineages

Can you maybe explain a bit more how these files would be generated, which extra knowledge is needed, why is it not available in the pipeline, and why the pipeline cannot be adapted?

@corneliusroemer
Copy link
Contributor

The pipeline has no domain knowledge. The Nextclade datasets don't contain the information. So it has to be generated somehow and generation of this is per-lineage-system so definitely not something to do within Loculus which is pathogen agnostic.

One has to inform oneself on how the tree looks like per lineage system. This is not something you would do, I can generate it for the pathogens we're looking at.

@chaoran-chen
Copy link
Member Author

SILO should have a tolerant mode

I'm not generally against it but think that it is important to think this through carefully as it is always possible to introduce such a feature later when we have stronger arguments/use cases but more difficult to remove it again.

Anyways, even if we had this feature in SILO, I don't think that we should use it in Loculus – at least not by default. A Loculus maintainer should need to know about SILO as little as possible and would likely not check the SILO logs/warnings closely, so we would face the challenge of making the SILO warnings visible, understandable and actionable. Instead, I think that Loculus should avoid ever reaching the situation where an unknown lineage is sent to SILO.

Anyone can download get released data at any point and validate it vs the config file, this doesn't need to be baked into the backend.

Can you please explain in more detail how this could work? What should the maintainer do exactly?

It's a very specific type of validation that doesn't fit into the general nature of the backend.

I think that this is exactly the type of validation that the backend should and is already doing. The backend, at the moment, is already performing all the validation (hopefully – unless we forgot something) that is needed to make sure that the data that it returns in /get-released-date can be accepted by SILO without issues. It is checking that fields of type date are also really formatted as a date, etc. Lineages are of course more sophisticated than dates but if we want to implement sub-lineage queries as a Loculus feature, then the backend will also need to know about the lineages.

The pipeline can't do this

We are programming the pipeline and we can either develop it in a way that it can – for SARS-CoV-2, it could mean that it knows where the alias file lives. Or we can also maintain a GitHub repository with the lineage definitions for major Nextclade datasets as you suggested and the pipeline downloads from it. The difference between our approaches is just the separation of concern: I think that it makes a lot of sense for the pipeline to be responsible for providing the lineage definitions as it is also the entity that either calls the lineages (most cases probably) or at least has to verify the lineage in some ways (if users submit the lineages).


I know that my approach will take longer, very rough estimate: one week for the main implementation and another for adding tests and documentation and getting reviews? But I believe that it is also the more reliable/stable and complete solution. Your solution could be enough as a Pathoplexus-specific solution: we could (somehow) overwrite/adjust the SILO config and overwrite the import_silo_job.sh to download the lineage file. Such a solution, however, wouldn't be a re-usable Loculus feature.

@theosanderson
Copy link
Member

theosanderson commented Nov 26, 2024

I've spent a bit of time getting my head around this. I think I end up at the moment finding that something closer to Cornelius's idea makes sense to me.

As a slight enhancement, one could give the SILO import script some an env var which would map from pipeline version to lineage definition file.

{1:"http://bla.com/versionxoflineagedefs.yaml", 2: "http://bla.com/versionyoflineagedefs.yaml"}

and it could use that to download the appropriate file depending on the pipeline version it received (given we planned to add a pipelineVersion column anyway).

I think this idea that new pipeline versions can need to involve changes which aren't simply new versions of each datum is something that ultimately applies beyond lineage definitions, it potentially applies to the whole schema. E.g. we're about to change the raw date column from date to string to accomodate ranges. (Or maybe we'll rename it, but that's a conceivable thing that might happen, and would likely break SILO). I think at some time in the future we may want basically the whole metadata schema to be versioned like the pipeline version. But we don't have capacity to do that soon. But that makes me not necessarily want to invest a lot in a specific solution for this specific issue.

Happy to discuss this on a call.

@chaoran-chen
Copy link
Member Author

chaoran-chen commented Dec 1, 2024

I think that we are talking about two different issues here: one is about whether the backend should validate the output of the pipeline and the other one is about how to pass in the lineage definitions.

Discussion 1: validation

As said above, the backend has been validating all output provided by the pipeline in so far, that it ensured that the output can be processed by SILO. This means that a wrongly-behaving pipeline (hopefully) cannot cause SILO preprocessing to crash. I think that this is an important feature to be mainainer-friendly. We should try to minimize the likelihood that a Loculus instance breaks due to bugs in the pipeline, and I remember that we discussed this quite a bit in August and wanted to further improve by addressing #2570.

Discussion 2: providing the lineage definitions

SILO preprocessing definitely needs to know the definitions and in my opinion, as argued above, the backend should know them as well. So, the question is how to give the definitions to them and two ideas are (1) through the configuration via Values.yaml and (2) via a new backend endpoint to register new pipeline versions.

First, conceptually: what makes most sense / where do the definitions "fit" most? I can see arguments for either. One can well argue that the lineage definitions are part of the schema: it is like changing the type of the "lineage" field from a string to an enum (and w.r.t. the backend validation (discussion 1) it really is just an enum (i.e., the backend will just check whether the values provided by the pipeline is one of the defined values)) – and a schema seems like something that belongs into a configuration file.

On the other hand, the lineage definitions are very dynamic. The configuration of a server software (I'm thinking about software like PostgreSQL, GitLab, Discourse, …) usually doesn't need to be changed on a regular basis but the lineage definitions are expected to update regularly.

Regarding the implementation complexity: Both approaches have to achieve the same (i.e., providing the lineage definitions to the components that need them) and the difference is just whether the logic of passing around the definitions is in the software code or in the deployment code. I'm not sure whether the software-based solution really takes more effort. With the configuration approach, we will increase the complexity of the Helm templates which are more error-prone and difficult to write (unit-)tests for.

Regarding the maintainer-friendliness: Since the lineage definition updates are expected to be a regular operation that maintainers will do and not just an once-off thing at the beginning when they set up their instance, we should try to make it maintainer-friendly.

An API endpoint will enable many approaches: If the lineage definitions don't update too frequently, a maintainer can just call the endpoint manually. We can in the future also implement a graphical admin panel where the admin can drag & drop a lineage file. But the API endpoint will also allow easy automated approaches: for a pathogen like SARS-CoV-2 where the lineages update quite often, one might want to set up a fully automated system what, e.g. when detecting a new Nextclade dataset, automatically downloads the current Pango lineage aliases along with the Nextclade dataset to generate the lineage definition file and submits it. Automation is more difficult with the configuration approach as it requires write access to the cluster.

Further, with the endpoint approach, it is easier to provide error messages when there is an error with the provided lineage definitions: the endpoint can simply send a response with a HTTP error status and explanations in the body. With the config system, the error messages will only be visible somewhere in the logs and more difficult for a maintainer to find (and even harder to retrieve in an automated way).

The proposed {1:"http://bla.com/versionxoflineagedefs.yaml", 2: "http://bla.com/versionyoflineagedefs.yaml"} also doesn't seem ideal as it requires the definition files to be uploaded to a different place first. We'd then also need to deal with the question of what should happen if a file is not reachable and how to inform the maintainer.

I'm also happy to discuss this more in a call.

@anna-parker
Copy link
Contributor

It seems like the big disagreement here is if the backend should validate the lineage definition files, however our main goal is just to enable sublineage queries. As SILO now supports sublineage querying the validation isn't required for the feature, so I would suggest we move forward with @theosanderson's suggestion for the moment and pass the lineage definitions in the config with a link to the pipeline version they use. Then we can add the sublineage search option to the website.

I personally think it would be an overkill for the backend to additionally validate the lineage definitions or have to register the lineage definitions. To me lineage definition files are like nextclade datasets - we use them without validation on our end and assume they are correct. However we can come back to this after adding the search option to the website.

@chaoran-chen
Copy link
Member Author

As SILO now supports sublineage querying the validation isn't required for the feature

What do you mean exactly? For me, the issues is that SILO will crash if the pipeline produces data with lineages that are not in the definitions file and I believe that Loculus should avoid that.

@theosanderson
Copy link
Member

theosanderson commented Dec 2, 2024

We are already in a place where SILO import can [and does] crash during version upgrades due to schema changes, which is why I am for the moment I think fairly willing to tolerate that (but I do acknowedge that this would add another new thing that could cause crashes).

@anna-parker
Copy link
Contributor

Thanks @theosanderson I was just trying to best formulate this.

I think we should try to move forward with the sublineage search feature and then think about adding sublineage validation. We also don't for example validate that nextclade datasets are accurate - we assume the nextclade datasets CLI does this. If we created another library to generate the lineage files (something @corneliusroemer suggested in #3074 (comment)) we could add a similar verification there.

@chaoran-chen
Copy link
Member Author

We also don't for example validate that nextclade datasets are accurate

I don't think that this is really comparable because we explicitly designed the architecture in a way that considers the pipeline as an external component. It is natural that the backend doesn't validate the Nextclade dataset name as it doesn't know about it and other pipelines mght not use Nextclade at all. For that reason, the backend also doesn't and shouldn't fully trust what the pipelne produces but validate that the data types are correct. (Obviously, the backend doesn't check whether the values are semantically correct as this would be outside of the scope of a database.)

We are already in a place where SILO import can [and does] crash during version upgrades due to schema changes, which is why I am for the moment I think fairly willing to tolerate that (but I do acknowedge that this would add another new thing that could cause crashes).

Schema changes can currently indeed cause down times but for me, this is rather something that we should try to avoid/improve going forward rather than creating more cases that can cause a crash. In particular, wouldn't you agree that in practice, we expect lineage definition updates to occur much more often than breaking schema changes?

We talked earlier about wanting more stability and implementing this ticket without validation would move us further away from that.

@theosanderson
Copy link
Member

In particular, wouldn't you agree that in practice, we expect lineage definition updates to occur much more often than breaking schema changes?

In the next 6 months, potentially actually less often, but beyond that, hopefully yes

@corneliusroemer
Copy link
Contributor

If you tie pipeline versions with SILO lineage config and registering with backend endpoint, then fixing a lineage config will require a reprocessing of everything which is wasteful when one could just fix the lineage config and be done.


designed the architecture in a way that considers the pipeline as an external component

LAPIS is just as much of an external component

For that reason, the backend also doesn't and shouldn't fully trust what the pipeline produces but validate that the data types are correct.

That's why you validate things in SILO and error if things are wrong, like the backend rejects bad feedback.

@fengelniederhammer
Copy link
Contributor

designed the architecture in a way that considers the pipeline as an external component

LAPIS is just as much of an external component

The difference is: We expect noone to plug in a different version of LAPIS. Loculus is designed in a way that maintainers should be able to use their own pipeline. That's what "external" refers to in this context.

@chaoran-chen
Copy link
Member Author

@fhennig
Copy link
Contributor

fhennig commented Dec 9, 2024

I had a chat with @chaoran-chen and this is what I got:

  • Make it possible to configure the lineage system for a metadata field. This also means configuring a lineage config file per processing pipeline version, for example:
metadata:
  - name: lineage
    lineageSystem:
      1: link
      2: link
  • These links should then be passed into the Pod that runs the silo_import_job.sh. It could be that a JSON like this is passed through an env var: {1:"http://bla.com/versionxoflineagedefs.yaml", 2: "http://bla.com/versionyoflineagedefs.yaml"}
  • In the job, the call to siloApi --processing needs to receive the preprocessing_config.json that then refers to the lineage.yaml.
  • UI - The website doesn't need to know about the lineage file, but it needs to know when a field is of type lineage.
    • A new search control for the lineage type is needed, allowing to search with "include sublineages" enabled. This would then modify the query that gets send to LAPIS.
preprocessing_config.yaml Example
ndjsonInputFilename: "sample.ndjson.zst"
lineageDefinitionsFilename: "lineage_definitions.yaml"
referenceGenomeFilename: "reference_genomes.json"
duckdbMemoryLimitInG: 150

Not sure whether all of these need to be given or not.

@theosanderson
Copy link
Member

theosanderson commented Dec 12, 2024

An unspecified thing that I hadn't previously considered is what to do when getReleasedData returns no sequences, and so no pipeline version. I'm not sure if SILO demands a lineage file in the case that there is no data. If so we can just (I presume) supply an empty file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Open questions epic A major task that should be broken down into smaller tasks
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

7 participants