Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGD - Noctua migration #34

Closed
pgaudet opened this issue Oct 26, 2022 · 24 comments
Closed

SGD - Noctua migration #34

pgaudet opened this issue Oct 26, 2022 · 24 comments
Assignees
Labels

Comments

@pgaudet
Copy link

pgaudet commented Oct 26, 2022

Project link

https://github.com/orgs/geneontology/projects/61

Project description

Tasks needed to complete the migration of manual SGD annotations from Protein2GO to Noctua for full adoption of Noctua as SGD GO curation tool.

PI

Mike

Project owner (PO)

@suzialeksander

Technical lead (TL)

@dustine32

Other personnel (OP)

TBD

Technical specs

Using current system

Other comments

Code changes will likely be in https://github.com/biolink/ontobio; tickets about bugs/code will be here.

SGD models will be isolated in https://github.com/geneontology/sgd-go-cams; tickets about project progress will be here.

@pgaudet pgaudet moved this from Hopper to Priority in Project Metadata Overview Oct 26, 2022
@pgaudet pgaudet moved this from Priority to Active in Project Metadata Overview Oct 26, 2022
@kltm kltm added Needs LA approval Needs final approval from the Lead Architect Needs PM approval Needs final approval from the Project Manager Needs tech doc and removed Issue incomplete labels Oct 27, 2022
@pgaudet pgaudet removed Needs LA approval Needs final approval from the Lead Architect Needs PM approval Needs final approval from the Project Manager labels Nov 22, 2022
@kltm kltm added Ready and removed Needs tech doc labels Dec 6, 2022
@pgaudet
Copy link
Author

pgaudet commented Feb 1, 2023

Update on managers call:

  • converting UniProt IDs to SGD
  • some non-yeast data in there needs to be kept in P2GO
  • some annotations to non-canonical yeast strains

@pgaudet
Copy link
Author

pgaudet commented Mar 22, 2023

Was delayed because of GPI issue (now fixed)

@pgaudet
Copy link
Author

pgaudet commented May 24, 2023

Managers call: @suzialeksander says data not yet loaded

@pgaudet
Copy link
Author

pgaudet commented Jul 24, 2023

@suzialeksander Can you add the current state of the project here please? ie remaining work and planned delivery date

Thanks, Pascale

@suzialeksander
Copy link

suzialeksander commented Jul 26, 2023

These models are in Noctua Dev, and SGD is testing. Hopefully, if no major issues, we can ask to have these loaded to prod quite soon (couple weeks?). Other remaining work includes switching pipelines at SGD/GO to not scoop QuickGO anymore and make sure we aren't eating tails anywhere.

@pgaudet
Copy link
Author

pgaudet commented Aug 30, 2023

  • SGD is testing the models on dev
  • so far models look OK

@srengel
Copy link

srengel commented Dec 13, 2023

SGD is ready for this project to move forward.

steps to get into prod:

  1. fresh dump from Alex of SGD P2GO content, then delete the annotations at P2GO or "freeze" by marking as archived (so that they are not used by anyone since once loaded into Noctua, the P2GO annotations with source=SGD will now be duplicates of what is in Noctua)
  2. Dustin (?) load the fresh set of annotations into Noctua prod
  3. at SGD, we will continue our pipeline 'as is' to gather annotations and generate necessary files. (only difference is that the P2GO load into SGD pipeline will no longer have any source=SGD annotations.)

@pgaudet
Copy link
Author

pgaudet commented Dec 14, 2023

Thanks @srengel !

Just to check:

  • you will keep on loading manual annotations from P2GO from other sources ? (or these have always been excluded from the SGD loads, if I recall correctly? )
  • are the IEAs coming from the GOA pipeline, or are they ran locally at SGD?
    Thanks, Pascale

@srengel
Copy link

srengel commented Dec 14, 2023

Hi @pgaudet

  • yes we will still load all the same stuff we load now from both P2GO and Noctua. only difference is that, after the migration of SGD annotations from P2GO to Noctua, all source=SGD will only come from Noctua.
  • we don't run IEAs at SGD, we only pick up the ones from the GOA pipeline

FYI, we currently have GO annotations in SGD from these sources:
GO_Central
GOC
InterPro
RNAcentral
SGD
UniProt
ComplexPortal
RHEA

@suzialeksander
Copy link

Update, SGD is ready for a new file. Pending production of the new file by @alexsign, we should be able to load a new file next week with help from @dustine32. There are a few pipeline tweaks and data checks for both P2GO and SGD after that.

@suzialeksander
Copy link

suzialeksander commented Jan 23, 2024

There is some documentation at https://docs.google.com/document/d/1PZH2SiyF9FJhvW_M_cr3GlReSZfvbj96AkFHs9DD6Qc. To expand the process:

  1. Alex generated a GPAD of all annotations with source=SGD.
  • This GPAD needs UniProtKBs converted into SGDIDs in the first column, done by @dustine32. There are a handful of non-yeast entities that should remain in P2GO, these seem to be old annotations used for ISS, etc. The SGD GPI did not contain all of these external IDs so mapping files from the SGD database, plus UniProt Mappings, were used to resolve the remaining UniProtKB, ComplePortal, and RNAcentral IDs.
  • There are external IDs in the with/from as well as extensions.
  • Not all contributors have ORCIDs.
  1. Load this file into Noctua
  2. SGD curators compare the Noctua load (in Dev or Prod) to the annotations in P2GO. Filter in Noctua by Title: SGD:*
  3. P2GO checks that no annotations were lost, and the few they needed to retain (not to Sc) are still in P2GO.
  4. P2GO marks all Sc annotations assigned_by:SGD as status=archived
  5. SGD moves the Noctua GPAD as the priority load at SGD, P2GO continues to produce the "remainders" file of IEAs, non-SGD sources (ComplexPortal, etc.)

in progress

@kltm
Copy link
Member

kltm commented Jan 23, 2024

@suzialeksander As we get closer to the end of this, it would be good to work out a final timeline for these steps to make sure that we're not causing any double-ups or gaps anywhere.

@vanaukenk It might also be interesting to see what the profile of this import set is when viewed through the lens of @balhoff 's recent tooling for geneontology/go-shapes#306 . It might help contextualize some choices before final commitments are made.

@suzialeksander
Copy link

suzialeksander commented Jan 30, 2024

Final (?) planning call today: agenda in Shared Drive

Slides with current yeast dataflow, and nearly identical flow post-project

Next steps:

Feb 1:

Upon successful snapshot containing above models:

  • notify @alexsign so they can test Noctua load at GOA
  • trigger next GO Release candidate

@suzialeksander
Copy link

Also, SGD is waiting for the remainders file that @alexsign is working on.

@suzialeksander
Copy link

suzialeksander commented Feb 2, 2024

After the outage, the models seem to have landed as intended. Success!!

However, a tiny issue emerged during spot checking: two curators were assigned the same ORCID when converting the files for loading, ~647 models out of the 7075 loaded.

Next steps:

  • Noctua curation can proceed/restart, as SGD was on a curation freeze. However, SGD will refrain from editing models that have the contributor "https://orcid.org/0000-0003-3166-4638". This includes not using "model copy" on these models.

  • Flush all models as usual next outage (8 Feb)

  • @dustine32 will sed to today's date (Feb 1)

  • hopefully see ~647 models changed

@alexsign
Copy link

alexsign commented Feb 2, 2024

@suzialeksander remainders file available now. same name, same place. please take a look and let me know if it's good.

@kltm
Copy link
Member

kltm commented Feb 2, 2024

Noting that #34 (comment) has changed given recent discussions: we will essentially be doing a full clobber with the expectation that we're essentially doing a re-run of yesterday (as SGD is still in their curation freeze).

@suzialeksander
Copy link

suzialeksander commented Feb 9, 2024

Update from the 8 Feb Noctua outage/load:

Spot checking has revealed some extraneous inferred annotations, specifically "reproductive process" from

cellular response to pheromone PMID:12446563 IMP 20231004 SGD part_of(conjugation with cellular fusion)

The immediate actions are:

  • @alexsign notified to not import Snapshot yet, even if there is a successful run.
  • @suzialeksander will notify him when it's safe to start that import.
  • SGD models will remain in Noctua and are available for spot checking by SGD
  • @dustine32 will look into diff-ing GPADs, when feasible, to see how much impact these extra annotations might have. For SGD, it looks like there are roughly 6k inferred annotations on top of a load of 50k.
  • Call on 9 Feb, time TBD, where @kltm @dustine32 @vanaukenk @suzialeksander and other vital parties- @balhoff, etc. will discuss next steps:
  • one proposal: New rule- if reference, EC are the same, Noctua will be prevented from releasing an annotation to an ancestor GO into the output
  • Some groups (MGI) do want these, others (SGD) do not
  • Different MODs get different rules internally? Hard to implement.
  • All annotations treated the same and all groups' annotations treated the same- somebody loses

@vanaukenk
Copy link

Just to clarify in advance of today's call - I can't speak directly for them, but I doubt that MGI would also want redundant ancestor/child annotations with the same evidence code from the same paper.

There may, however, be other inferred annotations that they do want.

One other option we've considered for the GPAD output is to give inferred annotations their own evidence code so they could more readily be filtered if groups do not want them.

That said, it would still be nice to create useful inferences wherever possible.

@suzialeksander
Copy link

After today's call, @kltm and @cmungall will look into diff'ing the terms and seeing if adding do_not_annotate or similar tags on terms will help, as it's likely most of the inferred annotations are to a handful of terms with a long tail.

Managers agreed that dealing with inferred annotaitons is really a separate project from the import, and further work in this new project would include giving these inferred annotations a more accurate EC than implying the curator made these inferred annotations directly. Inferred annotation situation is analogous to when GO inferred BP-MF annotations, then backtracked.

@kltm
Copy link
Member

kltm commented Feb 14, 2024

Noting that the diff/exploration has a ticket here: geneontology/noctua-models#271

@suzialeksander
Copy link

After spot checking some models, there are ShEx violations in several models- incorrect relations for the terms, tec. Waiting on the violation report to see a full list, but the ones that have come up are individually fixable so far.

As for the inferences, it seems these might be fixable though ontology improvements.

Still waiting on a release to make sure the entire cycle SGD-GO works, but starting to test the snapshot that just came out.

@suzialeksander
Copy link

Discussing with @pgaudet
The GO part of this work is now finished.
Remaining tickets are work for the SGD curators to fix annotations.
We will close this project, and open a new one specifically for SGD tasks & cleanup of annotations that are in Noctua

@github-project-automation github-project-automation bot moved this from Active to Complete - 2023 and earlier in Project Metadata Overview Mar 12, 2024
@pgaudet pgaudet moved this from Complete - 2023 and earlier to Complete - 2024 in Project Metadata Overview Mar 13, 2024
@srengel
Copy link

srengel commented Apr 1, 2024

@alexsign Please start Noctua import. After it’s done, please cross check annotations and delete old ones. Then please make NoctuaSGD public. (this is our understanding of remaining steps for this project. please correct us if this is wrong.)

for reference, this is Suzi's email from last Thursday Mar28:

Hi Alex,

Thanks for these files. We've looked at them, specifically the P2GO_not_in_Noctua, and it looks like these are left out mostly due to being not yeast, or not in the protein-centric world (this is expected, lots of RNAs and such). The lastest Snapshot is 2024-03-21, and Pascale and I cleared it for SGD annotations this week although it doesn't have a lot of our latest edits to save annotations that failed the import. I think everything looks good for you to proceed with Step 4, deleting SGD data from P2GO & make NoctuaSGD public.

Thanks
Suzi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Complete - 2024
Development

No branches or pull requests

6 participants