Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tier1 to dcp journey #1324

Open
arschat opened this issue Nov 12, 2024 · 1 comment
Open

tier1 to dcp journey #1324

arschat opened this issue Nov 12, 2024 · 1 comment
Labels

Comments

@arschat
Copy link
Collaborator

arschat commented Nov 12, 2024

In order to ingest Tier2 metadata in our ingest platform, we need to convert the existing Tier1 metadata to DCP schema, and then if there is previously wrangled submission, compare spreadsheets and show differences between the two.
Then we can decide which fields should be updated before appending Tier2 fields in the submission based on donor_id or sample_id.

flowchart TD
    A[CxG <a href='https://cellxgene.cziscience.com/collections'> public database</a>]
    B[Tier 1 - csv]
    C[Tier 1 - dcp spreadsheet]
    D[Tier 1 - dcp validated]
    E[previously wrangled - spreadsheet] 
    F[Tier 2 - xlsx/ csv]
    G[Full metadata - dcp spreadsheet]
    
    A --> |<a href='https://github.com/ebi-ait/hca-tier1-to-dcp/blob/main/cellxgene_metadata_collection.py'>collection</a>| B
    B --> |<a href='https://github.com/ebi-ait/hca-tier1-to-dcp/blob/main/convert_to_dcp.py'>DCP mapping</a>| C
    C --> |If not previously wrangled| D
    C --> |If previously wrangled| E
    E --> |<a href='https://github.com/ebi-ait/hca-tier1-to-dcp/blob/main/compare_with_dcp.py'>Compare</a>| D
    D --> G
    F --> D
    
    subgraph collection
    A
    end
    subgraph conversion
    B
    C
    end
    subgraph comparison
    E
    end
    subgraph addition
    F
    G
    end
Loading

We can split this process in the following steps.

  1. create dcp ingestible spreadsheet
    a. pull Tier1 metadata from CxG api in csv
    b. convert Tier1 csv to DCP spreadsheet
  2. compare two spreadsheets
    a. compare per number of entity
    b. compare per biomaterial_id
    c. compare all other field values
    d. if there are discrepancies, manual curation to figure out how to update submission
  3. append Tier2 into "Tier1"
    a. based on Tier2 to DCP mapping, using donor_id & sample_id join all Tier2 metadata
@arschat arschat changed the title tier1 to dcp validator tier1 to dcp journey Nov 12, 2024
@arschat arschat added the HCA label Nov 12, 2024
@arschat
Copy link
Collaborator Author

arschat commented Nov 13, 2024

Repo for step 1 created:

Previous experiment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant