Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ftp: README review #2227

Open
ValWood opened this issue Sep 3, 2024 · 38 comments
Open

ftp: README review #2227

ValWood opened this issue Sep 3, 2024 · 38 comments

Comments

@ValWood
Copy link
Member

ValWood commented Sep 3, 2024

subtask from:
pombase/pombase-chado#720
#2058

@ValWood
Copy link
Member Author

ValWood commented Sep 3, 2024

  1. @kimrutherford add file names of new structure to this document
    https://docs.google.com/document/d/1TfvWngsI2U9-wkw2czxhHOZNmmQ8nxRwYa-TctrhMs0/edit

  2. @PCarme write READMEs (a lot of the info will be on the downlades website, do decide how much detail required and add referring URLs. If you don't know what it is tag me in the doc

  3. @ValWood / @kimrutherford to review fill in missing parts

  4. @PCarme to copy into place in Git

@kimrutherford
Copy link
Member

add file names of new structure to this document

I can do that but does it make sense to duplicate what is already in Git? Can't we edit the README text directly rather than copy into a Google doc and then back to the files in Git?

The READMEs are here:
https://github.com/pombase/pombase-scripts/tree/main/release_readme_files

There is one README file for each of the directories in the new structure:
https://www.pombase.org/public_releases/pombase-2024-06-01/

@ValWood
Copy link
Member Author

ValWood commented Sep 4, 2024

good point!

@PCarme
Copy link
Contributor

PCarme commented Sep 4, 2024

The READMEs are here:
https://github.com/pombase/pombase-scripts/tree/main/release_readme_files

Okay, thanks Kim ! I'll review the READMEs in there, and let you know when I'm done.

@PCarme
Copy link
Contributor

PCarme commented Sep 4, 2024

In https://github.com/pombase/pombase-scripts/blob/main/release_readme_files/exports_for_external_resources-README.txt, I have listed the files in the directory, but I don't really know what each of those corresponds to.

@kimrutherford
Copy link
Member

I have listed the files in the directory, but I don't really know what each of those corresponds to.

Thanks Pascal. I'll work on that one.

@PCarme
Copy link
Contributor

PCarme commented Sep 4, 2024

The "genome_sequence_and_features" directory contains several subdirectory. Should there be READMEs for all subdirectory, or a single README describing the content of all the subdirectories ?

@ValWood
Copy link
Member Author

ValWood commented Sep 4, 2024

The contents are quite diverse so I think each directory needs a README

@PCarme
Copy link
Contributor

PCarme commented Sep 4, 2024

Also, this file https://www.pombase.org/public_releases/pombase-2024-06-01/protein_features/transmembrane_domain_coords_and_seqs.tsv displays the entire sequence of each protein, not just the transmembrane domains sequences. Is it intended like that ?

@ValWood
Copy link
Member Author

ValWood commented Sep 4, 2024

It says coordinates and sequences, but it seems strange to put them together...

Maybe this wasn't a file for the public?

@kimrutherford ?

@kimrutherford
Copy link
Member

It says coordinates and sequences, but it seems strange to put them together...
Maybe this wasn't a file for the public?

This is all I can find about it:

@kimrutherford
Copy link
Member

This is all I can find about it:

I dug into my old email. This is from Snezhka. The thread is from April 2019, with the subject "transmembrane domains":


Hope everything is well - writing now to bug you with a question, sorry... Wonder if there is a way to, say, 'automatically' collect all transmembrane domains from all proteins. What I want to do is to compare the transmembrane domains (e.g. length distribution, unusual amino acids) in S. pombe to those in S. japonicus. Ideally so that I could do it separately for single spanners vs multispanners.


The file was created for Snezhka but it's updated nightly. Perhaps we don't need it in the new release directories?

@ValWood
Copy link
Member Author

ValWood commented Sep 5, 2024

Perhaps we don't need it in the new release directories?

agree, it's a bit random

@kimrutherford
Copy link
Member

agree, it's a bit random

OK, I've removed that file from the script that creates the new release directory structure.

@kimrutherford
Copy link
Member

The contents are quite diverse so I think each directory needs a README

I've added empty READMEs and checked that the script can process README files for sub-directories correctly.

@PCarme
Copy link
Contributor

PCarme commented Sep 5, 2024

@PCarme
Copy link
Contributor

PCarme commented Sep 5, 2024

@ValWood ValWood closed this as completed Sep 5, 2024
@ValWood ValWood reopened this Sep 5, 2024
@ValWood
Copy link
Member Author

ValWood commented Sep 5, 2024

There is a file with introns in CDS only (more important that we have these annotated), and one with CDS+UTRs
(we started adding UTR introns later, and we definitely don't have them all)

@PCarme
Copy link
Contributor

PCarme commented Sep 5, 2024

Oh right ! I hadn't thought about the UTR introns, it makes sense then. Thanks !

@kimrutherford
Copy link
Member

Also, this file isn't loaded properly https://www.pombase.org/public_releases/pombase-2024-06-01/genome_sequence_and_features/gff_format/Schizosaccharomyces_pombe_all_chromosomes_unstranded.gff3

I think that's OK. The file is empty because we don't have any unstranded features. Maybe we did have some years ago. I think it's best to remove it to prevent confusion.

@PCarme
Copy link
Contributor

PCarme commented Sep 6, 2024

I'm done writing the READMEs by the way.

@kimrutherford
Copy link
Member

I'm done writing the READMEs by the way.

Excellent. Thanks!

I haven't completed exports_for_external_resources-README.txt yet. Once I have, I'll make an example releases directory for 2024-09-01 so we can see if there is anything else needed.

@kimrutherford
Copy link
Member

Here's how the structure looks with the new READMEs and the latest release:
https://www.pombase.org/public_releases/pombase-2024-09-01/

We currently have the GPI/GPAD files for GO in this directory:
https://www.pombase.org/public_releases/pombase-2024-09-01/exports_for_external_resources/

Maybe they should be in the gene_ontology directory? It could be a sub-directory.

@kimrutherford
Copy link
Member

I've moved the allele_summaries.json file from exports_for_external_resources to the training_data_for_ML_and_AI directory since that's what is was created for (I think). There's nothing stopping us having files in more than one place so we could have a copy in exports_for_external_resources if it makes sense.

@kimrutherford
Copy link
Member

I haven't completed exports_for_external_resources-README.txt yet.

I've done that now:
https://www.pombase.org/public_releases/pombase-2024-09-01/exports_for_external_resources/PomBase_exports_for_external_resources_README.txt

As an experiment, the format is a bit different from the other READMEs. Let me know if you think it's better or worse.

Once I have, I'll make an example releases directory for 2024-09-01 so we can see if there is anything else needed.

I've done that too. Perhaps we can have a chat about it once we're all back from holiday.

https://www.pombase.org/public_releases/pombase-2024-09-01

@ValWood
Copy link
Member Author

ValWood commented Sep 11, 2024

I agree it makes sense to have the official GO release in the GO directory

@kimrutherford
Copy link
Member

I've moved the GPI/GPAD files into the gene_ontology directory.
https://www.pombase.org/public_releases/pombase-2024-09-01/

@kimrutherford
Copy link
Member

I think we decided to include all the gene expression results in the output file?

So should we remove these lines from the README?:

This file currently contains the RNA level and protein level
quantification data from:
 - Marguerat et al., 2012, Cell (PMID:23101633)
 - Carpy et al., 2014, Mol. Cell. Proteomics (PMID:24763107)

@ValWood
Copy link
Member Author

ValWood commented Jan 29, 2025

So should we remove these lines from the README?:

yes go ahead

@kimrutherford
Copy link
Member

New version of gene expression README:

https://www.pombase.org/public_releases/pombase-2024-12-01/gene_expression/PomBase_gene_expression_README.txt

@kimrutherford
Copy link
Member

Here's the more recent release in the new directory structures, with READMEs:
https://www.pombase.org/public_releases/pombase-2025-02-01/

@kimrutherford
Copy link
Member

kimrutherford commented Feb 5, 2025

I'm adding "For use of this dataset please cite: ..." to all the READMEs.

For the GO data I also added:

and the Gene Ontology Consortium:
  https://geneontology.org/docs/go-citation-policy/

https://www.pombase.org/monthly_releases/pombase-2025-02-01/gene_ontology/PomBase_gene_ontology_README.txt

For the disease dataset I was thinking of adding a link to: https://monarchinitiative.org/cite

Are there any other datasets or data types where we should do something similar?

https://www.pombase.org/public_releases/pombase-2025-02-01/

@kimrutherford
Copy link
Member

kimrutherford commented Feb 5, 2025

I've changed the web server configuration so that the README contents are shown after the list of files:

https://www.pombase.org/monthly_releases/pombase-2025-02-01/exports_for_external_resources/

Image

@ValWood
Copy link
Member Author

ValWood commented Feb 5, 2025

For disease should include
https://www.medrxiv.org/content/10.1101/2022.04.13.22273750v3

@kimrutherford
Copy link
Member

kimrutherford commented Feb 5, 2025

@kimrutherford
Copy link
Member

I'm experimenting with grouping the releases by year for tidiness:
https://www.pombase.org/monthly_releases/

@kimrutherford
Copy link
Member

I've now copied all the data files and READMEs to the new structure:
https://www.pombase.org/monthly_releases/

We could do we adding a short README in that directory.

The older monthly releases are missing some files. An example is the disease association file, which we only started exporting late in 2021.

This link will redirect to the most recent monthly release directory:
https://www.pombase.org/latest_release/

We can also link to a particular file in the latest release like this:
https://www.pombase.org/latest_release/gene_ontology/cc_go_slim_terms.tsv
which will redirect to the most recent version.

So perhaps a good next step would be to work through the documentation and link to the "latest_release" files where possible.
Then we can rewrite the main Datasets documentation page to talk about the monthly releases and to link to the latest release.

@kimrutherford
Copy link
Member

I've now copied all the data files and READMEs to the new structure:

I forgot to say, we only started doing monthly releases in 2019. We have older releases but for those there are fewer exported file types and they are less than monthly. The older releases are still available here if a user needs an old file:
https://www.pombase.org/releases/

For GO data we also have this collection of GAF files that go back to 2001 (although there are some gaps):
https://github.com/pombase/pombase-historic-go-stats/tree/main/raw_data
https://github.com/pombase/pombase-historic-go-stats/tree/main/extra_pombase_data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants