Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving information about PGS Catalog scoring file versions #348

Open
mglev1n opened this issue Jul 31, 2024 · 1 comment
Open

Improving information about PGS Catalog scoring file versions #348

mglev1n opened this issue Jul 31, 2024 · 1 comment
Labels
enhancement New feature or request user-query User queries & requests

Comments

@mglev1n
Copy link

mglev1n commented Jul 31, 2024

Description of feature

It would be great if there was some Version Control, such that the state of the PGS Catalog could be re-constructed as of a given date. At minimum, publishing a running change log (eg. when change occured, what was changed, why the change was made, etc.) would be extremely useful. Apologies if this feature already exists - if it does, making it more prominent would be great. A longer-term goal might be to allow users of pgsc_calc to request scores based on a given version/release of the PGS Catalog.

Motivation

My lab and collaborators have noticed that executing the exact same pgsc_calc command that pulls scores from the PGS Catalog has resulted in different output when run on different days. In troubleshooting, we noticed a few issues:

  • In one case, the #trait_efo assigned to a scorefile changed from one day to the next.

  • In two other cases we noticed that the sign of the effect_weight column was changed within the scorefiles.

It's great that archived versions of each scorefile are maintained on PGS Catalog FTP site, which eventually allowed us to troubleshoot these issues. However, tracking down these individual scorefile changes is very time consuming, particularly as the number of scores and number of archived versions increases. This problem also raises the potential for broader transparency/reproducibility issues. Thanks for all the hard work making this resource possible!

@mglev1n mglev1n added the enhancement New feature or request label Jul 31, 2024
@smlmbrt
Copy link
Member

smlmbrt commented Aug 1, 2024

@mglev1n, thanks for your comments and suggestions! I think there's some easy things we can do to better expose the version the scores (release date in file header and adding that to the report), and we will discuss the feasibility of a changelog on our side.

  • Re: EFO term swaps. These are sometimes out of our control when a trait gets deprecated from the ontology or changes parentage, other times we swap the term to be more accurate.
  • Re: effect_weight directions. We usually only change these when authors ask us to, but are always sure to leave the original file as an archived version on the FTP.

We do provide md5 for the scores so that versions can be compared. It is possible to download all the scorefiles you want using our python package (https://pypi.org/project/pgscatalog-utils/, https://pygscatalog.readthedocs.io/en/latest/) and use those downloads as a stable input to the pipeline (or extract them from the first run of the pipeline for re-use).

@smlmbrt smlmbrt added the user-query User queries & requests label Aug 1, 2024
@smlmbrt smlmbrt changed the title PGS Catalog version control Improving information about PGS Catalog scoring file versions Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request user-query User queries & requests
Projects
None yet
Development

No branches or pull requests

2 participants