Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harmonised file conforms to new standard format #50

Closed
jdhayhurst opened this issue Sep 5, 2022 · 4 comments · Fixed by #71
Closed

harmonised file conforms to new standard format #50

jdhayhurst opened this issue Sep 5, 2022 · 4 comments · Fixed by #71
Assignees

Comments

@jdhayhurst
Copy link
Contributor

Enable the harmonisation pipeline to produce a file that conforms to the new sumstats standard

@jdhayhurst
Copy link
Contributor Author

Places of change are:

  • main_pysam.py script
  • the nf code that sorts and tabix indexes
  • qc
  • common constants

jiyue1214 added a commit that referenced this issue Nov 23, 2022
This branch is design to solved ticket 49 to read genome build from the yaml file (here, suppose the yaml file is under the same folder), and ticket 50 to rearrange the output file with the new format.
@jdhayhurst
Copy link
Contributor Author

jdhayhurst commented Nov 23, 2022

@ljwh2 does the following sound appropriate in the scenario where we have both beta and odds_ratio?
if beta and odds_ratio:
harmonise both fields
output beta in the effect size column (index 4) and output odds_ratio at the end

@jdhayhurst
Copy link
Contributor Author

jdhayhurst commented Nov 23, 2022

@jhayhurst create ticket for:
qc change:
input hm_rsid (generated from main_pysam.py), rsid (user uploaded but is optional)
if not rsid:
rename hm_rsid to rsid
else:
compare hm_rsid to rsid (as we currently do hm_rsid against variant_id)
if hm_rsid not equal or synonymous to rsid - drop record
rename hm_rsid to rsid

@jdhayhurst
Copy link
Contributor Author

@jiyue1214 we've decided to go with if there are both beta and odds_ratio:

  1. determine which is given in column 4
  2. harmonise both beta and odds_ratio
  3. output maintains whichever field was in column 4 in the input

@jiyue1214 jiyue1214 linked a pull request Mar 10, 2023 that will close this issue
jdhayhurst added a commit that referenced this issue May 23, 2023
* ticket #49 #50

This branch is design to solved ticket 49 to read genome build from the yaml file (here, suppose the yaml file is under the same folder), and ticket 50 to rearrange the output file with the new format.

* Update map_to_build.nf

* Update common_constants.py

main.py using the --rsid_col rsid not the variant_id to solve the problem if there are multiple reference records for one site.

* Update main_pysam.py

keep harmonised header in specific order

* Update gwascatalogharm.nf

change the input file as [GCST,yaml path, input path]

* bare bones meta client

* Create GCSTtest.tsv.ymal

* Update and rename GCSTtest_b37.tsv to GCSTtest.tsv

* Rename test_data/GCSTtest.tsv.ymal to test_data/homes/yueji/.nextflow/assets/EBISPOT/gwas-sumstats-harmoniser/test_data/GCSTtest.tsv-meta.yamll

* test yaml file

add test yaml file

* Update test.config

new format test data

* Create GCSTtest.tsv-meta.yaml

* Update GCSTtest.tsv

test data modify

* Update GCSTtest.tsv-meta.yaml

* Update GCSTtest.tsv

* Update main_pysam.py

tmp: keep the original variant_id and hm_varid column temporary for the qc  since qc step need it. the column after the qc is fixed

* Update GCSTtest.tsv

modify the test input file

* Update qc.nf

will add bgzip and tabix in the qc.py using pysam

* Update main_pysam.py

FIx the hm_code and hm_coordinate_conversio at the position after the mandatory fields.

* Update harmonization_log.nf

new hm_code position

* Yaml

make raw_yaml available until the qc step

* keep raw content in col 4

keep what in the raw column 4

* Update map_to_build_nf.py

update master #65 update to the branch

* Update main_pysam.py

Allows the standard_error does not exist in the raw data.

* Update main_pysam.py

* Update qc.nf

sort the qc result for tabix

* sort chr and pos

update the sorted by the chr and pos, also for data with mandatory columns only

* Update harmonization.nf

* adding metadata model, update yaml file and publish

* add coord system to metadata

* #70

#70 liftover with specified coordinate system

* #70

Further update on converting the coordinate system:
1. map_to_build.py: bp-coordinate -> liftover -> bp'+1
2. main_pysam.py:
if 0-base and indels and hm_coordinate_conversion="lo":
     vcf_rec=tabix(chr, bp-2,bp)
else:
     vcf_rec=tabix(chr, bp-1,bp)
3. harmonisation.nf: read yaml file
4. test_data: 1_base and 0_base tsv file.

* #70

* add one new row of the test data

* #70

* import schema from gwas-sumstats-tools, camelCase to snake_case

* add container.config

* auto map col names to scheme

use utils.py to map args for strand count

* rename test input file

* Create GCST0.tsv-meta.yaml

* Create GCST1.tsv-meta.yaml

* Update main_harm.nf

* change on flip_beta

* Change variant_id to rsid for map_to_build

* Update container.config

* qc base on rsid not variant_id

* using gwas_sumstats_tools to reorder the columns

reorder the output columns using the gwas_sumstats_tools function : _set_header_order()

* Update main_pysam.py

* Update main_pysam.py

* wait for all previous process to finish

* wait until all chr finish ten_sc process

* Update main_pysam.py

* Update container.config

resetting default container config

---------

Co-authored-by: jdhayhurst <hayhurst.jd@gmail.com>
Co-authored-by: jdhayhurst <38317975+jdhayhurst@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants