arraymap2ga4gh_v2.py
- It works for the new arraymap/progentix scheme.
- A few improvements.
- Please only use get_attribte() to get values, don't use simple[attribute].
- get_attribute is impoved, please see the function comments for more.
- Please try to assign values directly within the JSON block.
- gvnc.py & gbnv.py are now integrated into arraymap2ga4gh.py
- most params' function remain the same, but with the names modified.
- Some new behaviors:
- all the generated data is written to a new database different from the source.
- user can suppress the database overwriting warning prompt.
- user can use 2 filter params to manipulate with the input data.
- the default input db is "arraymap"
- the default output db is "arraymap_ga4gh"
- user can change both sources and also collections names.
- please see the help for a detailed list.
- this script sometimes can run from hours, to make our life easier, it can be fully automatic now.
- example: >python3 arraymap2ga4gh.py --dwa
- this will suppress the warning prompt.
- double check before you do this, as it will overwrite the database.
- user can now provide a query by parameter or file.
- the query must be of correct mongodb syntax.
- when filtering through parameter, make sure to surround the query by "" and \ system symbol.
- example: >python3 arraymap2ga4gh.py -f "{'ICDMORPHOLOGYCODE': {'$regex':'^[89]'}}"
- when filtering through file, only the first line is used.
- example: >python3 arraymap2ga4gh.py -ff f.txt
- file has priority over param.
- can generate individuals now.
- related cli option "---collection_dst_individuals" is provided.
- represents empty value as 'null' instead of ''.
generates pre-set "variant set".
- variant, biosample and callset are all following the same naming convention.
- they all have an unique "id" attribute generated as AM_V_"UID", AM_BS_"UID", and AM_CS_"UID".
- "UID" is from data collection "samples" in database "arraymap".
- all other attirbutes of them are exactly following the GA4GH schema.
- calls do not store biosample_id anymore, it can be retrieved through callset.
- biosample_id of a variant is always generated now, instead of checking existence and generating when absent.
- new option:'-s', '--status', default='-exclude'
- it filters the data collection by STATUS value, param must be in format [+/-]keyword. "+" means to include, "-" means to exclude.
- it accepts regular expression, eg: "--status +include\|^NA" as an option from cmd.
- biosample_id is always generated now, instead of checking existence and generating when absent.
- characteristics of biosample is implemented. Right now, it simply captures ICD information.
- duplicated "UID" (duplicated biosample_id) problem is temporarily resolved by only including simples with more than 50 attributes.
- shortnames '-d' and '-l' for options '--demo' and '--log' are provided.
New cli script to generate BIOSAMPLEs and CALLSETs. Usage is very similar to gvnc.py except two destination collections are needed instead of one:
- '-dstb, --collection_dst_biosamples TEXT The collection to write into, default is "biosamples"'
- '-dstc, --collection_dst_callsets TEXT The collection to write into, default is "callsets"'
Not all relevent attributes in sample are copied to biosample, just have some for deomstration.
New cli script gvnc.py is added.
This script checks through the given source db collection, generate new VARIANTs containing CALLs, and put them in the destination db collection.
NOTE: Please manually create the destination collection to avoid accidental overwritting of existing collection.
WARNING: This script will overwrite the destination db collection, check before running.
Example: python3 gvnc.py --demo 1000 --dnw --log log.txt This will run for 1000 valid samples, the result wont be written to the database, and any error or warning messages will be shown in log.txt.
-db, --dbname TEXT The name of the database, default is "arraymap"
-src, --collection_src TEXT The collection to read from, default is "samples"
-dst, --collection_dst TEXT The collection to write into, default is "variants"
--demo INTEGER RANGE Only process a limited number of entries
--dnw Do Not Write to the db
--log FILENAME Output errors and warnings to a log file
--help Show this message and exit.
- sudo pip3 install pymongo
- sudo pip3 install click
The old script version:
python generate.py
sudo pip install pymongo
This script will scan the "samples", find all the "HG18 segment"s and put ones with identical type and location in one variant.
- query for variant hitting the gene TP53 at (hg18) 7512444 - 7531593, but staying inside 5012444 - 10031593:
db.variants.find( { reference_name:"17", variant_type:"DEL", $and:[ { start: {$gte:5012444}, start: {$lte:7531593} } ], $and:[ { end: { $gte:7512444} }, { end: {$lte:10031593} } ] } ).count()