should we store stats as YAML (or json) #79

sergpolly · 2020-01-31T21:16:26Z

that's how we store stats now:

total_mapped    2189618376
total_nodups    1753432070
cis     1533122797
...
pair_types/WW   88884
pair_types/MU   404456330
...
cis_1kb+        998076606
cis_2kb+        836035718
...
chrom_freq/chr1/chr1    137125332
chrom_freq/chr1/chr10   1791283
...

it is hard to parse that and YAML would serve us just fine i believe - should we switch ?
would be useful for #78

The text was updated successfully, but these errors were encountered:

Phlya · 2020-01-31T21:23:36Z

It's quite easy to parse, I think... Just read as a table with pandas?

…

On Fri, Jan 31, 2020, 21:16 Sergey Venev ***@***.***> wrote: that's how we store stats now: total_mapped 2189618376 total_nodups 1753432070 cis 1533122797 ... pair_types/WW 88884 pair_types/MU 404456330 ... cis_1kb+ 998076606 cis_2kb+ 836035718 ... chrom_freq/chr1/chr1 137125332 chrom_freq/chr1/chr10 1791283 ... it is hard to parse that and YAML would serve us just fine i believe - should we switch ? would be useful for #78 <#78> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#79>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWCZORRRRE7LRPBF5A2XJLRASIKVANCNFSM4KONV4VA> .

sergpolly · 2020-01-31T21:28:31Z

things like pairs_type:

...
pair_types/WW   88884
pair_types/MU   404456330
...

imply nested structure - i.e. I would want to parse it as

stats = {...,"pair_types": {"WW": 8884, "MU":40404000},...}

I'm not sure pandas would help with that

Also , for MultiQC - they don't want to rely on pandas for whatever reason - pandas isn't the smallest dependency I guess

sergpolly · 2020-01-31T21:35:32Z

that's how we parse a typical stats file in the pairtools now: https://github.com/mirnylab/pairtools/blob/d1ddf9c39a336662f7fc725fa5a70ec68df9ba95/pairtools/pairtools_stats.py#L263

with standard YAML - that is great for storing nested dicst, and various small lists it would simply look like:

import yaml

stats_dict = yaml.load("sample.nodups.stats.yml")

and here is the ultimate goal:
https://multiqc.info/
https://multiqc.info/examples/hi-c/multiqc_report.html

agalitsyna · 2022-04-14T09:54:22Z

I updated pairtools stats output in yaml in version 1.0.0: https://github.com/open2c/pairtools/pull/117/files#diff-e4b8770efd538564222d48d69b00ed2c5012a76b35c926f1aba227fe45db2309

I guessed the best way to convert some fields, e.g. reporting chromosomes separated by slash instead of separate dict for each chromosome:

chrom_freq:
  chr1/chr1: 3
  chr1/chr2: 1
  chr2/chr3: 1

But this is minor and you may change it in the future.

sergpolly added enhancement question labels Jan 31, 2020

Phlya mentioned this issue Feb 13, 2020

restore an option to do per-lane stats for QC open2c/distiller-nf#85

Open

Phlya mentioned this issue Mar 2, 2022

format pairtools stats output as YAML #111

Closed

agalitsyna mentioned this issue Apr 6, 2022

pairtools v1.0.0 roadmap #116

Closed

31 tasks

open2c locked and limited conversation to collaborators Apr 20, 2022

agalitsyna converted this issue into discussion #129 Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

should we store stats as YAML (or json) #79

should we store stats as YAML (or json) #79

sergpolly commented Jan 31, 2020

Phlya commented Jan 31, 2020 via email

sergpolly commented Jan 31, 2020

sergpolly commented Jan 31, 2020

agalitsyna commented Apr 14, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

should we store stats as YAML (or json) #79

should we store stats as YAML (or json) #79

Comments

sergpolly commented Jan 31, 2020

Phlya commented Jan 31, 2020 via email

sergpolly commented Jan 31, 2020

sergpolly commented Jan 31, 2020

agalitsyna commented Apr 14, 2022

This issue was moved to a discussion.