-
I'm trying to work with bigbed files from JASPAR: http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/ The files contain the following columns:
The expanded bed file is ~460 GB. The bigbed file is 86 GB and I can read it using:
data now contains an array with the relative scores. Is it possible to get the corresponding TF name column together with the scores using bbi? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Not quite. Unlike bigwig/bedgraph which stores non-overlapping quantitative intervals along the genome, bigbed/bed can contain overlapping intervals with lots of attributes. Therefore, the flat signal obtained from If you want the scores and any other BED fields, you'll have to read out the actual interval records themselves: In [1]: import bbi
In [2]: f = bbi.open('http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/JASPAR2020_hg19.bb')
In [3]: f.schema
Out[3]:
{'name': 'bed',
'comment': 'Browser Extensible Data',
'columns': ['chrom', 'start', 'end', 'name', 'score', 'strand'],
'dtypes': {'chrom': 'object',
'start': 'uint32',
'end': 'uint32',
'name': 'object',
'score': 'uint32',
'strand': 'object'},
'description': {'chrom': 'Reference sequence chromosome or scaffold',
'start': 'Start position in chromosome',
'end': 'End position in chromosome',
'name': 'Name of item.',
'score': 'Score (0-1000)',
'strand': '+ or - for strand'}}
In [4]: f.fetch_intervals('chr1', 94586705, 94587705)
Out[4]:
chrom start end name score strand
0 chr1 94586689 94586706 NR3C1 382 +
1 chr1 94586689 94586706 NR3C1 397 -
2 chr1 94586689 94586706 NR3C2 396 +
3 chr1 94586689 94586706 NR3C2 415 -
4 chr1 94586697 94586708 NR2F2 294 -
... ... ... ... ... ... ...
3769 chr1 94587702 94587709 NFATC2 285 -
3770 chr1 94587703 94587713 Sox3 298 -
3771 chr1 94587704 94587711 FOXP3 206 +
3772 chr1 94587704 94587714 BARHL2 201 +
3773 chr1 94587704 94587714 SOX4 253 +
[3774 rows x 6 columns] |
Beta Was this translation helpful? Give feedback.
Not quite. Unlike bigwig/bedgraph which stores non-overlapping quantitative intervals along the genome, bigbed/bed can contain overlapping intervals with lots of attributes. Therefore, the flat signal obtained from
.fetch()
for a bigBed file corresponds to the interval coverage (number of intervals in the file overlapping a given base or bin), not the relative scores. This is also whatbigBedSummary
does.If you want the scores and any other BED fields, you'll have to read out the actual interval records themselves: