-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inability to convert from json biom to hdf5 biom #513
Comments
also, @ElDeveloper verified that he can get this error, so its at least more than my machine. |
I am able to reproduce this problem in my machine. @josenavas I think (I am not sure at all) this has something to do with the way the metadata is represented as this table includes sample metadata: In [21]: from biom import load_table
In [22]: bt = load_table('master.hsid.r3800.biom')
In [23]: for e in bt.ids('sample'):
print bt.metadata(e, 'sample')
....:
defaultdict(<function <lambda> at 0x105c61668>, {u'Ix.s.064NYR3.1089051': {u'TARGET_SUBFRAGMENT': u'V13', u'LIFE_STAGE': u'adult', u'ASSIGNED_FROM_GEO': u'y', u'EXPERIMENT_CENTER': u'NCSU and NCU', u'TITLE': u'Meshnick_Ixodes_NY_CT_NC', u'RUN_PREFIX': u'HTCIOYZ08', u'DEPTH': u'0', u'HOST_TAXID': u'6945', u'TAXON_ID': u'749906', u'ELEVATION': u'9.78', u'RUN_DATE': u'12/19/11', u'COLLECTION_DATE': u'4/6/11', u'ALTITUDE': u'0', u'BarcodeSequence': u'ATACGACGTA', u'ENV_BIOME': u'ENVO:organism-associated habitat', u'SEX': u'M', u'PLATFORM': u'Titanium', u'COUNTRY': u'GAZ:United States of America', u'HSID': u'tick_064_NY', u'HOST_SUBJECT_ID': u'1885:Ix s/064NYR3', u'ANONYMIZED_NAME': u'Ix.s.064NYR3', u'SAMPLE_CENTER': u'NCSU-GSL', u'SAMP_SIZE': u'.1.,g', u'SITE': u'NY deer ticks', u'LONGITUDE': u'-74.01', u'STUDY_ID': u'1885', u'LinkerPrimerSequence': u'ATTACCGCGGCTGCTGG', u'EXPERIMENT_DESIGN_DESCRIPTION': u'Deer ticks from NY and CT where Lyme disease is endemic vs the same ticks in NC where', u'STUDY_CENTER': u'NCSU and NCU', u'HOST_COMMON_NAME': u'deer tick', u'SEQUENCING_METH': u'pyrosequencing', u'ENV_MATTER': u'ENVO:organic material feature', u'TARGET_GENE': u'16S rRNA', u'Description': u'Sample taken from tick Ix.s.064NYR3', u'ENV_FEATURE': u'ENVO:organism-associated habitat', u'KEY_SEQ': u'TCAG', u'RUN': u'run 3', u'REGION': u'NA', u'RUN_CENTER': u'UNC', u'PCR_PRIMERS': u'FWD:TACCGCGGCTGCTGG; REV:AGTTTGATCCTGGCTCAG', u'LIBRARY_CONSTRUCTION_PROTOCOL': u'454FLX Titanium targetting the V13 region, 27F,534R', u'EXPERIMENT_TITLE': u'Meshnick_Ixodes_CT_NY_NC', u'LATITUDE': u'9.78', u'PUBLIC': u'n'}, u'Ix.s.064NYR2.1089150': {u'TARGET_SUBFRAGMENT ........ I'm very unfamiliar with some of the latest changes that took place so I'm not sure what might be causing this. |
Yes, it has to do with the metadata. Is this a collapsed table? It looks like the sample metadata is a dict of dict which is not supported. In the current version, if you have collapsed metadata, it just stores the original ids, not the complete metadata (we already agreed that there is no use case for it). |
It's a table produced through this command from October 27th 2013:
What I worry about is that I have quite a few tables like this. This is a pretty aggressive bit of backwards incompatibility. Is there going to be a solution on this one in the future? |
IMHO this should be fixed, that's the whole point of the converter (I On (Jul-21-14|15:27), Will Van Treuren wrote:
|
Well, we can modify the converter so it accepts a parameter (--collapsed) which tells him that the table is collapsed. It then modifies the metadata of the table to be the new one. Did you actually need the metadata stored on the table? i.e. are you actually accessing to it from the biom table? If so, can you explain the use case? |
I think that solution sounds great. The use case is just that I want to be able to convert bioms tables I created with QIIME in the last year. |
Ok, I will be able to put something in place for tomorrow. I'll use your table as test. |
I think this is still an issue with the latest biom-format package. Running QIIME's filter_otus_from_otu_table.py fails when writing the filtered BIOM table in HDF5 format, but succeeds when h5py is uninstalled and QIIME falls back to writing JSON. See this forum post for an example table with sample metadata and the command to reproduce the issue. Workaround for now is to have users uninstall h5py to force QIIME to write JSON. Another workaround is removing sample metadata from the input file but I don't know of an easy way to do that. |
That isn't a conversion issue but an issue with qiime and h5py?
|
...from the forum post, this looks like a different issue. It appears the
|
It is the same issue here, just a different script calling into BIOM ( TypeError: Object dtype dtype('O') has no native HDF5 equivalent The issue is that the new HDF5 format can't write the sample metadata that's stored in the user's BIOM table. This was the issue @wdwvt1 was having (see previous conversation). |
We do it in the unit tests though, and observation metadata is handled with
|
There is a problem with the HDF5 writer for this table. Here we load in the JSON table and try to write it as HDF5, which fails: In [1]: from biom import load_table
In [2]: from biom.parse import biom_open
In [3]: t = load_table('Combined_otu_Early.biom')
In [4]: with biom_open('foo.biom', 'w') as biom_file:
...: t.to_hdf5(biom_file, 'foo')
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-df9c961c71c4> in <module>()
1 with biom_open('foo.biom', 'w') as biom_file:
----> 2 t.to_hdf5(biom_file, 'foo')
3
/Users/jairideout/dev/biom-format/biom/table.pyc in to_hdf5(self, h5grp, generated_by, compress, format_fs)
3605 self.group_metadata(axis='observation'), 'csr', compression)
3606 axis_dump(h5grp.create_group('sample'), self.ids(),
-> 3607 self.metadata(), self.group_metadata(), 'csc', compression)
3608
3609 @classmethod
/Users/jairideout/dev/biom-format/biom/table.pyc in axis_dump(grp, ids, md, group_md, order, compression)
3575 # Create the dataset for the current category,
3576 # putting values in id order
-> 3577 formatter[category](grp, category, md, compression)
3578
3579 # Create the group for the group metadata
/Users/jairideout/dev/biom-format/biom/table.pyc in general_formatter(grp, header, md, compression)
244 'metadata/%s' % header, shape=(len(md),),
245 data=[m[header] for m in md],
--> 246 compression=compression)
247
248
/Users/jairideout/miniconda/envs/biom-format/lib/python2.7/site-packages/h5py/_hl/group.pyc in create_dataset(self, name, shape, dtype, data, **kwds)
101 """
102 with phil:
--> 103 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
104 dset = dataset.Dataset(dsid)
105 if name is not None:
/Users/jairideout/miniconda/envs/biom-format/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
85 else:
86 dtype = numpy.dtype(dtype)
---> 87 tid = h5t.py_create(dtype, logical=1)
88
89 # Legacy
h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:16162)()
h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:15993)()
h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:15895)()
TypeError: Object dtype dtype('O') has no native HDF5 equivalent Note the traceback indicates this is a problem with writing sample metadata. |
The in-memory representations of the metadata are not well defined, so it I agree this is a problem, assuming the metadata described in the table are
|
The JSON table can be read correctly as JSON and written as JSON: In [1]: from biom import load_table
In [2]: t = load_table('Combined_otu_Early.biom')
In [3]: with open('foo.biom', 'w') as biom_file:
...: t.to_json('foo', biom_file)
...:
In [4]: It is a valid BIOM file: $ biom validate-table -i Combined_otu_Early.biom
The input file is a valid BIOM-formatted file. The user reported that this table used to work with QIIME 1.8.0 but not 1.9.0+. |
Still does not indicate if it is the parser or writer, or how to begin to
|
Shouldn't the format spec for 2.1 define how to deal with all possible JSON serializations for metadata? This would indicate a bug in the writer, as the only thing BIOM 1.0 can physically store is JSON (maps, unicode, floats, bool, null, and arrays). |
In [1]: from biom import load_table
In [2]: t = load_table('Combined_otu_Early.biom')
In [3]: t.metadata(axis='sample')
Out[3]:
(defaultdict(<function biom.table.<lambda>>,
{u'WW.1.memb.F.1': {u'BarcodeSequence': u'ACACTGTTCATG',
u'Category': u'Early',
u'Combined': u'Early_1',
u'Description': u'WW1_memb_5hours_filtration_1',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'1',
u'name': u'2'},
u'WW.1.memb.F.2': {u'BarcodeSequence': u'AGCAGTCGCGAT',
u'Category': u'Early',
u'Combined': u'Early_1',
u'Description': u'WW1_memb_5hours_filtration_2',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'1',
u'name': u'17'}}),
defaultdict(<function biom.table.<lambda>>,
{u'WW.5.memb.F.1': {u'BarcodeSequence': u'AACTGTGCGTAC',
u'Category': u'Early',
u'Combined': u'Early_5',
u'Description': u'WW5_memb_5hours_filtration_1',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'5',
u'name': u'12'},
u'WW.5.memb.F.2': {u'BarcodeSequence': u'ACGTCTGTAGCA',
u'Category': u'Early',
u'Combined': u'Early_5',
u'Description': u'WW5_memb_5hours_filtration_2',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'5',
u'name': u'29'}}),
defaultdict(<function biom.table.<lambda>>,
{u'WW.4.memb.F.1': {u'BarcodeSequence': u'ACGGATCGTCAG',
u'Category': u'Early',
u'Combined': u'Early_4',
u'Description': u'WW4_memb_5hours_filtration_1',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'4',
u'name': u'10'},
u'WW.4.memb.F.2': {u'BarcodeSequence': u'AAGCTGCAGTCG',
u'Category': u'Early',
u'Combined': u'Early_4',
u'Description': u'WW4_memb_5hours_filtration_2',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'4',
u'name': u'26'}}),
defaultdict(<function biom.table.<lambda>>,
{u'WW.3.memb.F.1': {u'BarcodeSequence': u'ACAGACCACTCA',
u'Category': u'Early',
u'Combined': u'Early_3',
u'Description': u'WW3_memb_5hours_filtration_1',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'3',
u'name': u'8'},
u'WW.3.memb.F.2': {u'BarcodeSequence': u'AGAGAGCAAGTG',
u'Category': u'Early',
u'Combined': u'Early_3',
u'Description': u'WW3_memb_5hours_filtration_2',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'3',
u'name': u'23'}}),
defaultdict(<function biom.table.<lambda>>,
{u'WW.2.memb.F.1': {u'BarcodeSequence': u'ACGCTCATGGAT',
u'Category': u'Early',
u'Combined': u'Early_2',
u'Description': u'WW2_memb_5hours_filtration_1',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'2',
u'name': u'4'},
u'WW.2.memb.F.2': {u'BarcodeSequence': u'ACAGCAGTGGTC',
u'Category': u'Early',
u'Combined': u'Early_2',
u'Description': u'WW2_memb_5hours_filtration_2',
u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
u'WWTP': u'2',
u'name': u'20'}}))
In [4]: |
Evan, and nested forms, which is the issue. Thanks, Jai. I'm not aware of these nesting having been encountered Note, we tabled the refactor of the hdf5 readers and writers until the port Shouldn't the format spec for 2.1 define how to deal with all possible JSON — |
This issue was originally opened by @wdwvt1 due to nested metadata in existing BIOM tables. That's why I reopened the issue and didn't create a new one. |
Why was if closed? @josenavas, did you ever get a fix for this?
|
@jairideout it looks like your table has collapsed sample metadata, can you use the command |
Thanks for the workaround, I can confirm this works. |
heres the json biom table: http://cl.ly/0R0q1s2t0o0W
heres the command:
biom convert -i /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/nov7_open_ref_parallel/master.hsid.r3800.biom -o /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/july_2014/master.hsid.r3800.biom --to-hdf5 --table-type 'OTU table'
heres the output:
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
The text was updated successfully, but these errors were encountered: