Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inability to convert from json biom to hdf5 biom #513

Closed
wdwvt1 opened this issue Jul 21, 2014 · 25 comments · Fixed by #515
Closed

inability to convert from json biom to hdf5 biom #513

wdwvt1 opened this issue Jul 21, 2014 · 25 comments · Fixed by #515
Assignees
Labels
Milestone

Comments

@wdwvt1
Copy link
Contributor

wdwvt1 commented Jul 21, 2014

heres the json biom table: http://cl.ly/0R0q1s2t0o0W

heres the command: biom convert -i /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/nov7_open_ref_parallel/master.hsid.r3800.biom -o /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/july_2014/master.hsid.r3800.biom --to-hdf5 --table-type 'OTU table'

heres the output: TypeError: Object dtype dtype('O') has no native HDF5 equivalent

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Jul 21, 2014

also, @ElDeveloper verified that he can get this error, so its at least more than my machine.

@ElDeveloper
Copy link
Member

I am able to reproduce this problem in my machine. @josenavas I think (I am not sure at all) this has something to do with the way the metadata is represented as this table includes sample metadata:

In [21]: from biom import load_table

In [22]: bt = load_table('master.hsid.r3800.biom')

In [23]: for e in bt.ids('sample'):
    print bt.metadata(e, 'sample')
   ....:     
defaultdict(<function <lambda> at 0x105c61668>, {u'Ix.s.064NYR3.1089051': {u'TARGET_SUBFRAGMENT': u'V13', u'LIFE_STAGE': u'adult', u'ASSIGNED_FROM_GEO': u'y', u'EXPERIMENT_CENTER': u'NCSU and NCU', u'TITLE': u'Meshnick_Ixodes_NY_CT_NC', u'RUN_PREFIX': u'HTCIOYZ08', u'DEPTH': u'0', u'HOST_TAXID': u'6945', u'TAXON_ID': u'749906', u'ELEVATION': u'9.78', u'RUN_DATE': u'12/19/11', u'COLLECTION_DATE': u'4/6/11', u'ALTITUDE': u'0', u'BarcodeSequence': u'ATACGACGTA', u'ENV_BIOME': u'ENVO:organism-associated habitat', u'SEX': u'M', u'PLATFORM': u'Titanium', u'COUNTRY': u'GAZ:United States of America', u'HSID': u'tick_064_NY', u'HOST_SUBJECT_ID': u'1885:Ix s/064NYR3', u'ANONYMIZED_NAME': u'Ix.s.064NYR3', u'SAMPLE_CENTER': u'NCSU-GSL', u'SAMP_SIZE': u'.1.,g', u'SITE': u'NY deer ticks', u'LONGITUDE': u'-74.01', u'STUDY_ID': u'1885', u'LinkerPrimerSequence': u'ATTACCGCGGCTGCTGG', u'EXPERIMENT_DESIGN_DESCRIPTION': u'Deer ticks from NY and CT where Lyme disease is endemic vs the same ticks in NC where', u'STUDY_CENTER': u'NCSU and NCU', u'HOST_COMMON_NAME': u'deer tick', u'SEQUENCING_METH': u'pyrosequencing', u'ENV_MATTER': u'ENVO:organic material feature', u'TARGET_GENE': u'16S rRNA', u'Description': u'Sample taken from tick Ix.s.064NYR3', u'ENV_FEATURE': u'ENVO:organism-associated habitat', u'KEY_SEQ': u'TCAG', u'RUN': u'run 3', u'REGION': u'NA', u'RUN_CENTER': u'UNC', u'PCR_PRIMERS': u'FWD:TACCGCGGCTGCTGG; REV:AGTTTGATCCTGGCTCAG', u'LIBRARY_CONSTRUCTION_PROTOCOL': u'454FLX Titanium targetting the V13 region, 27F,534R', u'EXPERIMENT_TITLE': u'Meshnick_Ixodes_CT_NY_NC', u'LATITUDE': u'9.78', u'PUBLIC': u'n'}, u'Ix.s.064NYR2.1089150': {u'TARGET_SUBFRAGMENT ........

I'm very unfamiliar with some of the latest changes that took place so I'm not sure what might be causing this.

@josenavas
Copy link
Member

Yes, it has to do with the metadata.

Is this a collapsed table? It looks like the sample metadata is a dict of dict which is not supported. In the current version, if you have collapsed metadata, it just stores the original ids, not the complete metadata (we already agreed that there is no use case for it).

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Jul 21, 2014

It's a table produced through this command from October 27th 2013:

summarize_otu_by_cat.py -c /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/oct_27th/uclust.biom -i /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/map_gail_update.txt -m HSID -o /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/oct_27th/master.hsid.biom

What I worry about is that I have quite a few tables like this. This is a pretty aggressive bit of backwards incompatibility. Is there going to be a solution on this one in the future?

@ElDeveloper
Copy link
Member

IMHO this should be fixed, that's the whole point of the converter (I
think), to provide a layer of compatibility between multiple different
format versions. What do others think?

On (Jul-21-14|15:27), Will Van Treuren wrote:

It's a table produced through this command from October 27th 2013:

summarize_otu_by_cat.py -c /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/oct_27th/uclust.biom -i /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/map_gail_update.txt -m HSID -o /Users/wdwvt1/Desktop/work/ticks_v2_ixodes/oct_27th/master.hsid.biom

What I worry about is that I have quite a few tables like this. This is a pretty aggressive bit of backwards incompatibility. Is there going to be a solution on this one in the future?


Reply to this email directly or view it on GitHub:
#513 (comment)

@josenavas
Copy link
Member

Well, we can modify the converter so it accepts a parameter (--collapsed) which tells him that the table is collapsed. It then modifies the metadata of the table to be the new one.

Did you actually need the metadata stored on the table? i.e. are you actually accessing to it from the biom table? If so, can you explain the use case?

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Jul 21, 2014

I think that solution sounds great.

The use case is just that I want to be able to convert bioms tables I created with QIIME in the last year.

@josenavas
Copy link
Member

Ok, I will be able to put something in place for tomorrow. I'll use your table as test.

@jairideout
Copy link
Member

I think this is still an issue with the latest biom-format package. Running QIIME's filter_otus_from_otu_table.py fails when writing the filtered BIOM table in HDF5 format, but succeeds when h5py is uninstalled and QIIME falls back to writing JSON. See this forum post for an example table with sample metadata and the command to reproduce the issue.

Workaround for now is to have users uninstall h5py to force QIIME to write JSON. Another workaround is removing sample metadata from the input file but I don't know of an easy way to do that.

@jairideout jairideout reopened this Nov 3, 2015
@wasade
Copy link
Member

wasade commented Nov 3, 2015

That isn't a conversion issue but an issue with qiime and h5py?
On Nov 3, 2015 8:50 AM, "Jai Ram Rideout" notifications@github.com wrote:

I think this is still an issue with the latest biom-format package.
Running QIIME's filter_otus_from_otu_table.py fails when writing the
filtered BIOM table in HDF5 format, but succeeds when h5py is uninstalled
and QIIME falls back to writing JSON. See this forum post
https://groups.google.com/d/msg/qiime-forum/WJ9tPQIhB-g/MFqYr_AlEgAJ
for an example table with sample metadata and the command to reproduce the
issue.

Workaround for now is to have users uninstall h5py to force QIIME to write
JSON. Another workaround is removing sample metadata from the input file
but I don't know of an easy way to do that.


Reply to this email directly or view it on GitHub
#513 (comment)
.

@wasade
Copy link
Member

wasade commented Nov 3, 2015

...from the forum post, this looks like a different issue. It appears the
metadata are of type object so h5py doesn't know how to serialize. My guess
is that the filter script is doing something unexpected to the metadata.
It's possible that there is more aggressive casting with writing out JSON.
On Nov 3, 2015 9:39 AM, "Daniel T. McDonald" Daniel.Mcdonald@colorado.edu
wrote:

That isn't a conversion issue but an issue with qiime and h5py?
On Nov 3, 2015 8:50 AM, "Jai Ram Rideout" notifications@github.com
wrote:

I think this is still an issue with the latest biom-format package.
Running QIIME's filter_otus_from_otu_table.py fails when writing the
filtered BIOM table in HDF5 format, but succeeds when h5py is uninstalled
and QIIME falls back to writing JSON. See this forum post
https://groups.google.com/d/msg/qiime-forum/WJ9tPQIhB-g/MFqYr_AlEgAJ
for an example table with sample metadata and the command to reproduce the
issue.

Workaround for now is to have users uninstall h5py to force QIIME to
write JSON. Another workaround is removing sample metadata from the input
file but I don't know of an easy way to do that.


Reply to this email directly or view it on GitHub
#513 (comment)
.

@jairideout
Copy link
Member

It is the same issue here, just a different script calling into BIOM (convert vs. filter_otus_from_otu_table.py):

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

The issue is that the new HDF5 format can't write the sample metadata that's stored in the user's BIOM table. This was the issue @wdwvt1 was having (see previous conversation).

@wasade
Copy link
Member

wasade commented Nov 3, 2015

We do it in the unit tests though, and observation metadata is handled with
the exact same code as sample metadata
On Nov 3, 2015 9:52 AM, "Jai Ram Rideout" notifications@github.com wrote:

It is the same issue here, just a different script calling into BIOM (
convert vs. filter_otus_from_otu_table.py):

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

The issue is that the new HDF5 format can't write the sample metadata
that's stored in the user's BIOM table. This was the issue @wdwvt1
https://github.com/wdwvt1 was having (see previous conversation).


Reply to this email directly or view it on GitHub
#513 (comment)
.

@jairideout
Copy link
Member

There is a problem with the HDF5 writer for this table. Here we load in the JSON table and try to write it as HDF5, which fails:

In [1]: from biom import load_table

In [2]: from biom.parse import biom_open

In [3]: t = load_table('Combined_otu_Early.biom')

In [4]: with biom_open('foo.biom', 'w') as biom_file:
   ...:     t.to_hdf5(biom_file, 'foo')
   ...:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-df9c961c71c4> in <module>()
      1 with biom_open('foo.biom', 'w') as biom_file:
----> 2     t.to_hdf5(biom_file, 'foo')
      3

/Users/jairideout/dev/biom-format/biom/table.pyc in to_hdf5(self, h5grp, generated_by, compress, format_fs)
   3605                   self.group_metadata(axis='observation'), 'csr', compression)
   3606         axis_dump(h5grp.create_group('sample'), self.ids(),
-> 3607                   self.metadata(), self.group_metadata(), 'csc', compression)
   3608
   3609     @classmethod

/Users/jairideout/dev/biom-format/biom/table.pyc in axis_dump(grp, ids, md, group_md, order, compression)
   3575                     # Create the dataset for the current category,
   3576                     # putting values in id order
-> 3577                     formatter[category](grp, category, md, compression)
   3578
   3579             # Create the group for the group metadata

/Users/jairideout/dev/biom-format/biom/table.pyc in general_formatter(grp, header, md, compression)
    244             'metadata/%s' % header, shape=(len(md),),
    245             data=[m[header] for m in md],
--> 246             compression=compression)
    247
    248

/Users/jairideout/miniconda/envs/biom-format/lib/python2.7/site-packages/h5py/_hl/group.pyc in create_dataset(self, name, shape, dtype, data, **kwds)
    101         """
    102         with phil:
--> 103             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    104             dset = dataset.Dataset(dsid)
    105             if name is not None:

/Users/jairideout/miniconda/envs/biom-format/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
     85         else:
     86             dtype = numpy.dtype(dtype)
---> 87         tid = h5t.py_create(dtype, logical=1)
     88
     89     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:16162)()

h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:15993)()

h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:15895)()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Note the traceback indicates this is a problem with writing sample metadata.

@wasade
Copy link
Member

wasade commented Nov 3, 2015

The in-memory representations of the metadata are not well defined, so it
isn't clear if the issue would lay with the json parser or with the hdf5
writer.

I agree this is a problem, assuming the metadata described in the table are
actually within the (loose) confines of the 1.0 metadata spec.
On Nov 3, 2015 11:58 AM, "Jai Ram Rideout" notifications@github.com wrote:

There is a problem with the HDF5 writer for this table. Here we load in
the JSON table and try to write it as HDF5, which fails:

In [1]: from biom import load_table

In [2]: from biom.parse import biom_open

In [3]: t = load_table('Combined_otu_Early.biom')

In [4]: with biom_open('foo.biom', 'w') as biom_file:
...: t.to_hdf5(biom_file, 'foo')
...:---------------------------------------------------------------------------TypeError Traceback (most recent call last) in ()
1 with biom_open('foo.biom', 'w') as biom_file:----> 2 t.to_hdf5(biom_file, 'foo')
3
/Users/jairideout/dev/biom-format/biom/table.pyc in to_hdf5(self, h5grp, generated_by, compress, format_fs)
3605 self.group_metadata(axis='observation'), 'csr', compression)
3606 axis_dump(h5grp.create_group('sample'), self.ids(),-> 3607 self.metadata(), self.group_metadata(), 'csc', compression)
3608
3609 @classmethod
/Users/jairideout/dev/biom-format/biom/table.pyc in axis_dump(grp, ids, md, group_md, order, compression)
3575 # Create the dataset for the current category,
3576 # putting values in id order-> 3577 formatter[category](grp, category, md, compression)
3578
3579 # Create the group for the group metadata
/Users/jairideout/dev/biom-format/biom/table.pyc in general_formatter(grp, header, md, compression)
244 'metadata/%s' % header, shape=(len(md),),
245 data=[m[header] for m in md],--> 246 compression=compression)
247
248
/Users/jairideout/miniconda/envs/biom-format/lib/python2.7/site-packages/h5py/_hl/group.pyc in create_dataset(self, name, shape, dtype, data, *_kwds)
101 """ 102 with phil:--> 103 dsid = dataset.make_new_dset(self, shape, dtype, data, *_kwds) 104 dset = dataset.Dataset(dsid) 105 if name is not None:/Users/jairideout/miniconda/envs/biom-format/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times) 85 else: 86 dtype = numpy.dtype(dtype)---> 87 tid = h5t.py_create(dtype, logical=1) 88 89 # Legacyh5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:16162)()h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:15993)()h5py/h5t.pyx in h5py.h5t.py_create (-------src-dir--------/h5py/h5t.c:15895)()TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Note the traceback indicates this is a problem with writing sample
metadata.


Reply to this email directly or view it on GitHub
#513 (comment)
.

@jairideout
Copy link
Member

The JSON table can be read correctly as JSON and written as JSON:

In [1]: from biom import load_table

In [2]: t = load_table('Combined_otu_Early.biom')

In [3]: with open('foo.biom', 'w') as biom_file:
   ...:     t.to_json('foo', biom_file)
   ...:

In [4]:

It is a valid BIOM file:

$ biom validate-table -i Combined_otu_Early.biom

The input file is a valid BIOM-formatted file.

The user reported that this table used to work with QIIME 1.8.0 but not 1.9.0+.

@wasade
Copy link
Member

wasade commented Nov 3, 2015

Still does not indicate if it is the parser or writer, or how to begin to
solve this issue. What do the parsed metadata actually look like?
On Nov 3, 2015 12:15 PM, "Jai Ram Rideout" notifications@github.com wrote:

The JSON table can be read correctly as JSON and written as JSON:

In [1]: from biom import load_table

In [2]: t = load_table('Combined_otu_Early.biom')

In [3]: with open('foo.biom', 'w') as biom_file:
...: t.to_json('foo', biom_file)
...:

In [4]:

It is a valid BIOM file:

$ biom validate-table -i Combined_otu_Early.biom

The input file is a valid BIOM-formatted file.

The user reported that this table used to work with QIIME 1.8.0 but not
1.9.0+.


Reply to this email directly or view it on GitHub
#513 (comment)
.

@ebolyen
Copy link
Member

ebolyen commented Nov 3, 2015

Shouldn't the format spec for 2.1 define how to deal with all possible JSON serializations for metadata? This would indicate a bug in the writer, as the only thing BIOM 1.0 can physically store is JSON (maps, unicode, floats, bool, null, and arrays).

@jairideout
Copy link
Member

In [1]: from biom import load_table

In [2]: t = load_table('Combined_otu_Early.biom')

In [3]: t.metadata(axis='sample')
Out[3]:
(defaultdict(<function biom.table.<lambda>>,
             {u'WW.1.memb.F.1': {u'BarcodeSequence': u'ACACTGTTCATG',
               u'Category': u'Early',
               u'Combined': u'Early_1',
               u'Description': u'WW1_memb_5hours_filtration_1',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'1',
               u'name': u'2'},
              u'WW.1.memb.F.2': {u'BarcodeSequence': u'AGCAGTCGCGAT',
               u'Category': u'Early',
               u'Combined': u'Early_1',
               u'Description': u'WW1_memb_5hours_filtration_2',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'1',
               u'name': u'17'}}),
 defaultdict(<function biom.table.<lambda>>,
             {u'WW.5.memb.F.1': {u'BarcodeSequence': u'AACTGTGCGTAC',
               u'Category': u'Early',
               u'Combined': u'Early_5',
               u'Description': u'WW5_memb_5hours_filtration_1',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'5',
               u'name': u'12'},
              u'WW.5.memb.F.2': {u'BarcodeSequence': u'ACGTCTGTAGCA',
               u'Category': u'Early',
               u'Combined': u'Early_5',
               u'Description': u'WW5_memb_5hours_filtration_2',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'5',
               u'name': u'29'}}),
 defaultdict(<function biom.table.<lambda>>,
             {u'WW.4.memb.F.1': {u'BarcodeSequence': u'ACGGATCGTCAG',
               u'Category': u'Early',
               u'Combined': u'Early_4',
               u'Description': u'WW4_memb_5hours_filtration_1',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'4',
               u'name': u'10'},
              u'WW.4.memb.F.2': {u'BarcodeSequence': u'AAGCTGCAGTCG',
               u'Category': u'Early',
               u'Combined': u'Early_4',
               u'Description': u'WW4_memb_5hours_filtration_2',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'4',
               u'name': u'26'}}),
 defaultdict(<function biom.table.<lambda>>,
             {u'WW.3.memb.F.1': {u'BarcodeSequence': u'ACAGACCACTCA',
               u'Category': u'Early',
               u'Combined': u'Early_3',
               u'Description': u'WW3_memb_5hours_filtration_1',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'3',
               u'name': u'8'},
              u'WW.3.memb.F.2': {u'BarcodeSequence': u'AGAGAGCAAGTG',
               u'Category': u'Early',
               u'Combined': u'Early_3',
               u'Description': u'WW3_memb_5hours_filtration_2',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'3',
               u'name': u'23'}}),
 defaultdict(<function biom.table.<lambda>>,
             {u'WW.2.memb.F.1': {u'BarcodeSequence': u'ACGCTCATGGAT',
               u'Category': u'Early',
               u'Combined': u'Early_2',
               u'Description': u'WW2_memb_5hours_filtration_1',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'2',
               u'name': u'4'},
              u'WW.2.memb.F.2': {u'BarcodeSequence': u'ACAGCAGTGGTC',
               u'Category': u'Early',
               u'Combined': u'Early_2',
               u'Description': u'WW2_memb_5hours_filtration_2',
               u'LinkerPrimerSequence': u'CAAGAGTTTGATCCTGGCTCAG',
               u'WWTP': u'2',
               u'name': u'20'}}))

In [4]:

@wasade
Copy link
Member

wasade commented Nov 3, 2015

Evan, and nested forms, which is the issue.

Thanks, Jai. I'm not aware of these nesting having been encountered
previously, and I don't know if anyone has defined a formatter. The 2.x
spec is more ridged than the 1.0 spec, so this is actually not directly
supported according to the format definition for either axis unless we
serialize the metadata into json or something.

Note, we tabled the refactor of the hdf5 readers and writers until the port
to skbio.

Shouldn't the format spec for 2.1 define how to deal with all possible JSON
serializations for metadata? This would indicate a bug in the writer, as
the only thing BIOM 1.0 can physically store is JSON (maps, unicode,
floats, bool, null, and arrays).


Reply to this email directly or view it on GitHub
#513 (comment).

@jairideout
Copy link
Member

I'm not aware of these nesting having been encountered
previously

This issue was originally opened by @wdwvt1 due to nested metadata in existing BIOM tables. That's why I reopened the issue and didn't create a new one.

@wasade
Copy link
Member

wasade commented Nov 3, 2015

Why was if closed? @josenavas, did you ever get a fix for this?
On Nov 3, 2015 12:41 PM, "Jai Ram Rideout" notifications@github.com wrote:

I'm not aware of these nesting having been encountered
previously

This issue was originally opened by @wdwvt1 https://github.com/wdwvt1
due to nested metadata in existing BIOM tables. That's why I reopened the
issue and didn't create a new one.


Reply to this email directly or view it on GitHub
#513 (comment)
.

@josenavas
Copy link
Member

Sorry for the late reply:

The original problem was that the original table was the result of collapsing the table. The fix was put in place here: #515 and the specific description of the solution is outlined in the original discussion with @wdwvt1

@josenavas
Copy link
Member

@jairideout it looks like your table has collapsed sample metadata, can you use the command biom convert with the parameter --collapsed-samples? It will create a compatible BIOM table. Note that there was a discussion on how we are going to handle collapsed metadata and the agreement was to only store the ids.

@jairideout
Copy link
Member

Thanks for the workaround, I can confirm this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants