Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give better errors when writing illegal column names #321

Open
flying-sheep opened this issue Feb 13, 2020 · 7 comments
Open

Give better errors when writing illegal column names #321

flying-sheep opened this issue Feb 13, 2020 · 7 comments

Comments

@flying-sheep
Copy link
Member

There’s a too generic/unhelpful error when writing DataFrame column names that are

  • numeric

  • 'index'

  • containing slashes (unsure if this is still the case)

  • non-unicode: issue_52_demo.tar.gz

    $ ./demo_load_maggie.py 
    Variable names are not unique. To make them unique, call `.var_names_make_unique`.
    anndata/readwrite/write.py:301: UserWarning: Could not save field with key = "uns/condition_categories" to hdf5 file.
    'to hdf5 file.'.format(key))

#52 has more details

@ivirshup
Copy link
Member

ivirshup commented Feb 16, 2020

numeric

Could be better, but I think it gives the important information (what element of the AnnData, and what key) at the end:

Example traceback:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-dc0e4fee41fa> in <module>
      1 a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
      2 a.write_h5ad("tmp.h5ad")
----> 3 b = ad.read_h5ad("tmp.h5ad")

~/github/anndata/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    427     _clean_uns(d)  # backwards compat
    428 
--> 429     return AnnData(**d)
    430 
    431 

~/github/anndata/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    311                 varp=varp,
    312                 filename=filename,
--> 313                 filemode=filemode,
    314             )
    315 

~/github/anndata/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    492 
    493         # TODO: Think about consequences of making obsm a group in hdf
--> 494         self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm))
    495         self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm))
    496 

~/github/anndata/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
    229         self._data = dict()
    230         if vals is not None:
--> 231             self.update(vals)
    232 
    233 

/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_collections_abc.py in update(*args, **kwds)
    839             if isinstance(other, Mapping):
    840                 for key in other:
--> 841                     self[key] = other[key]
    842             elif hasattr(other, "keys"):
    843                 for key in other.keys():

~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
    148 
    149     def __setitem__(self, key: str, value: V):
--> 150         value = self._validate_value(value, key)
    151         self._data[key] = value
    152 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
    212                 f"value.index does not match parent’s axis {self.axes[0]} names"
    213             )
--> 214         return super()._validate_value(val, key)
    215 
    216 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
     47         """Raises an error if value is invalid"""
     48         for i, axis in enumerate(self.axes):
---> 49             if self.parent.shape[axis] != val.shape[i]:
     50                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
     51                 raise ValueError(

AttributeError: 'dict' object has no attribute 'shape'

In [4]: a = gen_adata((10, 10)) 
   ...: a.obs[1] = 1 
   ...: a.write("tmp.h5ad") 
   ...: b = ad.read_h5ad("tmp.h5ad")                                                                                                    
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    187         try:
--> 188             return func(elem, key, val, *args, **kwargs)
    189         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
    280     else:
--> 281         group[key] = series.values
    282 

/usr/local/lib/python3.7/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
    369         with phil:
--> 370             name, lcpl = self._e(name, lcpl=True)
    371 

/usr/local/lib/python3.7/site-packages/h5py/_hl/base.py in _e(self, name, lcpl)
    136             try:
--> 137                 name = name.encode('ascii')
    138                 coding = h5t.CSET_ASCII

AttributeError: 'int' object has no attribute 'encode'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    187         try:
--> 188             return func(elem, key, val, *args, **kwargs)
    189         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
    254     for colname, series in df.items():
--> 255         write_series(group, colname, series, dataset_kwargs=dataset_kwargs)
    256 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    194                 f" from {parent}."
--> 195             ) from e
    196 

AttributeError: 'int' object has no attribute 'encode'

Above error raised while writing key 1 of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-4-44b896a10938> in <module>
      1 a = gen_adata((10, 10))
      2 a.obs[1] = 1
----> 3 a.write("tmp.h5ad")
      4 b = ad.read_h5ad("tmp.h5ad")

~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   2008             force_dense=force_dense,
   2009             as_dense=as_dense,
-> 2010         )
   2011 
   2012         if self.isbacked:

~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
    103         else:
    104             write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 105         write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
    106         write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    107         write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)

/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/functools.py in wrapper(*args, **kw)
    838                             '1 positional argument')
    839 
--> 840         return dispatch(args[0].__class__)(*args, **kw)
    841 
    842     funcname = getattr(func, '__name__', 'singledispatch function')

~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
    124     if key in f:
    125         del f[key]
--> 126     _write_method(type(value))(f, key, value, *args, **kwargs)
    127 
    128 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    193                 f"Above error raised while writing key {key!r} of {type(elem)}"
    194                 f" from {parent}."
--> 195             ) from e
    196 
    197     return func_wrapper

AttributeError: 'int' object has no attribute 'encode'

Above error raised while writing key 1 of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.

'index'

I think this one should work fine now. _index won't work, but I think we give a good error.

containing slashes (unsure if this is still the case)

To my surprise, this one works with dataframes, but not with values in mappings.

import numpy as np
import anndata as ad
from anndata.tests.helpers import gen_adata

a = gen_adata((10, 10))
a.obs["name/with/slashes"] = 1
a.write("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
print(b.obs.columns)
# Index(['obs_cat', 'cat_ordered', 'int64', 'float64', 'uint8',
#       'name/with/slashes'],

a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
a.write_h5ad("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
Traceback
AttributeError                            Traceback (most recent call last)
<ipython-input-3-dc0e4fee41fa> in <module>
      1 a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
      2 a.write_h5ad("tmp.h5ad")
----> 3 b = ad.read_h5ad("tmp.h5ad")

~/github/anndata/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    427     _clean_uns(d)  # backwards compat
    428 
--> 429     return AnnData(**d)
    430 
    431 

~/github/anndata/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    311                 varp=varp,
    312                 filename=filename,
--> 313                 filemode=filemode,
    314             )
    315 

~/github/anndata/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    492 
    493         # TODO: Think about consequences of making obsm a group in hdf
--> 494         self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm))
    495         self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm))
    496 

~/github/anndata/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
    229         self._data = dict()
    230         if vals is not None:
--> 231             self.update(vals)
    232 
    233 

/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_collections_abc.py in update(*args, **kwds)
    839             if isinstance(other, Mapping):
    840                 for key in other:
--> 841                     self[key] = other[key]
    842             elif hasattr(other, "keys"):
    843                 for key in other.keys():

~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
    148 
    149     def __setitem__(self, key: str, value: V):
--> 150         value = self._validate_value(value, key)
    151         self._data[key] = value
    152 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
    212                 f"value.index does not match parent’s axis {self.axes[0]} names"
    213             )
--> 214         return super()._validate_value(val, key)
    215 
    216 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
     47         """Raises an error if value is invalid"""
     48         for i, axis in enumerate(self.axes):
---> 49             if self.parent.shape[axis] != val.shape[i]:
     50                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
     51                 raise ValueError(

AttributeError: 'dict' object has no attribute 'shape'

non-unicode:

I don't get an error for that one.

@ivirshup
Copy link
Member

ivirshup commented Feb 16, 2020

Update on name/with/slashes in dataframes:

It works, but I'm not sure it's doing the right thing. Zarr does something similar. Truncated output of h5ls -r tmp.h5ad:

/obs                     Group
/obs/__categories        Group
/obs/__categories/cat_ordered Dataset {10}
/obs/__categories/obs_cat Dataset {7}
/obs/_index              Dataset {10}
/obs/cat_ordered         Dataset {10}
/obs/float64             Dataset {10}
/obs/int64               Dataset {10}
/obs/name                Group
/obs/name/with           Group
/obs/name/with/slashes   Dataset {10}
/obs/obs_cat             Dataset {10}
/obs/uint8               Dataset {10}
zarr["obs"].tree()
obs
 ├── __categories
 │   ├── cat_ordered (10,) object
 │   └── obs_cat (7,) object
 ├── _index (10,) object
 ├── cat_ordered (10,) int8
 ├── float64 (10,) float64
 ├── int64 (10,) int64
 ├── name
 │   └── with
 │       └── slashes (10,) int64
 ├── obs_cat (10,) int8
 └── uint8 (10,) uint8

Should we normalize or error for these? I'm thinking normalize since I think slashes will be fairly common in column/ group names. E.g. "CD4+/CD8-".

It would be good to use some external standard for normalizing these names. I'm not sure where to find this kind of thing though.

@flying-sheep
Copy link
Member Author

flying-sheep commented Feb 17, 2020

  • Option A mangle/encode the names (e.g. replacing slashes with the unicode division slash /), which modifies the names, so things aren’t as the user specified) and document this mangling in the format documentation.

    I think mangling is inherently problematic, as it’ll result in subtle bugs arising from just a few columns not matching the names you put in. In this case it’s less bad because we wouldn’t mangle data (like gene names) but manually-addressed metadata, and we can unmangle again, so the bugs would be in other peoples’ code reading AnnData.

  • Option B store the names somewhere else (which makes the file less explorable since you now need code instead of a generic CLI tool to descend into the HDF5 file)

  • Option C leave things as is, fix all code that iterates over HDF5-stored dataframe columns, and document this so people wanting to read the thing don’t make a mistake

    If it works, why does it matter if it’s internally represented as a tree of groups? The only problem is code that loops over dataframe columns, which would have to be changed to match and be a source of bugs for people wanting to read AnnData who don’t know this.

@ivirshup
Copy link
Member

Option A mangle/encode the names (e.g. replacing slashes with the unicode division slash /→∕), which modifies the names, so things aren’t as the user specified) and document this mangling in the format documentation.

I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way. Basically, I was hoping we could just escape characters that filesystems treat as special.

I'm not sure if this can actually be done with hdf5. The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google) says group names must be ascii and can't a "." or "/". However, group names can contain a "." and be unicode (mention in other parts), so maybe there's a way for them to include "/'.

Option C ... fix all code that iterates over HDF5-stored dataframe columns

What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes.

@flying-sheep
Copy link
Member Author

flying-sheep commented Feb 18, 2020

I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way.

You mean exactly the way I described above with the division slash? We can of course also escape the division slash by doubling it in the off chance someone uses one as column name, then stuff is reversible.

The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google)

fucking amazing, their online PDF viewer doesn’t even support Ctrl+F. Do they actively want to make developers’ lives harder?

I was hoping we could just escape characters that filesystems treat as special

HAhahahaHA “just” 🤪 Oh god, if we had to support filesystems we could either invest days into researching all the legacy crap that went into the 3 OS’s main file systems during decades of computing history or restrict ourselves to [A-Za-z_-] (or so! no idea if dashes and underscores are allowed everywhere! I mean, colons aren’t, so everything is possible 😵)

What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes.

It does? We don’t have code that does for column_dataset in columns_group without checking if column_dataset is a group itself?

@ivirshup
Copy link
Member

You mean exactly the way I described above with the division slash?

Not that, because it becomes non-obvious how to read the group. It would be really annoying for f["gene+/gene-"] to not work when it looks like it should. I was thinking more of escaping like \/.

Do they actively want to make developers’ lives harder?

Well, they have pivoted to emphasize consulting...

HAhahahaHA “just”

Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that. There is a normalize_key argument, but it might just enforce lowercasing.

It does? We don’t have code that does for column_dataset in columns_group without checking if column_dataset is a group itself?

Since the order of the columns is important for dataframes, we save the names in the correct order in the attributes, then do {k: h5group[k] for k in colnames}. Here's the logic: https://github.com/theislab/anndata/blob/4440b90ff3dff213b4c512478e21426cf210368d/anndata/_io/h5ad.py#L460-L469

@flying-sheep
Copy link
Member Author

flying-sheep commented Feb 19, 2020

I was thinking more of escaping like \/.

I’m pretty sure this won’t work. Escaping is something that depends on the interpreter. If HDF5 group names work like files system inode names in this regard it won’t interpret escape characters, it’ll simply do *group_names, value_name = value_path.split('/'), making it impossible to put a literal slash anywhere in a group name.

Not that, because it becomes non-obvious how to read the group. It would be really annoying for f["gene+/gene-"] to not work when it looks like it should.

As I said, if we go the mangling route, we should document whatever we do and de-mangle group names when reading. But yeah, it’s probably a good idea to not use lookalike characters. We should use an escape character that looks outlandishly unicody and is very unlikely to be used in an old anndata object so we don’t accidentally “unescape” something that was never escaped to begin with. Word uses some like that for their totally-not-LaTeX UnicodeMath (PDF)

Well, they have pivoted to emphasize consulting...

Well, did I ever hit the nail on the head.

Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that.

Ah fuck, right, Zarr does that. Well, if we want to use it more prominently we have to stop using group names to store column names, plain and simple. The pain of figuring out and escaping all known file systems’ disallowed characters is one thing, but then there’s also device files (which on windows don’t have absolute paths and are named e.g. com1), different file systems are case sensitive while others aren’t, starting a column name with / might introduce a security risk, and I bet there’s more hidden nastiness. Just no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants