-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give better errors when writing illegal column names #321
Comments
Could be better, but I think it gives the important information (what element of the AnnData, and what key) at the end: Example traceback:---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-dc0e4fee41fa> in <module>
1 a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
2 a.write_h5ad("tmp.h5ad")
----> 3 b = ad.read_h5ad("tmp.h5ad")
~/github/anndata/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
427 _clean_uns(d) # backwards compat
428
--> 429 return AnnData(**d)
430
431
~/github/anndata/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
311 varp=varp,
312 filename=filename,
--> 313 filemode=filemode,
314 )
315
~/github/anndata/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
492
493 # TODO: Think about consequences of making obsm a group in hdf
--> 494 self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm))
495 self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm))
496
~/github/anndata/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
229 self._data = dict()
230 if vals is not None:
--> 231 self.update(vals)
232
233
/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_collections_abc.py in update(*args, **kwds)
839 if isinstance(other, Mapping):
840 for key in other:
--> 841 self[key] = other[key]
842 elif hasattr(other, "keys"):
843 for key in other.keys():
~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
148
149 def __setitem__(self, key: str, value: V):
--> 150 value = self._validate_value(value, key)
151 self._data[key] = value
152
~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
212 f"value.index does not match parent’s axis {self.axes[0]} names"
213 )
--> 214 return super()._validate_value(val, key)
215
216
~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
47 """Raises an error if value is invalid"""
48 for i, axis in enumerate(self.axes):
---> 49 if self.parent.shape[axis] != val.shape[i]:
50 right_shape = tuple(self.parent.shape[a] for a in self.axes)
51 raise ValueError(
AttributeError: 'dict' object has no attribute 'shape'
In [4]: a = gen_adata((10, 10))
...: a.obs[1] = 1
...: a.write("tmp.h5ad")
...: b = ad.read_h5ad("tmp.h5ad")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
187 try:
--> 188 return func(elem, key, val, *args, **kwargs)
189 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
280 else:
--> 281 group[key] = series.values
282
/usr/local/lib/python3.7/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
369 with phil:
--> 370 name, lcpl = self._e(name, lcpl=True)
371
/usr/local/lib/python3.7/site-packages/h5py/_hl/base.py in _e(self, name, lcpl)
136 try:
--> 137 name = name.encode('ascii')
138 coding = h5t.CSET_ASCII
AttributeError: 'int' object has no attribute 'encode'
The above exception was the direct cause of the following exception:
AttributeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
187 try:
--> 188 return func(elem, key, val, *args, **kwargs)
189 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
254 for colname, series in df.items():
--> 255 write_series(group, colname, series, dataset_kwargs=dataset_kwargs)
256
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
194 f" from {parent}."
--> 195 ) from e
196
AttributeError: 'int' object has no attribute 'encode'
Above error raised while writing key 1 of <class 'h5py._hl.group.Group'> from /.
The above exception was the direct cause of the following exception:
AttributeError Traceback (most recent call last)
<ipython-input-4-44b896a10938> in <module>
1 a = gen_adata((10, 10))
2 a.obs[1] = 1
----> 3 a.write("tmp.h5ad")
4 b = ad.read_h5ad("tmp.h5ad")
~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
2008 force_dense=force_dense,
2009 as_dense=as_dense,
-> 2010 )
2011
2012 if self.isbacked:
~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
103 else:
104 write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 105 write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
106 write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
107 write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)
/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/functools.py in wrapper(*args, **kw)
838 '1 positional argument')
839
--> 840 return dispatch(args[0].__class__)(*args, **kw)
841
842 funcname = getattr(func, '__name__', 'singledispatch function')
~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
124 if key in f:
125 del f[key]
--> 126 _write_method(type(value))(f, key, value, *args, **kwargs)
127
128
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
193 f"Above error raised while writing key {key!r} of {type(elem)}"
194 f" from {parent}."
--> 195 ) from e
196
197 return func_wrapper
AttributeError: 'int' object has no attribute 'encode'
Above error raised while writing key 1 of <class 'h5py._hl.group.Group'> from /.
Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.
I think this one should work fine now.
To my surprise, this one works with dataframes, but not with values in mappings. import numpy as np
import anndata as ad
from anndata.tests.helpers import gen_adata
a = gen_adata((10, 10))
a.obs["name/with/slashes"] = 1
a.write("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
print(b.obs.columns)
# Index(['obs_cat', 'cat_ordered', 'int64', 'float64', 'uint8',
# 'name/with/slashes'],
a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
a.write_h5ad("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad") TracebackAttributeError Traceback (most recent call last)
<ipython-input-3-dc0e4fee41fa> in <module>
1 a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
2 a.write_h5ad("tmp.h5ad")
----> 3 b = ad.read_h5ad("tmp.h5ad")
~/github/anndata/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
427 _clean_uns(d) # backwards compat
428
--> 429 return AnnData(**d)
430
431
~/github/anndata/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
311 varp=varp,
312 filename=filename,
--> 313 filemode=filemode,
314 )
315
~/github/anndata/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
492
493 # TODO: Think about consequences of making obsm a group in hdf
--> 494 self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm))
495 self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm))
496
~/github/anndata/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
229 self._data = dict()
230 if vals is not None:
--> 231 self.update(vals)
232
233
/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_collections_abc.py in update(*args, **kwds)
839 if isinstance(other, Mapping):
840 for key in other:
--> 841 self[key] = other[key]
842 elif hasattr(other, "keys"):
843 for key in other.keys():
~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
148
149 def __setitem__(self, key: str, value: V):
--> 150 value = self._validate_value(value, key)
151 self._data[key] = value
152
~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
212 f"value.index does not match parent’s axis {self.axes[0]} names"
213 )
--> 214 return super()._validate_value(val, key)
215
216
~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
47 """Raises an error if value is invalid"""
48 for i, axis in enumerate(self.axes):
---> 49 if self.parent.shape[axis] != val.shape[i]:
50 right_shape = tuple(self.parent.shape[a] for a in self.axes)
51 raise ValueError(
AttributeError: 'dict' object has no attribute 'shape'
I don't get an error for that one. |
Update on It works, but I'm not sure it's doing the right thing. Zarr does something similar. Truncated output of
zarr["obs"].tree()
Should we normalize or error for these? I'm thinking normalize since I think slashes will be fairly common in column/ group names. E.g. "CD4+/CD8-". It would be good to use some external standard for normalizing these names. I'm not sure where to find this kind of thing though. |
|
I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way. Basically, I was hoping we could just escape characters that filesystems treat as special. I'm not sure if this can actually be done with hdf5. The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google) says group names must be ascii and can't a "." or "/". However, group names can contain a "." and be unicode (mention in other parts), so maybe there's a way for them to include "/'.
What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes. |
You mean exactly the way I described above with the division slash? We can of course also escape the division slash by doubling it in the off chance someone uses one as column name, then stuff is reversible.
fucking amazing, their online PDF viewer doesn’t even support Ctrl+F. Do they actively want to make developers’ lives harder?
HAhahahaHA “just” 🤪 Oh god, if we had to support filesystems we could either invest days into researching all the legacy crap that went into the 3 OS’s main file systems during decades of computing history or restrict ourselves to
It does? We don’t have code that does |
Not that, because it becomes non-obvious how to read the group. It would be really annoying for f["gene+/gene-"] to not work when it looks like it should. I was thinking more of escaping like
Well, they have pivoted to emphasize consulting...
Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that. There is a
Since the order of the columns is important for dataframes, we save the names in the correct order in the attributes, then do |
I’m pretty sure this won’t work. Escaping is something that depends on the interpreter. If HDF5 group names work like files system inode names in this regard it won’t interpret escape characters, it’ll simply do
As I said, if we go the mangling route, we should document whatever we do and de-mangle group names when reading. But yeah, it’s probably a good idea to not use lookalike characters. We should use an escape character that looks outlandishly unicody and is very unlikely to be used in an old anndata object so we don’t accidentally “unescape” something that was never escaped to begin with. Word uses some like that for their totally-not-LaTeX UnicodeMath (PDF)
Well, did I ever hit the nail on the head.
Ah fuck, right, Zarr does that. Well, if we want to use it more prominently we have to stop using group names to store column names, plain and simple. The pain of figuring out and escaping all known file systems’ disallowed characters is one thing, but then there’s also device files (which on windows don’t have absolute paths and are named e.g. com1), different file systems are case sensitive while others aren’t, starting a column name with |
There’s a too generic/unhelpful error when writing DataFrame column names that are
numeric
'index'
containing slashes (unsure if this is still the case)
non-unicode: issue_52_demo.tar.gz
#52 has more details
The text was updated successfully, but these errors were encountered: