Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object array fill value #216

Open
alimanfoo opened this issue Dec 7, 2017 · 11 comments
Open

Object array fill value #216

alimanfoo opened this issue Dec 7, 2017 · 11 comments

Comments

@alimanfoo
Copy link
Member

If a fill value is provided for an array with object dtype, and the fill value cannot be JSON encoded, an error will occur:

In [5]: zarr.full(10, dtype=bytes, object_codec=numcodecs.VLenBytes(), fill_value=b'foobar')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-fa0fae8e2b58> in <module>()
----> 1 zarr.full(10, dtype=bytes, object_codec=numcodecs.VLenBytes(), fill_value=b'foobar')

~/src/github/alimanfoo/zarr/zarr/creation.py in full(shape, fill_value, **kwargs)
    267     """
    268 
--> 269     return create(shape=shape, fill_value=fill_value, **kwargs)
    270 
    271 

~/src/github/alimanfoo/zarr/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, read_only, object_codec, **kwargs)
    112     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    113                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
--> 114                chunk_store=chunk_store, filters=filters, object_codec=object_codec)
    115 
    116     # instantiate array

~/src/github/alimanfoo/zarr/zarr/storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    290                          order=order, overwrite=overwrite, path=path,
    291                          chunk_store=chunk_store, filters=filters,
--> 292                          object_codec=object_codec)
    293 
    294 

~/src/github/alimanfoo/zarr/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    364                 order=order, filters=filters_config)
    365     key = _path_to_prefix(path) + array_meta_key
--> 366     store[key] = encode_array_metadata(meta)
    367 
    368 

~/src/github/alimanfoo/zarr/zarr/meta.py in encode_array_metadata(meta)
     66     )
     67     s = json.dumps(meta, indent=4, sort_keys=True, ensure_ascii=True,
---> 68                    separators=(',', ': '))
     69     b = s.encode('ascii')
     70     return b

/usr/lib/python3.6/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    236         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237         separators=separators, default=default, sort_keys=sort_keys,
--> 238         **kw).encode(obj)
    239 
    240 

/usr/lib/python3.6/json/encoder.py in encode(self, o)
    199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
--> 201             chunks = list(chunks)
    202         return ''.join(chunks)
    203 

/usr/lib/python3.6/json/encoder.py in _iterencode(o, _current_indent_level)
    428             yield from _iterencode_list(o, _current_indent_level)
    429         elif isinstance(o, dict):
--> 430             yield from _iterencode_dict(o, _current_indent_level)
    431         else:
    432             if markers is not None:

/usr/lib/python3.6/json/encoder.py in _iterencode_dict(dct, _current_indent_level)
    402                 else:
    403                     chunks = _iterencode(value, _current_indent_level)
--> 404                 yield from chunks
    405         if newline_indent is not None:
    406             _current_indent_level -= 1

/usr/lib/python3.6/json/encoder.py in _iterencode(o, _current_indent_level)
    435                     raise ValueError("Circular reference detected")
    436                 markers[markerid] = o
--> 437             o = _default(o)
    438             yield from _iterencode(o, _current_indent_level)
    439             if markers is not None:

/usr/lib/python3.6/json/encoder.py in default(self, o)
    178         """
    179         raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180                         o.__class__.__name__)
    181 
    182     def encode(self, o):

TypeError: Object of type 'bytes' is not JSON serializable
@alimanfoo
Copy link
Member Author

Hard to see how to resolve this one without a spec change and/or breaking compatibility with previous data. May have to punt on this for 2.2.

@jakirkham
Copy link
Member

Hopefully to make this clearer to someone less familiar with Zarr (and also ensure I understand the issue). Please correct me if any of this is wrong.

The fill_value is stored in .zarray, which is a JSON file. Hence the fill_value has to be JSON serializable.

A workaround would be to not use fill_value for these cases and simply fill all values in the array with the intended default value. This will be less space efficient than having the intended fill_value set, but is an option for now.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 7, 2017 via email

@jakirkham
Copy link
Member

Thanks for the follow-up. Do we guarantee that fill_value is always specified in .zarray? If not, what if we started storing the fill_value to its own "chunk"?

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 7, 2017 via email

@jakirkham
Copy link
Member

Yeah that sounds ideal.

An alternative might be to base64 encode it at the end and shove it back in JSON. Can flesh this out a bit more if it is interesting.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 7, 2017 via email

@jakirkham
Copy link
Member

Guess I was imagining the base64 encoding to look a bit like this. That way it wouldn't overlap with any existing strategy. That said, not attached to this if you have other ideas to tackle the problem. FWIW I don't think this is pressing just interesting is all.

    "fill_value": {
        "base64": "..."
    }

Also eager to see 2.2 ship reasonably soon. There's a lot of great functionality already built up in master and it would be great to get out there so more people can access it. Happy to table this until afterwards.

@jakirkham
Copy link
Member

To propose another option here, we could use jsonpickle. This can handle a wide variety of cases for us here. Though there are the usual concerns with pickling data generally. I'm not sure how we avoid those.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 4, 2018 via email

@joshmoore
Copy link
Member

slightly related that several fixes to object arrays were in #806
cc: @abergou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants