Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured arrays support #1

Closed
wants to merge 70 commits into from
Closed

Structured arrays support #1

wants to merge 70 commits into from

Conversation

aldanor
Copy link
Owner

@aldanor aldanor commented Jun 21, 2016

This branch represents work-in-progress re: structured arrays support.

See example20.cpp, example20.py and example20.ref for the attached tests. Most of the implementation is contained in numpy.h.

  • Added py::memoryview wrapper type with constructor from buffer_info.
  • py::array constructor from buffer logic had to be rewritten to account for structured types.
    • Added PYBIND11_NUMPY_DTYPE macro for defining dtypes for structured arrays.
  • npy_format_descriptor::value had to be changed to npy_format_descriptor::dtype() since the typenums are only sufficient for built-in types. Internally, it is converted to a descriptor anyway by numpy. However, this is backwards-compatible since the typenums for non-structured types are still accessible via ::value.
  • format_descriptor::value had to be changed to format_descriptor::format() so as to be able to use numpy's logic to general buffer format strings at runtime. However, the change is backwards compatible since the format strings known at compile time are still accessible via ::value.

@aldanor
Copy link
Owner Author

aldanor commented Jun 21, 2016

Fixed segfault with nested descriptors, apparently you need to incref them before passing on to numpy API, the tests actually pass fine now.

@aldanor aldanor force-pushed the recarray branch 2 times, most recently from a8a3c53 to 84a5056 Compare July 2, 2016 13:53
auto buf_info = info;
if (!buf_info.ptr)
// always allocate at least 1 element, same way as NumPy does it
buf_info.ptr = std::calloc(std::max(info.size, 1ul), info.itemsize);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any way to provide a way to allocate uninitialized memory, which was what the previous implementation did? I would say that in almost all cases, the user will allocate a buffer and then fill it evaluations with some kind of numerical routine, so the initialization is just wasted.

also, should this be (size_t) 1 instead of 1ul in the std::max call?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: 1ul, yea, I guess I was just lazy :)

Re: malloc vs calloc, here's the only piece of code from numpy that refers to NPY_NEEDS_INIT:

    if (data == NULL) {
        /*
         * Allocate something even for zero-space arrays
         * e.g. shape=(0,) -- otherwise buffer exposure
         * (a.data) doesn't work as it should.
         */
        if (nbytes == 0) {
            nbytes = descr->elsize;
        }
        /*
         * It is bad to have uninitialized OBJECT pointers
         * which could also be sub-fields of a VOID array
         */
        if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) {
            data = npy_alloc_cache_zero(nbytes);
        }
        else {
            data = npy_alloc_cache(nbytes);
        }
        if (data == NULL) {
            PyErr_NoMemory();
            goto fail;
        }
        fa->flags |= NPY_ARRAY_OWNDATA;

    }

Realistically, if we never handle object dtypes, we can just use malloc instead of calloc I guess.

>>> a = np.array(['foo', 'bar'], dtype=object)
>>> memoryview(a).format
'O'

Or maybe we can check if there's any 'O' in the format string, we switch to calloc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to ignore object types, which is somewhat esoteric in any case.

@aldanor
Copy link
Owner Author

aldanor commented Jul 17, 2016

@wjakob Hey, I think I've fixed all issues with padding and stuff, rebased onto the most recent master (this took me a good while since there were name clashes, had to rename example18.* to example20.*, do some git filter-branch dark magic etc...), now just following up on the rest of your comments here and it will be then done :)

A few things:

  • Is there a great value in having PYBIND11_DESCR being constexpr (and hence descr not being constructible from runtime strings) other than it being fancy? I see it's only used to generate signatures in docstrings, that sounds quite runtime to me :)

It would be sure nice to see signatures like

def f(arr : numpy.ndarray[dtype=Foo]) -> numpy.ndarray[dtype=Bar]

instead of it just being "struct" for all user-defined types. Are there any workarounds around this?

  • Do you think it's worth introducing a py::dtype object? I don't think your previous suggestion (i.e. have methods like .append(offset, name, format)) would work since not all dtypes are structured, so you would need a whole hierarchy of c++ classes. But maybe moving padding stripping logic and a few other things into py::dtype would make sense (e.g., construct dtype from buffer format string, extract it from an array object, etc). Also there's now functions like py::dtype_of<T>() (maybe makes sense to make this a static function in py::dtype?) and py::array::dtype() that should naturally return py::dtype objects.
  • Re: static class variables, your suggestion wouldn't work, since they're not constexpr and this library is header-only, so there's no way to statically initialize them. Which implies they should be kept in those ugly static functions...

P.S. by the way there's a few exception leaks in examples 5 and 18 in master, I've posted the issue to the main repo

struct NestedStruct {
SimpleStruct a;
PackedStruct b;
} __attribute__((packed));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think MSVC will like this attribute. Do you want to turn the PR into an official one at the pybind11 repository? That way, the CI build bots will try to compile your code on various platforms, which may uncover a few more things.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch, I'll fix that tonight and open a PR to pybind11 :)

// please see my other few comments above

@wjakob
Copy link

wjakob commented Jul 18, 2016

PYBIND11_DESCR really has to stay constexpr. This enables generating function signatures at compile time, which reduces the size of generated binaries considerably.

I am greatly in favor of introducing a py::dtype object that can be used dynamically (in whatever way makes sense -- don't interpret my previous suggestion for an API too strictly).

@aldanor
Copy link
Owner Author

aldanor commented Jul 18, 2016

@wjakob I believe I've addressed all of the comments here (last 6 commits), except moving dtype-related logic inside of a py::dtype wrapper -- working on that right now.

I've also recalled I forgot to add descriptor bindings for one important class of types: string fields; it's fairly straightforward actually (e.g. so that char[7] and std::array<char, 7> map to S7 in numpy and 7s in buffer protocol), however I ran into a weird problem with detail::_, maybe I don't fully understand how it works :)

template <size_t N> struct npy_format_descriptor<char[N]> {
    static PYBIND11_DESCR name() { return _("S") + _<N>(); } 
    // more stuff
};

This compiles/links fine, however if you try to run it with an example, you get

Traceback (most recent call last):
  File "../example/example20.py", line 6, in <module>
    from example import (
ImportError: dlopen(../example/example.so, 2): Symbol not found: __ZN8pybind116detail10int_to_strILm0EJLm3EEE6digitsE
  Referenced from: ../example/example.so
  Expected in: flat namespace
 in ../example/example.so

So I can't use _ on template parameters? 🐼

@wjakob
Copy link

wjakob commented Jul 18, 2016

Weird .. sounds like a potential compiler issue. The whole point of _ is that it can be used with template parameters :)

@aldanor
Copy link
Owner Author

aldanor commented Jul 19, 2016

@wjakob Hmm, a very short example (this is clang on OSX, c++14). Should this work / compile?

#include <pybind11/pybind11.h>

namespace pybind11 {
namespace detail {
template<size_t N> PYBIND11_DESCR foo() { return _("S") + _<N>(); }
}
}

int main() {
    std::cout << pybind11::detail::foo<3>().text() << std::endl;
}
Undefined symbols for architecture x86_64:
  "pybind11::detail::int_to_str<0ul, 3ul>::digits", referenced from:
      auto pybind11::detail::_<3ul>() in pb11-_-2001d1.o
ld: symbol(s) not found for architecture x86_64

(This is built without -undefined dynamic_lookup so it actually triggers a compile error and not a runtime one)

Interestingly, it seems to work on gcc 5 on Linux (c++14), just checked.

@wjakob
Copy link

wjakob commented Jul 19, 2016

Fixed (4a87933). The issue is that a reference of a constexpr value was taken, which requires it to actually exist at runtime. The commit now switches to pass-by-value at compile time.

@aldanor
Copy link
Owner Author

aldanor commented Jul 19, 2016

Cool, I'll check this tonight when I get back to my mac 👍

// Wonder why gcc5 was fine with it though?

@wjakob
Copy link

wjakob commented Jul 19, 2016

I think whether this should give a linker error is implementation/optimization-specific behavior that is not covered by the spec.

@aldanor
Copy link
Owner Author

aldanor commented Jul 19, 2016

@wjakob Your int_to_str fix seems to have worked, I've implemented format/numpy descriptors for char[N] and std::array<char, N> + tests (that's not directly recarray-related, but very dtype-related, and is a great addition to our numpy arsenal, idk if it's worth mentioning in the docs or not), works on both clang/osx and gcc5/linux.

I've also moved all numpy API-related stuff out of py::array into a separate container for the sake of clarity, it was started to get swamped with it. So now the last cleanup bit is finishing py::dtype, hope that won't take long, will keep you posted :)

@wjakob
Copy link

wjakob commented Jul 21, 2016

Ok that sounds good! Please go ahead and create an official PR at the main repo once you have something.

@aldanor
Copy link
Owner Author

aldanor commented Jul 23, 2016

@wjakob This is the final bit of this PR, I've pushed preliminary implementation of py::dtype which mostly works (all tests pass, but when the process closes there's a segfault in handle::dec_ref(), seems like refcount is off somewhere but I can seem to find our where exactly so far........ -- hopefully I can find it).

While I'm searching for the refcount bug, looking forward to any comments and suggestions re: the API 😃

@wjakob
Copy link

wjakob commented Jul 24, 2016

@aldanor: This looks fantastic. I would suggest one minor reorganization and one addition before wrapping it up:

There is a part of the code which sets up the arguments for the dtype and then calls the dtype constructor.

args["names"] = names; args["formats"] = formats; args["offsets"] = offsets;

It would make more sense to me if (instead of the static from_args function) there was a special dtype constructor which directly takes these lists, i.e.:

dtype::dtype(list names, list formats, list offsets, size_t itemsize) 

(or something like that, feel free to modify)

The other bit that would be super nice to have is an array constructor which can accept the dynamic dtype:

array:array(const dtype &dt, size_t size, const void *ptr);

@aldanor
Copy link
Owner Author

aldanor commented Jul 24, 2016

Agreed on having two more constructors, good points! (maybe you can spot the ref count bug too, haha? refcount bug fixed)

Should from_args stay too or should it become internal? It accepts a much wider range of things (everything that np.dtype does, and it's a lot if you check the docstring). The only reason it's a static function and not a ctor is that it takes a single object which would mix up with object's ctors

@aldanor
Copy link
Owner Author

aldanor commented Jul 24, 2016

@wjakob Okay, done, the refcount bug fixed, plus a few constructors added, see latest commits:

  • simplified buffer_info constructor for 1-D case
  • dtype constructor accepting names/formats/offsets/itemsize
  • array gained a constructor accepting a dtype
  • all copy/paste eliminated inarray / array_t constructors, added a whole bunch of them to cover most common cases (e.g. if you provide just the shape, strides will be auto-computed)

Basically, it's now possible to do stuff like this:

auto arr1 = py::array_t<int>(10);
auto arr2 = py::array("int32", { 3, 2 });

without having to mess with buffer_info in such simple examples.

I could probably fix the existing examples in the docs to use those.

Also, the const qualifier for the data pointer was dropped for array ctors -- I'm not absolutely sure about this yet but it looks a more reasonable default, I think? This will prob need to be revisited as a part of the whole casting rework, but generally we do want to create and export arrays that don't own their data (e.g. wrappers to expose huge vectors in c++), I'd say that'd be one of the most common use cases. It's also possible to add ctors from std::vector, too, but that needs more though re: casting and data ownership (the general numpy convention is that nothing is copied unless you explicitly ask for it, e.g. via ensurecopy flag; this would also mean though that such a constructor would be unsafe be default since its data could now reside in some external vector whose lifetime is not tied to the array). Maybe it's best to just have pointers to hint at the fact that it's not 100% safe and you should know what you're doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants