Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - C Data Interface #179

Closed
wants to merge 18 commits into from
Closed

Conversation

sa-
Copy link

@sa- sa- commented Apr 17, 2021

Building off of this PR
#178

@sa- sa- force-pushed the sa-/c_data_interface branch 2 times, most recently from 83fd834 to feb5fff Compare April 17, 2021 23:35
@sa-
Copy link
Author

sa- commented Apr 17, 2021

using Arrow, PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
sch = Arrow.CDataInterface.get_schema() do ptr
    rb.schema._export_to_c(Int(ptr))
end
arr = Arrow.CDataInterface.get_array() do ptr
    rb._export_to_c(Int(ptr))
end

missed a spot

little refactor

pycall and conda to extras

little refactoring

squash
@sa- sa- force-pushed the sa-/c_data_interface branch from feb5fff to 130de79 Compare April 18, 2021 09:12
precision = Int(splits[1])
scale = Int(splits[2])
if length(splits) == 3
bandwidth = splits[3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be bitwidth instead of bandwidth

if length(splits) == 3
bandwidth = splits[3]
end
#TODO return something here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the eltypes.jl file, we define:

struct Decimal{P, S, T}
    value::T # only Int128 or Int256
end

which is what we should return here.

end
#TODO return something here
elseif format_string[1] == 'w'
#TODO figure out fixed width binary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just be the same as Arrow.FixedSizeList, but with UInt8 as the element type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a fixed width binary type won't have any children; it's like a hard-coded UInt8 child type and so doesn't need to be parsed recursively.

#TODO figure out fixed width binary
elseif format_string[1] == '+'
if format_string[2] == 'l' || format_string[2] == 'L'
Arrow.List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these nested types, we'll need to parse the children recursively. So the overall method signature needs to take the full ArrowSchema type, and then access the format string at the top. Then when we get here, we'll call get_type_from_format_string(sch.children[1]) and so on to get the List element type so we end up with a type like Arrow.List{Vector{Int64}} or whatever.

end
elseif format_string[1] == 't'
if format_string[2:3]
Date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing we should probably support is a convert::Bool=true keyword arg to this function. It would allow the user to specify whether they'd like the native arrow type converted to a more natural Julia type or not. We support this in Arrow.Table. It's nice because I think there are some cases where the user just wants the raw arrow data type, but usually the user wants the nice Julia type. This is one of those cases where we have Arrow.Date, which is different from Dates.Date.

jrevels and others added 3 commits April 22, 2021 18:28
* fix propagation of maxdepth kwarg

* bump Project.toml
* ability to append partitions to an arrow file

This adds a method to `append` partitions to existing arrow files. Partitiions to append to are supplied in the form of any [Tables.jl](https://github.com/JuliaData/Tables.jl)-compatible table.

Multiple record batches will be written based on the number of `Tables.partitions(tbl)` that are provided.

Each partition being appended must have the same `Tables.Schema` as the destination arrow file that is being appended to.

Other parameters that `append` accepts are similar to what `write` accepts.

* remove unused methods

* add more tests and some fixes

* allow appends to both seekable IO and files

* few changes to Stream,avoid duplication for append

store few additional stream properties in the `Stream` data type and avoid duplicating code for append functionality

* call Tables.schema on result of Tables.columns
* Ensure requested List type is requested on List getindex

Fixes apache#167. Not tested yet.

* add test
quinnj and others added 2 commits April 23, 2021 21:26
…pache#183)

* Add global metadata lock to ensure thread safety of global metadata
store

Follow up to apache#90, based on discussions in that issue.

* fix
Arrow.Flatbuf.TimeUnitModule.NANOSECOND
end

timezone = length(format_string) == 4 ? nothing : format_string[5:end]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @quinnj , this line of code is incorrect. I couldn't quite figure out how to new up the timezone type. Is there any documentation around this?

@sa-
Copy link
Author

sa- commented Apr 24, 2021

And in general, how does one new-up an ArrowVector? I was hoping to find a constructor that looks vaguely like this

ArrowVector{T}(buffers ::Vector{ArrowBuffer{T}})

@sa- sa- closed this Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants