-
Notifications
You must be signed in to change notification settings - Fork 32
feat: implement avro file reader #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ee7cdbf
to
1f1885e
Compare
cc @mapleFU would you like to take another look? |
Personally I prefer set a internal |
@mapleFU Sounds good. I think the current implementation aims to provide a minimal Avro reader implementation. We can revisit this once we have complete Avro and Parquet reader impls and make them robust. |
We can just move forward fast and leave this as a minor todo |
|
||
// Open the input stream and adapt to the avro interface. | ||
// TODO(gangwu): make this configurable | ||
constexpr int64_t kDefaultBufferSize = 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the TODO, we can also defer this to a later PR. Implementations aim to create manifest files of ~8MB, and performance-wise wise it is best to read them all the way directly. The manifest list can be unbounded (theoretically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I just blindly chose a default value for now.
HasIdVisitor has_id_visitor; | ||
ICEBERG_RETURN_UNEXPECTED(has_id_visitor.Visit(file_schema)); | ||
if (has_id_visitor.HasNoIds()) { | ||
// TODO(gangwu): support applying field-ids based on name mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that name-mapping only applies to data-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we just need to pass NameMapping
object (if available) via ReaderOptions to read data files. For manifest and manifest list files, it should error out when field ids are missing.
@@ -46,6 +46,9 @@ class ICEBERG_BUNDLE_EXPORT ArrowFileSystemFileIO : public FileIO { | |||
/// \brief Delete a file at the given location. | |||
Status DeleteFile(const std::string& file_location) override; | |||
|
|||
/// \brief Get the Arrow file system. | |||
const std::shared_ptr<::arrow::fs::FileSystem>& fs() const { return arrow_fs_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko We have two libraries: libiceberg
and libiceberg-bundle
:
- libiceberg: it only uses interfaces of
FileIO
andFileReader
, downstream projects should provide their own implementations. - libiceberg-bundle: it uses
ArrowFileSystemFileIO
as the implementation and both Avro and Parquet reader implementations assume that arrow::FileSystem is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AppendDatumToBuilder
is not implemented yet.