Skip to content

feat: implement avro file reader #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 4, 2025
Merged

feat: implement avro file reader #113

merged 2 commits into from
Jun 4, 2025

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented May 28, 2025

  • Refactor a little bit of Reader and ReaderFactory interfaces.
  • Implement the skeleton of Avro reader to read data into ArrowArray.
  • AppendDatumToBuilder is not implemented yet.

@wgtmac wgtmac force-pushed the avro_reader branch 2 times, most recently from ee7cdbf to 1f1885e Compare May 30, 2025 07:37
@wgtmac wgtmac marked this pull request as ready for review May 30, 2025 07:39
@wgtmac wgtmac requested a review from lidavidm May 30, 2025 07:46
@Xuanwo
Copy link
Member

Xuanwo commented Jun 3, 2025

cc @mapleFU would you like to take another look?

@mapleFU
Copy link
Member

mapleFU commented Jun 3, 2025

Personally I prefer set a internal status_ when error happens, and when Next is called, internal status_ is checked before doing operations. Since this is generally a batch interface, this would not harm the performance and making using reader more robustness. However, current impl also lgtm

@wgtmac
Copy link
Member Author

wgtmac commented Jun 3, 2025

@mapleFU Sounds good. I think the current implementation aims to provide a minimal Avro reader implementation. We can revisit this once we have complete Avro and Parquet reader impls and make them robust.

@mapleFU
Copy link
Member

mapleFU commented Jun 3, 2025

We can just move forward fast and leave this as a minor todo


// Open the input stream and adapt to the avro interface.
// TODO(gangwu): make this configurable
constexpr int64_t kDefaultBufferSize = 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the TODO, we can also defer this to a later PR. Implementations aim to create manifest files of ~8MB, and performance-wise wise it is best to read them all the way directly. The manifest list can be unbounded (theoretically).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I just blindly chose a default value for now.

HasIdVisitor has_id_visitor;
ICEBERG_RETURN_UNEXPECTED(has_id_visitor.Visit(file_schema));
if (has_id_visitor.HasNoIds()) {
// TODO(gangwu): support applying field-ids based on name mapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that name-mapping only applies to data-files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we just need to pass NameMapping object (if available) via ReaderOptions to read data files. For manifest and manifest list files, it should error out when field ids are missing.

@@ -46,6 +46,9 @@ class ICEBERG_BUNDLE_EXPORT ArrowFileSystemFileIO : public FileIO {
/// \brief Delete a file at the given location.
Status DeleteFile(const std::string& file_location) override;

/// \brief Get the Arrow file system.
const std::shared_ptr<::arrow::fs::FileSystem>& fs() const { return arrow_fs_; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko We have two libraries: libiceberg and libiceberg-bundle:

  • libiceberg: it only uses interfaces of FileIO and FileReader, downstream projects should provide their own implementations.
  • libiceberg-bundle: it uses ArrowFileSystemFileIO as the implementation and both Avro and Parquet reader implementations assume that arrow::FileSystem is available.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this forward, thanks for the great work @wgtmac and thanks for the review @mapleFU and @lidavidm

@Fokko Fokko merged commit ed49d1e into apache:main Jun 4, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants