Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimal implementation of ingesting Parquet and CSV files #327

Merged
merged 21 commits into from
Dec 1, 2019

Conversation

voonhous
Copy link
Collaborator

@voonhous voonhous commented Nov 26, 2019

This PR tries to follow minimal implementation to ingest a parquet file using PyArrow.
This is achieved by reading a Parquet file using PyArrow, batching it into RecordBatches before ingesting it with the existing code.

my_file.parquet → PyArrow Table → RecordBatches → Dataframe → FeatureRows (existing code) → Stream

Other modifications:

  • Generator function to encode chunks of PyArrow table of type RecordBatch to yield FeatureRow objects.

@woop
Copy link
Member

woop commented Nov 26, 2019

Thanks @voonhous

I have two comments

  1. Can you please submit all of the formatting and other changes separately? It makes it incredibly difficult to review your code if you submit 700+ lines of code difference.
  2. Is there a good reason you wrap ingest with ingest_file? It seems like ingest has steps that both it and ingest_file should share, like validating the dataframe, getting the feature set, and inferring/applying the schema from the dataframe. These should not be rerun for every batch.

I think it would make more sense if you extended the existing ingest so that it took either a dataframe, path, or file-like object. You would then handle it just like in ingest_file, where you use different methods to read the source based on what the user provides, but ultimately it results in a table.

At the infer_schema step, a single batch can be materialized into a DF for creating the schema. Then the rest of the table is passed into ingest_kafka where all iteration can happen.

What do you think?

@woop
Copy link
Member

woop commented Nov 27, 2019

Also, would you mind updating the title of the PR to something more descriptive? We use the PRs as items in our change logs.

@davidheryanto
Copy link
Collaborator

/retest

@davidheryanto
Copy link
Collaborator

/test test-core-and-ingestion
Trying to figure out if the integration test in feast-ingestion is flaky i.e. the test can fail with the same test configuration, specifically this test feast.ingestion.ImportJobTest

@voonhous voonhous changed the title Pyarrow dev Adding minimal implementation of Parquet and CSV files Nov 27, 2019
@voonhous voonhous changed the title Adding minimal implementation of Parquet and CSV files Adding minimal implementation of ingesting Parquet and CSV files Nov 27, 2019
@voonhous voonhous changed the title Adding minimal implementation of ingesting Parquet and CSV files Add minimal implementation of ingesting Parquet and CSV files Nov 27, 2019
@woop
Copy link
Member

woop commented Nov 28, 2019

I have created this code snippet which creates a testing dataframe. It has all the types besides lists of boolean. I think we should ensure that we can ingest this as both a parquet and a pandas dataframe.

https://gist.github.com/woop/d074ded542bc2b6ec5a0b5a96c72e9ab

@woop
Copy link
Member

woop commented Nov 30, 2019

/retest

1 similar comment
@woop
Copy link
Member

woop commented Dec 1, 2019

/retest

@woop
Copy link
Member

woop commented Dec 1, 2019

/lgtm
/approve

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: voonhous, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 2abeabd into master Dec 1, 2019
@voonhous voonhous deleted the pyarrow-dev branch December 17, 2019 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants