Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc and apache hudi integration #4937

Closed
LuisMoralesAlonso opened this issue Nov 22, 2020 · 8 comments
Closed

dvc and apache hudi integration #4937

LuisMoralesAlonso opened this issue Nov 22, 2020 · 8 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@LuisMoralesAlonso
Copy link

Does it make sense this kind of integration? We could rely on hudi to manage versions (incrementals this time, so with less storage needs).

hope your comments,

luis

@efiop
Copy link
Contributor

efiop commented Nov 22, 2020

@LuisMoralesAlonso Could you please elaborate?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Nov 22, 2020
@LuisMoralesAlonso
Copy link
Author

  • We have a datalake based on Hive + parquet format. All of our use cases will be with this "external data". We've organized this datalakes in several datahubs, based on functional requirements.
  • Regarding ML we are thinking about creating a new datahub, the "datascience"-datahub that will store those data that we cosider "features" (they've been used in any of our ML use cases). We want to version data here, so we've thought about using Apache Hudi to manage this the most efficient way.
  • At the same time DVC seems the right tool to be the base for applying CI/CD to our ML projects, as we can leverage on our previous knowledge and expertise with tools like git. DVC will allow us to manage models too, not only data.
  • As there is no integration between frameworks like TF or pytorch with Apache Hudi we are thinking in materializing a dataset from a particular version of the data in Apache Hudi in each run of a dvc pipeline (maybe because there is new data).
  • The question arises here, would it be possible to align versioning in DVC and versioning in Apache Hudi?

@karajan1001
Copy link
Contributor

karajan1001 commented Nov 23, 2020

Incremental data, might be related to #331

@dmpetrov
Copy link
Member

@LuisMoralesAlonso there are a few more questions...

  • We want to version data here, so we've thought about using Apache Hudi to manage this the most efficient way.
  1. is parquet format the primary format for ML and "datascience"-datahub or you use "less structured" formats?
  2. is Hudi needed for being close to real-time? is close-to-real-time important for ML use cases?

@LuisMoralesAlonso
Copy link
Author

LuisMoralesAlonso commented Nov 24, 2020

answers:
1.- we are actually using parquet for all our data lake, so we want to use it as much as possible. at the same time we want to version de features we are using for our ML projects. that's the reason to think about apache hudi as our primary format for this datascience-hub. We want this data governed.
2.- once we need to train a model in a particular project, we would materialize the features needed from the datascience-hub. at that time you can use whatever format is needed (will depend mainly on the supported formats for the particular framework you are using). This will be more ephemeral.
3.- We could use petastorm for using parquet from the main DL frameworks, but it's not compatible with hudi. This is something we are asking to the petastorm team too.
4.- for real-time, in the case of online model serving, we have an in-memory grid where we will have the features replicated (or will be calculated) to coordinate both the training and the different serving options (inference).

@LuisMoralesAlonso
Copy link
Author

any comment here?

@dmpetrov
Copy link
Member

@LuisMoralesAlonso sorry for the delay.

I'm trying to understand where do you have data versioning already and when it needs to be introduced. So far, it seems like DVC and Hudi have a bit different purposes and I'm trying to understand your scenario (and Hudi) better.

Does Hudi have proper versioning? I'm not a Hudi expert, but it seems like it can efficiently support the latest version but not the whole history.

  • We have a datalake based on Hive + parquet format. All of our use cases will be with this "external data". We've organized this datalakes in several datahubs, based on functional requirements.

1.- we are actually using parquet for all our data lake, so we want to use it as much as possible. at the same time we want to version de features we are using for our ML projects. that's the reason to think about apache hudi as our primary format for this datascience-hub. We want this data governed.

Are you building/deriving features for datascience-hub from the regular tables/datahubs or from some other sources/streaming? Do you have any versioning for regular data hubs/tables?

2.- once we need to train a model in a particular project, we would materialize the features needed from the datascience-hub....This will be more ephemeral.

Would you like to create a version of Hudi "table" by a request?

4.- for real-time, ... we will have the features replicated (or will be calculated) to coordinate both the training and the different serving options (inference).

It is usually done with real streaming. I thought that Hudi cannot handle this level of latency but I'm not an expert in Hudi.

PS: It can be way more efficient to schedule a chat - please feel free to shoot me an email to my-first-name at iterative.ai or DM at https://twitter.com/fullstackml

@efiop
Copy link
Contributor

efiop commented Mar 13, 2021

Closing as stale. Please feel free to reopen.

@efiop efiop closed this as completed Mar 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

4 participants