dvc and apache hudi integration #4937

LuisMoralesAlonso · 2020-11-22T17:03:46Z

Does it make sense this kind of integration? We could rely on hudi to manage versions (incrementals this time, so with less storage needs).

hope your comments,

luis

efiop · 2020-11-22T23:50:33Z

@LuisMoralesAlonso Could you please elaborate?

LuisMoralesAlonso · 2020-11-23T07:57:41Z

We have a datalake based on Hive + parquet format. All of our use cases will be with this "external data". We've organized this datalakes in several datahubs, based on functional requirements.
Regarding ML we are thinking about creating a new datahub, the "datascience"-datahub that will store those data that we cosider "features" (they've been used in any of our ML use cases). We want to version data here, so we've thought about using Apache Hudi to manage this the most efficient way.
At the same time DVC seems the right tool to be the base for applying CI/CD to our ML projects, as we can leverage on our previous knowledge and expertise with tools like git. DVC will allow us to manage models too, not only data.
As there is no integration between frameworks like TF or pytorch with Apache Hudi we are thinking in materializing a dataset from a particular version of the data in Apache Hudi in each run of a dvc pipeline (maybe because there is new data).
The question arises here, would it be possible to align versioning in DVC and versioning in Apache Hudi?

karajan1001 · 2020-11-23T08:59:00Z

Incremental data, might be related to #331

dmpetrov · 2020-11-24T09:17:39Z

@LuisMoralesAlonso there are a few more questions...

We want to version data here, so we've thought about using Apache Hudi to manage this the most efficient way.

is parquet format the primary format for ML and "datascience"-datahub or you use "less structured" formats?
is Hudi needed for being close to real-time? is close-to-real-time important for ML use cases?

LuisMoralesAlonso · 2020-11-24T11:35:06Z

answers:
1.- we are actually using parquet for all our data lake, so we want to use it as much as possible. at the same time we want to version de features we are using for our ML projects. that's the reason to think about apache hudi as our primary format for this datascience-hub. We want this data governed.
2.- once we need to train a model in a particular project, we would materialize the features needed from the datascience-hub. at that time you can use whatever format is needed (will depend mainly on the supported formats for the particular framework you are using). This will be more ephemeral.
3.- We could use petastorm for using parquet from the main DL frameworks, but it's not compatible with hudi. This is something we are asking to the petastorm team too.
4.- for real-time, in the case of online model serving, we have an in-memory grid where we will have the features replicated (or will be calculated) to coordinate both the training and the different serving options (inference).

LuisMoralesAlonso · 2020-11-30T11:44:21Z

any comment here?

dmpetrov · 2020-12-14T00:28:39Z

@LuisMoralesAlonso sorry for the delay.

I'm trying to understand where do you have data versioning already and when it needs to be introduced. So far, it seems like DVC and Hudi have a bit different purposes and I'm trying to understand your scenario (and Hudi) better.

Does Hudi have proper versioning? I'm not a Hudi expert, but it seems like it can efficiently support the latest version but not the whole history.

We have a datalake based on Hive + parquet format. All of our use cases will be with this "external data". We've organized this datalakes in several datahubs, based on functional requirements.

1.- we are actually using parquet for all our data lake, so we want to use it as much as possible. at the same time we want to version de features we are using for our ML projects. that's the reason to think about apache hudi as our primary format for this datascience-hub. We want this data governed.

Are you building/deriving features for datascience-hub from the regular tables/datahubs or from some other sources/streaming? Do you have any versioning for regular data hubs/tables?

2.- once we need to train a model in a particular project, we would materialize the features needed from the datascience-hub....This will be more ephemeral.

Would you like to create a version of Hudi "table" by a request?

4.- for real-time, ... we will have the features replicated (or will be calculated) to coordinate both the training and the different serving options (inference).

It is usually done with real streaming. I thought that Hudi cannot handle this level of latency but I'm not an expert in Hudi.

PS: It can be way more efficient to schedule a chat - please feel free to shoot me an email to my-first-name at iterative.ai or DM at https://twitter.com/fullstackml

efiop · 2021-03-13T00:45:30Z

Closing as stale. Please feel free to reopen.

efiop added the awaiting response we are waiting for your reply, please respond! :) label Nov 22, 2020

efiop closed this as completed Mar 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc and apache hudi integration #4937

dvc and apache hudi integration #4937

LuisMoralesAlonso commented Nov 22, 2020

efiop commented Nov 22, 2020

LuisMoralesAlonso commented Nov 23, 2020

karajan1001 commented Nov 23, 2020 •

edited

Loading

dmpetrov commented Nov 24, 2020

LuisMoralesAlonso commented Nov 24, 2020 •

edited

Loading

LuisMoralesAlonso commented Nov 30, 2020

dmpetrov commented Dec 14, 2020

efiop commented Mar 13, 2021

dvc and apache hudi integration #4937

dvc and apache hudi integration #4937

Comments

LuisMoralesAlonso commented Nov 22, 2020

efiop commented Nov 22, 2020

LuisMoralesAlonso commented Nov 23, 2020

karajan1001 commented Nov 23, 2020 • edited Loading

dmpetrov commented Nov 24, 2020

LuisMoralesAlonso commented Nov 24, 2020 • edited Loading

LuisMoralesAlonso commented Nov 30, 2020

dmpetrov commented Dec 14, 2020

efiop commented Mar 13, 2021

karajan1001 commented Nov 23, 2020 •

edited

Loading

LuisMoralesAlonso commented Nov 24, 2020 •

edited

Loading