[QST] Does Spark RAPIDS support Delta Lake? #5340

Niharikadutta · 2021-08-10T19:33:35Z

Niharikadutta
Aug 10, 2021

What is your question?
Does latest released version of Spark RAPIDS (21.06.1) support Delta Lake?

Aug 10, 2021

We have not explicitly been testing delta lake. Reading for the most part it should just work. I am not sure on writing though.

Delta lake stores the data in parquet format internally along with metadata stored in a combination of JSON and parquet. When reading the data the metadata is queried and cached. This metadata query often involves JSON data, which the Rapids Accelerator does not yet support. So it either ends up being not on the GPU or partially on the GPU. Generally the amount of data is relatively small so it has little impact to the overall performance of the read.

As for writes it has been a while so I am not sure I remember exactly what happens with it. I'll try to test it a…

View full answer

revans2 · 2021-08-10T19:47:00Z

revans2
Aug 10, 2021
Maintainer

We have not explicitly been testing delta lake. Reading for the most part it should just work. I am not sure on writing though.

Delta lake stores the data in parquet format internally along with metadata stored in a combination of JSON and parquet. When reading the data the metadata is queried and cached. This metadata query often involves JSON data, which the Rapids Accelerator does not yet support. So it either ends up being not on the GPU or partially on the GPU. Generally the amount of data is relatively small so it has little impact to the overall performance of the read.

As for writes it has been a while so I am not sure I remember exactly what happens with it. I'll try to test it again.

0 replies

Niharikadutta · 2021-08-10T20:17:15Z

Niharikadutta
Aug 10, 2021
Author

Thanks @revans2 for the information, please let me know if you are able to test the writes. Thank you!

0 replies

revans2 · 2021-08-10T20:27:52Z

revans2
Aug 10, 2021
Maintainer

So I just did a quick test with delta lake 0.8.0 on Spark 3.0.2. The read is accelerated but the meta data queries are not fully. For writes deltalake appears to end up running several queries. One of which runs the main portion of the query, a few others that calculate metrics on it, and then one query that actually writes the data out. That last part that writes the data out is not accelerated. If this is important for you please file a feature request issue and we can see what it might take to support it, but it is not going to be trivial.

0 replies

Niharikadutta · 2021-08-10T20:54:54Z

Niharikadutta
Aug 10, 2021
Author

Thanks @revans2 for the quick check, I've created a feature request issue for this as this is an important feature for us. Link to Feature request

0 replies

revans2 · 2021-08-10T20:55:50Z

revans2
Aug 10, 2021
Maintainer

Sure I'll close this then. Feel free to reopen if you have more

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Does Spark RAPIDS support Delta Lake? #5340

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QST] Does Spark RAPIDS support Delta Lake? #5340

Niharikadutta Aug 10, 2021

Replies: 5 comments

revans2 Aug 10, 2021 Maintainer

Niharikadutta Aug 10, 2021 Author

revans2 Aug 10, 2021 Maintainer

Niharikadutta Aug 10, 2021 Author

revans2 Aug 10, 2021 Maintainer

Niharikadutta
Aug 10, 2021

revans2
Aug 10, 2021
Maintainer

Niharikadutta
Aug 10, 2021
Author

revans2
Aug 10, 2021
Maintainer

Niharikadutta
Aug 10, 2021
Author

revans2
Aug 10, 2021
Maintainer