[QST] Does Spark RAPIDS support Delta Lake? #5340
-
What is your question? |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
We have not explicitly been testing delta lake. Reading for the most part it should just work. I am not sure on writing though. Delta lake stores the data in parquet format internally along with metadata stored in a combination of JSON and parquet. When reading the data the metadata is queried and cached. This metadata query often involves JSON data, which the Rapids Accelerator does not yet support. So it either ends up being not on the GPU or partially on the GPU. Generally the amount of data is relatively small so it has little impact to the overall performance of the read. As for writes it has been a while so I am not sure I remember exactly what happens with it. I'll try to test it again. |
Beta Was this translation helpful? Give feedback.
-
Thanks @revans2 for the information, please let me know if you are able to test the writes. Thank you! |
Beta Was this translation helpful? Give feedback.
-
So I just did a quick test with delta lake 0.8.0 on Spark 3.0.2. The read is accelerated but the meta data queries are not fully. For writes deltalake appears to end up running several queries. One of which runs the main portion of the query, a few others that calculate metrics on it, and then one query that actually writes the data out. That last part that writes the data out is not accelerated. If this is important for you please file a feature request issue and we can see what it might take to support it, but it is not going to be trivial. |
Beta Was this translation helpful? Give feedback.
-
Thanks @revans2 for the quick check, I've created a feature request issue for this as this is an important feature for us. Link to Feature request |
Beta Was this translation helpful? Give feedback.
-
Sure I'll close this then. Feel free to reopen if you have more |
Beta Was this translation helpful? Give feedback.
We have not explicitly been testing delta lake. Reading for the most part it should just work. I am not sure on writing though.
Delta lake stores the data in parquet format internally along with metadata stored in a combination of JSON and parquet. When reading the data the metadata is queried and cached. This metadata query often involves JSON data, which the Rapids Accelerator does not yet support. So it either ends up being not on the GPU or partially on the GPU. Generally the amount of data is relatively small so it has little impact to the overall performance of the read.
As for writes it has been a while so I am not sure I remember exactly what happens with it. I'll try to test it a…