-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delta (parquet) format #13
Comments
@rambabu-posa Let me try to answer each question one by one
It is not possible directly because Delta format relies on the transaction log being present, and obviously, a parquet table does not have that log. So attempt to read a parquet table using delta format will throw an error. That said, Managed Delta Lake already has a CONVERT command that can do in-place convert a parquet table to delta table by writing a new transaction log inside the same directory. We are hoping that we can eventually put that command in the open source.
As of now, we are supporting only the parquet format so that Delta Lake users can get the maximum benefit of parquet data skipping, etc. when querying the delta lake. We may make this configurable in the future. But really, as a Delta Lake user, you should not have to worry about what the underlying file format as you can query it using the "delta" format and you get the maximum benefit of partition pruning, data skipping, etc.
When you say "see that data", do you mean visually see it in Hue? I am not so familiar with Hue UI, so I am not sure how to answer this. But my guess would be, it depends on how Hue detects which format is being used by a directory. |
I have a question about format, too. Now, "delta" uses snappy for compression, but I hope to use gzip. gzip is slow to write, but it reads speedy like snappy, and most of all, it compresses well than snappy. Why delta uses snappy? Do you have gzip in future? |
Regarding this question, part-00007-144fb4c5-dff0-4487-a4e8-241a9e850b35.c000.snappy.parquet I'm able to see it in my Local FS as I read it. In the same way, if we write it to HDFS, I can traverse the given path in HDFS using HUE UI and see this kind of file(s) in that location. In the same way, Is it possible to login to Delta Lake ( If I assume it like a DFS for instance HDFS, S3 etc) File system, browse to the given path and see my files? |
@hkak03key - One of the reasons we chose snappy is that gzip isn't splittable. We have no current plans to support gzip in the future, but that can always change based on community feedback. |
Actually, Delta Lake is not a file format. It’s like Hive Metastore but the
table metadata is stored in the file system so that it can use Spark to
process them (table metadata is also a big data problem for a large table).
Delta lake also provides advanced features (ACID, DMLs which is coming
soon) on top of the distributed metadata.
Hence, if you are using HDFS, you will see all files that Delta writes in
HUE. No more plugin needed.
On Mon, Apr 29, 2019 at 2:23 AM rambabu-posa ***@***.***> wrote:
I'm able to read and write data from and to this Delta lake. Is it
possible to see that data in Delta lake just like we can see HDFS data from
HUE ui.
Regarding this question,
When I read my data (already wrote in my previous steps) using
"spark.read.format("delta").load("/delta/events")", Im able to see the
following file in that location in my local system:
part-00007-144fb4c5-dff0-4487-a4e8-241a9e850b35.c000.snappy.parquet
I'm able to see it in my Local FS as I read it. In the same way, if we
write it to HDFS, I can traverse the given path in HDFS using HUE UI and
see this kind of file(s) in that location.
In the same way, Is it possible to login to Delta Lake ( If I assume it
like a DFS for instance HDFS, S3 etc) File system, browse to the given path
and see my files?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#13 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAHUKSTSUJCB3VTORVRX6V3PS247ZANCNFSM4HIWBDEQ>
.
--
Best Regards,
Shixiong Zhu
|
@rambabu-posa - Everything Shixiong said is correct, but one warning - while you may be able to see all the individual files Delta writes, you're not looking at a consistent view of the table, because some of those files may have been logically removed from the table. |
Based on my data engineering experience, parquet file with gzip compression is actually splittable. |
@hkak03key - My mistake, we actually do support gzip through the Spark config |
I am closing this issue for now. Feel free to reopen it if your questions haven't been answered. |
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter
Hi Delta team,
I tried delta, interesting. I have few questions.
Even though we use "delta" format, its underlying format is "parquet". So is it possible to use this Spark Delta format to read my existing parquet data written without using this Delta.
Why its supporting only delta for parquet? Why not for other spark supported formats? Do you have them in future?
I'm able to read and write data from and to this Delta lake. Is it possible to see that data in Delta lake just like we can see HDFS data from HUE ui.
Many thanks,
Ram
The text was updated successfully, but these errors were encountered: