Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deltalake.DeltaTable constructor hangs in Jupyter #1093

Closed
xbrianh opened this issue Jan 23, 2023 · 14 comments · Fixed by #1112
Closed

deltalake.DeltaTable constructor hangs in Jupyter #1093

xbrianh opened this issue Jan 23, 2023 · 14 comments · Fixed by #1112
Labels
bug Something isn't working

Comments

@xbrianh
Copy link

xbrianh commented Jan 23, 2023

Environment

Delta-rs version:0.6.0

Binding: python 3.10

Environment:

  • Cloud provider: Azure
  • OS: ubuntu, windws WSL
  • Other:

Bug

What happened: DeltaTable constructor hangs in Jupyter

dt = deltatable.DeltaTable(“az://container/…”, storage_options={‘account_name’: “foo”, “AZURE_STORAGE_ACCOUNT_KEY”: “bar”})

What you expected to happen: Not hang

How to reproduce it: Attempt to create an azure storage backed DeltaTable in Jupyter

More details: This was lightly discussed in #915

@xbrianh xbrianh added the bug Something isn't working label Jan 23, 2023
@jabbera
Copy link

jabbera commented Jan 24, 2023

I am seeing this issue as well.

@MrPowers
Copy link
Collaborator

@xbrianh - can you try delta-rs v0.6.4? If I recall correctly, this was a 0.6.0 bug that was fixed. Let me know if that helps!

@xbrianh
Copy link
Author

xbrianh commented Jan 25, 2023

@MrPowers Same issue:

>pip show deltalake
Name: deltalake
Version: 0.6.4
Summary: Native Delta Lake Python binding based on delta-rs with Pandas integration
Home-page: https://github.com/delta-io/delta-rs
Author: Qingping Hou <dave2008713@gmail.com>
Author-email: Qingping Hou <dave2008713@gmail.com>
License: Apache-2.0
Location: /Users/xath/.virtualenvs/p310/lib/python3.11/site-packages
Requires: pyarrow
Required-by:

@craustin
Copy link

This was never resolved for me in #915. I can still repro in deltalake==0.7.0.

@jabbera
Copy link

jabbera commented Jan 29, 2023

@MrPowers I compiled a develop version and ran a backtrace. Looks like it's hanging writing to stdio for some reason:

__GI___libc_write (nbytes=792, buf=0x32a9f90, fd=2) at ../sysdeps/unix/sysv/linux/write.c:26
26      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) bt
#0  __GI___libc_write (nbytes=792, buf=0x32a9f90, fd=2) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=2, buf=0x32a9f90, nbytes=792) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007fe51abf5707 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:152
#3  std::sys::unix::stdio::{impl#5}::write () at library/std/src/sys/unix/stdio.rs:66
#4  std::io::stdio::{impl#2}::write () at library/std/src/io/stdio.rs:153
#5  std::io::stdio::{impl#20}::write () at library/std/src/io/stdio.rs:942
#6  0x00007fe514f2a0a6 in termcolor::{impl#9}::write (self=0x7ffd30081ab0, b=...) at src/lib.rs:427
#7  termcolor::{impl#45}::write<termcolor::IoStandardStreamLock> (self=0x7ffd30081ab0, buf=...)
    at src/lib.rs:2141
#8  0x00007fe514f28f7d in std::io::Write::write_all<termcolor::LossyStandardStream<termcolor::IoStandardStreamLock>> (self=0x7ffd30081ab0, buf=...)
    at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/io/mod.rs:1542
#9  0x00007fe514f29af6 in termcolor::BufferWriter::print (self=0x1d977c0, buf=0x2ef3188) at src/lib.rs:1035
--Type <RET> for more, q to quit, c to continue without paging--
#10 0x00007fe514f0d813 in env_logger::fmt::writer::termcolor::imp::BufferWriter::print (self=0x1d977c0, 
    buf=0x2ef3188) at src/fmt/writer/termcolor/extern_impl.rs:138
#11 0x00007fe514f256d3 in env_logger::fmt::writer::Writer::print (self=0x1d977c0, buf=0x2ef3188)
    at src/fmt/writer/mod.rs:120
#12 0x00007fe514f1c56c in env_logger::fmt::Formatter::print (self=0x241a4f8, writer=0x1d977c0)
    at src/fmt/mod.rs:115
#13 0x00007fe514f1f73f in env_logger::{impl#2}::log::{closure#0}::{closure#0} () at src/lib.rs:924
#14 0x00007fe514f1a59e in core::result::Result<(), std::io::error::Error>::and_then<(), std::io::error::Error, (), env_logger::{impl#2}::log::{closure#0}::{closure_env#0}> (self=..., op=...)
    at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/result.rs:1372
#15 0x00007fe514f1f6f4 in env_logger::{impl#2}::log::{closure#0} (formatter=0x241a4f8, record=0x7ffd300821b8)
    at src/lib.rs:924
#16 0x00007fe514f1f996 in env_logger::{impl#2}::log::{closure#1} (tl_buf=0x241a4f0) at src/lib.rs:943
--Type <RET> for more, q to quit, c to continue without paging--
#17 0x00007fe514f1784c in std::thread::local::LocalKey<core::cell::RefCell<core::option::Option<env_logger::fmt::Formatter>>>::try_with<core::cell::RefCell<core::option::Option<env_logger::fmt::Formatter>>, env_logger::{impl#2}::log::{closure_env#1}, ()> (self=0x7fe51bd64348, f=...)
    at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/thread/local.rs:446
#18 0x00007fe514f1f5d2 in env_logger::{impl#2}::log (self=0x1d977b0, record=0x7ffd300821b8) at src/lib.rs:930
#19 0x00007fe51a8dcfed in log::__private_api_log (args=..., level=log::Level::Warn, kvs=...) at src/lib.rs:1597
#20 0x00007fe51513db82 in deltalake::action::Remove::from_parquet_record (record=0x32b1fb0)
    at rust/src/action/parquet_read/mod.rs:468
#21 0x00007fe51513eef4 in deltalake::action::Action::from_parquet_record (schema=0x2ef7320, 
    record=0x7ffd30084490) at rust/src/action/parquet_read/mod.rs:588
#22 0x00007fe514f9dbdd in deltalake::table_state::DeltaTableState::process_checkpoint_bytes (
    self=0x7ffd3008cf58, data=..., table_config=0x7ffd3008cde0) at rust/src/table_state.rs:116
#23 0x00007fe5151e91c4 in deltalake::table_state::{impl#0}::from_checkpoint::{async_fn#0} ()

@jabbera
Copy link

jabbera commented Jan 29, 2023

Confirmed that logging has something to do with it. Setting log level to error prevents the hang. Unfortunately, that is not a solution that would work in production:

image

@jabbera
Copy link

jabbera commented Jan 29, 2023

This can be worked around by setting the following configuration value: c.IPKernelApp.capture_fd_output = False. It is related to this github issue: jupyterlab/jupyterlab#12845. That said can anyone tell me why this much output is being generated? (This is a VERY short summary of what it's trying to print)

[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-002807cb-a22c-48f8-a085-373169bcf56f-c000.snappy.parquet")), ("deletionTimestamp", Long(1674489804387)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12320)), ("tags", MapInternal(Map { entries: [(Str("INSERTION_TIME"), Str("1674432561000000")), (Str("MIN_INSERTION_TIME"), Str("1674432561000000")), (Str("MAX_INSERTION_TIME"), Str("1674432561000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-091a9bf6-6130-42c6-a2d1-00df42d5ec80-c000.snappy.parquet")), ("deletionTimestamp", Long(1674548310286)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12406)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674522189000000")), (Str("INSERTION_TIME"), Str("1674522189000000")), (Str("compactedInto"), Str("[\"part-00000-c22b1419-4adf-4e6b-a2a7-19f3167e0b61-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1674522189000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-0e8cb4d9-055b-4ea8-af05-2f4630be8e7d.c000.snappy.parquet")), ("deletionTimestamp", Long(1674408258854)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12668)), ("tags", MapInternal(Map { entries: [(Str("INSERTION_TIME"), Str("1674408219000000")), (Str("MIN_INSERTION_TIME"), Str("1674407379000000")), (Str("MAX_INSERTION_TIME"), Str("1674408219000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-233d6744-772f-4a20-a658-5266da2d7900.c000.snappy.parquet")), ("deletionTimestamp", Long(1674548310286)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(79225)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674489797000000")), (Str("INSERTION_TIME"), Str("1674489797000000")), (Str("compactedInto"), Str("[\"part-00000-c22b1419-4adf-4e6b-a2a7-19f3167e0b61-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1659461412000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-26d8e855-d3f7-4ca8-9582-5f6fe7db7179-c000.snappy.parquet")), ("deletionTimestamp", Long(1674548310286)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12582)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674519753000000")), (Str("INSERTION_TIME"), Str("1674519753000000")), (Str("compactedInto"), Str("[\"part-00000-c22b1419-4adf-4e6b-a2a7-19f3167e0b61-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1674519753000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-27b997e1-a630-44f8-b1ce-9e5a8ef0dfa1-c000.snappy.parquet")), ("deletionTimestamp", Long(1674489804387)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(33293161)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1659461412000000")), (Str("INSERTION_TIME"), Str("1659461412000000")), (Str("MIN_INSERTION_TIME"), Str("1659461412000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-34b88e2a-486c-4a36-a17d-2af1ccd4b634-c000.snappy.parquet")), ("deletionTimestamp", Long(1674408263353)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12326)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674220136000000")), (Str("INSERTION_TIME"), Str("1674220136000000")), (Str("compactedInto"), Str("[\"part-00000-9e0a699d-fc4b-45d5-81b9-653f8ce90da9-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1674220136000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-36c1c6b6-e24e-4104-9fe3-e455d64d6be1-c000.snappy.parquet")), ("deletionTimestamp", Long(1674806817725)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12512)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674796733000008")), (Str("INSERTION_TIME"), Str("1674796733000008")), (Str("compactedInto"), Str("[\"part-00000-6872ac86-d729-4675-8340-7dacd35efb67-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1674796733000008")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-38672aca-d1a5-41c0-8f61-fc637118772e.c000.snappy.parquet")), ("deletionTimestamp", Long(1674808323448)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12609)), ("tags", MapInternal(Map { entries: [(Str("INSERTION_TIME"), Str("1674808212000000")), (Str("MIN_INSERTION_TIME"), Str("1659461412000000")), (Str("MAX_INSERTION_TIME"), Str("1674808212000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-387f6ff0-03a6-40a5-8e1b-f6849cb569ec-c000.snappy.parquet")), ("deletionTimestamp", Long(1674489804387)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12328)), ("tags", MapInternal(Map { entries: [(Str("INSERTION_TIME"), Str("1674436710000000")), (Str("MIN_INSERTION_TIME"), Str("1674436710000000")), (Str("MAX_INSERTION_TIME"), Str("1674436710000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-3aca15a9-1cdf-4932-8b4b-fec727a70997-c000.snappy.parquet")), ("deletionTimestamp", Long(1674751843814)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(30466)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674548517000020")), (Str("INSERTION_TIME"), Str("1674548517000020")), (Str("MIN_INSERTION_TIME"), Str("1659461412000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-3b254f9f-ac04-41f3-859d-ef116d38134b-c000.snappy.parquet")), ("deletionTimestamp", Long(1674409550884)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(1698441)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674408250000019")), (Str("INSERTION_TIME"), Str("1673250634000000")), (Str("compactedInto"), Str("[\"part-00000-ca8505d7-8cc8-4f3f-8544-e41baa67ed0f-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1659461412000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-3e872e6a-05a6-446f-86f2-4bd744f38f7d-c000.snappy.parquet")), ("deletionTimestamp", Long(1674806817725)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(12606)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674796733000000")), (Str("INSERTION_TIME"), Str("1674796733000000")), (Str("compactedInto"), Str("[\"part-00000-6872ac86-d729-4675-8340-7dacd35efb67-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1674796733000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }
[2023-01-29T19:05:54Z WARN  deltalake::action::parquet_read] Unexpected field name `deletionVector` for remove action: Row { fields: [("path", Str("part-00000-41203d0f-14a3-4091-a045-3225274800af.c000.snappy.parquet")), ("deletionTimestamp", Long(1674752370253)), ("dataChange", Bool(false)), ("extendedFileMetadata", Bool(true)), ("partitionValues", MapInternal(Map { entries: [] })), ("size", Long(13642)), ("tags", MapInternal(Map { entries: [(Str("MAX_INSERTION_TIME"), Str("1674551663000000")), (Str("INSERTION_TIME"), Str("1674551663000000")), (Str("compactedInto"), Str("[\"part-00000-e7204879-cf6f-4a06-b15d-bcfe22d0c2ae-c000.snappy.parquet\"]")), (Str("MIN_INSERTION_TIME"), Str("1659461412000000")), (Str("OPTIMIZE_TARGET_SIZE"), Str("268435456"))] })), ("deletionVector", Null)] }

@jabbera
Copy link

jabbera commented Jan 29, 2023

Looks like this is a known issue maybe the fact that this will hang a default jupyter lab installation ups the priority? #928

@houqp
Copy link
Member

houqp commented Jan 29, 2023

Thanks @jabbera for digging into this. I filed #1112 to workaround this.

To properly fix this, we should resolve #928, which should be pretty straightforward to implement!

@jabbera
Copy link

jabbera commented Jan 29, 2023

@houqp many thanks! Any idea what the side effect of not respecting that log field is? I'm new to all this delta table stuff but to a beginner it seems like we will be getting rows back that should be deleted.

houqp added a commit that referenced this issue Jan 29, 2023
# Description
logging this error per row is too verbose when reading newer unsupported
versions of delta log

# Related Issue(s)
closes #1093
@houqp
Copy link
Member

houqp commented Jan 29, 2023

@jabbera you will end up reading deleted rows if the deletion vector is set but ignored, so we should probably add support for this as soon we could to avoid data issue.

@wjones127
Copy link
Collaborator

I'm new to all this delta table stuff but to a beginner it seems like we will be getting rows back that should be deleted.

@jabbera @houqp If the deletion vector has a value, then the reader protocol version should be 3, and we will therefore error (since the max we support is 1). So this should never happen.

@jabbera
Copy link

jabbera commented Mar 6, 2023

@wjones127 any idea when we are getting a new release of the python bindings with this fix?

@wjones127
Copy link
Collaborator

Yeah I will probably cut a release next weekend.

chitralverma pushed a commit to chitralverma/delta-rs that referenced this issue Mar 17, 2023
# Description
logging this error per row is too verbose when reading newer unsupported
versions of delta log

# Related Issue(s)
closes delta-io#1093
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants