Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error/Segfault when writing many partitions #396

Closed
svilupp opened this issue Mar 12, 2023 · 0 comments · Fixed by #445
Closed

Error/Segfault when writing many partitions #396

svilupp opened this issue Mar 12, 2023 · 0 comments · Fixed by #445
Labels
bug Something isn't working

Comments

@svilupp
Copy link
Contributor

svilupp commented Mar 12, 2023

When writing a partitioned table with compression, it throws an error or segfaults.
It impacts Julia 1.8.5 on both ARM and Linux when threading is enabled. Possible root causes are suggested below.

How to reproduce:

using Arrow, Tables, Random
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)

# Create dataset
N = 64 # set to 2x Threads.nthreads()
len = 100
t = Tables.rowtable((; x1 = map(x -> randstring(len), 1:N)))

# This runs okay
Arrow.write("test_partitioned.arrow", Iterators.partition(t, 1); compress = :zstd);

# This fails with Threads.nthreads()>1 and compression enabled (logging causes a yield, which creates the collision
with_logger(debuglogger) do
    # fails for both :zstd and :lz4
    Arrow.write("test_partitioned.arrow", Iterators.partition(t, 1); compress = :lz4)
end
# if it doesn't fail immediately, run it a few times

If you don't use compression, have 1 thread, or don't use logging, or don't use partitioned data (to enabled threaded workload) all is fine.

On Linux, it results in the following errors:

# ERROR: fatal error writing arrow data
# or: InvalidStateException: Channel is closed. ERROR: fatal error writing arrow data

on ARM (Apple M1), it either gives the same errors, or sometimes it segfaults:

# signal (11): Segmentation fault: 11
# in expression starting at /Users/x/arrow-internals/bug_with_partitions.jl:15
# jl_table_peek_bp at /Users/x/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.8.dylib (unknown line)
# ijl_eqtable_get at /Users/x/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.8.dylib (unknown line)

# signal (11): Segmentation fault: 11
# in expression starting at /Users/x/arrow-internals/bug_with_partitions.jl:15

# signal (11): Segmentation fault: 11
# in expression starting at /Users/x/arrow-internals/bug_with_partitions.jl:15
# _ZN5dyld44APIs6dladdrEPKvP7dl_info at /usr/lib/dyld (unknown line)
# Allocations: 103975646 (Pool: 103867433; Big: 108213); GC: 82

# signal (11): Segmentation fault: 11
# in expression starting at /Users/x/arrow-internals/bug_with_partitions.jl:15
# subtype at /Users/x/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.8.dylib (unknown line)
# forall_exists_subtype at /Users/x/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.8.dylib (unknown line)
# ijl_subtype_env at /Users/x/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.8.dylib (unknown line)
# jl_type_intersection_env_s at /Users/x/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.8.dylib (unknown line)

# signal (11): Segmentation fault: 11
# in expression starting at /Users/x/arrow-internals/bug_with_partitions.jl:15
# _ZN5dyld44APIs6dladdrEPKvP7dl_info at /usr/lib/dyld (unknown line)
# Allocations: 103975646 (Pool: 103867433; Big: 108213); GC: 82

Suspected root cause:

  • This line where the list of pre-initialized compressors are accessed by Threads.threadid()
  • From Julia 1.8, :dynamic thread scheduler ships as a default, so if any part of the code yields (for example when using logging - vis. above!), different tasks might try to tap into the same compressor
  • See the issue discussed on Floops.jl and here

Versioninfo:

Arrow 2.4.3 (both the released and on main branch)

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
Threads: 1 on 32 virtual cores

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 6 virtual cores

@ericphanson ericphanson added the bug Something isn't working label Apr 9, 2023
quinnj added a commit that referenced this issue May 25, 2023
Fixes #396.

As noted in the originally reported issue, enabling debug logging when
writing arrow data with compression can result in segfaults because the
underlying CodecX package have debug logs, causing task switches/migration
and thus making the pattern of using a single `X_COMPRESSOR` array indexed
by `Threads.threadid()` unsafe since multiple threads may try using the
same compressor at the same time.

We fix this by wrapping each compressor in a `Lockable` and ensuring the
`compress` (or `uncompress`) operation holds the lock for the duration of
the operation. We also:
* Add a decompressor per thread to avoid recreating them over and over during reading
* Lazily initialize compressors/decompressors in a way that is 1.9+ safe and only
creates the object when needed by a specific thread
* Switch from WorkerUtilities -> ConcurrentUtilities (the package was renamed)

Co-authored-by: J S <49557684+svilupp@users.noreply.github.com>
quinnj added a commit that referenced this issue May 30, 2023
Fixes #396.

As noted in the originally reported issue, enabling debug logging when
writing arrow data with compression can result in segfaults because the
underlying CodecX package have debug logs, causing task
switches/migration and thus making the pattern of using a single
`X_COMPRESSOR` array indexed by `Threads.threadid()` unsafe since
multiple threads may try using the same compressor at the same time.

We fix this by wrapping each compressor in a `Lockable` and ensuring the
`compress` (or `uncompress`) operation holds the lock for the duration
of the operation. We also:
* Add a decompressor per thread to avoid recreating them over and over
during reading
* Lazily initialize compressors/decompressors in a way that is 1.9+ safe
and only creates the object when needed by a specific thread
* Switch from WorkerUtilities -> ConcurrentUtilities (the package was
renamed)

Successor to #397; I've added
@svilupp as a co-author here since they started the original movement
for the code to go in this direction.

---------

Co-authored-by: J S <49557684+svilupp@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants