[Python] Segmentation fault when writing empty RecordBatches to Parquet #20163

asfimport · 2018-11-14T18:08:03Z

Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range are populated). The file I am working with spans upto ~63M rows. I decided to iterate in batches of 500k rows, 127 batches in total. Each row batch is a RecordBatch. I create 4 batches at a time, and write to a parquet file incrementally. Something like this:

batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)

I was getting a segmentation fault at the final step, I narrowed it down to a specific iteration. I noticed that iteration had empty batches; specifically, [0, 0, 2876, 14423]. The number of rows for each RecordBatch for the whole dataset is below:

[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]

On excluding the empty RecordBatch-es, the segfault goes away, but unfortunately I couldn't create a proper minimal example with synthetic data.

Not quite minimal example

The data I am using is from the 1000 Genome project, which has been public for many years, so we can be reasonably sure the data is good. The following steps should help you replicate the issue.

Download the data file (and index), about 330MB:

$ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}

Install the Cython library pysam, a thin wrapper around the reference implementation of the VCF file spec. You will need zlib headers, but that's probably not a problem :)
```
$ pip3 install --user pysam
```
Now you can use the attached script to replicate the crash.

Extra information

I have tried attaching gdb, the backtrace when the segfault occurs is shown below (maybe it helps, this is how I realised empty batches could be the reason).

{code}
(gdb) bt
#0 0x00007f3e7676d670 in parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
ARROW-5: Update drill-fmpp-maven-plugin to 1.5.0 #1 0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>, arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
ARROW-9: Replace straggler references to Drill #2 0x00007f3e7673a3d4 in parquet::arrow::(anonymous namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
ARROW-10: Fix mismatch of javadoc names and method parameters #3 0x00007f3e7673df09 in parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptrarrow::ChunkedArray const&, long, long) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
ARROW-15: Fix a naming typo for memory.AllocationManager.AllocationOutcome #4 0x00007f3e7673c74d in parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptrarrow::ChunkedArray const&, long, long) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
ARROW:17 & ARROW:18 #5 0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table const&, long) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
ARROW-8: Add .travis.yml and test script for Arrow C++. OS X build fixes #6 0x00007f3e731e3a51 in __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, _object*) ()
from /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
{code}

Environment: Fedora 28, pyarrow installed with pip
Fedora 29, pyarrow installed from conda-forge
Reporter: Suvayu Ali / @suvayu
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

Externally tracked issue: #2951

PRs and other links:

GitHub Pull Request #3141

_{Note: This issue was originally created as ARROW-3792. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2018-12-05T15:48:48Z

Tanya Schlusser / @tanyaschlusser:
I have followed Suvayu's instructions and can successfully reproduce the segfault. I am going to try working on this, thanks!

asfimport · 2018-12-06T02:41:04Z

Tanya Schlusser / @tanyaschlusser:
I can now reproduce the bug with a more minimal code (see the attached file minimal_bug_arrow3792.py) – it is a problem with the column that contains a list – I think the segfault occurs when dealing with the empty batch that is supposed to contain a column that contains a list. I'm still going to look at it more but in case I'm slow and someone else wants to do it faster you no longer need to download the genome dataset or pysam.

asfimport · 2018-12-06T02:54:51Z

Wes McKinney / @wesm:
Thank you for creating the minimal reproduction! That's very helpful. I can take a look sometime in the next several days

asfimport · 2018-12-06T03:03:25Z

Tanya Schlusser / @tanyaschlusser:
Sweet! I'll stop on this then :)

asfimport · 2018-12-10T02:28:24Z

Wes McKinney / @wesm:
The Parquet writer code behaves incorrectly when writing a length-0 array. There is another bug report about writing length-0 record batches so possible the same fix involved.

asfimport · 2018-12-10T13:05:28Z

Krisztian Szucs / @kszucs:
Issue resolved by pull request 3141
#3141

asfimport closed this as completed Dec 10, 2018

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.12.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Segmentation fault when writing empty RecordBatches to Parquet #20163

[Python] Segmentation fault when writing empty RecordBatches to Parquet #20163

asfimport commented Nov 14, 2018

Extra information

asfimport commented Dec 5, 2018

asfimport commented Dec 6, 2018

asfimport commented Dec 6, 2018

asfimport commented Dec 6, 2018

asfimport commented Dec 10, 2018

asfimport commented Dec 10, 2018

[Python] Segmentation fault when writing empty RecordBatches to Parquet #20163

[Python] Segmentation fault when writing empty RecordBatches to Parquet #20163

Comments

asfimport commented Nov 14, 2018

Background

Not quite minimal example

Extra information

Original Issue Attachments:

Externally tracked issue: #2951

PRs and other links:

asfimport commented Dec 5, 2018

asfimport commented Dec 6, 2018

asfimport commented Dec 6, 2018

asfimport commented Dec 6, 2018

asfimport commented Dec 10, 2018

asfimport commented Dec 10, 2018