Skip to content
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
311 changes: 311 additions & 0 deletions _posts/2022-08-16-9.0.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
---
layout: post
title: "Apache Arrow 9.0.0 Release"
date: "2022-08-16 00:00:00"
author: pmc
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->


The Apache Arrow team is pleased to announce the 9.0.0 release. This covers
over 3 months of development work and includes [**509 resolved issues**][1]
from [**114 distinct contributors**][2]. See the Install Page to learn how to
get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bug fixes and improvements have been made: we refer
you to the [complete changelog][3].

## Community

Since the 8.0.0 release, Dewey Dunnington, Alenka Frim and Rok Mihevc
have been invited to be committers.
Thanks for your contributions and participation in the project!

## Columnar Format Notes

## Arrow Flight RPC notes

Arrow Flight is now available in MacOS M1 Python wheels ([ARROW-16779](https://issues.apache.org/jira/browse/ARROW-16779)).
Arrow Flight SQL is now buildable on Windows ([ARROW-16902](https://issues.apache.org/jira/browse/ARROW-16902)).
Ruby now exposes more of the Flight and Flight SQL APIs (various JIRAs).

## Linux packages notes

AlmaLinux 9 is now supported. ([ARROW-16745](https://issues.apache.org/jira/browse/ARROW-16745))

AmazonLinux 2 aarch64 is now supported. ([ARROW-16477](https://issues.apache.org/jira/browse/ARROW-16477))

## C++ notes

STL-like iteration is now provided over chunked arrays ([ARROW-602](https://issues.apache.org/jira/browse/ARROW-602)).

### Compute

The C++ compute and execution engine is now officially named "Acero", though
its C++ namespaces have not changed.

New light-weight data holder abstractions have been introduced in order
to reduce the overhead of invoking compute functions and kernels, especially
at the small data sizes desirable for efficient parallelization (typically
L1- or L2-sized). Specifically, the non-owning `ArraySpan` and `ExecSpan`
structures have internally superseded the much heavier `ExecBatch`, which
is still supported for compatibility at the API level
([ARROW-16756](https://issues.apache.org/jira/browse/ARROW-16756), [ARROW-16824](https://issues.apache.org/jira/browse/ARROW-16824), [ARROW-16852](https://issues.apache.org/jira/browse/ARROW-16852)).

In a similar vein, the `ValueDescr` class was removed and `ScalarKernel`
implementations now always receive at least one non-scalar input, removing
the special case where a `ScalarKernel` needs to output a scalar rather than
an array. The higher-level compute APIs still allow executing a scalar function
over all-scalar inputs; but those scalars are internally broadcasted to
1-element arrays so as to simplify kernel implementation ([ARROW-16757](https://issues.apache.org/jira/browse/ARROW-16757)).

Some performance improvements were made to the hash join node. These changes
do not require additional configuration. The hash join exec node has been
improved to more efficiently use CPU cache and make better use of available
vectorization hardware ([ARROW-14182](https://issues.apache.org/jira/browse/ARROW-14182)).

Some plans containing a sequence of hash join operators will now use bloom
filters to eliminate rows earlier in the plan, reducing the overall CPU
cost of the plan ([ARROW-15498](https://issues.apache.org/jira/browse/ARROW-15498)).

Timestamp comparison is now supported ([ARROW-16425](https://issues.apache.org/jira/browse/ARROW-16425)).

A cumulative sum function is implemented over numeric inputs ([ARROW-13530](https://issues.apache.org/jira/browse/ARROW-13530)). Note that this is a vector
function so cannot be used in an Acero ExecPlan.

A rank vector kernel has been added ([ARROW-16234](https://issues.apache.org/jira/browse/ARROW-16234)).

Temporal rounding functions received additional options to control how
rounding is done ([ARROW-14821](https://issues.apache.org/jira/browse/ARROW-14821)).

Improper computation of the "mode" function on boolean input was fixed
([ARROW-17096](https://issues.apache.org/jira/browse/ARROW-17096)).

Function registries can now be nested ([ARROW-16677](https://issues.apache.org/jira/browse/ARROW-16677)).

### Dataset

The `autogenerate_column_names` option for CSV reading is now handled correctly
([ARROW-16436](https://issues.apache.org/jira/browse/ARROW-16436)).

Fix `InMemoryDataset::ReplaceSchema` to actually replace the schema
([ARROW-16085](https://issues.apache.org/jira/browse/ARROW-16085)).

Fix `FilenamePartitioning` to properly support null values ([ARROW-16302](https://issues.apache.org/jira/browse/ARROW-16302)).

### Filesystem

A number of bug fixes and improvements were made to the Google Cloud Storage
filesystem implementation ([ARROW-14892](https://issues.apache.org/jira/browse/ARROW-14892)).

By default, the S3 filesystem implementation does not create or drop buckets
anymore ([ARROW-15906](https://issues.apache.org/jira/browse/ARROW-15906)). This is a compatibility-breaking change intended
to prevent user errors from having potentially catastrophic consequences.
Options have been added to restore the previous behavior if necessary.

### Parquet

The default Parquet version is now 2.4 for writing, enabling use of
more recent logical types by default ([ARROW-12203](https://issues.apache.org/jira/browse/ARROW-12203)).

Non-nullable fields are now handled correctly by the Parquet reader
([ARROW-16116](https://issues.apache.org/jira/browse/ARROW-16116)).

Reading encrypted files should now be thread-safe ([ARROW-14114](https://issues.apache.org/jira/browse/ARROW-14114)).

Statistics equality now works correctly with minmax ([ARROW-16487](https://issues.apache.org/jira/browse/ARROW-16487)).

The minimum Thrift version required for building is now 0.13 ([ARROW-16721](https://issues.apache.org/jira/browse/ARROW-16721)).

The Thrift deserialization limits can now be configured to accommodate for
data files with very large metadata ([ARROW-16546](https://issues.apache.org/jira/browse/ARROW-16546)).

### Substrait

The Substrait spec has been updated to 0.6.0 ([ARROW-16816](https://issues.apache.org/jira/browse/ARROW-16816)). In addition, a
larger subset of the Substrait specification is now supported ([ARROW-15587](https://issues.apache.org/jira/browse/ARROW-15587),
[ARROW-15590](https://issues.apache.org/jira/browse/ARROW-15590),
[ARROW-15901](https://issues.apache.org/jira/browse/ARROW-15901),
[ARROW-16657](https://issues.apache.org/jira/browse/ARROW-16657),
[ARROW-15591](https://issues.apache.org/jira/browse/ARROW-15591)).

## C# notes

#### New Features

* Added support for Time32Array and Time64Array ([ARROW-16660](https://github.com/apache/arrow/pull/13279))

#### Bug Fixes

* When using TableFromRecordBatches, the resulting table columns have no data array. ([ARROW-13129](https://github.com/apache/arrow/pull/10562))
* Fix intermittent test failures due to async memory management bug. ([ARROW-16978](https://github.com/apache/arrow/pull/13573))

## Go notes

### Security

* Updated testify dependency to address CVE-2022-28948. ([ARROW-16759](https://issues.apache.org/jira/browse/ARROW-16759)) (This was also backported to previous versions and released as patch versions v6.0.2, v7.0.1, and v8.0.1)

### Arrow

#### New Features

* Dictionary Scalars are now available ([ARROW-16323](https://issues.apache.org/jira/browse/ARROW-16323))
* Introduced a DictionaryUnifier object along with functions for unifying Chunked Arrays and Tables ([ARROW-16324](https://issues.apache.org/jira/browse/ARROW-16324))
* New CSV examples added to documentation to demonstrate error handling ([ARROW-16450](https://issues.apache.org/jira/browse/ARROW-16450))
* CSV Reader now supports arrow.TimestampType ([ARROW-16504](https://issues.apache.org/jira/browse/ARROW-16504))
* JSON parsing for Temporal Types now allow passing numeric values in addition to strings for parsing. Timezones will be properly parsed if they exist in the string and a function was added to retrieve a time.Location object from a TimestampType ([ARROW-16551](https://issues.apache.org/jira/browse/ARROW-16551))
* New utilities added to decimal128 for rescaling and easy conversion to and from float32/float64 ([ARROW-16552](https://issues.apache.org/jira/browse/ARROW-16552))
* Arrow DataType interface now has a LayoutMethod which returns the physical layout of the given datatype such as the number of buffers, types, etc. This matches the behavior of the layout() methods in C++ for data types. ([ARROW-16556](https://issues.apache.org/jira/browse/ARROW-16556))
* Added a SliceBuffer function to the memory package to allow better re-using of memory across buffer objects ([ARROW-16557](https://issues.apache.org/jira/browse/ARROW-16557))
* Dictionary Arrays can now be concatenated using array.Concatenate ([ARROW-17095](https://issues.apache.org/jira/browse/ARROW-17095))

#### Bug Fixes

* ipc.FileReader now properly uses the memory.Allocator interface ([ARROW-16002](https://issues.apache.org/jira/browse/ARROW-16002))
* Addressed issue with Integration tests between Go and Java ([ARROW-16441](https://issues.apache.org/jira/browse/ARROW-16441))
* RecordBuilder.UnmarshalJSON now properly ignores extra unknown fields rather than panicking ([ARROW-16456](https://issues.apache.org/jira/browse/ARROW-16456))
* StructBuilder.UnmarshalJSON will no longer fail and panic when Nullable fields are missing ([ARROW-16502](https://issues.apache.org/jira/browse/ARROW-16502))
* ipc.Reader no longer silently accepts string columns with invalid offsets, preventing unexpected panics later when writing or accessing the resulting arrays. ([ARROW-16831](https://issues.apache.org/jira/browse/ARROW-16831))
* Arrow CSV reader no longer clobbers its reported errors and properly surfaces them ([ARROW-16926](https://issues.apache.org/jira/browse/ARROW-16926))

### Parquet

#### New Features

* The CreatedBy version string for the Parquet writer will now correctly reflect the library version, and will be updated by the release scripts ([ARROW-16484](https://issues.apache.org/jira/browse/ARROW-16484))
* Parquet bit_packing functions now have ARM64 NEON implementations for performance ([ARROW-16486](https://issues.apache.org/jira/browse/ARROW-16486))
* It is now possible to customize the root node in the Parquet writer instead of hardcoding it to be named "schema" with a repetition type of Repeated. This was needed to allow producing files similar to Apache Spark where the root node has a repetition type of Required. It still defaults to the spec definition of Repeated. ([ARROW-16561](https://issues.apache.org/jira/browse/ARROW-16561))
* parquet_reader CLI mainprog has been enhanced to dump values out as JSON and CSV along with setting an output file instead of just dumping to the terminal. ([ARROW-16934](https://issues.apache.org/jira/browse/ARROW-16934))

#### Bug Fixes

* Fixed a memory leak with Parquet page reading ([ARROW-16473](https://issues.apache.org/jira/browse/ARROW-16473))
* Parquet Reader properly parallelizes column reads when the parallel option is set to true. ([ARROW-16530](https://issues.apache.org/jira/browse/ARROW-16530))
* Fixed bug in the Bool decoder for plain encoding ([ARROW-16563](https://issues.apache.org/jira/browse/ARROW-16563))
* Fixed a bug in the Parquet bool column reader where it failed to properly skip rows ([ARROW-16638](https://issues.apache.org/jira/browse/ARROW-16638))
* Fixed the flakey travis ARM64 builds by reducing the size of a test case in the pqarrow unit tests to reduce the memory usage for the tests. ([ARROW-16669](https://issues.apache.org/jira/browse/ARROW-16669))
* Parquet writer now properly handles writing arrow.NULL type arrays ([ARROW-16749](https://issues.apache.org/jira/browse/ARROW-16749))
* Column level dictionary encoding configuration for Parquet writing now correctly respects the input value ([ARROW-16813](https://issues.apache.org/jira/browse/ARROW-16813))
* Memory leak in DeltaByteArray encoding fixed ([ARROW-16983](https://issues.apache.org/jira/browse/ARROW-16983))


## Java notes
#### New Features
* Allow overriding column nullability in arrow-jdbc ([#13558](https://github.com/apache/arrow/pull/13558))
* Enable skip BOUNDS_CHECKING with setBytes and getBytes of ArrowBuf ([#13161](https://github.com/apache/arrow/pull/13161))
* Initialize JNI components on use instead of statically ([#13146](https://github.com/apache/arrow/pull/13146))
* Provide explicit JDBC column type mapping ([#13166](https://github.com/apache/arrow/pull/13166))
* Allow duplicated field names in Java C data interface ([#13247](https://github.com/apache/arrow/pull/13247))
* Improve and document StackTrace ([#12656](https://github.com/apache/arrow/pull/12656))
* Keep more context when marshaling errors through JNI ([#13246](https://github.com/apache/arrow/pull/13246))
* Make RoundingMode configurable to handle inconsistent scale in BigDecimals ([#13433](https://github.com/apache/arrow/pull/13433))
* Improve Java dev experience with IntelliJ ([#13017](https://github.com/apache/arrow/pull/13017))
* Implement ArrowArrayStream ([#13465](https://github.com/apache/arrow/pull/13465)))

#### Bug Fixes
* Fix variable-width vectors in integration JSON writer ([#13676](https://github.com/apache/arrow/pull/13676))
* Handle empty JDBC ResultSet ([#13049](https://github.com/apache/arrow/pull/13049))
* Fix hasNext() in ArrowVectorIterator ([#13107](https://github.com/apache/arrow/pull/13107))
* Fix ArrayConsumer when using ArrowVectorIterator ([#12692](https://github.com/apache/arrow/pull/12692))
* Update Gandiva Protobuf library to enable builds on Apple M1 ([#13121](https://github.com/apache/arrow/pull/13121))
* Patch dataset module testing failure with JSE11+ ([#13200](https://github.com/apache/arrow/pull/13200))
* Don't duplicate generated Protobuf classes between flight-core and flight-sql ([#13596](https://github.com/apache/arrow/pull/13596))

## JavaScript notes

* Fix error iterating tables with no batches ([ARROW-16371](https://issues.apache.org/jira/browse/ARROW-16371))
* Handle case where `tableFromIPC` input is an async `RecordBatchReader` ([ARROW-16704](https://issues.apache.org/jira/browse/ARROW-16704))

## Python notes

Compatibility notes:

* PyArrow now requires Python >= 3.7 ([ARROW-16474](https://issues.apache.org/jira/browse/ARROW-16474)).

* The default behaviour regarding memory mapping has changed in several APIs (reading of Feather or Parquet files, IPC RecordBatchFileReader and RecordBatchStreamReader) to disable memory mapping by default ([ARROW-16382](https://issues.apache.org/jira/browse/ARROW-16382)).

* The default Parquet version is now 2.4 for writing, enabling use of
more recent logical types by default such as unsigned integers ([ARROW-12203](https://issues.apache.org/jira/browse/ARROW-12203)). One can specify `version="2.6"` to also enable support for nanosecond timestamps. Use `version="1.0"` to restore the old behaviour and maximizes file compatibility.

* Some deprecated APIs (deprecated at least since pyarrow 1.0.0) have been removed: IPC methods in the top-level namespace, the `Value` scalar classes and the `pyarrow.compat` module ([ARROW-17010](https://issues.apache.org/jira/browse/ARROW-17010)).

New features:

* Google Cloud Storage (GCS) File System support is now available in the Python bindings ([ARROW-14892](https://issues.apache.org/jira/browse/ARROW-14892)).

* The `Table.filter()` method now supports passing an expression in addition to a boolean array ([ARROW-16469](https://issues.apache.org/jira/browse/ARROW-16469)).

* When implementing extension types in Python, it is now possible to also customize which Python scalar gets returned (in `Array.to_pylist()` or `Scalar.as_py()`) by subclassing `ExtensionScalar` ([ARROW-13612](https://issues.apache.org/jira/browse/ARROW-13612), ([ARROW-17065](https://issues.apache.org/jira/browse/ARROW-17065))).

* It is now possible to register User Defined Functions (UDF) for scalar functions using `register_scalar_function` ([ARROW-15639](https://issues.apache.org/jira/browse/ARROW-15639)).

* Basic support for consuming a Substrait plan has been exposed in Python as `pyarrow.substrait.run_query` ([ARROW-15779](https://issues.apache.org/jira/browse/ARROW-15779)).

* The `cast` method and compute kernel now exposes the fine grained options in addition to safe/unsafe casting ([ARROW-15365](https://issues.apache.org/jira/browse/ARROW-15365)).

In addition, this release includes several bug fixes and documention improvements (such as expanded examples in docstrings ([ARROW-16091](https://issues.apache.org/jira/browse/ARROW-16091))).

Further, the Python bindings benefit from improvements in the C++ library
(e.g. new compute functions); see the C++ notes above for additional details.


## R notes

Highlights include several new `dplyr` verbs, including `glimpse()` and `union_all()`, as well as many more datetime functions from `lubridate`. There is also experimental support for user-defined scalar functions in the query engine, and most packages include native support for datasets in Google Cloud Storage (opt-in in the Linux full source build).

For more on what’s in the 9.0.0 R package, see the [R changelog][4].

## Ruby and C GLib notes

FlightSQL is now supported but there are minimum features for now.

More Flight features are now supported.

### Ruby

`Enumerable` compatible methods such as `#min` and `#max` on `Arrow::Array`, `Arrow::ChunkedArray` and `Arrow::Column` are implemented by C++'s [compute functions]({{ site.baseurl }}/docs/cpp/compute.html). This improves performance. ([ARROW-15222](https://issues.apache.org/jira/browse/ARROW-15222))

This release fixed some memory leaks. ([ARROW-14790](https://issues.apache.org/jira/browse/ARROW-14790))

This release improved support for interval type arrays such as `Arrow::MonthIntervalArray`. ([ARROW-16206](https://issues.apache.org/jira/browse/ARROW-16206))

This release improved auto data type conversion. ([ARROW-16874](https://issues.apache.org/jira/browse/ARROW-16874))

### C GLib

Vala is now supported. ([ARROW-15671](https://issues.apache.org/jira/browse/ARROW-15671)). See [`c_glib/example/vala/`](https://github.com/apache/arrow/tree/apache-arrow-9.0.0/c_glib/example/vala) for examples.

`GArrowQuantil
eOptions` is added. ([ARROW-16623](https://issues.apache.org/jira/browse/ARROW-16623))

## Rust notes

The Rust projects have moved to separate repositories outside the
main Arrow monorepo. For notes on the 19.0.0 release of the Rust
implementation, see the [Arrow Rust changelog][5].

[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%209.0.0
[2]: {{ site.baseurl }}/release/9.0.0.html#contributors
[3]: {{ site.baseurl }}/release/9.0.0.html#changelog
[4]: {{ site.baseurl }}/docs/r/news/
[5]: https://github.com/apache/arrow-rs/blob/19.0.0/CHANGELOG.md