diff --git a/_posts/2022-08-16-9.0.0-release.md b/_posts/2022-08-16-9.0.0-release.md new file mode 100644 index 000000000000..f69e3edfc26b --- /dev/null +++ b/_posts/2022-08-16-9.0.0-release.md @@ -0,0 +1,311 @@ +--- +layout: post +title: "Apache Arrow 9.0.0 Release" +date: "2022-08-16 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 9.0.0 release. This covers +over 3 months of development work and includes [**509 resolved issues**][1] +from [**114 distinct contributors**][2]. See the Install Page to learn how to +get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bug fixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 8.0.0 release, Dewey Dunnington, Alenka Frim and Rok Mihevc +have been invited to be committers. +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +Arrow Flight is now available in MacOS M1 Python wheels ([ARROW-16779](https://issues.apache.org/jira/browse/ARROW-16779)). +Arrow Flight SQL is now buildable on Windows ([ARROW-16902](https://issues.apache.org/jira/browse/ARROW-16902)). +Ruby now exposes more of the Flight and Flight SQL APIs (various JIRAs). + +## Linux packages notes + +AlmaLinux 9 is now supported. ([ARROW-16745](https://issues.apache.org/jira/browse/ARROW-16745)) + +AmazonLinux 2 aarch64 is now supported. ([ARROW-16477](https://issues.apache.org/jira/browse/ARROW-16477)) + +## C++ notes + +STL-like iteration is now provided over chunked arrays ([ARROW-602](https://issues.apache.org/jira/browse/ARROW-602)). + +### Compute + +The C++ compute and execution engine is now officially named "Acero", though +its C++ namespaces have not changed. + +New light-weight data holder abstractions have been introduced in order +to reduce the overhead of invoking compute functions and kernels, especially +at the small data sizes desirable for efficient parallelization (typically +L1- or L2-sized). Specifically, the non-owning `ArraySpan` and `ExecSpan` +structures have internally superseded the much heavier `ExecBatch`, which +is still supported for compatibility at the API level +([ARROW-16756](https://issues.apache.org/jira/browse/ARROW-16756), [ARROW-16824](https://issues.apache.org/jira/browse/ARROW-16824), [ARROW-16852](https://issues.apache.org/jira/browse/ARROW-16852)). + +In a similar vein, the `ValueDescr` class was removed and `ScalarKernel` +implementations now always receive at least one non-scalar input, removing +the special case where a `ScalarKernel` needs to output a scalar rather than +an array. The higher-level compute APIs still allow executing a scalar function +over all-scalar inputs; but those scalars are internally broadcasted to +1-element arrays so as to simplify kernel implementation ([ARROW-16757](https://issues.apache.org/jira/browse/ARROW-16757)). + +Some performance improvements were made to the hash join node. These changes +do not require additional configuration. The hash join exec node has been +improved to more efficiently use CPU cache and make better use of available +vectorization hardware ([ARROW-14182](https://issues.apache.org/jira/browse/ARROW-14182)). + +Some plans containing a sequence of hash join operators will now use bloom +filters to eliminate rows earlier in the plan, reducing the overall CPU +cost of the plan ([ARROW-15498](https://issues.apache.org/jira/browse/ARROW-15498)). + +Timestamp comparison is now supported ([ARROW-16425](https://issues.apache.org/jira/browse/ARROW-16425)). + +A cumulative sum function is implemented over numeric inputs ([ARROW-13530](https://issues.apache.org/jira/browse/ARROW-13530)). Note that this is a vector +function so cannot be used in an Acero ExecPlan. + +A rank vector kernel has been added ([ARROW-16234](https://issues.apache.org/jira/browse/ARROW-16234)). + +Temporal rounding functions received additional options to control how +rounding is done ([ARROW-14821](https://issues.apache.org/jira/browse/ARROW-14821)). + +Improper computation of the "mode" function on boolean input was fixed +([ARROW-17096](https://issues.apache.org/jira/browse/ARROW-17096)). + +Function registries can now be nested ([ARROW-16677](https://issues.apache.org/jira/browse/ARROW-16677)). + +### Dataset + +The `autogenerate_column_names` option for CSV reading is now handled correctly +([ARROW-16436](https://issues.apache.org/jira/browse/ARROW-16436)). + +Fix `InMemoryDataset::ReplaceSchema` to actually replace the schema +([ARROW-16085](https://issues.apache.org/jira/browse/ARROW-16085)). + +Fix `FilenamePartitioning` to properly support null values ([ARROW-16302](https://issues.apache.org/jira/browse/ARROW-16302)). + +### Filesystem + +A number of bug fixes and improvements were made to the Google Cloud Storage +filesystem implementation ([ARROW-14892](https://issues.apache.org/jira/browse/ARROW-14892)). + +By default, the S3 filesystem implementation does not create or drop buckets +anymore ([ARROW-15906](https://issues.apache.org/jira/browse/ARROW-15906)). This is a compatibility-breaking change intended +to prevent user errors from having potentially catastrophic consequences. +Options have been added to restore the previous behavior if necessary. + +### Parquet + +The default Parquet version is now 2.4 for writing, enabling use of +more recent logical types by default ([ARROW-12203](https://issues.apache.org/jira/browse/ARROW-12203)). + +Non-nullable fields are now handled correctly by the Parquet reader +([ARROW-16116](https://issues.apache.org/jira/browse/ARROW-16116)). + +Reading encrypted files should now be thread-safe ([ARROW-14114](https://issues.apache.org/jira/browse/ARROW-14114)). + +Statistics equality now works correctly with minmax ([ARROW-16487](https://issues.apache.org/jira/browse/ARROW-16487)). + +The minimum Thrift version required for building is now 0.13 ([ARROW-16721](https://issues.apache.org/jira/browse/ARROW-16721)). + +The Thrift deserialization limits can now be configured to accommodate for +data files with very large metadata ([ARROW-16546](https://issues.apache.org/jira/browse/ARROW-16546)). + +### Substrait + +The Substrait spec has been updated to 0.6.0 ([ARROW-16816](https://issues.apache.org/jira/browse/ARROW-16816)). In addition, a +larger subset of the Substrait specification is now supported ([ARROW-15587](https://issues.apache.org/jira/browse/ARROW-15587), +[ARROW-15590](https://issues.apache.org/jira/browse/ARROW-15590), +[ARROW-15901](https://issues.apache.org/jira/browse/ARROW-15901), +[ARROW-16657](https://issues.apache.org/jira/browse/ARROW-16657), +[ARROW-15591](https://issues.apache.org/jira/browse/ARROW-15591)). + +## C# notes + +#### New Features + +* Added support for Time32Array and Time64Array ([ARROW-16660](https://github.com/apache/arrow/pull/13279)) + +#### Bug Fixes + +* When using TableFromRecordBatches, the resulting table columns have no data array. ([ARROW-13129](https://github.com/apache/arrow/pull/10562)) +* Fix intermittent test failures due to async memory management bug. ([ARROW-16978](https://github.com/apache/arrow/pull/13573)) + +## Go notes + +### Security + +* Updated testify dependency to address CVE-2022-28948. ([ARROW-16759](https://issues.apache.org/jira/browse/ARROW-16759)) (This was also backported to previous versions and released as patch versions v6.0.2, v7.0.1, and v8.0.1) + +### Arrow + +#### New Features + +* Dictionary Scalars are now available ([ARROW-16323](https://issues.apache.org/jira/browse/ARROW-16323)) +* Introduced a DictionaryUnifier object along with functions for unifying Chunked Arrays and Tables ([ARROW-16324](https://issues.apache.org/jira/browse/ARROW-16324)) +* New CSV examples added to documentation to demonstrate error handling ([ARROW-16450](https://issues.apache.org/jira/browse/ARROW-16450)) +* CSV Reader now supports arrow.TimestampType ([ARROW-16504](https://issues.apache.org/jira/browse/ARROW-16504)) +* JSON parsing for Temporal Types now allow passing numeric values in addition to strings for parsing. Timezones will be properly parsed if they exist in the string and a function was added to retrieve a time.Location object from a TimestampType ([ARROW-16551](https://issues.apache.org/jira/browse/ARROW-16551)) +* New utilities added to decimal128 for rescaling and easy conversion to and from float32/float64 ([ARROW-16552](https://issues.apache.org/jira/browse/ARROW-16552)) +* Arrow DataType interface now has a LayoutMethod which returns the physical layout of the given datatype such as the number of buffers, types, etc. This matches the behavior of the layout() methods in C++ for data types. ([ARROW-16556](https://issues.apache.org/jira/browse/ARROW-16556)) +* Added a SliceBuffer function to the memory package to allow better re-using of memory across buffer objects ([ARROW-16557](https://issues.apache.org/jira/browse/ARROW-16557)) +* Dictionary Arrays can now be concatenated using array.Concatenate ([ARROW-17095](https://issues.apache.org/jira/browse/ARROW-17095)) + +#### Bug Fixes + +* ipc.FileReader now properly uses the memory.Allocator interface ([ARROW-16002](https://issues.apache.org/jira/browse/ARROW-16002)) +* Addressed issue with Integration tests between Go and Java ([ARROW-16441](https://issues.apache.org/jira/browse/ARROW-16441)) +* RecordBuilder.UnmarshalJSON now properly ignores extra unknown fields rather than panicking ([ARROW-16456](https://issues.apache.org/jira/browse/ARROW-16456)) +* StructBuilder.UnmarshalJSON will no longer fail and panic when Nullable fields are missing ([ARROW-16502](https://issues.apache.org/jira/browse/ARROW-16502)) +* ipc.Reader no longer silently accepts string columns with invalid offsets, preventing unexpected panics later when writing or accessing the resulting arrays. ([ARROW-16831](https://issues.apache.org/jira/browse/ARROW-16831)) +* Arrow CSV reader no longer clobbers its reported errors and properly surfaces them ([ARROW-16926](https://issues.apache.org/jira/browse/ARROW-16926)) + +### Parquet + +#### New Features + +* The CreatedBy version string for the Parquet writer will now correctly reflect the library version, and will be updated by the release scripts ([ARROW-16484](https://issues.apache.org/jira/browse/ARROW-16484)) +* Parquet bit_packing functions now have ARM64 NEON implementations for performance ([ARROW-16486](https://issues.apache.org/jira/browse/ARROW-16486)) +* It is now possible to customize the root node in the Parquet writer instead of hardcoding it to be named "schema" with a repetition type of Repeated. This was needed to allow producing files similar to Apache Spark where the root node has a repetition type of Required. It still defaults to the spec definition of Repeated. ([ARROW-16561](https://issues.apache.org/jira/browse/ARROW-16561)) +* parquet_reader CLI mainprog has been enhanced to dump values out as JSON and CSV along with setting an output file instead of just dumping to the terminal. ([ARROW-16934](https://issues.apache.org/jira/browse/ARROW-16934)) + +#### Bug Fixes + +* Fixed a memory leak with Parquet page reading ([ARROW-16473](https://issues.apache.org/jira/browse/ARROW-16473)) +* Parquet Reader properly parallelizes column reads when the parallel option is set to true. ([ARROW-16530](https://issues.apache.org/jira/browse/ARROW-16530)) +* Fixed bug in the Bool decoder for plain encoding ([ARROW-16563](https://issues.apache.org/jira/browse/ARROW-16563)) +* Fixed a bug in the Parquet bool column reader where it failed to properly skip rows ([ARROW-16638](https://issues.apache.org/jira/browse/ARROW-16638)) +* Fixed the flakey travis ARM64 builds by reducing the size of a test case in the pqarrow unit tests to reduce the memory usage for the tests. ([ARROW-16669](https://issues.apache.org/jira/browse/ARROW-16669)) +* Parquet writer now properly handles writing arrow.NULL type arrays ([ARROW-16749](https://issues.apache.org/jira/browse/ARROW-16749)) +* Column level dictionary encoding configuration for Parquet writing now correctly respects the input value ([ARROW-16813](https://issues.apache.org/jira/browse/ARROW-16813)) +* Memory leak in DeltaByteArray encoding fixed ([ARROW-16983](https://issues.apache.org/jira/browse/ARROW-16983)) + + +## Java notes +#### New Features +* Allow overriding column nullability in arrow-jdbc ([#13558](https://github.com/apache/arrow/pull/13558)) +* Enable skip BOUNDS_CHECKING with setBytes and getBytes of ArrowBuf ([#13161](https://github.com/apache/arrow/pull/13161)) +* Initialize JNI components on use instead of statically ([#13146](https://github.com/apache/arrow/pull/13146)) +* Provide explicit JDBC column type mapping ([#13166](https://github.com/apache/arrow/pull/13166)) +* Allow duplicated field names in Java C data interface ([#13247](https://github.com/apache/arrow/pull/13247)) +* Improve and document StackTrace ([#12656](https://github.com/apache/arrow/pull/12656)) +* Keep more context when marshaling errors through JNI ([#13246](https://github.com/apache/arrow/pull/13246)) +* Make RoundingMode configurable to handle inconsistent scale in BigDecimals ([#13433](https://github.com/apache/arrow/pull/13433)) +* Improve Java dev experience with IntelliJ ([#13017](https://github.com/apache/arrow/pull/13017)) +* Implement ArrowArrayStream ([#13465](https://github.com/apache/arrow/pull/13465))) + +#### Bug Fixes +* Fix variable-width vectors in integration JSON writer ([#13676](https://github.com/apache/arrow/pull/13676)) +* Handle empty JDBC ResultSet ([#13049](https://github.com/apache/arrow/pull/13049)) +* Fix hasNext() in ArrowVectorIterator ([#13107](https://github.com/apache/arrow/pull/13107)) +* Fix ArrayConsumer when using ArrowVectorIterator ([#12692](https://github.com/apache/arrow/pull/12692)) +* Update Gandiva Protobuf library to enable builds on Apple M1 ([#13121](https://github.com/apache/arrow/pull/13121)) +* Patch dataset module testing failure with JSE11+ ([#13200](https://github.com/apache/arrow/pull/13200)) +* Don't duplicate generated Protobuf classes between flight-core and flight-sql ([#13596](https://github.com/apache/arrow/pull/13596)) + +## JavaScript notes + +* Fix error iterating tables with no batches ([ARROW-16371](https://issues.apache.org/jira/browse/ARROW-16371)) +* Handle case where `tableFromIPC` input is an async `RecordBatchReader` ([ARROW-16704](https://issues.apache.org/jira/browse/ARROW-16704)) + +## Python notes + +Compatibility notes: + +* PyArrow now requires Python >= 3.7 ([ARROW-16474](https://issues.apache.org/jira/browse/ARROW-16474)). + +* The default behaviour regarding memory mapping has changed in several APIs (reading of Feather or Parquet files, IPC RecordBatchFileReader and RecordBatchStreamReader) to disable memory mapping by default ([ARROW-16382](https://issues.apache.org/jira/browse/ARROW-16382)). + +* The default Parquet version is now 2.4 for writing, enabling use of +more recent logical types by default such as unsigned integers ([ARROW-12203](https://issues.apache.org/jira/browse/ARROW-12203)). One can specify `version="2.6"` to also enable support for nanosecond timestamps. Use `version="1.0"` to restore the old behaviour and maximizes file compatibility. + +* Some deprecated APIs (deprecated at least since pyarrow 1.0.0) have been removed: IPC methods in the top-level namespace, the `Value` scalar classes and the `pyarrow.compat` module ([ARROW-17010](https://issues.apache.org/jira/browse/ARROW-17010)). + +New features: + +* Google Cloud Storage (GCS) File System support is now available in the Python bindings ([ARROW-14892](https://issues.apache.org/jira/browse/ARROW-14892)). + +* The `Table.filter()` method now supports passing an expression in addition to a boolean array ([ARROW-16469](https://issues.apache.org/jira/browse/ARROW-16469)). + +* When implementing extension types in Python, it is now possible to also customize which Python scalar gets returned (in `Array.to_pylist()` or `Scalar.as_py()`) by subclassing `ExtensionScalar` ([ARROW-13612](https://issues.apache.org/jira/browse/ARROW-13612), ([ARROW-17065](https://issues.apache.org/jira/browse/ARROW-17065))). + +* It is now possible to register User Defined Functions (UDF) for scalar functions using `register_scalar_function` ([ARROW-15639](https://issues.apache.org/jira/browse/ARROW-15639)). + +* Basic support for consuming a Substrait plan has been exposed in Python as `pyarrow.substrait.run_query` ([ARROW-15779](https://issues.apache.org/jira/browse/ARROW-15779)). + +* The `cast` method and compute kernel now exposes the fine grained options in addition to safe/unsafe casting ([ARROW-15365](https://issues.apache.org/jira/browse/ARROW-15365)). + +In addition, this release includes several bug fixes and documention improvements (such as expanded examples in docstrings ([ARROW-16091](https://issues.apache.org/jira/browse/ARROW-16091))). + +Further, the Python bindings benefit from improvements in the C++ library +(e.g. new compute functions); see the C++ notes above for additional details. + + +## R notes + +Highlights include several new `dplyr` verbs, including `glimpse()` and `union_all()`, as well as many more datetime functions from `lubridate`. There is also experimental support for user-defined scalar functions in the query engine, and most packages include native support for datasets in Google Cloud Storage (opt-in in the Linux full source build). + +For more on what’s in the 9.0.0 R package, see the [R changelog][4]. + +## Ruby and C GLib notes + +FlightSQL is now supported but there are minimum features for now. + +More Flight features are now supported. + +### Ruby + +`Enumerable` compatible methods such as `#min` and `#max` on `Arrow::Array`, `Arrow::ChunkedArray` and `Arrow::Column` are implemented by C++'s [compute functions]({{ site.baseurl }}/docs/cpp/compute.html). This improves performance. ([ARROW-15222](https://issues.apache.org/jira/browse/ARROW-15222)) + +This release fixed some memory leaks. ([ARROW-14790](https://issues.apache.org/jira/browse/ARROW-14790)) + +This release improved support for interval type arrays such as `Arrow::MonthIntervalArray`. ([ARROW-16206](https://issues.apache.org/jira/browse/ARROW-16206)) + +This release improved auto data type conversion. ([ARROW-16874](https://issues.apache.org/jira/browse/ARROW-16874)) + +### C GLib + +Vala is now supported. ([ARROW-15671](https://issues.apache.org/jira/browse/ARROW-15671)). See [`c_glib/example/vala/`](https://github.com/apache/arrow/tree/apache-arrow-9.0.0/c_glib/example/vala) for examples. + +`GArrowQuantil +eOptions` is added. ([ARROW-16623](https://issues.apache.org/jira/browse/ARROW-16623)) + +## Rust notes + +The Rust projects have moved to separate repositories outside the +main Arrow monorepo. For notes on the 19.0.0 release of the Rust +implementation, see the [Arrow Rust changelog][5]. + +[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%209.0.0 +[2]: {{ site.baseurl }}/release/9.0.0.html#contributors +[3]: {{ site.baseurl }}/release/9.0.0.html#changelog +[4]: {{ site.baseurl }}/docs/r/news/ +[5]: https://github.com/apache/arrow-rs/blob/19.0.0/CHANGELOG.md