From 65552ab52358fd1190882f987ce514b95d6e4de6 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 12 Sep 2023 08:32:23 +0900 Subject: [PATCH 01/96] MINOR: [C#] Bump Grpc.Net.Client from 2.56.0 to 2.57.0 in /csharp (#37660) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.56.0 to 2.57.0.
Release notes

Sourced from Grpc.Net.Client's releases.

Release v2.57.0

What's Changed

Full Changelog: https://github.com/grpc/grpc-dotnet/compare/v2.56.0...v2.57.0

Release v2.57.0-pre1

What's Changed

... (truncated)

Commits

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Grpc.Net.Client&package-manager=nuget&previous-version=2.56.0&new-version=2.57.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj b/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj index 005714ef28a18..eaa28e33ea9d4 100644 --- a/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj +++ b/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj @@ -6,7 +6,7 @@ - + From 18792852007e45889326aaf4c94b22a780521f67 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 12 Sep 2023 08:33:15 +0900 Subject: [PATCH 02/96] MINOR: [C#] Bump BenchmarkDotNet from 0.13.7 to 0.13.8 in /csharp (#37661) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet) from 0.13.7 to 0.13.8.
Release notes

Sourced from BenchmarkDotNet's releases.

0.13.8

Full changelog: https://benchmarkdotnet.org/changelog/v0.13.8.html

Highlights

This release contains important bug fixes.

What's Changed

New Contributors

Full Changelog: https://github.com/dotnet/BenchmarkDotNet/compare/v0.13.7...v0.13.8

Commits

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=BenchmarkDotNet&package-manager=nuget&previous-version=0.13.7&new-version=0.13.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj b/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj index a81fc15bae861..6a058a752bc2e 100644 --- a/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj +++ b/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj @@ -6,7 +6,7 @@ - + From b7581fee01ed0d111d5a0361c2f05779aa3c33e8 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 12 Sep 2023 08:34:12 +0900 Subject: [PATCH 03/96] MINOR: [C#] Bump Grpc.AspNetCore from 2.56.0 to 2.57.0 in /csharp (#37664) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [Grpc.AspNetCore](https://github.com/grpc/grpc-dotnet) from 2.56.0 to 2.57.0.
Release notes

Sourced from Grpc.AspNetCore's releases.

Release v2.57.0

What's Changed

Full Changelog: https://github.com/grpc/grpc-dotnet/compare/v2.56.0...v2.57.0

Release v2.57.0-pre1

What's Changed

... (truncated)

Commits
  • 7733c07 [2.57.x] Update version to 2.57.0 (#2272)
  • 4dadd82 [v2.57.x] Update version to 2.57.0-pre1 (#2266)
  • b50e46f Fix connection bugs from BalancerAddress changes (#2265)
  • 311f878 Change subchannel BalancerAddress when attributes change (#2228)
  • b421751 Fix unobserved exceptions with retries (#2255)
  • 41f67ad Log socket lifetime when closing unusable sockets (#2258)
  • a9e810c Update call debugger display to show status code (#2259)
  • 6429ae2 Reduce logger allocations by not using generic CreateLogger (#2256)
  • 3db1683 Add transport status to subchannel picked log (#2261)
  • 0104983 Update Grpc.Tools dependency to 2.57.0 (#2257)
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Grpc.AspNetCore&package-manager=nuget&previous-version=2.56.0&new-version=2.57.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../Apache.Arrow.Flight.TestWeb.csproj | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj b/csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj index ef7d730d2cd45..ce46466bd6ca5 100644 --- a/csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj +++ b/csharp/test/Apache.Arrow.Flight.TestWeb/Apache.Arrow.Flight.TestWeb.csproj @@ -5,7 +5,7 @@ - + From 247b7f0c067dbcdbdd69b3a970f3fbfc03b484f0 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 12 Sep 2023 09:48:39 +0900 Subject: [PATCH 04/96] MINOR: [C#] Bump BenchmarkDotNet.Diagnostics.Windows from 0.13.7 to 0.13.8 in /csharp (#37659) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [BenchmarkDotNet.Diagnostics.Windows](https://github.com/dotnet/BenchmarkDotNet) from 0.13.7 to 0.13.8.
Release notes

Sourced from BenchmarkDotNet.Diagnostics.Windows's releases.

0.13.8

Full changelog: https://benchmarkdotnet.org/changelog/v0.13.8.html

Highlights

This release contains important bug fixes.

What's Changed

New Contributors

Full Changelog: https://github.com/dotnet/BenchmarkDotNet/compare/v0.13.7...v0.13.8

Commits
  • f8de1e9 Prepare v0.13.8 changelog
  • e2e888c Use Roslyn Toolchain by default if no build settings are changed.
  • b035d90 feat: add text justification style (#2410)
  • 2a8bab5 Fixed nullability warnings for some files from BenchmarkDotNet project
  • 3860e4a Removed redundant check
  • 83fc5ed Updated CodeAnnotations to the actual version
  • 2d763cf Enable nullability for BenchmarkDotNet.Annotations
  • d391085 Update stub decoding for .NET 8 for disassemblers (#2416)
  • e0c667f - update the templates install command to reflect dotnet cli updates (#2415)
  • c35dcb2 Refactor out base TextLogger from StreamLogger (#2406)
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=BenchmarkDotNet.Diagnostics.Windows&package-manager=nuget&previous-version=0.13.7&new-version=0.13.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj b/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj index 6a058a752bc2e..35f17270e0b04 100644 --- a/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj +++ b/csharp/test/Apache.Arrow.Benchmarks/Apache.Arrow.Benchmarks.csproj @@ -7,7 +7,7 @@ - + From 5009282ddb4c980374833a95071f147a08b2b9f5 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 12 Sep 2023 09:49:15 +0900 Subject: [PATCH 05/96] MINOR: [C#] Bump Grpc.Tools from 2.57.0 to 2.58.0 in /csharp (#37662) Bumps [Grpc.Tools](https://github.com/grpc/grpc) from 2.57.0 to 2.58.0.
Commits

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Grpc.Tools&package-manager=nuget&previous-version=2.57.0&new-version=2.58.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj | 2 +- csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj b/csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj index 57bb6b6876ca8..4f785971b2849 100644 --- a/csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj +++ b/csharp/src/Apache.Arrow.Flight.Sql/Apache.Arrow.Flight.Sql.csproj @@ -5,7 +5,7 @@ - + diff --git a/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj b/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj index eaa28e33ea9d4..0ffb2f0a8e518 100644 --- a/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj +++ b/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj @@ -7,7 +7,7 @@ - + From 1adc745f67791dde1d9e2574b9de52a8d00ca7e8 Mon Sep 17 00:00:00 2001 From: Arkadiusz Rudny <93520526+aru-trackunit@users.noreply.github.com> Date: Tue, 12 Sep 2023 06:27:34 +0200 Subject: [PATCH 06/96] GH-37560 [Python][Documentation] Replacing confusing batch size from 128Ki to 128_000 (#37605) ### Rationale for this change https://github.com/apache/arrow/issues/37560 ### Are these changes tested? -> No ### Are there any user-facing changes? -> Documentation * Closes: #37560 Authored-by: Arkadiusz Rudny Signed-off-by: AlenkaF --- python/pyarrow/_dataset.pyx | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx index 8f5688de29072..d29fa125e2061 100644 --- a/python/pyarrow/_dataset.pyx +++ b/python/pyarrow/_dataset.pyx @@ -319,7 +319,7 @@ cdef class Dataset(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -441,7 +441,7 @@ cdef class Dataset(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -519,7 +519,7 @@ cdef class Dataset(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -597,7 +597,7 @@ cdef class Dataset(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -675,7 +675,7 @@ cdef class Dataset(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -730,7 +730,7 @@ cdef class Dataset(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -1411,7 +1411,7 @@ cdef class Fragment(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -1491,7 +1491,7 @@ cdef class Fragment(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -1574,7 +1574,7 @@ cdef class Fragment(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -1653,7 +1653,7 @@ cdef class Fragment(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -1731,7 +1731,7 @@ cdef class Fragment(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -1786,7 +1786,7 @@ cdef class Fragment(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -3436,7 +3436,7 @@ cdef class Scanner(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -3515,7 +3515,7 @@ cdef class Scanner(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. @@ -3601,7 +3601,7 @@ cdef class Scanner(_Weakrefable): partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. - batch_size : int, default 128Ki + batch_size : int, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. From 5a1ca69d617288a12769b279502d8815a859f487 Mon Sep 17 00:00:00 2001 From: Chris Jordan-Squire <788080+chrisjordansquire@users.noreply.github.com> Date: Tue, 12 Sep 2023 07:02:17 -0400 Subject: [PATCH 07/96] GH-37244: [Python] Remove support for pickle5 (#37644) Resolve issue https://github.com/apache/arrow/issues/37244 by removing pickle5 usage in pyarrow. ### Rationale for this change See issue https://github.com/apache/arrow/issues/37244 . ### What changes are included in this PR? pickle5 usage is removed from pyarrow . ### Are these changes tested? Yes, the python test suite was run. ### Are there any user-facing changes? No. * Closes: #37244 Authored-by: Chris Jordan-Squire Signed-off-by: AlenkaF --- python/pyarrow/compat.pxi | 8 +------- python/pyarrow/io.pxi | 3 ++- python/pyarrow/pandas_compat.py | 5 +++-- python/pyarrow/types.pxi | 5 +++-- python/requirements-wheel-test.txt | 1 - 5 files changed, 9 insertions(+), 13 deletions(-) diff --git a/python/pyarrow/compat.pxi b/python/pyarrow/compat.pxi index 98aa1f2433ef0..8cf106d5609b5 100644 --- a/python/pyarrow/compat.pxi +++ b/python/pyarrow/compat.pxi @@ -33,16 +33,10 @@ def encode_file_path(path): ordered_dict = dict -try: - import pickle5 as builtin_pickle -except ImportError: - import pickle as builtin_pickle - - try: import cloudpickle as pickle except ImportError: - pickle = builtin_pickle + import pickle def tobytes(o): diff --git a/python/pyarrow/io.pxi b/python/pyarrow/io.pxi index e3018ab4704f0..460e932b86273 100644 --- a/python/pyarrow/io.pxi +++ b/python/pyarrow/io.pxi @@ -21,6 +21,7 @@ from libc.stdlib cimport malloc, free import codecs +import pickle import re import sys import threading @@ -1368,7 +1369,7 @@ cdef class Buffer(_Weakrefable): def __reduce_ex__(self, protocol): if protocol >= 5: - bufobj = builtin_pickle.PickleBuffer(self) + bufobj = pickle.PickleBuffer(self) elif self.buffer.get().is_mutable(): # Need to pass a bytearray to recreate a mutable buffer when # unpickling. diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py index 12f1cc431293c..4e5c868efd4c8 100644 --- a/python/pyarrow/pandas_compat.py +++ b/python/pyarrow/pandas_compat.py @@ -26,13 +26,14 @@ from itertools import zip_longest import json import operator +import pickle import re import warnings import numpy as np import pyarrow as pa -from pyarrow.lib import _pandas_api, builtin_pickle, frombytes # noqa +from pyarrow.lib import _pandas_api, frombytes # noqa _logical_type_map = {} @@ -720,7 +721,7 @@ def _reconstruct_block(item, columns=None, extension_columns=None): klass=_int.DatetimeTZBlock, dtype=dtype) elif 'object' in item: - block = _int.make_block(builtin_pickle.loads(block_arr), + block = _int.make_block(pickle.loads(block_arr), placement=placement) elif 'py_array' in item: # create ExtensionBlock diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi index ffaebd2418a58..9f8b347d56294 100644 --- a/python/pyarrow/types.pxi +++ b/python/pyarrow/types.pxi @@ -19,6 +19,7 @@ from cpython.pycapsule cimport PyCapsule_CheckExact, PyCapsule_GetPointer import atexit from collections.abc import Mapping +import pickle import re import sys import warnings @@ -1699,12 +1700,12 @@ cdef class PyExtensionType(ExtensionType): .format(type(self).__name__)) def __arrow_ext_serialize__(self): - return builtin_pickle.dumps(self) + return pickle.dumps(self) @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): try: - ty = builtin_pickle.loads(serialized) + ty = pickle.loads(serialized) except Exception: # For some reason, it's impossible to deserialize the # ExtensionType instance. Perhaps the serialized data is diff --git a/python/requirements-wheel-test.txt b/python/requirements-wheel-test.txt index c23a30f70e838..9de0acb754079 100644 --- a/python/requirements-wheel-test.txt +++ b/python/requirements-wheel-test.txt @@ -1,7 +1,6 @@ cffi cython hypothesis -pickle5; platform_system != "Windows" and python_version < "3.8" pytest pytest-lazy-fixture pytz From f71594c982f115e01463c352b95bea0722bcd866 Mon Sep 17 00:00:00 2001 From: david dali susanibar arce Date: Tue, 12 Sep 2023 08:40:13 -0500 Subject: [PATCH 08/96] GH-37216: [Docs] adding documentation to deal with unreleased allocators (#37498) ### Rationale for this change To close #37216 ### What changes are included in this PR? Documentation added to try to catch unreleased allocations. ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #37216 Lead-authored-by: david dali susanibar arce Co-authored-by: David Li Signed-off-by: David Li --- docs/source/java/memory.rst | 49 ++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/docs/source/java/memory.rst b/docs/source/java/memory.rst index af6c0abc7c82a..036befa148692 100644 --- a/docs/source/java/memory.rst +++ b/docs/source/java/memory.rst @@ -133,7 +133,7 @@ Development Guidelines Applications should generally: * Use the BufferAllocator interface in APIs instead of RootAllocator. -* Create one RootAllocator at the start of the program. +* Create one RootAllocator at the start of the program and explicitly pass it when needed. * ``close()`` allocators after use (whether they are child allocators or the RootAllocator), either manually or preferably via a try-with-resources statement. @@ -288,6 +288,53 @@ Finally, enabling the ``TRACE`` logging level will automatically provide this st | at RootAllocator.close (RootAllocator.java:29) | at (#8:1) +Sometimes, explicitly passing allocators around is difficult. For example, it +can be hard to pass around extra state, like an allocator, through layers of +existing application or framework code. A global or singleton allocator instance +can be useful here, though it should not be your first choice. + +How this works: + +1. Set up a global allocator in a singleton class. +2. Provide methods to create child allocators from the global allocator. +3. Give child allocators proper names to make it easier to figure out where + allocations occurred in case of errors. +4. Ensure that resources are properly closed. +5. Check that the global allocator is empty at some suitable point, such as + right before program shutdown. +6. If it is not empty, review the above allocation bugs. + +.. code-block:: java + + //1 + private static final BufferAllocator allocator = new RootAllocator(); + private static final AtomicInteger childNumber = new AtomicInteger(0); + ... + //2 + public static BufferAllocator getChildAllocator() { + return allocator.newChildAllocator(nextChildName(), 0, Long.MAX_VALUE); + } + ... + //3 + private static String nextChildName() { + return "Allocator-Child-" + childNumber.incrementAndGet(); + } + ... + //4: Business code + try (BufferAllocator allocator = GlobalAllocator.getChildAllocator()) { + ... + } + ... + //5 + public static void checkGlobalCleanUpResources() { + ... + if (!allocator.getChildAllocators().isEmpty()) { + throw new IllegalStateException(...); + } else if (allocator.getAllocatedMemory() != 0) { + throw new IllegalStateException(...); + } + } + .. _`ArrowBuf`: https://arrow.apache.org/docs/java/reference/org/apache/arrow/memory/ArrowBuf.html .. _`ArrowBuf.print()`: https://arrow.apache.org/docs/java/reference/org/apache/arrow/memory/ArrowBuf.html#print-java.lang.StringBuilder-int-org.apache.arrow.memory.BaseAllocator.Verbosity- .. _`BufferAllocator`: https://arrow.apache.org/docs/java/reference/org/apache/arrow/memory/BufferAllocator.html From 47bf6e9d9eb60fe873e42261db5cae3480d1cc73 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Tue, 12 Sep 2023 17:49:41 +0100 Subject: [PATCH 09/96] GH-37671: [R] legacy timezone symlinks cause CRAN failures (#37672) ### Rationale for this change A tzdata update causes our tests to fail on CRAN due to our use of legacy timezones (i.e. starting with "US/") in some of our tests ### What changes are included in this PR? Update the timezones ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #37671 Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/tests/testthat/helper-data.R | 4 ++-- r/tests/testthat/test-Array.R | 4 ++-- r/tests/testthat/test-dplyr-funcs-datetime.R | 16 ++++++++-------- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/r/tests/testthat/helper-data.R b/r/tests/testthat/helper-data.R index 1088be6850143..0631cfccae3fc 100644 --- a/r/tests/testthat/helper-data.R +++ b/r/tests/testthat/helper-data.R @@ -59,9 +59,9 @@ haven_data <- tibble::tibble( example_with_times <- tibble::tibble( date = Sys.Date() + 1:10, posixct = lubridate::ymd_hms("2018-10-07 19:04:05") + 1:10, - posixct_tz = lubridate::ymd_hms("2018-10-07 19:04:05", tz = "US/Eastern") + 1:10, + posixct_tz = lubridate::ymd_hms("2018-10-07 19:04:05", tz = "America/New_York") + 1:10, posixlt = as.POSIXlt(lubridate::ymd_hms("2018-10-07 19:04:05") + 1:10), - posixlt_tz = as.POSIXlt(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "US/Eastern") + 1:10) + posixlt_tz = as.POSIXlt(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "America/New_York") + 1:10) ) verses <- list( diff --git a/r/tests/testthat/test-Array.R b/r/tests/testthat/test-Array.R index 960faa8bb751b..b29c1f4e09dde 100644 --- a/r/tests/testthat/test-Array.R +++ b/r/tests/testthat/test-Array.R @@ -283,8 +283,8 @@ test_that("array supports POSIXct (ARROW-3340)", { times[5] <- NA expect_array_roundtrip(times, timestamp("us", "UTC")) - times2 <- lubridate::ymd_hms("2018-10-07 19:04:05", tz = "US/Eastern") + 1:10 - expect_array_roundtrip(times2, timestamp("us", "US/Eastern")) + times2 <- lubridate::ymd_hms("2018-10-07 19:04:05", tz = "America/New_York") + 1:10 + expect_array_roundtrip(times2, timestamp("us", "America/New_York")) }) test_that("array uses local timezone for POSIXct without timezone", { diff --git a/r/tests/testthat/test-dplyr-funcs-datetime.R b/r/tests/testthat/test-dplyr-funcs-datetime.R index bcd2584851b70..e707a194a3626 100644 --- a/r/tests/testthat/test-dplyr-funcs-datetime.R +++ b/r/tests/testthat/test-dplyr-funcs-datetime.R @@ -3606,7 +3606,7 @@ test_that("with_tz() and force_tz() works", { "2012-01-01 01:02:03" ), tz = "UTC") - timestamps_non_utc <- force_tz(timestamps, "US/Central") + timestamps_non_utc <- force_tz(timestamps, "America/Chicago") nonexistent <- as_datetime(c( "2015-03-29 02:30:00", @@ -3622,10 +3622,10 @@ test_that("with_tz() and force_tz() works", { .input %>% mutate( timestamps_with_tz_1 = with_tz(timestamps, "UTC"), - timestamps_with_tz_2 = with_tz(timestamps, "US/Central"), + timestamps_with_tz_2 = with_tz(timestamps, "America/Chicago"), timestamps_with_tz_3 = with_tz(timestamps, "Asia/Kolkata"), timestamps_force_tz_1 = force_tz(timestamps, "UTC"), - timestamps_force_tz_2 = force_tz(timestamps, "US/Central"), + timestamps_force_tz_2 = force_tz(timestamps, "America/Chicago"), timestamps_force_tz_3 = force_tz(timestamps, "Asia/Kolkata") ) %>% collect(), @@ -3636,7 +3636,7 @@ test_that("with_tz() and force_tz() works", { .input %>% mutate( timestamps_with_tz_1 = with_tz(timestamps, "UTC"), - timestamps_with_tz_2 = with_tz(timestamps, "US/Central"), + timestamps_with_tz_2 = with_tz(timestamps, "America/Chicago"), timestamps_with_tz_3 = with_tz(timestamps, "Asia/Kolkata") ) %>% collect(), @@ -3733,17 +3733,17 @@ test_that("with_tz() and force_tz() can add timezone to timestamp without timezo expect_equal( arrow_table(timestamps = timestamps) %>% - mutate(timestamps = with_tz(timestamps, "US/Central")) %>% + mutate(timestamps = with_tz(timestamps, "America/Chicago")) %>% compute(), - arrow_table(timestamps = timestamps$cast(timestamp("s", "US/Central"))) + arrow_table(timestamps = timestamps$cast(timestamp("s", "America/Chicago"))) ) expect_equal( arrow_table(timestamps = timestamps) %>% - mutate(timestamps = force_tz(timestamps, "US/Central")) %>% + mutate(timestamps = force_tz(timestamps, "America/Chicago")) %>% compute(), arrow_table( - timestamps = call_function("assume_timezone", timestamps, options = list(timezone = "US/Central")) + timestamps = call_function("assume_timezone", timestamps, options = list(timezone = "America/Chicago")) ) ) }) From 1940cbbb3a5224243e429f9577604bb7b276cf4f Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Tue, 12 Sep 2023 20:03:31 +0100 Subject: [PATCH 10/96] GH-37681: [R] Update NEWS.md for 13.0.0.1 (#37682) ### Rationale for this change Update R changelog ### What changes are included in this PR? Update R changelog ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #37681 Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/NEWS.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/r/NEWS.md b/r/NEWS.md index d80efbf8de18e..2e2db1ad5d3fa 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -19,6 +19,10 @@ # arrow 13.0.0.9000 +# arrow 13.0.0.1 + +* Remove reference to legacy timezones to prevent CRAN check failures (#37671) + # arrow 13.0.0 ## Breaking changes From 8e4c826130df1592c7de0d2c221c16369e23c960 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 13 Sep 2023 08:28:47 +0900 Subject: [PATCH 11/96] MINOR: [C#] Bump Grpc.AspNetCore.Server from 2.56.0 to 2.57.0 in /csharp (#37663) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit **This PR includes breaking changes to public APIs.** Apache.Arrow.Flight.AspNetCore requires .NET 6.0 or later. Bumps [Grpc.AspNetCore.Server](https://github.com/grpc/grpc-dotnet) from 2.56.0 to 2.57.0.
Release notes

Sourced from Grpc.AspNetCore.Server's releases.

Release v2.57.0

What's Changed

Full Changelog: https://github.com/grpc/grpc-dotnet/compare/v2.56.0...v2.57.0

Release v2.57.0-pre1

What's Changed

... (truncated)

Commits
  • 7733c07 [2.57.x] Update version to 2.57.0 (#2272)
  • 4dadd82 [v2.57.x] Update version to 2.57.0-pre1 (#2266)
  • b50e46f Fix connection bugs from BalancerAddress changes (#2265)
  • 311f878 Change subchannel BalancerAddress when attributes change (#2228)
  • b421751 Fix unobserved exceptions with retries (#2255)
  • 41f67ad Log socket lifetime when closing unusable sockets (#2258)
  • a9e810c Update call debugger display to show status code (#2259)
  • 6429ae2 Reduce logger allocations by not using generic CreateLogger (#2256)
  • 3db1683 Add transport status to subchannel picked log (#2261)
  • 0104983 Update Grpc.Tools dependency to 2.57.0 (#2257)
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Grpc.AspNetCore.Server&package-manager=nuget&previous-version=2.56.0&new-version=2.57.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Lead-authored-by: Eric Erhardt Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../Apache.Arrow.Flight.AspNetCore.csproj | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/csharp/src/Apache.Arrow.Flight.AspNetCore/Apache.Arrow.Flight.AspNetCore.csproj b/csharp/src/Apache.Arrow.Flight.AspNetCore/Apache.Arrow.Flight.AspNetCore.csproj index c794e1a4f5089..845f2667970e4 100644 --- a/csharp/src/Apache.Arrow.Flight.AspNetCore/Apache.Arrow.Flight.AspNetCore.csproj +++ b/csharp/src/Apache.Arrow.Flight.AspNetCore/Apache.Arrow.Flight.AspNetCore.csproj @@ -1,11 +1,11 @@ - netcoreapp3.1 + net6.0 - + From 2a56d4598a3a401447c91645448853b940e86fbe Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Tue, 12 Sep 2023 22:21:21 -0400 Subject: [PATCH 12/96] GH-37636: [Go] Bump minimum go versions (#37637) Updating the Github Actions workflows to only test Go 1.19 and Go 1.20 instead of Go1.17 and Go1.18. Also bumping the default Go version in the `.env` file to update the docker images. * Closes: #37636 Lead-authored-by: Matt Topol Co-authored-by: Sutou Kouhei Signed-off-by: Jacob Wujciak-Jens --- .env | 4 +- .github/workflows/go.yml | 74 ++++------- ci/docker/conda-integration.dockerfile | 2 +- ci/docker/debian-11-go.dockerfile | 4 +- go/arrow/compute/executor.go | 3 + go/arrow/compute/exprs/builders_test.go | 2 +- go/go.mod | 54 ++++---- go/go.sum | 150 +++++++++------------- go/internal/hashing/hash_string_go1.19.go | 11 +- 9 files changed, 130 insertions(+), 174 deletions(-) diff --git a/.env b/.env index c9cd6c8094ed8..014bad3fe2a7a 100644 --- a/.env +++ b/.env @@ -58,8 +58,8 @@ CUDA=11.0.3 DASK=latest DOTNET=7.0 GCC_VERSION="" -GO=1.17 -STATICCHECK=v0.2.2 +GO=1.19.13 +STATICCHECK=v0.4.5 HDFS=3.2.1 JDK=8 KARTOTHEK=latest diff --git a/.github/workflows/go.yml b/.github/workflows/go.yml index c31ad0b77c2df..3c695891b48d6 100644 --- a/.github/workflows/go.yml +++ b/.github/workflows/go.yml @@ -54,28 +54,23 @@ jobs: include: - arch-label: AMD64 arch: amd64 - go: 1.17 + go: 1.19 runs-on: ubuntu-latest - staticcheck: v0.2.2 - arch-label: AMD64 arch: amd64 - go: 1.18 + go: '1.20' runs-on: ubuntu-latest - staticcheck: v0.3.3 - arch-label: ARM64 arch: arm64v8 - go: 1.17 - staticcheck: v0.2.2 + go: 1.19 runs-on: ["self-hosted", "arm", "linux"] - arch-label: ARM64 arch: arm64v8 - go: 1.18 - staticcheck: v0.3.3 + go: '1.20' runs-on: ["self-hosted", "arm", "linux"] env: ARCH: ${{ matrix.arch }} GO: ${{ matrix.go }} - STATICCHECK: ${{ matrix.staticcheck }} steps: - name: Checkout Arrow uses: actions/checkout@v4 @@ -145,7 +140,7 @@ jobs: - name: Install Go uses: actions/setup-go@v4 with: - go-version: 1.18 + go-version: 1.19 cache: true cache-dependency-path: go/go.sum - name: Run build @@ -161,15 +156,9 @@ jobs: strategy: fail-fast: false matrix: - go: [1.17, 1.18] - include: - - go: 1.17 - staticcheck: v0.2.2 - - go: 1.18 - staticcheck: v0.3.3 + go: [1.19, '1.20'] env: GO: ${{ matrix.go }} - STATICCHECK: ${{ matrix.staticcheck }} steps: - name: Checkout Arrow uses: actions/checkout@v4 @@ -208,15 +197,9 @@ jobs: strategy: fail-fast: false matrix: - go: [1.17, 1.18] - include: - - go: 1.17 - staticcheck: v0.2.2 - - go: 1.18 - staticcheck: v0.3.3 + go: [1.19, '1.20'] env: GO: ${{ matrix.go }} - STATICCHECK: ${{ matrix.staticcheck }} steps: - name: Checkout Arrow uses: actions/checkout@v4 @@ -253,12 +236,7 @@ jobs: strategy: fail-fast: false matrix: - go: [1.17, 1.18] - include: - - go: 1.17 - staticcheck: v0.2.2 - - go: 1.18 - staticcheck: v0.3.3 + go: [1.19, '1.20'] steps: - name: Checkout Arrow uses: actions/checkout@v4 @@ -272,7 +250,10 @@ jobs: cache: true cache-dependency-path: go/go.sum - name: Install staticcheck - run: go install honnef.co/go/tools/cmd/staticcheck@${{ matrix.staticcheck }} + shell: bash + run: | + . .env + go install honnef.co/go/tools/cmd/staticcheck@${STATICCHECK} - name: Build shell: bash run: ci/scripts/go_build.sh $(pwd) @@ -288,12 +269,7 @@ jobs: strategy: fail-fast: false matrix: - go: [1.17, 1.18] - include: - - go: 1.17 - staticcheck: v0.2.2 - - go: 1.18 - staticcheck: v0.3.3 + go: [1.19, '1.20'] steps: - name: Checkout Arrow uses: actions/checkout@v4 @@ -306,8 +282,10 @@ jobs: go-version: ${{ matrix.go }} cache: true cache-dependency-path: go/go.sum - - name: Install staticcheck - run: go install honnef.co/go/tools/cmd/staticcheck@${{ matrix.staticcheck }} + - name: Install staticcheck + run: | + . .env + go install honnef.co/go/tools/cmd/staticcheck@${STATICCHECK} - name: Build shell: bash run: ci/scripts/go_build.sh $(pwd) @@ -349,12 +327,7 @@ jobs: strategy: fail-fast: false matrix: - go: [1.17, 1.18] - include: - - go: 1.17 - staticcheck: v0.2.2 - - go: 1.18 - staticcheck: v0.3.3 + go: [1.19, '1.20'] env: ARROW_GO_TESTCGO: "1" steps: @@ -373,7 +346,9 @@ jobs: shell: bash run: brew install apache-arrow pkg-config - name: Install staticcheck - run: go install honnef.co/go/tools/cmd/staticcheck@${{ matrix.staticcheck }} + run: | + . .env + go install honnef.co/go/tools/cmd/staticcheck@${STATICCHECK} - name: Add To pkg config path shell: bash run: | @@ -430,11 +405,14 @@ jobs: - name: Install go uses: actions/setup-go@v4 with: - go-version: '1.18' + go-version: '1.19' cache: true cache-dependency-path: go/go.sum - name: Install staticcheck - run: go install honnef.co/go/tools/cmd/staticcheck@v0.3.3 + shell: bash + run: | + . .env + go install honnef.co/go/tools/cmd/staticcheck@${STATICCHECK} - name: Build shell: bash run: ci/scripts/go_build.sh $(pwd) diff --git a/ci/docker/conda-integration.dockerfile b/ci/docker/conda-integration.dockerfile index 43d7e7ab0b60d..a306790b5cb6d 100644 --- a/ci/docker/conda-integration.dockerfile +++ b/ci/docker/conda-integration.dockerfile @@ -24,7 +24,7 @@ ARG maven=3.5 ARG node=16 ARG yarn=1.22 ARG jdk=8 -ARG go=1.15 +ARG go=1.19.13 # Install Archery and integration dependencies COPY ci/conda_env_archery.txt /arrow/ci/ diff --git a/ci/docker/debian-11-go.dockerfile b/ci/docker/debian-11-go.dockerfile index 9f75bf23fddf2..de8186b9b8e1c 100644 --- a/ci/docker/debian-11-go.dockerfile +++ b/ci/docker/debian-11-go.dockerfile @@ -16,8 +16,8 @@ # under the License. ARG arch=amd64 -ARG go=1.17 -ARG staticcheck=v0.2.2 +ARG go=1.19 +ARG staticcheck=v0.4.5 FROM ${arch}/golang:${go}-bullseye # FROM collects all the args, get back the staticcheck version arg diff --git a/go/arrow/compute/executor.go b/go/arrow/compute/executor.go index ac87d063915b7..6da7ed1293065 100644 --- a/go/arrow/compute/executor.go +++ b/go/arrow/compute/executor.go @@ -1007,6 +1007,9 @@ func (v *vectorExecutor) WrapResults(ctx context.Context, out <-chan Datum, hasC case <-ctx.Done(): return nil case output = <-out: + if output == nil { + return nil + } // if the inputs contained at least one chunked array // then we want to return chunked output if hasChunked { diff --git a/go/arrow/compute/exprs/builders_test.go b/go/arrow/compute/exprs/builders_test.go index 9aaa4a2c4f9e4..e42d7569a8f03 100644 --- a/go/arrow/compute/exprs/builders_test.go +++ b/go/arrow/compute/exprs/builders_test.go @@ -37,7 +37,7 @@ func TestNewScalarFunc(t *testing.T) { require.NoError(t, err) assert.Equal(t, "add(i32(1), i32(10), {overflow: [ERROR]}) => i32", fn.String()) - assert.Equal(t, "add:i32_i32", fn.Name()) + assert.Equal(t, "add:i32_i32", fn.CompoundName()) } func TestFieldRefDotPath(t *testing.T) { diff --git a/go/go.mod b/go/go.mod index 46c093ed1ece2..a5581eb3925ca 100644 --- a/go/go.mod +++ b/go/go.mod @@ -20,60 +20,60 @@ go 1.20 require ( github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c - github.com/andybalholm/brotli v1.0.4 - github.com/apache/thrift v0.16.0 + github.com/andybalholm/brotli v1.0.5 + github.com/apache/thrift v0.17.0 github.com/docopt/docopt-go v0.0.0-20180111231733-ee0de3bc6815 - github.com/goccy/go-json v0.10.0 + github.com/goccy/go-json v0.10.2 github.com/golang/snappy v0.0.4 - github.com/google/flatbuffers v23.1.21+incompatible + github.com/google/flatbuffers v23.5.26+incompatible github.com/klauspost/asmfmt v1.3.2 - github.com/klauspost/compress v1.15.15 - github.com/klauspost/cpuid/v2 v2.2.3 + github.com/klauspost/compress v1.16.7 + github.com/klauspost/cpuid/v2 v2.2.5 github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 - github.com/pierrec/lz4/v4 v4.1.17 - github.com/stretchr/testify v1.8.1 + github.com/pierrec/lz4/v4 v4.1.18 + github.com/stretchr/testify v1.8.4 github.com/zeebo/xxh3 v1.0.2 - golang.org/x/exp v0.0.0-20230206171751-46f607a40771 - golang.org/x/sync v0.1.0 - golang.org/x/sys v0.5.0 - golang.org/x/tools v0.6.0 + golang.org/x/exp v0.0.0-20230905200255-921286631fa9 + golang.org/x/sync v0.3.0 + golang.org/x/sys v0.12.0 + golang.org/x/tools v0.13.0 golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 gonum.org/v1/gonum v0.12.0 - google.golang.org/grpc v1.53.0 - google.golang.org/protobuf v1.28.1 - modernc.org/sqlite v1.20.4 + google.golang.org/grpc v1.54.0 + google.golang.org/protobuf v1.31.0 + modernc.org/sqlite v1.21.2 ) require ( github.com/google/uuid v1.3.0 - github.com/substrait-io/substrait-go v0.2.1-0.20230517203920-30fa08bd57d0 + github.com/substrait-io/substrait-go v0.4.2 ) require ( - github.com/alecthomas/participle/v2 v2.0.0 // indirect + github.com/alecthomas/participle/v2 v2.1.0 // indirect github.com/davecgh/go-spew v1.1.1 // indirect github.com/dustin/go-humanize v1.0.1 // indirect - github.com/fatih/color v1.13.0 // indirect - github.com/goccy/go-yaml v1.9.8 // indirect - github.com/golang/protobuf v1.5.2 // indirect + github.com/fatih/color v1.15.0 // indirect + github.com/goccy/go-yaml v1.11.0 // indirect + github.com/golang/protobuf v1.5.3 // indirect github.com/kballard/go-shellquote v0.0.0-20180428030007-95032a82bc51 // indirect github.com/kr/text v0.2.0 // indirect github.com/mattn/go-colorable v0.1.13 // indirect - github.com/mattn/go-isatty v0.0.17 // indirect + github.com/mattn/go-isatty v0.0.19 // indirect github.com/pmezard/go-difflib v1.0.0 // indirect github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect github.com/rogpeppe/go-internal v1.9.0 // indirect github.com/stretchr/objx v0.5.0 // indirect - golang.org/x/mod v0.8.0 // indirect - golang.org/x/net v0.7.0 // indirect - golang.org/x/text v0.7.0 // indirect - google.golang.org/genproto v0.0.0-20230209215440-0dfe4f8abfcc // indirect + golang.org/x/mod v0.12.0 // indirect + golang.org/x/net v0.15.0 // indirect + golang.org/x/text v0.13.0 // indirect + google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect - lukechampine.com/uint128 v1.2.0 // indirect + lukechampine.com/uint128 v1.3.0 // indirect modernc.org/cc/v3 v3.40.0 // indirect modernc.org/ccgo/v3 v3.16.13 // indirect - modernc.org/libc v1.22.2 // indirect + modernc.org/libc v1.22.4 // indirect modernc.org/mathutil v1.5.0 // indirect modernc.org/memory v1.5.0 // indirect modernc.org/opt v0.1.3 // indirect diff --git a/go/go.sum b/go/go.sum index 0ccd809f50fae..609cf7173ef98 100644 --- a/go/go.sum +++ b/go/go.sum @@ -1,13 +1,13 @@ github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c h1:RGWPOewvKIROun94nF7v2cua9qP+thov/7M50KEoeSU= github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c/go.mod h1:X0CRv0ky0k6m906ixxpzmDRLvX58TFUKS2eePweuyxk= -github.com/alecthomas/assert/v2 v2.2.2 h1:Z/iVC0xZfWTaFNE6bA3z07T86hd45Xe2eLt6WVy2bbk= -github.com/alecthomas/participle/v2 v2.0.0 h1:Fgrq+MbuSsJwIkw3fEj9h75vDP0Er5JzepJ0/HNHv0g= -github.com/alecthomas/participle/v2 v2.0.0/go.mod h1:rAKZdJldHu8084ojcWevWAL8KmEU+AT+Olodb+WoN2Y= +github.com/alecthomas/assert/v2 v2.3.0 h1:mAsH2wmvjsuvyBvAmCtm7zFsBlb8mIHx5ySLVdDZXL0= +github.com/alecthomas/participle/v2 v2.1.0 h1:z7dElHRrOEEq45F2TG5cbQihMtNTv8vwldytDj7Wrz4= +github.com/alecthomas/participle/v2 v2.1.0/go.mod h1:Y1+hAs8DHPmc3YUFzqllV+eSQ9ljPTk0ZkPMtEdAx2c= github.com/alecthomas/repr v0.2.0 h1:HAzS41CIzNW5syS8Mf9UwXhNH1J9aix/BvDRf1Ml2Yk= -github.com/andybalholm/brotli v1.0.4 h1:V7DdXeJtZscaqfNuAdSRuRFzuiKlHSC/Zh3zl9qY3JY= -github.com/andybalholm/brotli v1.0.4/go.mod h1:fO7iG3H7G2nSZ7m0zPUDn85XEX2GTukHGRSepvi9Eig= -github.com/apache/thrift v0.16.0 h1:qEy6UW60iVOlUy+b9ZR0d5WzUWYGOo4HfopoyBaNmoY= -github.com/apache/thrift v0.16.0/go.mod h1:PHK3hniurgQaNMZYaCLEqXKsYK8upmhPbmdP2FXSqgU= +github.com/andybalholm/brotli v1.0.5 h1:8uQZIdzKmjc/iuPu7O2ioW48L81FgatrcpfFmiq/cCs= +github.com/andybalholm/brotli v1.0.5/go.mod h1:fO7iG3H7G2nSZ7m0zPUDn85XEX2GTukHGRSepvi9Eig= +github.com/apache/thrift v0.17.0 h1:cMd2aj52n+8VoAtvSvLn4kDC3aZ6IAkBuqWQ2IDu7wo= +github.com/apache/thrift v0.17.0/go.mod h1:OLxhMRJxomX+1I/KUw03qoV3mMz16BwaKI+d4fPBx7Q= github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E= github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= @@ -16,28 +16,22 @@ github.com/docopt/docopt-go v0.0.0-20180111231733-ee0de3bc6815 h1:bWDMxwH3px2JBh github.com/docopt/docopt-go v0.0.0-20180111231733-ee0de3bc6815/go.mod h1:WwZ+bS3ebgob9U8Nd0kOddGdZWjyMGR8Wziv+TBNwSE= github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY= github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto= -github.com/fatih/color v1.10.0/go.mod h1:ELkj/draVOlAH/xkhN6mQ50Qd0MPOk5AAr3maGEBuJM= -github.com/fatih/color v1.13.0 h1:8LOYc1KYPPmyKMuN8QV2DNRWNbLo6LZ0iLs8+mlH53w= -github.com/fatih/color v1.13.0/go.mod h1:kLAiJbzzSOZDVNGyDpeOxJ47H46qBXwg5ILebYFFOfk= -github.com/go-playground/assert/v2 v2.0.1/go.mod h1:VDjEfimB/XKnb+ZQfWdccd7VUvScMdVu0Titje2rxJ4= +github.com/fatih/color v1.15.0 h1:kOqh6YHBtK8aywxGerMG2Eq3H6Qgoqeo13Bk2Mv/nBs= +github.com/fatih/color v1.15.0/go.mod h1:0h5ZqXfHYED7Bhv2ZJamyIOUej9KtShiJESRwBDUSsw= github.com/go-playground/locales v0.13.0 h1:HyWk6mgj5qFqCT5fjGBuRArbVDfE4hi8+e8ceBS/t7Q= -github.com/go-playground/locales v0.13.0/go.mod h1:taPMhCMXrRLJO55olJkUXHZBHCxTMfnGwq/HNwmWNS8= github.com/go-playground/universal-translator v0.17.0 h1:icxd5fm+REJzpZx7ZfpaD876Lmtgy7VtROAbHHXk8no= -github.com/go-playground/universal-translator v0.17.0/go.mod h1:UkSxE5sNxxRwHyU+Scu5vgOQjsIJAF8j9muTVoKLVtA= github.com/go-playground/validator/v10 v10.4.1 h1:pH2c5ADXtd66mxoE0Zm9SUhxE20r7aM3F26W0hOn+GE= -github.com/go-playground/validator/v10 v10.4.1/go.mod h1:nlOn6nFhuKACm19sB/8EGNn9GlaMV7XkbRSipzJ0Ii4= -github.com/goccy/go-json v0.10.0 h1:mXKd9Qw4NuzShiRlOXKews24ufknHO7gx30lsDyokKA= -github.com/goccy/go-json v0.10.0/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I= -github.com/goccy/go-yaml v1.9.8 h1:5gMyLUeU1/6zl+WFfR1hN7D2kf+1/eRGa7DFtToiBvQ= -github.com/goccy/go-yaml v1.9.8/go.mod h1:JubOolP3gh0HpiBc4BLRD4YmjEjHAmIIB2aaXKkTfoE= -github.com/golang/mock v1.5.0/go.mod h1:CWnOUgYIOo4TcNZ0wHX3YZCqsaM1I1Jvs6v3mP3KVu8= +github.com/goccy/go-json v0.10.2 h1:CrxCmQqYDkv1z7lO7Wbh2HN93uovUHgrECaO5ZrCXAU= +github.com/goccy/go-json v0.10.2/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I= +github.com/goccy/go-yaml v1.11.0 h1:n7Z+zx8S9f9KgzG6KtQKf+kwqXZlLNR2F6018Dgau54= +github.com/goccy/go-yaml v1.11.0/go.mod h1:H+mJrWtjPTJAHvRbV09MCK9xYwODM+wRTVFFTWckfng= github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk= -github.com/golang/protobuf v1.5.2 h1:ROPKBNFfQgOUMifHyP+KYbvpjbdoFNs+aK7DXlji0Tw= -github.com/golang/protobuf v1.5.2/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY= +github.com/golang/protobuf v1.5.3 h1:KhyjKVUg7Usr/dYsdSqoFveMYd5ko72D+zANwlG1mmg= +github.com/golang/protobuf v1.5.3/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY= github.com/golang/snappy v0.0.4 h1:yAGX7huGHXlcLOEtBnF4w7FQwA26wojNCwOYAEhLjQM= github.com/golang/snappy v0.0.4/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q= -github.com/google/flatbuffers v23.1.21+incompatible h1:bUqzx/MXCDxuS0hRJL2EfjyZL3uQrPbMocUa8zGqsTA= -github.com/google/flatbuffers v23.1.21+incompatible/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8= +github.com/google/flatbuffers v23.5.26+incompatible h1:M9dgRyhJemaM4Sw8+66GHBu8ioaQmyPLg1b8VwK5WJg= +github.com/google/flatbuffers v23.5.26+incompatible/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8= github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= github.com/google/go-cmp v0.5.9 h1:O2Tfq5qg4qc4AmwVlvv0oLiVAGB7enBSJ2x2DqQFi38= github.com/google/pprof v0.0.0-20221118152302-e6195bd50e26 h1:Xim43kblpZXfIBQsbuBVKCudVG457BR2GZFIz3uw3hQ= @@ -48,31 +42,26 @@ github.com/kballard/go-shellquote v0.0.0-20180428030007-95032a82bc51 h1:Z9n2FFNU github.com/kballard/go-shellquote v0.0.0-20180428030007-95032a82bc51/go.mod h1:CzGEWj7cYgsdH8dAjBGEr58BoE7ScuLd+fwFZ44+/x8= github.com/klauspost/asmfmt v1.3.2 h1:4Ri7ox3EwapiOjCki+hw14RyKk201CN4rzyCJRFLpK4= github.com/klauspost/asmfmt v1.3.2/go.mod h1:AG8TuvYojzulgDAMCnYn50l/5QV3Bs/tp6j0HLHbNSE= -github.com/klauspost/compress v1.15.15 h1:EF27CXIuDsYJ6mmvtBRlEuB2UVOqHG1tAXgZ7yIO+lw= -github.com/klauspost/compress v1.15.15/go.mod h1:ZcK2JAFqKOpnBlxcLsJzYfrS9X1akm9fHZNnD9+Vo/4= -github.com/klauspost/cpuid/v2 v2.2.3 h1:sxCkb+qR91z4vsqw4vGGZlDgPz3G7gjaLyK3V8y70BU= -github.com/klauspost/cpuid/v2 v2.2.3/go.mod h1:RVVoqg1df56z8g3pUjL/3lE5UfnlrJX8tyFgg4nqhuY= +github.com/klauspost/compress v1.16.7 h1:2mk3MPGNzKyxErAw8YaohYh69+pa4sIQSC0fPGCFR9I= +github.com/klauspost/compress v1.16.7/go.mod h1:ntbaceVETuRiXiv4DpjP66DpAtAGkEQskQzEyD//IeE= +github.com/klauspost/cpuid/v2 v2.2.5 h1:0E5MSMDEoAulmXNFquVs//DdoomxaoTY1kUhbc/qbZg= +github.com/klauspost/cpuid/v2 v2.2.5/go.mod h1:Lcz8mBdAVJIBVzewtcLocK12l3Y+JytZYpaMropDUws= github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= github.com/leodido/go-urn v1.2.0 h1:hpXL4XnriNwQ/ABnpepYM/1vCLWNDfUNts8dX3xTG6Y= -github.com/leodido/go-urn v1.2.0/go.mod h1:+8+nEpDfqqsY+g338gtMEUOtuK+4dEMhiQEgxpxOKII= -github.com/mattn/go-colorable v0.1.8/go.mod h1:u6P/XSegPjTcexA+o6vUJrdnUu04hMope9wVRipJSqc= -github.com/mattn/go-colorable v0.1.9/go.mod h1:u6P/XSegPjTcexA+o6vUJrdnUu04hMope9wVRipJSqc= github.com/mattn/go-colorable v0.1.13 h1:fFA4WZxdEF4tXPZVKMLwD8oUnCTTo08duU7wxecdEvA= github.com/mattn/go-colorable v0.1.13/go.mod h1:7S9/ev0klgBDR4GtXTXX8a3vIGJpMovkB8vQcUbaXHg= -github.com/mattn/go-isatty v0.0.12/go.mod h1:cbi8OIDigv2wuxKPP5vlRcQ1OAZbq2CE4Kysco4FUpU= -github.com/mattn/go-isatty v0.0.14/go.mod h1:7GGIvUiUoEMVVmxf/4nioHXj79iQHKdU27kJ6hsGG94= github.com/mattn/go-isatty v0.0.16/go.mod h1:kYGgaQfpe5nmfYZH+SKPsOc2e4SrIfOl2e/yFXSvRLM= -github.com/mattn/go-isatty v0.0.17 h1:BTarxUcIeDqL27Mc+vyvdWYSL28zpIhv3RoTdsLMPng= -github.com/mattn/go-isatty v0.0.17/go.mod h1:kYGgaQfpe5nmfYZH+SKPsOc2e4SrIfOl2e/yFXSvRLM= -github.com/mattn/go-sqlite3 v1.14.15 h1:vfoHhTN1af61xCRSWzFIWzx2YskyMTwHLrExkBOjvxI= +github.com/mattn/go-isatty v0.0.19 h1:JITubQf0MOLdlGRuRq+jtsDlekdYPia9ZFsB8h/APPA= +github.com/mattn/go-isatty v0.0.19/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y= +github.com/mattn/go-sqlite3 v1.14.16 h1:yOQRA0RpS5PFz/oikGwBEqvAWhWg5ufRz4ETLjwpU1Y= github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 h1:AMFGa4R4MiIpspGNG7Z948v4n35fFGB3RR3G/ry4FWs= github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8/go.mod h1:mC1jAcsrzbxHt8iiaC+zU4b1ylILSosueou12R++wfY= github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 h1:+n/aFZefKZp7spd8DFdX7uMikMLXX4oubIzJF4kv/wI= github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3/go.mod h1:RagcQ7I8IeTMnF8JTXieKnO4Z6JCsikNEzj0DwauVzE= -github.com/pierrec/lz4/v4 v4.1.17 h1:kV4Ip+/hUBC+8T6+2EgburRtkE9ef4nbY3f4dFhGjMc= -github.com/pierrec/lz4/v4 v4.1.17/go.mod h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4= +github.com/pierrec/lz4/v4 v4.1.18 h1:xaKrnTkyoqfh1YItXl56+6KJNVYWlEEPuAQW9xsplYQ= +github.com/pierrec/lz4/v4 v4.1.18/go.mod h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4= github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/remyoudompheng/bigfft v0.0.0-20200410134404-eec4a21b6bb0/go.mod h1:qqbHyh8v60DhA7CoWK5oRCqLrMHRGoxYCSS9EjAz6Eo= @@ -84,93 +73,72 @@ github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+ github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw= github.com/stretchr/objx v0.5.0 h1:1zr/of2m5FGMsad5YfcqgdqdWrIhu+EBEJRhR1U7z/c= github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo= -github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4= github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU= -github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk= -github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= -github.com/substrait-io/substrait-go v0.2.1-0.20230517203920-30fa08bd57d0 h1:ULhfcCHY7uxA133qmInVpNpqfjyicryPXIaxCjbDVbw= -github.com/substrait-io/substrait-go v0.2.1-0.20230517203920-30fa08bd57d0/go.mod h1:qhpnLmrcvAnlZsUyPXZRqldiHapPTXC3t7xFgDi3aQg= +github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk= +github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo= +github.com/substrait-io/substrait-go v0.4.2 h1:buDnjsb3qAqTaNbOR7VKmNgXf4lYQxWEcnSGUWBtmN8= +github.com/substrait-io/substrait-go v0.4.2/go.mod h1:qhpnLmrcvAnlZsUyPXZRqldiHapPTXC3t7xFgDi3aQg= github.com/zeebo/assert v1.3.0 h1:g7C04CbJuIDKNPFHmsk4hwZDO5O+kntRxzaUoNXj+IQ= github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0= github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA= -golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= -golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= -golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto= -golang.org/x/crypto v0.1.0 h1:MDRAIl0xIo9Io2xV565hzXHw3zVseKrJKodhohM5CjU= -golang.org/x/exp v0.0.0-20230206171751-46f607a40771 h1:xP7rWLUr1e1n2xkK5YB4LI0hPEy3LJC6Wk+D4pGlOJg= -golang.org/x/exp v0.0.0-20230206171751-46f607a40771/go.mod h1:CxIveKay+FTh1D0yPZemJVgC/95VzuuOLq5Qi4xnoYc= -golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= -golang.org/x/mod v0.8.0 h1:LUYupSeNrTNCGzR/hVBk2NHZO4hXcVaW1k4Qx7rjPx8= -golang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs= -golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= -golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= -golang.org/x/net v0.7.0 h1:rJrUqqhjsgNp7KqAIc25s9pZnjU7TUcSY7HcVZjdn1g= -golang.org/x/net v0.7.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs= -golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.1.0 h1:wsuoTGHzEhffawBOhz5CYhcrV4IdKZbEyZjBMuTp12o= -golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= -golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20200116001909-b77594299b42/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.0.0-20220406163625-3f8b81556e12/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.0.0-20220704084225-05e143d24a9e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/crypto v0.13.0 h1:mvySKfSWJ+UKUii46M40LOvyWfN0s2U+46/jDd0e6Ck= +golang.org/x/exp v0.0.0-20230905200255-921286631fa9 h1:GoHiUyI/Tp2nVkLI2mCxVkOjsbSXD66ic0XW0js0R9g= +golang.org/x/exp v0.0.0-20230905200255-921286631fa9/go.mod h1:S2oDrQGGwySpoQPVqRShND87VCbxmc6bL1Yd2oYrm6k= +golang.org/x/mod v0.12.0 h1:rmsUpXtvNzj340zd98LZ4KntptpfRHwpFOHG188oHXc= +golang.org/x/mod v0.12.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs= +golang.org/x/net v0.15.0 h1:ugBLEUaxABaB5AJqW9enI0ACdci2RUd4eP51NTBvuJ8= +golang.org/x/net v0.15.0/go.mod h1:idbUs1IY1+zTqbi8yxTbhexhEEk5ur9LInksu6HrEpk= +golang.org/x/sync v0.3.0 h1:ftCYgMx6zT/asHUrPw8BLLscYtGznsLAnjq5RH9P66E= +golang.org/x/sync v0.3.0/go.mod h1:FU7BRWz2tNW+3quACPkgCx/L+uEAv1htQ0V83Z9Rj+Y= golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.5.0 h1:MUK/U/4lj1t1oPg0HfuXDN/Z1wv31ZJ/YcPiGccS4DU= golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= -golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk= -golang.org/x/text v0.7.0 h1:4BRB4x83lYWy72KwLD/qYDuTu7q9PjSagHvijDw7cLo= -golang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8= -golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= -golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= -golang.org/x/tools v0.6.0 h1:BOw41kyTf3PuCW1pVQf8+Cyg8pMlkYB1oo9iJ6D/lKM= -golang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU= -golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= +golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.12.0 h1:CM0HF96J0hcLAwsHPJZjfdNzs0gftsLfgKt57wWHJ0o= +golang.org/x/sys v0.12.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/text v0.13.0 h1:ablQoSUd0tRdKxZewP80B+BaqeKJuVhuRxj/dkrun3k= +golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE= +golang.org/x/tools v0.13.0 h1:Iey4qkscZuv0VvIt8E0neZjtPVQFSc870HQ448QgEmQ= +golang.org/x/tools v0.13.0/go.mod h1:HvlwmtVNQAhOuCjW7xxvovg8wbNq7LwfXh/k7wXUl58= golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 h1:H2TDz8ibqkAF6YGhCdN3jS9O0/s90v0rJh3X/OLHEUk= golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2/go.mod h1:K8+ghG5WaK9qNqU5K3HdILfMLy1f3aNYFI/wnl100a8= gonum.org/v1/gonum v0.12.0 h1:xKuo6hzt+gMav00meVPUlXwSdoEJP46BR+wdxQEFK2o= gonum.org/v1/gonum v0.12.0/go.mod h1:73TDxJfAAHeA8Mk9mf8NlIppyhQNo5GLTcYeqgo2lvY= -google.golang.org/genproto v0.0.0-20230209215440-0dfe4f8abfcc h1:ijGwO+0vL2hJt5gaygqP2j6PfflOBrRot0IczKbmtio= -google.golang.org/genproto v0.0.0-20230209215440-0dfe4f8abfcc/go.mod h1:RGgjbofJ8xD9Sq1VVhDM1Vok1vRONV+rg+CjzG4SZKM= -google.golang.org/grpc v1.53.0 h1:LAv2ds7cmFV/XTS3XG1NneeENYrXGmorPxsBbptIjNc= -google.golang.org/grpc v1.53.0/go.mod h1:OnIrk0ipVdj4N5d9IUoFUx72/VlD7+jUsHwZgwSMQpw= +google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1 h1:KpwkzHKEF7B9Zxg18WzOa7djJ+Ha5DzthMyZYQfEn2A= +google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1/go.mod h1:nKE/iIaLqn2bQwXBg8f1g2Ylh6r5MN5CmZvuzZCgsCU= +google.golang.org/grpc v1.54.0 h1:EhTqbhiYeixwWQtAEZAxmV9MGqcjEU2mFx52xCzNyag= +google.golang.org/grpc v1.54.0/go.mod h1:PUSEXI6iWghWaB6lXM4knEgpJNu2qUcKfDtNci3EC2g= google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw= google.golang.org/protobuf v1.26.0/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc= -google.golang.org/protobuf v1.28.1 h1:d0NfwRgPtno5B1Wa6L2DAG+KivqkdutMf1UhdNx175w= -google.golang.org/protobuf v1.28.1/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I= +google.golang.org/protobuf v1.31.0 h1:g0LDEJHgrBl9N9r17Ru3sqWhkIx2NB67okBHPwC7hs8= +google.golang.org/protobuf v1.31.0/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= -gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= -lukechampine.com/uint128 v1.2.0 h1:mBi/5l91vocEN8otkC5bDLhi2KdCticRiwbdB0O+rjI= -lukechampine.com/uint128 v1.2.0/go.mod h1:c4eWIwlEGaxC/+H1VguhU4PHXNWDCDMUlWdIWl2j1gk= +lukechampine.com/uint128 v1.3.0 h1:cDdUVfRwDUDovz610ABgFD17nXD4/uDgVHl2sC3+sbo= +lukechampine.com/uint128 v1.3.0/go.mod h1:c4eWIwlEGaxC/+H1VguhU4PHXNWDCDMUlWdIWl2j1gk= modernc.org/cc/v3 v3.40.0 h1:P3g79IUS/93SYhtoeaHW+kRCIrYaxJ27MFPv+7kaTOw= modernc.org/cc/v3 v3.40.0/go.mod h1:/bTg4dnWkSXowUO6ssQKnOV0yMVxDYNIsIrzqTFDGH0= modernc.org/ccgo/v3 v3.16.13 h1:Mkgdzl46i5F/CNR/Kj80Ri59hC8TKAhZrYSaqvkwzUw= modernc.org/ccgo/v3 v3.16.13/go.mod h1:2Quk+5YgpImhPjv2Qsob1DnZ/4som1lJTodubIcoUkY= modernc.org/ccorpus v1.11.6 h1:J16RXiiqiCgua6+ZvQot4yUuUy8zxgqbqEEUuGPlISk= modernc.org/httpfs v1.0.6 h1:AAgIpFZRXuYnkjftxTAZwMIiwEqAfk8aVB2/oA6nAeM= -modernc.org/libc v1.22.2 h1:4U7v51GyhlWqQmwCHj28Rdq2Yzwk55ovjFrdPjs8Hb0= -modernc.org/libc v1.22.2/go.mod h1:uvQavJ1pZ0hIoC/jfqNoMLURIMhKzINIWypNM17puug= +modernc.org/libc v1.22.4 h1:wymSbZb0AlrjdAVX3cjreCHTPCpPARbQXNz6BHPzdwQ= +modernc.org/libc v1.22.4/go.mod h1:jj+Z7dTNX8fBScMVNRAYZ/jF91K8fdT2hYMThc3YjBY= modernc.org/mathutil v1.5.0 h1:rV0Ko/6SfM+8G+yKiyI830l3Wuz1zRutdslNoQ0kfiQ= modernc.org/mathutil v1.5.0/go.mod h1:mZW8CKdRPY1v87qxC/wUdX5O1qDzXMP5TH3wjfpga6E= modernc.org/memory v1.5.0 h1:N+/8c5rE6EqugZwHii4IFsaJ7MUhoWX07J5tC/iI5Ds= modernc.org/memory v1.5.0/go.mod h1:PkUhL0Mugw21sHPeskwZW4D6VscE/GQJOnIpCnW6pSU= modernc.org/opt v0.1.3 h1:3XOZf2yznlhC+ibLltsDGzABUGVx8J6pnFMS3E4dcq4= modernc.org/opt v0.1.3/go.mod h1:WdSiB5evDcignE70guQKxYUl14mgWtbClRi5wmkkTX0= -modernc.org/sqlite v1.20.4 h1:J8+m2trkN+KKoE7jglyHYYYiaq5xmz2HoHJIiBlRzbE= -modernc.org/sqlite v1.20.4/go.mod h1:zKcGyrICaxNTMEHSr1HQ2GUraP0j+845GYw37+EyT6A= +modernc.org/sqlite v1.21.2 h1:ixuUG0QS413Vfzyx6FWx6PYTmHaOegTY+hjzhn7L+a0= +modernc.org/sqlite v1.21.2/go.mod h1:cxbLkB5WS32DnQqeH4h4o1B0eMr8W/y8/RGuxQ3JsC0= modernc.org/strutil v1.1.3 h1:fNMm+oJklMGYfU9Ylcywl0CO5O6nTfaowNsh2wpPjzY= modernc.org/strutil v1.1.3/go.mod h1:MEHNA7PdEnEwLvspRMtWTNnp2nnyvMfkimT1NKNAGbw= -modernc.org/tcl v1.15.0 h1:oY+JeD11qVVSgVvodMJsu7Edf8tr5E/7tuhF5cNYz34= +modernc.org/tcl v1.15.1 h1:mOQwiEK4p7HruMZcwKTZPw/aqtGM4aY00uzWhlKKYws= modernc.org/token v1.1.0 h1:Xl7Ap9dKaEs5kLoOQeQmPWevfnk/DM5qcLcYlA8ys6Y= modernc.org/token v1.1.0/go.mod h1:UGzOrNV1mAFSEB63lOFHIpNRUVMvYTc6yu1SMY/XTDM= modernc.org/z v1.7.0 h1:xkDw/KepgEjeizO2sNco+hqYkU12taxQFqPEmgm1GWE= diff --git a/go/internal/hashing/hash_string_go1.19.go b/go/internal/hashing/hash_string_go1.19.go index c496f43abdcc6..f38eb5c523dde 100644 --- a/go/internal/hashing/hash_string_go1.19.go +++ b/go/internal/hashing/hash_string_go1.19.go @@ -24,7 +24,14 @@ import ( ) func hashString(val string, alg uint64) uint64 { - buf := *(*[]byte)(unsafe.Pointer(&val)) - (*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(val) + if val == "" { + return Hash([]byte{}, alg) + } + // highly efficient way to get byte slice without copy before + // the introduction of unsafe.StringData in go1.20 + // (https://stackoverflow.com/questions/59209493/how-to-use-unsafe-get-a-byte-slice-from-a-string-without-memory-copy) + const MaxInt32 = 1<<31 - 1 + buf := (*[MaxInt32]byte)(unsafe.Pointer((*reflect.StringHeader)( + unsafe.Pointer(&val)).Data))[: len(val)&MaxInt32 : len(val)&MaxInt32] return Hash(buf, alg) } From 4fac528e2ee9c4efce453420275cbaf0b3b6adb0 Mon Sep 17 00:00:00 2001 From: Jacob Wujciak-Jens Date: Wed, 13 Sep 2023 11:24:17 +0200 Subject: [PATCH 13/96] GH-37639: [CI] Fix checkout on older OSes (#37640) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change Jobs are failing due to not being supported by node 20 ### What changes are included in this PR? Using v3 where needed. ### Are these changes tested? Crossbow * Closes: #37639 Lead-authored-by: Jacob Wujciak-Jens Co-authored-by: Raúl Cumplido Signed-off-by: Raúl Cumplido --- dev/tasks/macros.jinja | 4 ++-- dev/tasks/r/github.macos.autobrew.yml | 2 +- dev/tasks/r/github.packages.yml | 3 ++- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/dev/tasks/macros.jinja b/dev/tasks/macros.jinja index 06b9390c0f974..faf77a1168d1b 100644 --- a/dev/tasks/macros.jinja +++ b/dev/tasks/macros.jinja @@ -25,9 +25,9 @@ on: - "*-github-*" {% endmacro %} -{%- macro github_checkout_arrow(fetch_depth=1, submodules="recursive") -%} +{%- macro github_checkout_arrow(fetch_depth=1, submodules="recursive", action_v="4") -%} - name: Checkout Arrow - uses: actions/checkout@v4 + uses: actions/checkout@v{{ action_v }} with: fetch-depth: {{ fetch_depth }} path: arrow diff --git a/dev/tasks/r/github.macos.autobrew.yml b/dev/tasks/r/github.macos.autobrew.yml index 28733dbfef148..b8e23690e2090 100644 --- a/dev/tasks/r/github.macos.autobrew.yml +++ b/dev/tasks/r/github.macos.autobrew.yml @@ -34,7 +34,7 @@ jobs: - "{{ macros.r_release.ver }}" - "{{ macros.r_oldrel.ver }}" steps: - {{ macros.github_checkout_arrow()|indent }} + {{ macros.github_checkout_arrow(action_v='3')|indent }} - name: Configure autobrew script run: | # minio and sccache are pre-installed on the self-hosted 10.13 runner diff --git a/dev/tasks/r/github.packages.yml b/dev/tasks/r/github.packages.yml index e3e3d34e156dc..dbe21ffb6b160 100644 --- a/dev/tasks/r/github.packages.yml +++ b/dev/tasks/r/github.packages.yml @@ -262,7 +262,8 @@ jobs: # Get the arrow checkout just for the docker config scripts # Don't need submodules for this (hence false arg to macro): they fail on # actions/checkout for some reason in this context - {{ macros.github_checkout_arrow(1, false)|indent }} + {{ macros.github_checkout_arrow(1, false, '3')|indent }} + - name: Install system requirements env: ARROW_R_DEV: "TRUE" # To install curl/openssl in r_docker_configure.sh From 15a8ac3ce4e3ac31f9f361770ad4a38c69102aa1 Mon Sep 17 00:00:00 2001 From: andrewchambers Date: Thu, 14 Sep 2023 05:03:11 +1200 Subject: [PATCH 14/96] GH-37687: [Go] Don't copy in realloc when capacity is sufficient. (#37688) This removes excessive copies observed in some benchmarks. * Closes: #37687 Authored-by: Andrew Chambers Signed-off-by: Matt Topol --- go/arrow/memory/go_allocator.go | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/go/arrow/memory/go_allocator.go b/go/arrow/memory/go_allocator.go index 1dea4a8d23385..1017eb688d2ff 100644 --- a/go/arrow/memory/go_allocator.go +++ b/go/arrow/memory/go_allocator.go @@ -32,10 +32,9 @@ func (a *GoAllocator) Allocate(size int) []byte { } func (a *GoAllocator) Reallocate(size int, b []byte) []byte { - if size == len(b) { - return b + if cap(b) >= size { + return b[:size] } - newBuf := a.Allocate(size) copy(newBuf, b) return newBuf From 396b4759bfed70ad1f5d7724baaa7ee81654c6ea Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Thu, 14 Sep 2023 12:42:48 +0200 Subject: [PATCH 15/96] GH-37555: [Python] Update get_file_info_selector to ignore base directory (#37558) ### Rationale for this change There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see https://github.com/fsspec/filesystem_spec/pull/1329, which caused our tests to start failing. ### What changes are included in this PR? This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec. ### Are there any user-facing changes? No. * Closes: #37555 Authored-by: AlenkaF Signed-off-by: AlenkaF --- python/pyarrow/fs.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/python/pyarrow/fs.py b/python/pyarrow/fs.py index 567bea8ac05e8..36655c7d12863 100644 --- a/python/pyarrow/fs.py +++ b/python/pyarrow/fs.py @@ -356,7 +356,12 @@ def get_file_info_selector(self, selector): selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True ) for path, info in selected_files.items(): - infos.append(self._create_file_info(path, info)) + _path = path.strip("/") + base_dir = selector.base_dir.strip("/") + # Need to exclude base directory from selected files if present + # (fsspec filesystems, see GH-37555) + if _path != base_dir: + infos.append(self._create_file_info(path, info)) return infos From 28266f1f173f27c0db2aafd9497d4af7eb3f441c Mon Sep 17 00:00:00 2001 From: Slobodan Ilic Date: Thu, 14 Sep 2023 17:46:32 +0200 Subject: [PATCH 16/96] MINOR: [Python][Docs] Add examples for `MapArray.from_arrays` (#37656) ### Rationale for this change This PR enriched the `MapArray.from_arrays` with some nice examples. The examples are from the real-world scenario of working with survey data (scaled down, of course). ### What changes are included in this PR? The only change that this PR presents is to the docstring of the `MapArray.from_arrays` function. ### Are these changes tested? Does not apply ### Are there any user-facing changes? Yes, the docstring of the `MapArray.from_arrays` function. Lead-authored-by: Slobodan Ilic Co-authored-by: Slobodan Ilic Co-authored-by: Alenka Frim Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: AlenkaF --- python/pyarrow/array.pxi | 73 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi index e26b1ad3291b5..e36d8b2f04315 100644 --- a/python/pyarrow/array.pxi +++ b/python/pyarrow/array.pxi @@ -2363,6 +2363,79 @@ cdef class MapArray(ListArray): Returns ------- map_array : MapArray + + Examples + -------- + First, let's understand the structure of our dataset when viewed in a rectangular data model. + The total of 5 respondents answered the question "How much did you like the movie x?". + The value -1 in the integer array means that the value is missing. The boolean array + represents the null bitmask corresponding to the missing values in the integer array. + + >>> import pyarrow as pa + >>> movies_rectangular = np.ma.masked_array([ + ... [10, -1, -1], + ... [8, 4, 5], + ... [-1, 10, 3], + ... [-1, -1, -1], + ... [-1, -1, -1] + ... ], + ... [ + ... [False, True, True], + ... [False, False, False], + ... [True, False, False], + ... [True, True, True], + ... [True, True, True], + ... ]) + + To represent the same data with the MapArray and from_arrays, the data is + formed like this: + + >>> offsets = [ + ... 0, # -- row 1 start + ... 1, # -- row 2 start + ... 4, # -- row 3 start + ... 6, # -- row 4 start + ... 6, # -- row 5 start + ... 6, # -- row 5 end + ... ] + >>> movies = [ + ... "Dark Knight", # ---------------------------------- row 1 + ... "Dark Knight", "Meet the Parents", "Superman", # -- row 2 + ... "Meet the Parents", "Superman", # ----------------- row 3 + ... ] + >>> likings = [ + ... 10, # -------- row 1 + ... 8, 4, 5, # --- row 2 + ... 10, 3 # ------ row 3 + ... ] + >>> pa.MapArray.from_arrays(offsets, movies, likings).to_pandas() + 0 [(Dark Knight, 10)] + 1 [(Dark Knight, 8), (Meet the Parents, 4), (Sup... + 2 [(Meet the Parents, 10), (Superman, 3)] + 3 [] + 4 [] + dtype: object + + If the data in the empty rows needs to be marked as missing, it's possible + to do so by modifying the offsets argument, so that we specify `None` as + the starting positions of the rows we want marked as missing. The end row + offset still has to refer to the existing value from keys (and values): + + >>> offsets = [ + ... 0, # ----- row 1 start + ... 1, # ----- row 2 start + ... 4, # ----- row 3 start + ... None, # -- row 4 start + ... None, # -- row 5 start + ... 6, # ----- row 5 end + ... ] + >>> pa.MapArray.from_arrays(offsets, movies, likings).to_pandas() + 0 [(Dark Knight, 10)] + 1 [(Dark Knight, 8), (Meet the Parents, 4), (Sup... + 2 [(Meet the Parents, 10), (Superman, 3)] + 3 None + 4 None + dtype: object """ cdef: Array _offsets, _keys, _items From 670cf3b820ade8f553a81ba5b3346b74734cf972 Mon Sep 17 00:00:00 2001 From: Matthias Loibl Date: Thu, 14 Sep 2023 18:12:57 +0100 Subject: [PATCH 17/96] GH-37694: [Go] Add SetNull to array builders (#37695) ### Rationale for this change We are proposing to add a `SetNull(i int)` to the `Builder` interface. The underlying `builder` type implements this for most types. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? The SetNull is added to the Builders type. * Closes: #37694 Authored-by: Matthias Loibl Signed-off-by: Matt Topol --- go/arrow/array/builder.go | 10 ++++++++++ go/arrow/array/builder_test.go | 24 +++++++++++++++++++++++ go/arrow/array/map.go | 4 ++++ go/arrow/array/map_test.go | 35 ++++++++++++++++++++++++++++++++++ 4 files changed, 73 insertions(+) diff --git a/go/arrow/array/builder.go b/go/arrow/array/builder.go index aada095d099b8..58d4a0f4b8895 100644 --- a/go/arrow/array/builder.go +++ b/go/arrow/array/builder.go @@ -86,6 +86,9 @@ type Builder interface { // IsNull returns if a previously appended value at a given index is null or not. IsNull(i int) bool + // SetNull sets the value at index i to null. + SetNull(i int) + UnsafeAppendBoolToBitmap(bool) init(capacity int) @@ -126,6 +129,13 @@ func (b *builder) IsNull(i int) bool { return b.nullBitmap.Len() != 0 && bitutil.BitIsNotSet(b.nullBitmap.Bytes(), i) } +func (b *builder) SetNull(i int) { + if i < 0 || i >= b.length { + panic("arrow/array: index out of range") + } + bitutil.ClearBit(b.nullBitmap.Bytes(), i) +} + func (b *builder) init(capacity int) { toAlloc := bitutil.CeilByte(capacity) / 8 b.nullBitmap = memory.NewResizableBuffer(b.mem) diff --git a/go/arrow/array/builder_test.go b/go/arrow/array/builder_test.go index eeb7a2ac46b3f..3cacb54f725e7 100644 --- a/go/arrow/array/builder_test.go +++ b/go/arrow/array/builder_test.go @@ -97,3 +97,27 @@ func TestBuilder_IsNull(t *testing.T) { assert.Equal(t, i%2 != 0, b.IsNull(i)) } } + +func TestBuilder_SetNull(t *testing.T) { + b := &builder{mem: memory.NewGoAllocator()} + n := 32 + b.init(n) + + for i := 0; i < n; i++ { + // Set everything to true + b.UnsafeAppendBoolToBitmap(true) + } + for i := 0; i < n; i++ { + if i%2 == 0 { // Set all even numbers to null + b.SetNull(i) + } + } + + for i := 0; i < n; i++ { + if i%2 == 0 { + assert.True(t, b.IsNull(i)) + } else { + assert.False(t, b.IsNull(i)) + } + } +} diff --git a/go/arrow/array/map.go b/go/arrow/array/map.go index 84c1b55a0cf21..4fe860f26ef61 100644 --- a/go/arrow/array/map.go +++ b/go/arrow/array/map.go @@ -234,6 +234,10 @@ func (b *MapBuilder) AppendNulls(n int) { } } +func (b *MapBuilder) SetNull(i int) { + b.listBuilder.SetNull(i) +} + func (b *MapBuilder) AppendEmptyValue() { b.Append(true) } diff --git a/go/arrow/array/map_test.go b/go/arrow/array/map_test.go index cfb1cac87bedc..3fe78549ec803 100644 --- a/go/arrow/array/map_test.go +++ b/go/arrow/array/map_test.go @@ -217,3 +217,38 @@ func TestMapStringRoundTrip(t *testing.T) { assert.True(t, array.Equal(arr, arr1)) } + +func TestMapBuilder_SetNull(t *testing.T) { + pool := memory.NewCheckedAllocator(memory.NewGoAllocator()) + defer pool.AssertSize(t, 0) + + var ( + arr *array.Map + equalValid = []bool{true, true, true, true, true, true, true} + equalOffsets = []int32{0, 1, 2, 5, 6, 7, 8, 10} + equalKeys = []string{"a", "a", "a", "b", "c", "a", "a", "a", "a", "b"} + equalValues = []int32{1, 2, 3, 4, 5, 2, 2, 2, 5, 6} + ) + + bldr := array.NewMapBuilder(pool, arrow.BinaryTypes.String, arrow.PrimitiveTypes.Int32, false) + defer bldr.Release() + + kb := bldr.KeyBuilder().(*array.StringBuilder) + ib := bldr.ItemBuilder().(*array.Int32Builder) + + bldr.AppendValues(equalOffsets, equalValid) + for _, k := range equalKeys { + kb.Append(k) + } + ib.AppendValues(equalValues, nil) + + bldr.SetNull(0) + bldr.SetNull(3) + + arr = bldr.NewMapArray() + defer arr.Release() + + assert.True(t, arr.IsNull(0)) + assert.True(t, arr.IsValid(1)) + assert.True(t, arr.IsNull(3)) +} From 49890e94a0e878c60d8b4a62f48665f494ab2067 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 15 Sep 2023 11:32:28 +0900 Subject: [PATCH 18/96] GH-37715: [Packaging][CentOS] Use default g++ on CentOS 9 Stream (#37718) ### Rationale for this change We can use default g++ by using shared LLVM library. ### What changes are included in this PR? Use default g++ and remove needless `llvm-static`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #37715 Authored-by: Sutou Kouhei Signed-off-by: Sutou Kouhei --- .../apache-arrow/yum/centos-9-stream/Dockerfile | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/dev/tasks/linux-packages/apache-arrow/yum/centos-9-stream/Dockerfile b/dev/tasks/linux-packages/apache-arrow/yum/centos-9-stream/Dockerfile index 513a63fee8128..b1e1630103c34 100644 --- a/dev/tasks/linux-packages/apache-arrow/yum/centos-9-stream/Dockerfile +++ b/dev/tasks/linux-packages/apache-arrow/yum/centos-9-stream/Dockerfile @@ -18,15 +18,12 @@ ARG FROM=quay.io/centos/centos:stream9 FROM ${FROM} -ENV SCL=gcc-toolset-12 - ARG DEBUG RUN \ quiet=$([ "${DEBUG}" = "yes" ] || echo "--quiet") && \ dnf install -y ${quiet} epel-release && \ dnf install --enablerepo=crb -y ${quiet} \ - ${SCL} \ bison \ boost-devel \ brotli-devel \ @@ -46,7 +43,6 @@ RUN \ libarchive \ libzstd-devel \ llvm-devel \ - llvm-static \ lz4-devel \ make \ ncurses-devel \ @@ -65,11 +61,3 @@ RUN \ vala \ zlib-devel && \ dnf clean ${quiet} all - -# Workaround: We can remove this once redhat-rpm-config uses "annobin" -# not "gcc-annobin". -RUN \ - sed \ - -i \ - -e 's/gcc-annobin/annobin/g' \ - /usr/lib/rpm/redhat/redhat-annobin-select-gcc-built-plugin From 783a0023ffdb020b8bf20098e6bffff463a83541 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Fri, 15 Sep 2023 11:33:37 +0900 Subject: [PATCH 19/96] GH-37648: [Packaging][Linux] Fix libarrow-glib-dev/arrow-glib-devel dependencies (#37714) ### Rationale for this change Apache Arrow C GLib depends on Acero. So `libarrow-glib-dev`/`arrow-glib-devel` should depend on `libarrow-acero-dev`/`arrow-acero-devel`. ### What changes are included in this PR? Fix dependencies. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #37648 Authored-by: Sutou Kouhei Signed-off-by: Sutou Kouhei --- dev/tasks/linux-packages/apache-arrow/debian/control.in | 2 +- dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/tasks/linux-packages/apache-arrow/debian/control.in b/dev/tasks/linux-packages/apache-arrow/debian/control.in index 8c1bab8d058da..f08fc05bfc3ad 100644 --- a/dev/tasks/linux-packages/apache-arrow/debian/control.in +++ b/dev/tasks/linux-packages/apache-arrow/debian/control.in @@ -317,7 +317,7 @@ Multi-Arch: same Depends: ${misc:Depends}, libglib2.0-dev, - libarrow-dev (= ${binary:Version}), + libarrow-acero-dev (= ${binary:Version}), libarrow-glib1400 (= ${binary:Version}), gir1.2-arrow-1.0 (= ${binary:Version}) Suggests: libarrow-glib-doc diff --git a/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in b/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in index 6a87e19cd3091..4691f9e5439da 100644 --- a/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in +++ b/dev/tasks/linux-packages/apache-arrow/yum/arrow.spec.in @@ -562,7 +562,7 @@ This package contains the libraries for Apache Arrow GLib. %package glib-devel Summary: Libraries and header files for Apache Arrow GLib License: Apache-2.0 -Requires: %{name}-devel = %{version}-%{release} +Requires: %{name}-acero-devel = %{version}-%{release} Requires: %{name}%{major_version}-glib-libs = %{version}-%{release} Requires: glib2-devel Requires: gobject-introspection-devel From a446ff71b87880d399a204b767ee493cff573d15 Mon Sep 17 00:00:00 2001 From: Abe Tomoaki Date: Fri, 15 Sep 2023 17:09:45 +0900 Subject: [PATCH 20/96] GH-34567: [JS] Improve build and do not generate `bin/bin` directory (#36607) ### Rationale for this change `bin/bin` directory is unnecessary and should not be generated. ### What changes are included in this PR? * Add setting to exclude in tsconfig * Correctly set up `bin` out directory ### Are these changes tested? The following files are not generated. ``` targets/apache-arrow/bin/bin/arrow2csv.js targets/apache-arrow/bin/bin/arrow2csv.js.map targets/apache-arrow/bin/bin/arrow2csv.mjs targets/apache-arrow/bin/src/bin/arrow2csv.ts targets/es2015/cjs/bin/bin/arrow2csv.js targets/es2015/cjs/bin/bin/arrow2csv.js.map targets/es2015/cjs/bin/src/bin/arrow2csv.ts targets/es2015/esm/bin/bin/arrow2csv.js targets/es2015/esm/bin/bin/arrow2csv.js.map targets/es2015/esm/bin/src/bin/arrow2csv.ts targets/es2015/umd/bin/bin/arrow2csv.js targets/es2015/umd/bin/bin/arrow2csv.js.map targets/es2015/umd/bin/src/bin/arrow2csv.ts targets/es5/cjs/bin/bin/arrow2csv.js targets/es5/cjs/bin/bin/arrow2csv.js.map targets/es5/cjs/bin/src/bin/arrow2csv.ts targets/es5/esm/bin/bin/arrow2csv.js targets/es5/esm/bin/bin/arrow2csv.js.map targets/es5/esm/bin/src/bin/arrow2csv.ts targets/es5/umd/bin/bin/arrow2csv.js targets/es5/umd/bin/bin/arrow2csv.js.map targets/es5/umd/bin/src/bin/arrow2csv.ts targets/esnext/cjs/bin/bin/arrow2csv.js targets/esnext/cjs/bin/bin/arrow2csv.js.map targets/esnext/cjs/bin/src/bin/arrow2csv.ts targets/esnext/esm/bin/bin/arrow2csv.js targets/esnext/esm/bin/bin/arrow2csv.js.map targets/esnext/esm/bin/src/bin/arrow2csv.ts targets/esnext/umd/bin/bin/arrow2csv.js targets/esnext/umd/bin/bin/arrow2csv.js.map targets/esnext/umd/bin/src/bin/arrow2csv.ts ``` ### Are there any user-facing changes? * Closes: #34567 Lead-authored-by: abetomo Co-authored-by: ptaylor Signed-off-by: Dominik Moritz --- js/.eslintrc.cjs | 2 +- js/gulp/arrow-task.js | 28 +++++++++++++++++++--------- js/gulp/typescript-task.js | 26 ++++++++++++++++++++------ js/gulpfile.js | 4 ++++ js/src/Arrow.ts | 2 ++ js/src/bin/arrow2csv.ts | 16 ++++++++-------- js/tsconfig/tsconfig.base.json | 2 +- 7 files changed, 55 insertions(+), 25 deletions(-) mode change 100644 => 100755 js/src/bin/arrow2csv.ts diff --git a/js/.eslintrc.cjs b/js/.eslintrc.cjs index b629b862190f4..8a36516eec1c0 100644 --- a/js/.eslintrc.cjs +++ b/js/.eslintrc.cjs @@ -23,7 +23,7 @@ module.exports = { }, parser: "@typescript-eslint/parser", parserOptions: { - project: "tsconfig.json", + project: ["tsconfig.json", "tsconfig/tsconfig.bin.cjs.json"], sourceType: "module", ecmaVersion: 2020, }, diff --git a/js/gulp/arrow-task.js b/js/gulp/arrow-task.js index 411a817ddc09e..2de20947dc2f5 100644 --- a/js/gulp/arrow-task.js +++ b/js/gulp/arrow-task.js @@ -15,19 +15,18 @@ // specific language governing permissions and limitations // under the License. -import { targetDir, observableFromStreams } from './util.js'; +import { mainExport, targetDir, observableFromStreams } from './util.js'; -import { deleteAsync as del } from 'del'; import gulp from 'gulp'; +import path from 'path'; import { mkdirp } from 'mkdirp'; +import * as fs from 'fs/promises'; import gulpRename from 'gulp-rename'; import gulpReplace from 'gulp-replace'; import { memoizeTask } from './memoize-task.js'; import { ReplaySubject, forkJoin as ObservableForkJoin } from 'rxjs'; import { share } from 'rxjs/operators'; -import util from 'util'; -import stream from 'stream'; -const pipeline = util.promisify(stream.pipeline); +import { pipeline } from 'stream/promises'; export const arrowTask = ((cache) => memoizeTask(cache, function copyMain(target) { const out = targetDir(target); @@ -54,9 +53,20 @@ export const arrowTask = ((cache) => memoizeTask(cache, function copyMain(target }))({}); export const arrowTSTask = ((cache) => memoizeTask(cache, async function copyTS(target, format) { + const umd = targetDir(`es5`, `umd`); const out = targetDir(target, format); - await mkdirp(out); - await pipeline(gulp.src(`src/**/*`), gulp.dest(out)); - await del(`${out}/**/*.js`); -}))({}); + const arrowUMD = path.join(umd, `${mainExport}.js`); + const arrow2csvUMD = path.join(umd, `bin`, `arrow2csv.js`); + + await mkdirp(path.join(out, 'bin')); + await Promise.all([ + pipeline(gulp.src(`src/**/*`), gulp.dest(out)), + pipeline( + gulp.src([arrowUMD, arrow2csvUMD]), + gulpReplace(`../${mainExport}.js`, `./${mainExport}.js`), + gulp.dest(path.join(out, 'bin')) + ), + fs.writeFile(path.join(out, 'bin', 'package.json'), '{"type": "commonjs"}') + ]); +}))({}); diff --git a/js/gulp/typescript-task.js b/js/gulp/typescript-task.js index 02192192327ad..31769e3b1b236 100644 --- a/js/gulp/typescript-task.js +++ b/js/gulp/typescript-task.js @@ -19,12 +19,13 @@ import { targetDir, tsconfigName, observableFromStreams, shouldRunInChildProcess import gulp from 'gulp'; import path from 'path'; -import ts from 'gulp-typescript'; import tsc from 'typescript'; +import ts from 'gulp-typescript'; +import * as fs from 'fs/promises'; import sourcemaps from 'gulp-sourcemaps'; import { memoizeTask } from './memoize-task.js'; -import { ReplaySubject, forkJoin as ObservableForkJoin } from 'rxjs'; -import { mergeWith, takeLast, share } from 'rxjs/operators'; +import { ReplaySubject, forkJoin as ObservableForkJoin, defer as ObservableDefer } from 'rxjs'; +import { mergeWith, takeLast, share, concat } from 'rxjs/operators'; export const typescriptTask = ((cache) => memoizeTask(cache, function typescript(target, format) { if (shouldRunInChildProcess(target, format)) { @@ -44,10 +45,15 @@ export default typescriptTask; export function compileBinFiles(target, format) { const out = targetDir(target, format); const tsconfigPath = path.join(`tsconfig`, `tsconfig.${tsconfigName('bin', 'cjs')}.json`); - return compileTypescript(path.join(out, 'bin'), tsconfigPath, { target }); + const tsconfigOverrides = format === 'esm' ? { target, module: 'ES2015' } : { target }; + return compileTypescript(out, tsconfigPath, tsconfigOverrides, false) + .pipe(takeLast(1)) + .pipe(concat(ObservableDefer(() => { + return fs.chmod(path.join(out, 'bin', 'arrow2csv.js'), 0o755); + }))); } -function compileTypescript(out, tsconfigPath, tsconfigOverrides) { +function compileTypescript(out, tsconfigPath, tsconfigOverrides, writeSourcemaps = true) { const tsProject = ts.createProject(tsconfigPath, { typescript: tsc, ...tsconfigOverrides }); const { stream: { js, dts } } = observableFromStreams( tsProject.src(), sourcemaps.init(), @@ -56,7 +62,15 @@ function compileTypescript(out, tsconfigPath, tsconfigOverrides) { const writeSources = observableFromStreams(tsProject.src(), gulp.dest(path.join(out, 'src'))); const writeDTypes = observableFromStreams(dts, sourcemaps.write('./', { includeContent: false, sourceRoot: './src' }), gulp.dest(out)); const mapFile = tsProject.options.module === tsc.ModuleKind.ES2015 ? esmMapFile : cjsMapFile; - const writeJS = observableFromStreams(js, sourcemaps.write('./', { mapFile, includeContent: false, sourceRoot: './src' }), gulp.dest(out)); + const writeJSArgs = writeSourcemaps ? [ + js, + sourcemaps.write('./', { mapFile, includeContent: false, sourceRoot: './src' }), + gulp.dest(out) + ] : [ + js, + gulp.dest(out) + ]; + const writeJS = observableFromStreams(...writeJSArgs); return ObservableForkJoin([writeSources, writeDTypes, writeJS]); } diff --git a/js/gulpfile.js b/js/gulpfile.js index 6544b987b73f6..bf84a4a9e1b49 100644 --- a/js/gulpfile.js +++ b/js/gulpfile.js @@ -54,6 +54,10 @@ knownTargets.forEach((target) => { )); }); +gulp.task(`build:ts`, gulp.series( + `build:es5:umd`, `clean:ts`, `compile:ts`, `package:ts` +)); + // The main "apache-arrow" module builds the es2015/umd, es2015/cjs, // es2015/esm, and esnext/umd targets, then copies and renames the // compiled output into the apache-arrow folder diff --git a/js/src/Arrow.ts b/js/src/Arrow.ts index dc44e10b9206f..4a6394c266b1b 100644 --- a/js/src/Arrow.ts +++ b/js/src/Arrow.ts @@ -99,6 +99,7 @@ import * as util_bit_ from './util/bit.js'; import * as util_math_ from './util/math.js'; import * as util_buffer_ from './util/buffer.js'; import * as util_vector_ from './util/vector.js'; +import * as util_pretty_ from './util/pretty.js'; import { compareSchemas, compareFields, compareTypes } from './visitor/typecomparator.js'; /** @ignore */ @@ -109,6 +110,7 @@ export const util = { ...util_math_, ...util_buffer_, ...util_vector_, + ...util_pretty_, compareSchemas, compareFields, compareTypes, diff --git a/js/src/bin/arrow2csv.ts b/js/src/bin/arrow2csv.ts old mode 100644 new mode 100755 index eae7f5805c41c..39db8c17497cd --- a/js/src/bin/arrow2csv.ts +++ b/js/src/bin/arrow2csv.ts @@ -21,8 +21,7 @@ import * as fs from 'fs'; import * as stream from 'stream'; -import { valueToString } from '../util/pretty.js'; -import { Schema, RecordBatch, RecordBatchReader, AsyncByteQueue } from '../Arrow.node.js'; +import { Schema, RecordBatch, RecordBatchReader, AsyncByteQueue, util } from '../Arrow.js'; import commandLineUsage from 'command-line-usage'; import commandLineArgs from 'command-line-args'; @@ -58,9 +57,10 @@ type ToStringState = { if (state.closed) { break; } for await (reader of recordBatchReaders(source)) { hasReaders = true; - const transformToString = batchesToString(state, reader.schema); + const batches = stream.Readable.from(reader); + const toString = batchesToString(state, reader.schema); await pipeTo( - reader.pipe(transformToString), + batches.pipe(toString), process.stdout, { end: false } ).catch(() => state.closed = true); // Handle EPIPE errors } @@ -129,7 +129,7 @@ function batchesToString(state: ToStringState, schema: Schema) { let maxColWidths = [10]; const { hr, sep, metadata } = state; - const header = ['row_id', ...schema.fields.map((f) => `${f}`)].map(val => valueToString(val)); + const header = ['row_id', ...schema.fields.map((f) => `${f}`)].map(val => util.valueToString(val)); state.maxColWidths = header.map((x, i) => Math.max(maxColWidths[i] || 0, x.length)); @@ -181,7 +181,7 @@ function batchesToString(state: ToStringState, schema: Schema) { if (rowId % 350 === 0) { this.push(`${formatRow(header, maxColWidths, sep)}\n`); } - this.push(`${formatRow([rowId++, ...row.toArray()].map(v => valueToString(v)), maxColWidths, sep)}\n`); + this.push(`${formatRow([rowId++, ...row.toArray()].map(v => util.valueToString(v)), maxColWidths, sep)}\n`); } } cb(); @@ -202,7 +202,7 @@ function formatMetadataValue(value = '') { try { parsed = JSON.stringify(JSON.parse(value), null, 2); } catch { parsed = value; } - return valueToString(parsed).split('\n').join('\n '); + return util.valueToString(parsed).split('\n').join('\n '); } function formatMetadata(metadata: Map) { @@ -236,7 +236,7 @@ function measureColumnWidths(rowId: number, batch: RecordBatch, maxColWidths: nu (val.length * elementWidth) // width of stringified 2^N-1 ); } else { - maxColWidths[j + 1] = Math.max(maxColWidths[j + 1] || 0, valueToString(val).length); + maxColWidths[j + 1] = Math.max(maxColWidths[j + 1] || 0, util.valueToString(val).length); } ++j; } diff --git a/js/tsconfig/tsconfig.base.json b/js/tsconfig/tsconfig.base.json index fb4ecb38b5892..0d7fefd90949f 100644 --- a/js/tsconfig/tsconfig.base.json +++ b/js/tsconfig/tsconfig.base.json @@ -1,5 +1,5 @@ { - "exclude": ["../node_modules"], + "exclude": ["../node_modules", "../src/bin/*.ts"], "include": ["../src/**/*.ts"], "compileOnSave": false, "compilerOptions": { From d8ab9a802227468976958b109c08c1ca9637a7e8 Mon Sep 17 00:00:00 2001 From: Christian Lorentzen Date: Fri, 15 Sep 2023 10:12:26 +0200 Subject: [PATCH 21/96] MINOR: [C++][Python] [Docs] clearer description of q argument of quantiles (#37380) ### Rationale for this change Documention of the `q` parameter of quantiles is made more precise. ### What changes are included in this PR? Only documentation changes. ### Are these changes tested? No ### Are there any user-facing changes? No Authored-by: Christian Lorentzen Signed-off-by: Joris Van den Bossche --- cpp/src/arrow/compute/api_aggregate.h | 4 ++-- python/pyarrow/_compute.pyx | 6 ++++-- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/cpp/src/arrow/compute/api_aggregate.h b/cpp/src/arrow/compute/api_aggregate.h index 8f45f6199fbe1..3493c3146310d 100644 --- a/cpp/src/arrow/compute/api_aggregate.h +++ b/cpp/src/arrow/compute/api_aggregate.h @@ -138,7 +138,7 @@ class ARROW_EXPORT QuantileOptions : public FunctionOptions { static constexpr char const kTypeName[] = "QuantileOptions"; static QuantileOptions Defaults() { return QuantileOptions{}; } - /// quantile must be between 0 and 1 inclusive + /// probability level of quantile must be between 0 and 1 inclusive std::vector q; enum Interpolation interpolation; /// If true (the default), null values are ignored. Otherwise, if any value is null, @@ -162,7 +162,7 @@ class ARROW_EXPORT TDigestOptions : public FunctionOptions { static constexpr char const kTypeName[] = "TDigestOptions"; static TDigestOptions Defaults() { return TDigestOptions{}; } - /// quantile must be between 0 and 1 inclusive + /// probability level of quantile must be between 0 and 1 inclusive std::vector q; /// compression parameter, default 100 uint32_t delta; diff --git a/python/pyarrow/_compute.pyx b/python/pyarrow/_compute.pyx index 0c1744febbe1e..609307528d2ec 100644 --- a/python/pyarrow/_compute.pyx +++ b/python/pyarrow/_compute.pyx @@ -2145,7 +2145,8 @@ class QuantileOptions(_QuantileOptions): Parameters ---------- q : double or sequence of double, default 0.5 - Quantiles to compute. All values must be in [0, 1]. + Probability levels of the quantiles to compute. All values must be in + [0, 1]. interpolation : str, default "linear" How to break ties between competing data points for a given quantile. Accepted values are: @@ -2182,7 +2183,8 @@ class TDigestOptions(_TDigestOptions): Parameters ---------- q : double or sequence of double, default 0.5 - Quantiles to approximate. All values must be in [0, 1]. + Probability levels of the quantiles to approximate. All values must be + in [0, 1]. delta : int, default 100 Compression parameter for the T-digest algorithm. buffer_size : int, default 500 From ac7e9a4c41b73242b2f7a15f13f3c8fde843416d Mon Sep 17 00:00:00 2001 From: Bryce Mecum Date: Fri, 15 Sep 2023 00:50:20 -0800 Subject: [PATCH 22/96] GH-34105: [R] Provide extra output for failed builds (#37727) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change This is a replacement for the previous PR https://github.com/apache/arrow/pull/37698. The rationale for this PR is providing extra output for R package builds where the C++ build fails ### What changes are included in this PR? Update the system call to save output when building Arrow C++ from the R package and output it if it's failed ### Are these changes tested? No automated tests but the changes have been tested manually. ### Are there any user-facing changes? Yes, but only for users building the R package from source which is hopefully not common. * Closes: #34105 Lead-authored-by: Bryce Mecum Co-authored-by: Nic Crane Signed-off-by: Raúl Cumplido --- r/tools/nixlibs.R | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/r/tools/nixlibs.R b/r/tools/nixlibs.R index dca277c80948c..3d908c05cab07 100644 --- a/r/tools/nixlibs.R +++ b/r/tools/nixlibs.R @@ -473,17 +473,25 @@ build_libarrow <- function(src_dir, dst_dir) { env_vars <- env_vars_as_string(env_var_list) cat("**** arrow", ifelse(quietly, "", paste("with", env_vars)), "\n") - status <- suppressWarnings(system( - paste(env_vars, "inst/build_arrow_static.sh"), - ignore.stdout = quietly, ignore.stderr = quietly + + build_log_path <- tempfile(fileext = ".log") + status <- suppressWarnings(system2( + "bash", + "inst/build_arrow_static.sh", + env = env_vars, + stdout = ifelse(quietly, build_log_path, ""), + stderr = ifelse(quietly, build_log_path, "") )) + if (status != 0) { # It failed :( - cat( - "**** Error building Arrow C++.", - ifelse(env_is("ARROW_R_DEV", "true"), "", "Re-run with ARROW_R_DEV=true for debug information."), - "\n" - ) + cat("**** Error building Arrow C++.", "\n") + if (quietly) { + cat("**** Printing contents of build log because the build failed", + "while ARROW_R_DEV was set to FALSE\n") + cat(readLines(build_log_path), sep = "\n") + cat("**** Complete build log may still be present at", build_log_path, "\n") + } } invisible(status) } From d9ee3a7961df44a1e30d3fe615ee3a20af569025 Mon Sep 17 00:00:00 2001 From: mwish Date: Fri, 15 Sep 2023 23:38:23 +0800 Subject: [PATCH 23/96] GH-37643: [C++] Enhance arrow::Datum::ToString (#37646) ### Rationale for this change Add print child fields for `arrow::Datum::ToString`. Because previously it just print a type. ### What changes are included in this PR? Add detail in `arrow::Datum::ToString` ### Are these changes tested? no ### Are there any user-facing changes? Yes * Closes: #37643 Authored-by: mwish Signed-off-by: Benjamin Kietzman --- c_glib/test/test-array-datum.rb | 2 +- c_glib/test/test-chunked-array-datum.rb | 2 +- c_glib/test/test-record-batch-datum.rb | 2 +- c_glib/test/test-scalar-datum.rb | 2 +- c_glib/test/test-table-datum.rb | 11 ++++++++++- cpp/src/arrow/datum.cc | 10 +++++----- cpp/src/arrow/datum.h | 1 - cpp/src/arrow/datum_test.cc | 4 ++-- 8 files changed, 21 insertions(+), 13 deletions(-) diff --git a/c_glib/test/test-array-datum.rb b/c_glib/test/test-array-datum.rb index 623e5589ce40b..1b2c9f91e2aa2 100644 --- a/c_glib/test/test-array-datum.rb +++ b/c_glib/test/test-array-datum.rb @@ -61,7 +61,7 @@ def test_false end def test_to_string - assert_equal("Array", @datum.to_s) + assert_equal("Array([\n" + " true,\n" + " false\n" + "])", @datum.to_s) end def test_value diff --git a/c_glib/test/test-chunked-array-datum.rb b/c_glib/test/test-chunked-array-datum.rb index 76317315327e8..b82f3eed8a7af 100644 --- a/c_glib/test/test-chunked-array-datum.rb +++ b/c_glib/test/test-chunked-array-datum.rb @@ -49,7 +49,7 @@ def test_false end def test_to_string - assert_equal("ChunkedArray", @datum.to_s) + assert_equal("ChunkedArray([\n" + " [\n" + " true,\n" + " false\n" + " ]\n" + "])", @datum.to_s) end def test_value diff --git a/c_glib/test/test-record-batch-datum.rb b/c_glib/test/test-record-batch-datum.rb index 33eb793ba869a..ec572e0f13023 100644 --- a/c_glib/test/test-record-batch-datum.rb +++ b/c_glib/test/test-record-batch-datum.rb @@ -49,7 +49,7 @@ def test_false end def test_to_string - assert_equal("RecordBatch", @datum.to_s) + assert_equal("RecordBatch(visible: [\n" + " true,\n" + " false\n" + " ]\n" + ")", @datum.to_s) end def test_value diff --git a/c_glib/test/test-scalar-datum.rb b/c_glib/test/test-scalar-datum.rb index 17e5d6b061cc7..32a5331518d8b 100644 --- a/c_glib/test/test-scalar-datum.rb +++ b/c_glib/test/test-scalar-datum.rb @@ -60,7 +60,7 @@ def test_false end def test_to_string - assert_equal("Scalar", @datum.to_s) + assert_equal("Scalar(true)", @datum.to_s) end def test_value diff --git a/c_glib/test/test-table-datum.rb b/c_glib/test/test-table-datum.rb index 7ff3997e88a37..c34ecf6314118 100644 --- a/c_glib/test/test-table-datum.rb +++ b/c_glib/test/test-table-datum.rb @@ -49,7 +49,16 @@ def test_false end def test_to_string - assert_equal("Table", @datum.to_s) + assert_equal("Table(visible: bool\n" + + "----\n" + + "visible:\n" + + " [\n" + + " [\n" + + " true,\n" + + " false\n" + + " ]\n" + + " ]\n" + + ")", @datum.to_s) end def test_value diff --git a/cpp/src/arrow/datum.cc b/cpp/src/arrow/datum.cc index d0b5cf62c61be..2ac230232e1b7 100644 --- a/cpp/src/arrow/datum.cc +++ b/cpp/src/arrow/datum.cc @@ -182,15 +182,15 @@ std::string Datum::ToString() const { case Datum::NONE: return "nullptr"; case Datum::SCALAR: - return "Scalar"; + return "Scalar(" + scalar()->ToString() + ")"; case Datum::ARRAY: - return "Array"; + return "Array(" + make_array()->ToString() + ")"; case Datum::CHUNKED_ARRAY: - return "ChunkedArray"; + return "ChunkedArray(" + chunked_array()->ToString() + ")"; case Datum::RECORD_BATCH: - return "RecordBatch"; + return "RecordBatch(" + record_batch()->ToString() + ")"; case Datum::TABLE: - return "Table"; + return "Table(" + table()->ToString() + ")"; default: DCHECK(false); return ""; diff --git a/cpp/src/arrow/datum.h b/cpp/src/arrow/datum.h index 57ae3731b5ccd..31b2d2274c900 100644 --- a/cpp/src/arrow/datum.h +++ b/cpp/src/arrow/datum.h @@ -301,7 +301,6 @@ struct ARROW_EXPORT Datum { bool operator==(const Datum& other) const { return Equals(other); } bool operator!=(const Datum& other) const { return !Equals(other); } - /// \brief Return a string representation of the kind of datum stored. std::string ToString() const; }; diff --git a/cpp/src/arrow/datum_test.cc b/cpp/src/arrow/datum_test.cc index 14daac6a794fc..909d2577e68fb 100644 --- a/cpp/src/arrow/datum_test.cc +++ b/cpp/src/arrow/datum_test.cc @@ -154,8 +154,8 @@ TEST(Datum, ToString) { Datum v1(arr); Datum v2(std::make_shared(1)); - ASSERT_EQ("Array", v1.ToString()); - ASSERT_EQ("Scalar", v2.ToString()); + ASSERT_EQ("Array([\n 1,\n 2,\n 3,\n 4\n])", v1.ToString()); + ASSERT_EQ("Scalar(1)", v2.ToString()); } TEST(Datum, TotalBufferSize) { From e32e87529e0810572821b0e11afbe1562f1e7edd Mon Sep 17 00:00:00 2001 From: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Date: Fri, 15 Sep 2023 14:40:20 -0400 Subject: [PATCH 24/96] GH-37654: [MATLAB] Add `Fields` property to `arrow.type.Type` MATLAB class (#37725) ### Rationale for this change In order to implement `arrow.array.StructType`, we need to add a property called `Fields` to `arrow.type.Type`. This property will be a N-by-1 `arrow.type.Field` array. Adding `Fields` will let users inspect the `Type`s contained by a `StructType` object. ### What changes are included in this PR? 1. Added `Fields` as a property to `arrow.type.Type`. `Fields` is a 1xN `arrow.type.Field` array, where `N` is the number of fields. 2. Added method `field(idx)` to `arrow.type.Type`. This method accepts a numeric index and returns the `arrow.type.Field` stored at the specified index. ### Are these changes tested? 1. Yes, updated `hFixedWidthType.m` and `tStringType.m` to verify the behavior of the new property and method. 2. Currently, all of the concrete `arrow.type.Type`s do not have any fields. This means the `Fields` property is always a 0x0 `arrow.type.Field` array. Once we implement `StructType`, we will be able to test having a nonempty `Fields` property. ### Are there any user-facing changes? Yes, users can now extract fields from an `arrow.type.Type` object. ### Future Directions 1. #37724 2. #37653 * Closes: #37654 Authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- matlab/src/cpp/arrow/matlab/error/error.h | 8 +-- matlab/src/cpp/arrow/matlab/index/validate.cc | 56 +++++++++++++++ matlab/src/cpp/arrow/matlab/index/validate.h | 26 +++++++ .../cpp/arrow/matlab/tabular/proxy/schema.cc | 68 ++++++------------- .../src/cpp/arrow/matlab/type/proxy/type.cc | 35 ++++++++++ matlab/src/cpp/arrow/matlab/type/proxy/type.h | 2 + matlab/src/matlab/+arrow/+type/Type.m | 24 +++++++ matlab/test/arrow/tabular/tSchema.m | 10 +-- matlab/test/arrow/type/hFixedWidthType.m | 25 +++++++ matlab/test/arrow/type/tStringType.m | 25 +++++++ .../cmake/BuildMatlabArrowInterface.cmake | 4 +- 11 files changed, 223 insertions(+), 60 deletions(-) create mode 100644 matlab/src/cpp/arrow/matlab/index/validate.cc create mode 100644 matlab/src/cpp/arrow/matlab/index/validate.h diff --git a/matlab/src/cpp/arrow/matlab/error/error.h b/matlab/src/cpp/arrow/matlab/error/error.h index 2b3009d51eb5a..4ff77da8d8360 100644 --- a/matlab/src/cpp/arrow/matlab/error/error.h +++ b/matlab/src/cpp/arrow/matlab/error/error.h @@ -174,10 +174,7 @@ namespace arrow::matlab::error { static const char* INVALID_TIME_UNIT = "arrow:type:InvalidTimeUnit"; static const char* FIELD_FAILED_TO_CREATE_TYPE_PROXY = "arrow:field:FailedToCreateTypeProxy"; static const char* ARRAY_FAILED_TO_CREATE_TYPE_PROXY = "arrow:array:FailedToCreateTypeProxy"; - static const char* ARROW_TABULAR_SCHEMA_INVALID_NUMERIC_FIELD_INDEX = "arrow:tabular:schema:InvalidNumericFieldIndex"; - static const char* ARROW_TABULAR_SCHEMA_UNKNOWN_FIELD_NAME = "arrow:tabular:schema:UnknownFieldName"; static const char* ARROW_TABULAR_SCHEMA_AMBIGUOUS_FIELD_NAME = "arrow:tabular:schema:AmbiguousFieldName"; - static const char* ARROW_TABULAR_SCHEMA_NUMERIC_FIELD_INDEX_WITH_EMPTY_SCHEMA = "arrow:tabular:schema:NumericFieldIndexWithEmptySchema"; static const char* UNKNOWN_PROXY_FOR_ARRAY_TYPE = "arrow:array:UnknownProxyForArrayType"; static const char* RECORD_BATCH_NUMERIC_INDEX_WITH_EMPTY_RECORD_BATCH = "arrow:tabular:recordbatch:NumericIndexWithEmptyRecordBatch"; static const char* RECORD_BATCH_INVALID_NUMERIC_COLUMN_INDEX = "arrow:tabular:recordbatch:InvalidNumericColumnIndex"; @@ -195,6 +192,7 @@ namespace arrow::matlab::error { static const char* CHUNKED_ARRAY_MAKE_FAILED = "arrow:chunkedarray:MakeFailed"; static const char* CHUNKED_ARRAY_NUMERIC_INDEX_WITH_EMPTY_CHUNKED_ARRAY = "arrow:chunkedarray:NumericIndexWithEmptyChunkedArray"; static const char* CHUNKED_ARRAY_INVALID_NUMERIC_CHUNK_INDEX = "arrow:chunkedarray:InvalidNumericChunkIndex"; - - + + static const char* INDEX_EMPTY_CONTAINER = "arrow:index:EmptyContainer"; + static const char* INDEX_OUT_OF_RANGE = "arrow:index:OutOfRange"; } diff --git a/matlab/src/cpp/arrow/matlab/index/validate.cc b/matlab/src/cpp/arrow/matlab/index/validate.cc new file mode 100644 index 0000000000000..b24653f1b814c --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/index/validate.cc @@ -0,0 +1,56 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/matlab/index/validate.h" + +#include + +namespace arrow::matlab::index { + + namespace { + std::string makeEmptyContainerErrorMessage() { + return "Numeric indexing using the field method is not supported for objects with zero fields."; + } + + std::string makeIndexOutOfRangeErrorMessage(const int32_t matlab_index, const int32_t num_fields) { + std::stringstream error_message_stream; + error_message_stream << "Invalid field index: "; + // matlab uses 1-based indexing + error_message_stream << matlab_index; + error_message_stream << ". Field index must be between 1 and the number of fields ("; + error_message_stream << num_fields; + error_message_stream << ")."; + return error_message_stream.str(); + } + } // anonymous namespace + + arrow::Status validateNonEmptyContainer(const int32_t num_fields) { + if (num_fields == 0) { + const auto msg = makeEmptyContainerErrorMessage(); + return arrow::Status::Invalid(std::move(msg)); + } + return arrow::Status::OK(); + } + + arrow::Status validateInRange(const int32_t matlab_index, const int32_t num_fields) { + if (matlab_index < 1 || matlab_index > num_fields) { + const auto msg = makeIndexOutOfRangeErrorMessage(matlab_index, num_fields); + return arrow::Status::Invalid(std::move(msg)); + } + return arrow::Status::OK(); + } +} \ No newline at end of file diff --git a/matlab/src/cpp/arrow/matlab/index/validate.h b/matlab/src/cpp/arrow/matlab/index/validate.h new file mode 100644 index 0000000000000..40e109c19e9ef --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/index/validate.h @@ -0,0 +1,26 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/status.h" + +namespace arrow::matlab::index { + + arrow::Status validateNonEmptyContainer(const int32_t num_fields); + arrow::Status validateInRange(const int32_t matlab_index, const int32_t num_fields); +} \ No newline at end of file diff --git a/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc b/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc index 62fe863ca8b5f..ec1ac1eecb2fd 100644 --- a/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc +++ b/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc @@ -18,6 +18,7 @@ #include "arrow/matlab/error/error.h" #include "arrow/matlab/tabular/proxy/schema.h" #include "arrow/matlab/type/proxy/field.h" +#include "arrow/matlab/index/validate.h" #include "libmexclass/proxy/ProxyManager.h" #include "libmexclass/error/Error.h" @@ -28,25 +29,6 @@ namespace arrow::matlab::tabular::proxy { - namespace { - - libmexclass::error::Error makeUnknownFieldNameError(const std::string& name) { - using namespace libmexclass::error; - std::stringstream error_message_stream; - error_message_stream << "Unknown field name: '"; - error_message_stream << name; - error_message_stream << "'."; - return Error{error::ARROW_TABULAR_SCHEMA_UNKNOWN_FIELD_NAME, error_message_stream.str()}; - } - - libmexclass::error::Error makeEmptySchemaError() { - using namespace libmexclass::error; - return Error{error::ARROW_TABULAR_SCHEMA_NUMERIC_FIELD_INDEX_WITH_EMPTY_SCHEMA, - "Numeric indexing using the field method is not supported for schemas with no fields."}; - } - - } - Schema::Schema(std::shared_ptr schema) : schema{std::move(schema)} { REGISTER_METHOD(Schema, getFieldByIndex); REGISTER_METHOD(Schema, getFieldByName); @@ -86,37 +68,27 @@ namespace arrow::matlab::tabular::proxy { mda::StructArray args = context.inputs[0]; const mda::TypedArray index_mda = args[0]["Index"]; const auto matlab_index = int32_t(index_mda[0]); - // Note: MATLAB uses 1-based indexing, so subtract 1. - // arrow::Schema::field does not do any bounds checking. - const int32_t index = matlab_index - 1; - const auto num_fields = schema->num_fields(); - if (num_fields == 0) { - const auto& error = makeEmptySchemaError(); - context.error = error; - return; - } + // Validate there is at least 1 field + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT( + index::validateNonEmptyContainer(schema->num_fields()), + context, + error::INDEX_EMPTY_CONTAINER); - if (matlab_index < 1 || matlab_index > num_fields) { - using namespace libmexclass::error; - const std::string& error_message_id = std::string{error::ARROW_TABULAR_SCHEMA_INVALID_NUMERIC_FIELD_INDEX}; - std::stringstream error_message_stream; - error_message_stream << "Invalid field index: "; - error_message_stream << matlab_index; - error_message_stream << ". Field index must be between 1 and the number of fields ("; - error_message_stream << num_fields; - error_message_stream << ")."; - const std::string& error_message = error_message_stream.str(); - context.error = Error{error_message_id, error_message}; - return; - } + // Validate the matlab index provided is within the range [1, num_fields] + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT( + index::validateInRange(matlab_index, schema->num_fields()), + context, + error::INDEX_OUT_OF_RANGE); - const auto& field = schema->field(index); - auto field_proxy = std::make_shared(field); - const auto field_proxy_id = ProxyManager::manageProxy(field_proxy); - const auto field_proxy_id_mda = factory.createScalar(field_proxy_id); + // Note: MATLAB uses 1-based indexing, so subtract 1. + // arrow::Schema::field does not do any bounds checking. + const int32_t index = matlab_index - 1; - context.outputs[0] = field_proxy_id_mda; + auto field = schema->field(index); + auto field_proxy = std::make_shared(std::move(field)); + auto field_proxy_id = ProxyManager::manageProxy(field_proxy); + context.outputs[0] = factory.createScalar(field_proxy_id); } void Schema::getFieldByName(libmexclass::proxy::method::Context& context) { @@ -135,9 +107,7 @@ namespace arrow::matlab::tabular::proxy { const auto field = schema->GetFieldByName(name); auto field_proxy = std::make_shared(field); const auto field_proxy_id = ProxyManager::manageProxy(field_proxy); - const auto field_proxy_id_mda = factory.createScalar(field_proxy_id); - - context.outputs[0] = field_proxy_id_mda; + context.outputs[0] = factory.createScalar(field_proxy_id); } void Schema::getNumFields(libmexclass::proxy::method::Context& context) { diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/type.cc b/matlab/src/cpp/arrow/matlab/type/proxy/type.cc index 1eed4e6141347..1cbaaf328ee86 100644 --- a/matlab/src/cpp/arrow/matlab/type/proxy/type.cc +++ b/matlab/src/cpp/arrow/matlab/type/proxy/type.cc @@ -15,7 +15,11 @@ // specific language governing permissions and limitations // under the License. + +#include "arrow/matlab/error/error.h" +#include "arrow/matlab/index/validate.h" #include "arrow/matlab/type/proxy/type.h" +#include "arrow/matlab/type/proxy/field.h" #include "libmexclass/proxy/ProxyManager.h" @@ -24,6 +28,7 @@ namespace arrow::matlab::type::proxy { Type::Type(std::shared_ptr type) : data_type{std::move(type)} { REGISTER_METHOD(Type, getTypeID); REGISTER_METHOD(Type, getNumFields); + REGISTER_METHOD(Type, getFieldByIndex); REGISTER_METHOD(Type, isEqual); } @@ -47,6 +52,36 @@ namespace arrow::matlab::type::proxy { context.outputs[0] = num_fields_mda; } + void Type::getFieldByIndex(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + mda::ArrayFactory factory; + + mda::StructArray args = context.inputs[0]; + const mda::TypedArray index_mda = args[0]["Index"]; + const auto matlab_index = int32_t(index_mda[0]); + + // Validate there is at least 1 field + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT( + index::validateNonEmptyContainer(data_type->num_fields()), + context, + error::INDEX_EMPTY_CONTAINER); + + // Validate the matlab index provided is within the range [1, num_fields] + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT( + index::validateInRange(matlab_index, data_type->num_fields()), + context, + error::INDEX_OUT_OF_RANGE); + + // Note: MATLAB uses 1-based indexing, so subtract 1. + // arrow::DataType::field does not do any bounds checking. + const int32_t index = matlab_index - 1; + + auto field = data_type->field(index); + auto field_proxy = std::make_shared(std::move(field)); + auto field_proxy_id = libmexclass::proxy::ProxyManager::manageProxy(field_proxy); + context.outputs[0] = factory.createScalar(field_proxy_id); + } + void Type::isEqual(libmexclass::proxy::method::Context& context) { namespace mda = ::matlab::data; diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/type.h b/matlab/src/cpp/arrow/matlab/type/proxy/type.h index efd2b8255aa28..3a6b287a9254e 100644 --- a/matlab/src/cpp/arrow/matlab/type/proxy/type.h +++ b/matlab/src/cpp/arrow/matlab/type/proxy/type.h @@ -37,6 +37,8 @@ class Type : public libmexclass::proxy::Proxy { void getNumFields(libmexclass::proxy::method::Context& context); + void getFieldByIndex(libmexclass::proxy::method::Context& context); + void isEqual(libmexclass::proxy::method::Context& context); std::shared_ptr data_type; diff --git a/matlab/src/matlab/+arrow/+type/Type.m b/matlab/src/matlab/+arrow/+type/Type.m index 24f83e0267058..0fd0139b18b7a 100644 --- a/matlab/src/matlab/+arrow/+type/Type.m +++ b/matlab/src/matlab/+arrow/+type/Type.m @@ -19,6 +19,7 @@ properties (Dependent, GetAccess=public, SetAccess=private) ID + Fields NumFields end @@ -41,6 +42,29 @@ function typeID = get.ID(obj) typeID = arrow.type.ID(obj.Proxy.getTypeID()); end + + function F = field(obj, idx) + import arrow.internal.validate.* + + idx = index.numeric(idx, "int32", AllowNonScalar=false); + args = struct(Index=idx); + proxyID = obj.Proxy.getFieldByIndex(args); + proxy = libmexclass.proxy.Proxy(Name="arrow.type.proxy.Field", ID=proxyID); + F = arrow.type.Field(proxy); + end + + function fields = get.Fields(obj) + numFields = obj.NumFields; + if numFields == 0 + fields = arrow.type.Field.empty(0, 0); + else + fields = cell(1, numFields); + for ii = 1:numFields + fields{ii} = obj.field(ii); + end + fields = horzcat(fields); + end + end end methods(Access = protected) diff --git a/matlab/test/arrow/tabular/tSchema.m b/matlab/test/arrow/tabular/tSchema.m index 3220236d4aabe..e4c706d9a3d6c 100644 --- a/matlab/test/arrow/tabular/tSchema.m +++ b/matlab/test/arrow/tabular/tSchema.m @@ -239,7 +239,7 @@ function GetFieldByNameWithWhitespace(testCase) testCase.verifyEqual(field.Type.ID, arrow.type.ID.UInt32); end - function ErrorIfInvalidNumericFieldIndex(testCase) + function ErrorIfIndexIsOutOfRange(testCase) % Verify that an error is thrown when trying to access a field % with an invalid numeric index (e.g. greater than NumFields). schema = arrow.schema([... @@ -250,7 +250,7 @@ function ErrorIfInvalidNumericFieldIndex(testCase) % Index is greater than NumFields. index = 100; - testCase.verifyError(@() schema.field(index), "arrow:tabular:schema:InvalidNumericFieldIndex"); + testCase.verifyError(@() schema.field(index), "arrow:index:OutOfRange"); end function ErrorIfFieldNameDoesNotExist(testCase) @@ -376,7 +376,7 @@ function EmptySchema(testCase) testCase.verifyEqual(schema.FieldNames, string.empty(1, 0)); testCase.verifyEqual(schema.Fields, arrow.type.Field.empty(0, 0)); testCase.verifyError(@() schema.field(0), "arrow:badsubscript:NonPositive"); - testCase.verifyError(@() schema.field(1), "arrow:tabular:schema:NumericFieldIndexWithEmptySchema"); + testCase.verifyError(@() schema.field(1), "arrow:index:EmptyContainer"); % 0x1 empty Field array. fields = arrow.type.Field.empty(0, 1); @@ -385,7 +385,7 @@ function EmptySchema(testCase) testCase.verifyEqual(schema.FieldNames, string.empty(1, 0)); testCase.verifyEqual(schema.Fields, arrow.type.Field.empty(0, 0)); testCase.verifyError(@() schema.field(0), "arrow:badsubscript:NonPositive"); - testCase.verifyError(@() schema.field(1), "arrow:tabular:schema:NumericFieldIndexWithEmptySchema"); + testCase.verifyError(@() schema.field(1), "arrow:index:EmptyContainer"); % 1x0 empty Field array. fields = arrow.type.Field.empty(1, 0); @@ -394,7 +394,7 @@ function EmptySchema(testCase) testCase.verifyEqual(schema.FieldNames, string.empty(1, 0)); testCase.verifyEqual(schema.Fields, arrow.type.Field.empty(0, 0)); testCase.verifyError(@() schema.field(0), "arrow:badsubscript:NonPositive"); - testCase.verifyError(@() schema.field(1), "arrow:tabular:schema:NumericFieldIndexWithEmptySchema"); + testCase.verifyError(@() schema.field(1), "arrow:index:EmptyContainer"); end function GetFieldByNameWithChar(testCase) diff --git a/matlab/test/arrow/type/hFixedWidthType.m b/matlab/test/arrow/type/hFixedWidthType.m index adb234bbd3f38..b23c21a6b4feb 100644 --- a/matlab/test/arrow/type/hFixedWidthType.m +++ b/matlab/test/arrow/type/hFixedWidthType.m @@ -49,6 +49,31 @@ function TestNumFields(testCase) testCase.verifyEqual(arrowType.NumFields, int32(0)); end + function TestFieldsProperty(testCase) + % Verify Fields is a 0x0 arrow.type.Field array. + type = testCase.ArrowType; + fields = type.Fields; + testCase.verifyEqual(fields, arrow.type.Field.empty(0, 0)); + end + + function FieldsNoSetter(testCase) + % Verify the Fields property is not settable. + type = testCase.ArrowType; + testCase.verifyError(@() setfield(type, "Fields", "1"), "MATLAB:class:SetProhibited"); + end + + function InvalidFieldIndex(testCase) + % Verify the field() method throws the expected error message + % when given an invalid index. + type = testCase.ArrowType; + + testCase.verifyError(@() type.field(0), "arrow:badsubscript:NonPositive"); + testCase.verifyError(@() type.field("A"), "arrow:badsubscript:NonNumeric"); + + % NOTE: For FixedWidthTypes, Fields is always empty. + testCase.verifyError(@() type.field(1), "arrow:index:EmptyContainer"); + end + function TestBitWidthNoSetter(testCase) % Verify that an error is thrown when trying to set the value % of the BitWidth property. diff --git a/matlab/test/arrow/type/tStringType.m b/matlab/test/arrow/type/tStringType.m index e2a16ab133dbd..3d518b3da3320 100644 --- a/matlab/test/arrow/type/tStringType.m +++ b/matlab/test/arrow/type/tStringType.m @@ -64,6 +64,31 @@ function IsEqualFalse(testCase) testCase.verifyFalse(isequal(typeArray1, typeArray2)); end + function TestFieldsProperty(testCase) + % Verify Fields is a 0x0 arrow.type.Field array. + type = arrow.string(); + fields = type.Fields; + testCase.verifyEqual(fields, arrow.type.Field.empty(0, 0)); + end + + function FieldsNoSetter(testCase) + % Verify the Fields property is not settable. + type = arrow.string(); + testCase.verifyError(@() setfield(type, "Fields", "1"), "MATLAB:class:SetProhibited"); + end + + function InvalidFieldIndex(testCase) + % Verify the field() method throws the expected error message + % when given an invalid index. + type = arrow.string(); + + testCase.verifyError(@() type.field(0), "arrow:badsubscript:NonPositive"); + testCase.verifyError(@() type.field("A"), "arrow:badsubscript:NonNumeric"); + + % NOTE: For StringType, Fields is always empty. + testCase.verifyError(@() type.field(1), "arrow:index:EmptyContainer"); + end + end end diff --git a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake index a5c0b079b34a6..b5c480d6a68e7 100644 --- a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake +++ b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake @@ -68,7 +68,9 @@ set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_SOURCES "${CMAKE_SOURCE_DIR}/src/cpp/a "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/field.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/wrap.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/feather/proxy/writer.cc" - "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/feather/proxy/reader.cc") + "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/feather/proxy/reader.cc" + "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/index/validate.cc") + set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_FACTORY_INCLUDE_DIR "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/proxy") From 60e7d24e1f9d41456313ffb5317eae064d0b4194 Mon Sep 17 00:00:00 2001 From: abandy Date: Fri, 15 Sep 2023 17:06:41 -0400 Subject: [PATCH 25/96] GH-37744: [Swift] Add test for arrow flight doGet FlightData (#37746) Add a test for the doGet call that handles the FlightData response. * Closes: #37744 Authored-by: Alva Bandy Signed-off-by: Sutou Kouhei --- .../Tests/ArrowFlightTests/FlightTest.swift | 20 +++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/swift/ArrowFlight/Tests/ArrowFlightTests/FlightTest.swift b/swift/ArrowFlight/Tests/ArrowFlightTests/FlightTest.swift index d0db593b10304..3fd52af08b82f 100644 --- a/swift/ArrowFlight/Tests/ArrowFlightTests/FlightTest.swift +++ b/swift/ArrowFlight/Tests/ArrowFlightTests/FlightTest.swift @@ -225,6 +225,25 @@ public class FlightClientTester { XCTAssertEqual(num_call, 1) } + func doGetTestFlightData() async throws { + let ticket = FlightTicket("flight_ticket test".data(using: .utf8)!) + var num_call = 0 + try await client?.doGet(ticket, flightDataClosure: { flightData in + let reader = ArrowReader(); + let result = reader.fromStream(flightData.dataBody) + switch result { + case .success(let rb): + XCTAssertEqual(rb.schema?.fields.count, 3) + XCTAssertEqual(rb.batches[0].length, 4) + num_call += 1 + case .failure(let error): + throw error + } + }) + + XCTAssertEqual(num_call, 1) + } + func doPutTest() async throws { let rb = try makeRecordBatch() var num_call = 0 @@ -290,6 +309,7 @@ final class FlightTest: XCTestCase { try await clientImpl.doActionTest() try await clientImpl.getSchemaTest() try await clientImpl.doGetTest() + try await clientImpl.doGetTestFlightData() try await clientImpl.doPutTest() try await clientImpl.doExchangeTest() From 1db4c99f2448fc53588848f01eb49c1a51cf5a17 Mon Sep 17 00:00:00 2001 From: Matt Topol Date: Fri, 15 Sep 2023 17:14:46 -0400 Subject: [PATCH 26/96] GH-37738: [Go][CI] Update Go version for verification (#37745) ### Rationale for this change Forgot to update the verification script to use the newer version of Go, it was still using Go 1.17 * Closes: #37738 Authored-by: Matt Topol Signed-off-by: Sutou Kouhei --- dev/release/verify-release-candidate.sh | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh index ce31b497c1fab..77b996766f78c 100755 --- a/dev/release/verify-release-candidate.sh +++ b/dev/release/verify-release-candidate.sh @@ -24,7 +24,7 @@ # - JDK >=7 # - gcc >= 4.8 # - Node.js >= 11.12 (best way is to use nvm) -# - Go >= 1.17 +# - Go >= 1.19 # - Docker # # If using a non-system Boost, set BOOST_ROOT and add Boost libraries to @@ -405,7 +405,7 @@ install_go() { return 0 fi - local version=1.17.13 + local version=1.19.13 show_info "Installing go version ${version}..." local arch="$(uname -m)" @@ -422,8 +422,9 @@ install_go() { fi local archive="go${version}.${os}-${arch}.tar.gz" - curl -sLO https://dl.google.com/go/$archive + curl -sLO https://go.dev/dl/$archive + ls -l local prefix=${ARROW_TMPDIR}/go mkdir -p $prefix tar -xzf $archive -C $prefix @@ -860,12 +861,12 @@ test_go() { show_header "Build and test Go libraries" maybe_setup_go || exit 1 - maybe_setup_conda compilers go=1.17 || exit 1 + maybe_setup_conda compilers go=1.19 || exit 1 pushd go go get -v ./... go test ./... - go install ./... + go install -buildvcs=false ./... go clean -modcache popd } From 3db62050daa67b927810574dea60fd5b84bcb523 Mon Sep 17 00:00:00 2001 From: James Duong Date: Fri, 15 Sep 2023 14:45:08 -0700 Subject: [PATCH 27/96] GH-37701: [Java] Add default comparators for more types (#37748) ### Rationale for this change Add default comparators for more vector types to make algorithms easier to use and provide more consistency for Java compared to other languages. ### What changes are included in this PR? Add default type comparators for: - BitVector - DateDayVector - DateMilliVector - Decimal256Vector - DecimalVector - DurationVector - IntervalDayVector - TimeMicroVector - TimeMilliVector - TimeNanoVector - TimeSecVector - TimeStampVector IntervalMonthDayNanoVector is not supported due to its public type PeriodDuration not being Comparable. BitVector's getWidth() method does not return valid data by design since its length is smaller than 1 byte. Using a BitVector with a fixed-width type's algorithm will throw an IllegalArgumentException. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37701 Authored-by: James Duong Signed-off-by: David Li --- .../sort/DefaultVectorComparators.java | 329 +++++++++++++ .../FixedWidthOutOfPlaceVectorSorter.java | 4 + .../sort/TestDefaultVectorComparator.java | 447 ++++++++++++++++++ .../apache/arrow/vector/Decimal256Vector.java | 12 +- .../apache/arrow/vector/DecimalVector.java | 12 +- .../apache/arrow/vector/DurationVector.java | 14 +- .../arrow/vector/IntervalDayVector.java | 18 +- .../testing/ValueVectorDataPopulator.java | 30 ++ 8 files changed, 858 insertions(+), 8 deletions(-) diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java index c418219170380..99d66f94261ee 100644 --- a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java +++ b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java @@ -19,15 +19,31 @@ import static org.apache.arrow.vector.complex.BaseRepeatedValueVector.OFFSET_WIDTH; +import java.math.BigDecimal; +import java.time.Duration; + import org.apache.arrow.memory.util.ArrowBufPointer; import org.apache.arrow.memory.util.ByteFunctionHelpers; import org.apache.arrow.vector.BaseFixedWidthVector; import org.apache.arrow.vector.BaseVariableWidthVector; import org.apache.arrow.vector.BigIntVector; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.DateDayVector; +import org.apache.arrow.vector.DateMilliVector; +import org.apache.arrow.vector.Decimal256Vector; +import org.apache.arrow.vector.DecimalVector; +import org.apache.arrow.vector.DurationVector; import org.apache.arrow.vector.Float4Vector; import org.apache.arrow.vector.Float8Vector; import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.IntervalDayVector; +import org.apache.arrow.vector.IntervalMonthDayNanoVector; import org.apache.arrow.vector.SmallIntVector; +import org.apache.arrow.vector.TimeMicroVector; +import org.apache.arrow.vector.TimeMilliVector; +import org.apache.arrow.vector.TimeNanoVector; +import org.apache.arrow.vector.TimeSecVector; +import org.apache.arrow.vector.TimeStampVector; import org.apache.arrow.vector.TinyIntVector; import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt2Vector; @@ -69,6 +85,32 @@ public static VectorValueComparator createDefaultComp return (VectorValueComparator) new UInt4Comparator(); } else if (vector instanceof UInt8Vector) { return (VectorValueComparator) new UInt8Comparator(); + } else if (vector instanceof BitVector) { + return (VectorValueComparator) new BitComparator(); + } else if (vector instanceof DateDayVector) { + return (VectorValueComparator) new DateDayComparator(); + } else if (vector instanceof DateMilliVector) { + return (VectorValueComparator) new DateMilliComparator(); + } else if (vector instanceof Decimal256Vector) { + return (VectorValueComparator) new Decimal256Comparator(); + } else if (vector instanceof DecimalVector) { + return (VectorValueComparator) new DecimalComparator(); + } else if (vector instanceof DurationVector) { + return (VectorValueComparator) new DurationComparator(); + } else if (vector instanceof IntervalDayVector) { + return (VectorValueComparator) new IntervalDayComparator(); + } else if (vector instanceof IntervalMonthDayNanoVector) { + throw new IllegalArgumentException("No default comparator for " + vector.getClass().getCanonicalName()); + } else if (vector instanceof TimeMicroVector) { + return (VectorValueComparator) new TimeMicroComparator(); + } else if (vector instanceof TimeMilliVector) { + return (VectorValueComparator) new TimeMilliComparator(); + } else if (vector instanceof TimeNanoVector) { + return (VectorValueComparator) new TimeNanoComparator(); + } else if (vector instanceof TimeSecVector) { + return (VectorValueComparator) new TimeSecComparator(); + } else if (vector instanceof TimeStampVector) { + return (VectorValueComparator) new TimeStampComparator(); } } else if (vector instanceof BaseVariableWidthVector) { return (VectorValueComparator) new VariableWidthComparator(); @@ -345,6 +387,293 @@ public VectorValueComparator createNew() { } } + /** + * Default comparator for bit type. + * The comparison is based on values, with null comes first. + */ + public static class BitComparator extends VectorValueComparator { + + public BitComparator() { + super(-1); + } + + @Override + public int compareNotNull(int index1, int index2) { + boolean value1 = vector1.get(index1) != 0; + boolean value2 = vector2.get(index2) != 0; + + return Boolean.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new BitComparator(); + } + } + + /** + * Default comparator for DateDay type. + * The comparison is based on values, with null comes first. + */ + public static class DateDayComparator extends VectorValueComparator { + + public DateDayComparator() { + super(DateDayVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + int value1 = vector1.get(index1); + int value2 = vector2.get(index2); + return Integer.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new DateDayComparator(); + } + } + + /** + * Default comparator for DateMilli type. + * The comparison is based on values, with null comes first. + */ + public static class DateMilliComparator extends VectorValueComparator { + + public DateMilliComparator() { + super(DateMilliVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + long value1 = vector1.get(index1); + long value2 = vector2.get(index2); + + return Long.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new DateMilliComparator(); + } + } + + /** + * Default comparator for Decimal256 type. + * The comparison is based on values, with null comes first. + */ + public static class Decimal256Comparator extends VectorValueComparator { + + public Decimal256Comparator() { + super(Decimal256Vector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + BigDecimal value1 = vector1.getObjectNotNull(index1); + BigDecimal value2 = vector2.getObjectNotNull(index2); + + return value1.compareTo(value2); + } + + @Override + public VectorValueComparator createNew() { + return new Decimal256Comparator(); + } + } + + /** + * Default comparator for Decimal type. + * The comparison is based on values, with null comes first. + */ + public static class DecimalComparator extends VectorValueComparator { + + public DecimalComparator() { + super(DecimalVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + BigDecimal value1 = vector1.getObjectNotNull(index1); + BigDecimal value2 = vector2.getObjectNotNull(index2); + + return value1.compareTo(value2); + } + + @Override + public VectorValueComparator createNew() { + return new DecimalComparator(); + } + } + + /** + * Default comparator for Duration type. + * The comparison is based on values, with null comes first. + */ + public static class DurationComparator extends VectorValueComparator { + + public DurationComparator() { + super(DurationVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + Duration value1 = vector1.getObjectNotNull(index1); + Duration value2 = vector2.getObjectNotNull(index2); + + return value1.compareTo(value2); + } + + @Override + public VectorValueComparator createNew() { + return new DurationComparator(); + } + } + + /** + * Default comparator for IntervalDay type. + * The comparison is based on values, with null comes first. + */ + public static class IntervalDayComparator extends VectorValueComparator { + + public IntervalDayComparator() { + super(IntervalDayVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + Duration value1 = vector1.getObjectNotNull(index1); + Duration value2 = vector2.getObjectNotNull(index2); + + return value1.compareTo(value2); + } + + @Override + public VectorValueComparator createNew() { + return new IntervalDayComparator(); + } + } + + /** + * Default comparator for TimeMicro type. + * The comparison is based on values, with null comes first. + */ + public static class TimeMicroComparator extends VectorValueComparator { + + public TimeMicroComparator() { + super(TimeMicroVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + long value1 = vector1.get(index1); + long value2 = vector2.get(index2); + + return Long.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new TimeMicroComparator(); + } + } + + /** + * Default comparator for TimeMilli type. + * The comparison is based on values, with null comes first. + */ + public static class TimeMilliComparator extends VectorValueComparator { + + public TimeMilliComparator() { + super(TimeMilliVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + int value1 = vector1.get(index1); + int value2 = vector2.get(index2); + + return Integer.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new TimeMilliComparator(); + } + } + + /** + * Default comparator for TimeNano type. + * The comparison is based on values, with null comes first. + */ + public static class TimeNanoComparator extends VectorValueComparator { + + public TimeNanoComparator() { + super(TimeNanoVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + long value1 = vector1.get(index1); + long value2 = vector2.get(index2); + + return Long.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new TimeNanoComparator(); + } + } + + /** + * Default comparator for TimeSec type. + * The comparison is based on values, with null comes first. + */ + public static class TimeSecComparator extends VectorValueComparator { + + public TimeSecComparator() { + super(TimeSecVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + int value1 = vector1.get(index1); + int value2 = vector2.get(index2); + + return Integer.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new TimeSecComparator(); + } + } + + /** + * Default comparator for TimeSec type. + * The comparison is based on values, with null comes first. + */ + public static class TimeStampComparator extends VectorValueComparator { + + public TimeStampComparator() { + super(TimeStampVector.TYPE_WIDTH); + } + + @Override + public int compareNotNull(int index1, int index2) { + long value1 = vector1.get(index1); + long value2 = vector2.get(index2); + + return Long.compare(value1, value2); + } + + @Override + public VectorValueComparator createNew() { + return new TimeStampComparator(); + } + } + /** * Default comparator for {@link org.apache.arrow.vector.BaseVariableWidthVector}. * The comparison is in lexicographic order, with null comes first. diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/FixedWidthOutOfPlaceVectorSorter.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/FixedWidthOutOfPlaceVectorSorter.java index 43d604060d086..c3b68facfda97 100644 --- a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/FixedWidthOutOfPlaceVectorSorter.java +++ b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/FixedWidthOutOfPlaceVectorSorter.java @@ -21,6 +21,7 @@ import org.apache.arrow.memory.util.MemoryUtil; import org.apache.arrow.util.Preconditions; import org.apache.arrow.vector.BaseFixedWidthVector; +import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.BitVectorHelper; import org.apache.arrow.vector.IntVector; @@ -35,6 +36,9 @@ public class FixedWidthOutOfPlaceVectorSorter im @Override public void sortOutOfPlace(V srcVector, V dstVector, VectorValueComparator comparator) { + if (srcVector instanceof BitVector) { + throw new IllegalArgumentException("BitVector is not supported with FixedWidthOutOfPlaceVectorSorter."); + } comparator.attachVector(srcVector); int valueWidth = comparator.getValueWidth(); diff --git a/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java b/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java index 818bb60d116da..62051197740d8 100644 --- a/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java +++ b/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java @@ -25,8 +25,23 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.BigIntVector; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.DateDayVector; +import org.apache.arrow.vector.DateMilliVector; +import org.apache.arrow.vector.Decimal256Vector; +import org.apache.arrow.vector.DecimalVector; +import org.apache.arrow.vector.DurationVector; +import org.apache.arrow.vector.Float4Vector; +import org.apache.arrow.vector.Float8Vector; import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.IntervalDayVector; import org.apache.arrow.vector.SmallIntVector; +import org.apache.arrow.vector.TimeMicroVector; +import org.apache.arrow.vector.TimeMilliVector; +import org.apache.arrow.vector.TimeNanoVector; +import org.apache.arrow.vector.TimeSecVector; +import org.apache.arrow.vector.TimeStampMilliVector; +import org.apache.arrow.vector.TimeStampVector; import org.apache.arrow.vector.TinyIntVector; import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt2Vector; @@ -34,6 +49,7 @@ import org.apache.arrow.vector.UInt8Vector; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.testing.ValueVectorDataPopulator; +import org.apache.arrow.vector.types.TimeUnit; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.FieldType; @@ -271,6 +287,76 @@ public void testCompareUInt8() { } } + @Test + public void testCompareFloat4() { + try (Float4Vector vec = new Float4Vector("", allocator)) { + vec.allocateNew(9); + ValueVectorDataPopulator.setVector( + vec, -1.1f, 0.0f, 1.0f, null, 1.0f, 2.0f, Float.NaN, Float.NaN, Float.POSITIVE_INFINITY, + Float.NEGATIVE_INFINITY); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + assertTrue(comparator.compare(8, 3) > 0); + + // NaN behavior. + assertTrue(comparator.compare(6, 7) == 0); + assertTrue(comparator.compare(7, 6) == 0); + assertTrue(comparator.compare(7, 7) == 0); + assertTrue(comparator.compare(6, 0) > 0); + assertTrue(comparator.compare(6, 8) > 0); + assertTrue(comparator.compare(6, 3) > 0); + } + } + + @Test + public void testCompareFloat8() { + try (Float8Vector vec = new Float8Vector("", allocator)) { + vec.allocateNew(9); + ValueVectorDataPopulator.setVector( + vec, -1.1, 0.0, 1.0, null, 1.0, 2.0, Double.NaN, Double.NaN, Double.POSITIVE_INFINITY, + Double.NEGATIVE_INFINITY); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + assertTrue(comparator.compare(8, 3) > 0); + + // NaN behavior. + assertTrue(comparator.compare(6, 7) == 0); + assertTrue(comparator.compare(7, 6) == 0); + assertTrue(comparator.compare(7, 7) == 0); + assertTrue(comparator.compare(6, 0) > 0); + assertTrue(comparator.compare(6, 8) > 0); + assertTrue(comparator.compare(6, 3) > 0); + } + } + @Test public void testCompareLong() { try (BigIntVector vec = new BigIntVector("", allocator)) { @@ -393,6 +479,367 @@ public void testCompareByte() { } } + @Test + public void testCompareBit() { + try (BitVector vec = new BitVector("", allocator)) { + vec.allocateNew(6); + ValueVectorDataPopulator.setVector( + vec, 1, 2, 0, 0, -1, null); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) == 0); + assertTrue(comparator.compare(0, 2) > 0); + assertTrue(comparator.compare(0, 4) == 0); + assertTrue(comparator.compare(2, 1) < 0); + assertTrue(comparator.compare(2, 4) < 0); + + // null first + assertTrue(comparator.compare(5, 0) < 0); + assertTrue(comparator.compare(5, 2) < 0); + } + } + + @Test + public void testCompareDateDay() { + try (DateDayVector vec = new DateDayVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1, 0, 1, null, 1, 5, Integer.MIN_VALUE + 1, Integer.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareDateMilli() { + try (DateMilliVector vec = new DateMilliVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareDecimal() { + try (DecimalVector vec = new DecimalVector("", allocator, 10, 1)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareDecimal256() { + try (Decimal256Vector vec = new Decimal256Vector("", allocator, 10, 1)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareDuration() { + try (DurationVector vec = + new DurationVector("", FieldType.nullable(new ArrowType.Duration(TimeUnit.MILLISECOND)), allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareIntervalDay() { + try (IntervalDayVector vec = + new IntervalDayVector("", FieldType.nullable(new ArrowType.Duration(TimeUnit.MILLISECOND)), allocator)) { + vec.allocateNew(8); + vec.set(0, -1, 0); + vec.set(1, 0, 0); + vec.set(2, 1, 0); + vec.setNull(3); + vec.set(4, -1, -1); + vec.set(5, 1, 1); + vec.set(6, 1, 1); + vec.set(7, -1, -1); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + assertTrue(comparator.compare(2, 5) < 0); + assertTrue(comparator.compare(0, 4) > 0); + + // test equality + assertTrue(comparator.compare(5, 6) == 0); + assertTrue(comparator.compare(4, 7) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + } + } + + @Test + public void testCompareTimeMicro() { + try (TimeMicroVector vec = + new TimeMicroVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareTimeMilli() { + try (TimeMilliVector vec = new TimeMilliVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1, 0, 1, null, 1, 5, Integer.MIN_VALUE + 1, Integer.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareTimeNano() { + try (TimeNanoVector vec = + new TimeNanoVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareTimeSec() { + try (TimeSecVector vec = new TimeSecVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1, 0, 1, null, 1, 5, Integer.MIN_VALUE + 1, Integer.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + + @Test + public void testCompareTimeStamp() { + try (TimeStampMilliVector vec = + new TimeStampMilliVector("", allocator)) { + vec.allocateNew(8); + ValueVectorDataPopulator.setVector( + vec, -1L, 0L, 1L, null, 1L, 5L, Long.MIN_VALUE + 1L, Long.MAX_VALUE); + + VectorValueComparator comparator = + DefaultVectorComparators.createDefaultComparator(vec); + comparator.attachVector(vec); + + assertTrue(comparator.compare(0, 1) < 0); + assertTrue(comparator.compare(0, 2) < 0); + assertTrue(comparator.compare(2, 1) > 0); + + // test equality + assertTrue(comparator.compare(5, 5) == 0); + assertTrue(comparator.compare(2, 4) == 0); + + // null first + assertTrue(comparator.compare(3, 4) < 0); + assertTrue(comparator.compare(5, 3) > 0); + + // potential overflow + assertTrue(comparator.compare(6, 7) < 0); + assertTrue(comparator.compare(7, 6) > 0); + assertTrue(comparator.compare(7, 7) == 0); + } + } + @Test public void testCheckNullsOnCompareIsFalseForNonNullableVector() { try (IntVector vec = new IntVector("not nullable", diff --git a/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java b/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java index 4ccee50d6805a..70a895ff40496 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java @@ -154,10 +154,20 @@ public BigDecimal getObject(int index) { if (isSet(index) == 0) { return null; } else { - return DecimalUtility.getBigDecimalFromArrowBuf(valueBuffer, index, scale, TYPE_WIDTH); + return getObjectNotNull(index); } } + /** + * Same as {@link #getObject(int)}, but does not check for null. + * + * @param index position of element + * @return element at given index + */ + public BigDecimal getObjectNotNull(int index) { + return DecimalUtility.getBigDecimalFromArrowBuf(valueBuffer, index, scale, TYPE_WIDTH); + } + /** * Return precision for the decimal value. */ diff --git a/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java b/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java index db04563df24d7..6a3ec60afc52e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java @@ -153,10 +153,20 @@ public BigDecimal getObject(int index) { if (isSet(index) == 0) { return null; } else { - return DecimalUtility.getBigDecimalFromArrowBuf(valueBuffer, index, scale, TYPE_WIDTH); + return getObjectNotNull(index); } } + /** + * Same as {@link #getObect(int)} but does not check for null. + * + * @param index position of element + * @return element at given index + */ + public BigDecimal getObjectNotNull(int index) { + return DecimalUtility.getBigDecimalFromArrowBuf(valueBuffer, index, scale, TYPE_WIDTH); + } + /** * Return precision for the decimal value. */ diff --git a/java/vector/src/main/java/org/apache/arrow/vector/DurationVector.java b/java/vector/src/main/java/org/apache/arrow/vector/DurationVector.java index 1e1db0d1c3c5f..b6abc16194b77 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/DurationVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/DurationVector.java @@ -147,11 +147,21 @@ public Duration getObject(int index) { if (isSet(index) == 0) { return null; } else { - final long value = get(valueBuffer, index); - return toDuration(value, unit); + return getObjectNotNull(index); } } + /** + * Same as {@link #getObject(int)} but does not check for null. + * + * @param index position of element + * @return element at given index + */ + public Duration getObjectNotNull(int index) { + final long value = get(valueBuffer, index); + return toDuration(value, unit); + } + /** * Converts the given value and unit to the appropriate {@link Duration}. */ diff --git a/java/vector/src/main/java/org/apache/arrow/vector/IntervalDayVector.java b/java/vector/src/main/java/org/apache/arrow/vector/IntervalDayVector.java index 35312ba7c96a1..7c0d19baa9a6f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/IntervalDayVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/IntervalDayVector.java @@ -168,13 +168,23 @@ public Duration getObject(int index) { if (isSet(index) == 0) { return null; } else { - final long startIndex = (long) index * TYPE_WIDTH; - final int days = valueBuffer.getInt(startIndex); - final int milliseconds = valueBuffer.getInt(startIndex + MILLISECOND_OFFSET); - return Duration.ofDays(days).plusMillis(milliseconds); + return getObjectNotNull(index); } } + /** + * Same as {@link #getObject(int)} but does not check for null. + * + * @param index position of element + * @return element at given index + */ + public Duration getObjectNotNull(int index) { + final long startIndex = (long) index * TYPE_WIDTH; + final int days = valueBuffer.getInt(startIndex); + final int milliseconds = valueBuffer.getInt(startIndex + MILLISECOND_OFFSET); + return Duration.ofDays(days).plusMillis(milliseconds); + } + /** * Get the Interval value at a given index as a {@link StringBuilder} object. * diff --git a/java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java b/java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java index 15d6a5cf993c4..f9f0357861c15 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java @@ -19,6 +19,7 @@ import static org.junit.Assert.assertEquals; +import java.math.BigDecimal; import java.nio.charset.StandardCharsets; import java.util.List; import java.util.Map; @@ -29,6 +30,7 @@ import org.apache.arrow.vector.BitVectorHelper; import org.apache.arrow.vector.DateDayVector; import org.apache.arrow.vector.DateMilliVector; +import org.apache.arrow.vector.Decimal256Vector; import org.apache.arrow.vector.DecimalVector; import org.apache.arrow.vector.DurationVector; import org.apache.arrow.vector.FixedSizeBinaryVector; @@ -147,6 +149,34 @@ public static void setVector(DecimalVector vector, Long... values) { vector.setValueCount(length); } + /** + * Populate values for Decimal256Vector. + */ + public static void setVector(Decimal256Vector vector, Long... values) { + final int length = values.length; + vector.allocateNew(length); + for (int i = 0; i < length; i++) { + if (values[i] != null) { + vector.set(i, values[i]); + } + } + vector.setValueCount(length); + } + + /** + * Populate values for Decimal256Vector. + */ + public static void setVector(Decimal256Vector vector, BigDecimal... values) { + final int length = values.length; + vector.allocateNew(length); + for (int i = 0; i < length; i++) { + if (values[i] != null) { + vector.set(i, values[i]); + } + } + vector.setValueCount(length); + } + /** * Populate values for DurationVector. * @param values values of elapsed time in either seconds, milliseconds, microseconds or nanoseconds. From a7f5ee01b5949687bbc3b3918454860c6db6934c Mon Sep 17 00:00:00 2001 From: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Date: Mon, 18 Sep 2023 11:26:34 -0400 Subject: [PATCH 28/96] GH-37724: [MATLAB] Add `arrow.type.StructType` MATLAB class (#37749) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change In order to add a `arrow.array.StructArray` MATLAB class (https://github.com/apache/arrow/issues/37653), we first need to implement the `arrow.type.StructType` MATLAB class. ### What changes are included in this PR? 1. Added a new MATLAB class `arrow.type.StructType` 2. Added convenience constructor function `arrow.struct()` 3. Added `Struct` as a enumeration value to `arrow.type.ID` 4. Added `arrow.type.traits.StructTraits` MATLAB class. Some of its properties, such as `ArrayConstructor` and `ArrayProxyClassName`, are set to `missing` because they require `arrow.array.StructArray` (#37653). When that class is added, we can initialize these properties to correct values. **Example Usage** ```matlab >> fieldA = arrow.field("A", arrow.int32()); >> fieldB = arrow.field("B", arrow.timestamp(TimeZone="America/New_York")); >> fieldC = arrow.field("C", arrow.string()); >> structType = arrow.struct(fieldA, fieldB, fieldC) structType = StructType with properties: ID: Struct Fields: [1×3 arrow.type.Field] >> fieldBFromStruct = structType.field(2) fieldBFromStruct = B: timestamp[us, tz=America/New_York] ``` ### Are these changes tested? Yes. 1. Added a new test class called `tStructType.m` 2. Added a new test case to `tTypeDisplay.m` 3. Updated test case in `tID.m` ### Are there any user-facing changes? Yes. Users can now create an `arrow.type.StructType` object using the new `arrow.struct()` funciton. ### Future Directions 1. #37653 * Closes: #37724 Authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- matlab/src/cpp/arrow/matlab/proxy/factory.cc | 2 + .../arrow/matlab/type/proxy/struct_type.cc | 45 +++++ .../cpp/arrow/matlab/type/proxy/struct_type.h | 34 ++++ .../src/cpp/arrow/matlab/type/proxy/wrap.cc | 3 + .../+arrow/+type/+traits/StructTraits.m | 36 ++++ .../src/matlab/+arrow/+type/+traits/traits.m | 2 + matlab/src/matlab/+arrow/+type/ID.m | 6 + matlab/src/matlab/+arrow/+type/StructType.m | 46 +++++ matlab/src/matlab/+arrow/+type/Type.m | 2 +- matlab/src/matlab/+arrow/struct.m | 43 ++++ matlab/test/arrow/type/tField.m | 1 + matlab/test/arrow/type/tID.m | 3 +- matlab/test/arrow/type/tStructType.m | 190 ++++++++++++++++++ matlab/test/arrow/type/tTypeDisplay.m | 28 ++- matlab/test/arrow/type/traits/tStructTraits.m | 31 +++ matlab/test/arrow/type/traits/ttraits.m | 12 ++ .../cmake/BuildMatlabArrowInterface.cmake | 1 + 17 files changed, 482 insertions(+), 3 deletions(-) create mode 100644 matlab/src/cpp/arrow/matlab/type/proxy/struct_type.cc create mode 100644 matlab/src/cpp/arrow/matlab/type/proxy/struct_type.h create mode 100644 matlab/src/matlab/+arrow/+type/+traits/StructTraits.m create mode 100644 matlab/src/matlab/+arrow/+type/StructType.m create mode 100644 matlab/src/matlab/+arrow/struct.m create mode 100644 matlab/test/arrow/type/tStructType.m create mode 100644 matlab/test/arrow/type/traits/tStructTraits.m diff --git a/matlab/src/cpp/arrow/matlab/proxy/factory.cc b/matlab/src/cpp/arrow/matlab/proxy/factory.cc index 4035725f2b382..ebeb020a9e7c7 100644 --- a/matlab/src/cpp/arrow/matlab/proxy/factory.cc +++ b/matlab/src/cpp/arrow/matlab/proxy/factory.cc @@ -33,6 +33,7 @@ #include "arrow/matlab/type/proxy/date64_type.h" #include "arrow/matlab/type/proxy/time32_type.h" #include "arrow/matlab/type/proxy/time64_type.h" +#include "arrow/matlab/type/proxy/struct_type.h" #include "arrow/matlab/type/proxy/field.h" #include "arrow/matlab/io/feather/proxy/writer.h" #include "arrow/matlab/io/feather/proxy/reader.h" @@ -81,6 +82,7 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const ClassName& class_name, REGISTER_PROXY(arrow.type.proxy.Time64Type , arrow::matlab::type::proxy::Time64Type); REGISTER_PROXY(arrow.type.proxy.Date32Type , arrow::matlab::type::proxy::Date32Type); REGISTER_PROXY(arrow.type.proxy.Date64Type , arrow::matlab::type::proxy::Date64Type); + REGISTER_PROXY(arrow.type.proxy.StructType , arrow::matlab::type::proxy::StructType); REGISTER_PROXY(arrow.io.feather.proxy.Writer , arrow::matlab::io::feather::proxy::Writer); REGISTER_PROXY(arrow.io.feather.proxy.Reader , arrow::matlab::io::feather::proxy::Reader); diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/struct_type.cc b/matlab/src/cpp/arrow/matlab/type/proxy/struct_type.cc new file mode 100644 index 0000000000000..fbb8dc3f6edbe --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/type/proxy/struct_type.cc @@ -0,0 +1,45 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/matlab/type/proxy/struct_type.h" +#include "arrow/matlab/type/proxy/field.h" +#include "libmexclass/proxy/ProxyManager.h" + +namespace arrow::matlab::type::proxy { + + StructType::StructType(std::shared_ptr struct_type) : Type(std::move(struct_type)) {} + + libmexclass::proxy::MakeResult StructType::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { + namespace mda = ::matlab::data; + using StructTypeProxy = arrow::matlab::type::proxy::StructType; + + mda::StructArray args = constructor_arguments[0]; + const mda::TypedArray field_proxy_ids_mda = args[0]["FieldProxyIDs"]; + + std::vector> fields; + fields.reserve(field_proxy_ids_mda.getNumberOfElements()); + for (const auto proxy_id : field_proxy_ids_mda) { + using namespace libmexclass::proxy; + auto proxy = std::static_pointer_cast(ProxyManager::getProxy(proxy_id)); + auto field = proxy->unwrap(); + fields.push_back(field); + } + + auto struct_type = std::static_pointer_cast(arrow::struct_(fields)); + return std::make_shared(std::move(struct_type)); + } +} \ No newline at end of file diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/struct_type.h b/matlab/src/cpp/arrow/matlab/type/proxy/struct_type.h new file mode 100644 index 0000000000000..8ec6217b34278 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/type/proxy/struct_type.h @@ -0,0 +1,34 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/matlab/type/proxy/type.h" + +namespace arrow::matlab::type::proxy { + + class StructType : public arrow::matlab::type::proxy::Type { + + public: + StructType(std::shared_ptr struct_type); + + ~StructType() {} + + static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments); +}; + +} \ No newline at end of file diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/wrap.cc b/matlab/src/cpp/arrow/matlab/type/proxy/wrap.cc index 91a1e353496c7..3dd86e91409fa 100644 --- a/matlab/src/cpp/arrow/matlab/type/proxy/wrap.cc +++ b/matlab/src/cpp/arrow/matlab/type/proxy/wrap.cc @@ -24,6 +24,7 @@ #include "arrow/matlab/type/proxy/date32_type.h" #include "arrow/matlab/type/proxy/date64_type.h" #include "arrow/matlab/type/proxy/string_type.h" +#include "arrow/matlab/type/proxy/struct_type.h" namespace arrow::matlab::type::proxy { @@ -64,6 +65,8 @@ namespace arrow::matlab::type::proxy { return std::make_shared(std::static_pointer_cast(type)); case ID::STRING: return std::make_shared(std::static_pointer_cast(type)); + case ID::STRUCT: + return std::make_shared(std::static_pointer_cast(type)); default: return arrow::Status::NotImplemented("Unsupported DataType: " + type->ToString()); } diff --git a/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m b/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m new file mode 100644 index 0000000000000..a8ed98f8ae468 --- /dev/null +++ b/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m @@ -0,0 +1,36 @@ +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef StructTraits < arrow.type.traits.TypeTraits + + properties (Constant) + % TODO: When arrow.array.StructArray is implemented, set these + % properties appropriately + ArrayConstructor = missing + ArrayClassName = missing + ArrayProxyClassName = missing + ArrayStaticConstructor = missing + + TypeConstructor = @arrow.type.StructType + TypeClassName = "arrow.type.StructType" + TypeProxyClassName = "arrow.type.proxy.StructType" + + % TODO: When arrow.array.StructArray is implemented, set these + % properties appropriately + MatlabConstructor = missing + MatlabClassName = missing + end + +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+type/+traits/traits.m b/matlab/src/matlab/+arrow/+type/+traits/traits.m index 78804fdccb3f0..f737108ce5f76 100644 --- a/matlab/src/matlab/+arrow/+type/+traits/traits.m +++ b/matlab/src/matlab/+arrow/+type/+traits/traits.m @@ -56,6 +56,8 @@ typeTraits = Date32Traits(); case ID.Date64 typeTraits = Date64Traits(); + case ID.Struct + typeTraits = StructTraits(); otherwise error("arrow:type:traits:UnsupportedArrowTypeID", "Unsupported Arrow type ID: " + type); end diff --git a/matlab/src/matlab/+arrow/+type/ID.m b/matlab/src/matlab/+arrow/+type/ID.m index 646edb85c6632..b2c4facbe4065 100644 --- a/matlab/src/matlab/+arrow/+type/ID.m +++ b/matlab/src/matlab/+arrow/+type/ID.m @@ -37,5 +37,11 @@ Timestamp (18) Time32 (19) Time64 (20) + % IntervalMonths (21) + % IntervalDayTime (22) + % Decimal128 (23) + % Decimal256 (24) + % List (25) + Struct (26) end end diff --git a/matlab/src/matlab/+arrow/+type/StructType.m b/matlab/src/matlab/+arrow/+type/StructType.m new file mode 100644 index 0000000000000..6c1318f6376f3 --- /dev/null +++ b/matlab/src/matlab/+arrow/+type/StructType.m @@ -0,0 +1,46 @@ +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef StructType < arrow.type.Type + + methods + function obj = StructType(proxy) + arguments + proxy(1, 1) libmexclass.proxy.Proxy {validate(proxy, "arrow.type.proxy.StructType")} + end + import arrow.internal.proxy.validate + obj@arrow.type.Type(proxy); + end + end + + methods(Access = protected) + function groups = getDisplayPropertyGroups(obj) + targets = ["ID", "Fields"]; + groups = matlab.mixin.util.PropertyGroup(targets); + end + end + + methods (Hidden) + % TODO: Consider using a mixin approach to add this behavior. For + % example, ChunkedArray's toMATLAB method could check if its + % Type inherits from a mixin called "Preallocateable" (or something + % more descriptive). If so, we can call preallocateMATLABArray + % in the toMATLAB method. + function preallocateMATLABArray(~) + error("arrow:type:UnsupportedFunction", ... + "preallocateMATLABArray is not supported for StructType"); + end + end +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+type/Type.m b/matlab/src/matlab/+arrow/+type/Type.m index 0fd0139b18b7a..6dc4fbc438f34 100644 --- a/matlab/src/matlab/+arrow/+type/Type.m +++ b/matlab/src/matlab/+arrow/+type/Type.m @@ -62,7 +62,7 @@ for ii = 1:numFields fields{ii} = obj.field(ii); end - fields = horzcat(fields); + fields = horzcat(fields{:}); end end end diff --git a/matlab/src/matlab/+arrow/struct.m b/matlab/src/matlab/+arrow/struct.m new file mode 100644 index 0000000000000..2fdbd6a9864fd --- /dev/null +++ b/matlab/src/matlab/+arrow/struct.m @@ -0,0 +1,43 @@ +%STRUCT Constructs an arrow.type.StructType object + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +function type = struct(fields) + arguments(Repeating) + fields(1, :) arrow.type.Field {mustBeNonempty} + end + + % Must have at least one Field in a Struct + if isempty(fields) + error("arrow:struct:TooFewInputs", ... + "Must supply at least one arrow.type.Field"); + end + + fields = horzcat(fields{:}); + + % Extract the corresponding Proxy IDs from each of the + % supplied arrow.type.Field objects. + numFields = numel(fields); + fieldProxyIDs = zeros(1, numFields, "uint64"); + for ii = 1:numFields + fieldProxyIDs(ii) = fields(ii).Proxy.ID; + end + + % Construct an Arrow Field Proxy in C++ from the supplied Field Proxy IDs. + args = struct(FieldProxyIDs=fieldProxyIDs); + proxy = arrow.internal.proxy.create("arrow.type.proxy.StructType", args); + type = arrow.type.StructType(proxy); +end \ No newline at end of file diff --git a/matlab/test/arrow/type/tField.m b/matlab/test/arrow/type/tField.m index dba7190b49ce2..1a89c0077b5ae 100644 --- a/matlab/test/arrow/type/tField.m +++ b/matlab/test/arrow/type/tField.m @@ -42,6 +42,7 @@ function TestSupportedTypes(testCase) arrow.float64, ... arrow.string, ... arrow.timestamp, ... + arrow.struct(arrow.field("A", arrow.float32())) }; for ii = 1:numel(supportedTypes) supportedType = supportedTypes{ii}; diff --git a/matlab/test/arrow/type/tID.m b/matlab/test/arrow/type/tID.m index b69cd89842d73..e97d77e81c124 100644 --- a/matlab/test/arrow/type/tID.m +++ b/matlab/test/arrow/type/tID.m @@ -46,7 +46,8 @@ function CastToUInt64(testCase) ID.Date64, 17, ... ID.Timestamp, 18, ... ID.Time32, 19, ... - ID.Time64, 20 ... + ID.Time64, 20, ... + ID.Struct, 26 ... ); enumValues = typeIDs.keys(); diff --git a/matlab/test/arrow/type/tStructType.m b/matlab/test/arrow/type/tStructType.m new file mode 100644 index 0000000000000..f0585823f8dcf --- /dev/null +++ b/matlab/test/arrow/type/tStructType.m @@ -0,0 +1,190 @@ +% TSTRUCTTYPE Unit tests for arrow.type.StructType + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef tStructType < matlab.unittest.TestCase + + properties (Constant) + Field1 = arrow.field("A", arrow.float64()) + Field2 = arrow.field("C", arrow.boolean()) + Field3 = arrow.field("B", arrow.timestamp(TimeUnit="Microsecond", TimeZone="America/New_York")); + end + + methods (Test) + function Basic(tc) + % Verify arrow.struct() returns an arrow.type.StructType + % object. + type = arrow.struct(tc.Field1); + className = string(class(type)); + tc.verifyEqual(className, "arrow.type.StructType"); + tc.verifyEqual(type.ID, arrow.type.ID.Struct); + end + + function TooFewInputsError(tc) + % Verify arrow.struct() errors if given zero input arguments. + fcn = @() arrow.struct(); + tc.verifyError(fcn, "arrow:struct:TooFewInputs"); + end + + function InvalidInputTypeError(tc) + % Verify arrow.struct() errors if any one of the input + % arguments is not an arrow.type.Field object. + fcn = @() arrow.struct(1); + tc.verifyError(fcn, "MATLAB:validation:UnableToConvert"); + end + + function EmptyFieldError(tc) + % Verify arrow.struct() errors if given an empty + % arrow.type.Field array as one of its inputs. + fcn = @() arrow.struct(tc.Field1, arrow.type.Field.empty(0, 0)); + tc.verifyError(fcn, "MATLAB:validators:mustBeNonempty"); + end + + function NumFieldsGetter(tc) + % Verify the NumFields getter returns the expected value. + type = arrow.struct(tc.Field1); + tc.verifyEqual(type.NumFields, int32(1)); + + type = arrow.struct(tc.Field1, tc.Field2); + tc.verifyEqual(type.NumFields, int32(2)); + + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + tc.verifyEqual(type.NumFields, int32(3)); + end + + function NumFieldsNoSetter(tc) + % Verify the NumFields property is not settable. + type = arrow.struct(tc.Field1); + fcn = @() setfield(type, "NumFields", 20); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function FieldsGetter(tc) + % Verify the Fields getter returns the expected + % arrow.type.Field array. + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + actual = type.Fields; + expected = [tc.Field1, tc.Field2, tc.Field3]; + tc.verifyEqual(actual, expected); + end + + function FieldsNoSetter(tc) + % Verify the Fields property is not settable. + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + fcn = @() setfield(type, "Fields", tc.Field3); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function IDGetter(tc) + % Verify the ID getter returns the expected enum value. + type = arrow.struct(tc.Field1); + actual = type.ID; + expected = arrow.type.ID.Struct; + tc.verifyEqual(actual, expected); + end + + function IDNoSetter(tc) + % Verify the ID property is not settable. + type = arrow.struct(tc.Field1); + fcn = @() setfield(type, "ID", arrow.type.ID.Boolean); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function FieldMethod(tc) + % Verify the field method returns the expected arrow.type.Field + % with respect to the index provided. + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + + % Extract the 1st field + actual1 = type.field(1); + expected1 = tc.Field1; + tc.verifyEqual(actual1, expected1); + + % Extract the 2nd field + actual2 = type.field(2); + expected2 = tc.Field2; + tc.verifyEqual(actual2, expected2); + + % Extract the 3rd field + actual3 = type.field(3); + expected3 = tc.Field3; + tc.verifyEqual(actual3, expected3); + end + + function FieldIndexOutOfRangeError(tc) + % Verify field() throws an error if provided an index that + % exceeds NumFields. + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + fcn = @() type.field(100); + tc.verifyError(fcn, "arrow:index:OutOfRange"); + end + + function FieldIndexNonScalarError(tc) + % Verify field() throws an error if provided a nonscalar array + % of indices. + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + fcn = @() type.field([1 2]); + tc.verifyError(fcn, "arrow:badsubscript:NonScalar"); + end + + function FieldIndexNonNumberError(tc) + % Verify field() throws an error if not provided a number as + % the index. + + type = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + fcn = @() type.field("A"); + tc.verifyError(fcn, "arrow:badsubscript:NonNumeric"); + end + + function IsEqualTrue(tc) + % Verify two StructTypes are considered equal if their Fields + % properties are equal. + + type1 = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + type2 = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + + tc.verifyTrue(isequal(type1, type2)); + tc.verifyTrue(isequal(type1, type2, type2, type1)); + + % Non-scalar arrow.type.StructType arrays + type3 = [type1 type2]; + type4 = [type1 type2]; + tc.verifyTrue(isequal(type3, type4)); + end + + function IsEqualFalse(tc) + % Verify isequal returns false when expected. + type1 = arrow.struct(tc.Field1, tc.Field2, tc.Field3); + type2 = arrow.struct(tc.Field1, tc.Field2); + type3 = arrow.struct(tc.Field1, tc.Field3, tc.Field2); + + % Fields properties have different lengths + tc.verifyFalse(isequal(type1, type2)); + + % The corresponding elements in the Fields arrays are not equal + tc.verifyFalse(isequal(type1, type3)); + + % Non-scalar arrow.type.StructType arrays + type4 = [type1 type2]; + type5 = [type1; type2]; + type6 = [type1 type2]; + type7 = [type1 type3]; + tc.verifyFalse(isequal(type4, type5)); + tc.verifyFalse(isequal(type6, type7)); + + end + end +end \ No newline at end of file diff --git a/matlab/test/arrow/type/tTypeDisplay.m b/matlab/test/arrow/type/tTypeDisplay.m index 4d3c023da71ab..f84c5ab56e270 100644 --- a/matlab/test/arrow/type/tTypeDisplay.m +++ b/matlab/test/arrow/type/tTypeDisplay.m @@ -189,7 +189,7 @@ function TestDateType(testCase, DateType) testCase.verifyEqual(actualDisplay, expectedDisplay); end - function Display(testCase) + function TimestampTypeDisplay(testCase) % Verify the display of TimestampType objects. % % Example: @@ -211,6 +211,32 @@ function Display(testCase) actualDisplay = evalc('disp(type)'); testCase.verifyEqual(actualDisplay, expectedDisplay); end + + function StructTypeDisplay(testCase) + % Verify the display of StructType objects. + % + % Example: + % + % StructType with properties: + % + % ID: Struct + % Fields: [1x2 arrow.type.Field] + + fieldA = arrow.field("A", arrow.int32()); + fieldB = arrow.field("B", arrow.timestamp(TimeZone="America/Anchorage")); + type = arrow.struct(fieldA, fieldB); %#ok + classnameLink = makeLinkString(FullClassName="arrow.type.StructType", ClassName="StructType", BoldFont=true); + header = " " + classnameLink + " with properties:" + newline; + body = strjust(pad(["ID:"; "Fields:"])); + dimensionString = makeDimensionString([1 2]); + fieldString = compose("[%s %s]", dimensionString, "arrow.type.Field"); + body = body + " " + ["Struct"; fieldString]; + body = " " + body; + footer = string(newline); + expectedDisplay = char(strjoin([header body' footer], newline)); + actualDisplay = evalc('disp(type)'); + testCase.verifyDisplay(actualDisplay, expectedDisplay); + end end methods diff --git a/matlab/test/arrow/type/traits/tStructTraits.m b/matlab/test/arrow/type/traits/tStructTraits.m new file mode 100644 index 0000000000000..6a97b1e1852d6 --- /dev/null +++ b/matlab/test/arrow/type/traits/tStructTraits.m @@ -0,0 +1,31 @@ +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef tStructTraits < hTypeTraits + + properties + TraitsConstructor = @arrow.type.traits.StructTraits + ArrayConstructor = missing + ArrayClassName = missing + ArrayProxyClassName = missing + ArrayStaticConstructor = missing + TypeConstructor = @arrow.type.StructType + TypeClassName = "arrow.type.StructType" + TypeProxyClassName = "arrow.type.proxy.StructType" + MatlabConstructor = missing + MatlabClassName = missing + end + +end \ No newline at end of file diff --git a/matlab/test/arrow/type/traits/ttraits.m b/matlab/test/arrow/type/traits/ttraits.m index cdc5990ed03ba..2880645f2957c 100644 --- a/matlab/test/arrow/type/traits/ttraits.m +++ b/matlab/test/arrow/type/traits/ttraits.m @@ -199,6 +199,18 @@ function TestDate64(testCase) testCase.verifyEqual(actualTraits, expectedTraits); end + function TestStruct(testCase) + import arrow.type.traits.* + import arrow.type.* + + type = ID.Struct; + expectedTraits = StructTraits(); + + actualTraits = traits(type); + + testCase.verifyEqual(actualTraits, expectedTraits); + end + function TestMatlabUInt8(testCase) import arrow.type.traits.* diff --git a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake index b5c480d6a68e7..40c6b5a51d4fe 100644 --- a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake +++ b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake @@ -65,6 +65,7 @@ set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_SOURCES "${CMAKE_SOURCE_DIR}/src/cpp/a "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/time_type.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/time32_type.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/time64_type.cc" + "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/struct_type.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/field.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/wrap.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/feather/proxy/writer.cc" From 6a3c6a78e25b808174c71a9460f036b9f80390c0 Mon Sep 17 00:00:00 2001 From: James Duong Date: Mon, 18 Sep 2023 11:51:50 -0700 Subject: [PATCH 29/96] GH-37704: [Java] Add schema IPC serialization methods (#37778) ### Rationale for this change The methods in the Schema class in Java serialize the schema in a way that is inconsistent with other languages. This can cause integration issues when using a Java-serialized schema in another language via IPC or Arrow Flight or vice-versa. ### What changes are included in this PR? Added the serializeAsMessage() and deserializeMessage() methods to Schema for serialization Schemas to the standard IPC format (wrapped in an IPC Message). Marked the methods that serialize to raw FlatBuffer objects as deprecated (toByteArray() and deserialize()). ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37704 Authored-by: James Duong Signed-off-by: David Li --- .../flight/perf/PerformanceTestServer.java | 3 +- .../apache/arrow/flight/perf/TestPerf.java | 3 +- .../arrow/vector/types/pojo/Schema.java | 44 +++++++++++++++++++ .../arrow/vector/types/pojo/TestSchema.java | 39 ++++++++++++++++ 4 files changed, 86 insertions(+), 3 deletions(-) diff --git a/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/PerformanceTestServer.java b/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/PerformanceTestServer.java index 319aee445dca6..0ded2f7065f9c 100644 --- a/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/PerformanceTestServer.java +++ b/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/PerformanceTestServer.java @@ -18,7 +18,6 @@ package org.apache.arrow.flight.perf; import java.io.IOException; -import java.nio.ByteBuffer; import java.util.ArrayList; import java.util.List; import java.util.concurrent.ExecutorService; @@ -115,7 +114,7 @@ public void getStream(CallContext context, Ticket ticket, try { Token token = Token.parseFrom(ticket.getBytes()); Perf perf = token.getDefinition(); - Schema schema = Schema.deserialize(ByteBuffer.wrap(perf.getSchema().toByteArray())); + Schema schema = Schema.deserializeMessage(perf.getSchema().asReadOnlyByteBuffer()); root = VectorSchemaRoot.create(schema, allocator); BigIntVector a = (BigIntVector) root.getVector("a"); BigIntVector b = (BigIntVector) root.getVector("b"); diff --git a/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/TestPerf.java b/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/TestPerf.java index a7af8b713097d..17c83c205feb0 100644 --- a/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/TestPerf.java +++ b/java/flight/flight-core/src/test/java/org/apache/arrow/flight/perf/TestPerf.java @@ -65,7 +65,8 @@ public static FlightDescriptor getPerfFlightDescriptor(long recordCount, int rec Field.nullable("d", MinorType.BIGINT.getType()) )); - ByteString serializedSchema = ByteString.copyFrom(pojoSchema.toByteArray()); + byte[] bytes = pojoSchema.serializeAsMessage(); + ByteString serializedSchema = ByteString.copyFrom(bytes); return FlightDescriptor.command(Perf.newBuilder() .setRecordsPerStream(recordCount) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java index 2b3db1fb7de43..dcffea0ef5367 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -20,9 +20,11 @@ import static org.apache.arrow.vector.types.pojo.Field.convertField; +import java.io.ByteArrayOutputStream; import java.io.IOException; import java.nio.ByteBuffer; import java.nio.ByteOrder; +import java.nio.channels.Channels; import java.util.AbstractMap; import java.util.ArrayList; import java.util.Collections; @@ -36,7 +38,10 @@ import org.apache.arrow.flatbuf.KeyValue; import org.apache.arrow.util.Collections2; import org.apache.arrow.util.Preconditions; +import org.apache.arrow.vector.ipc.ReadChannel; +import org.apache.arrow.vector.ipc.WriteChannel; import org.apache.arrow.vector.ipc.message.FBSerializables; +import org.apache.arrow.vector.ipc.message.MessageSerializer; import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonIgnore; @@ -47,6 +52,7 @@ import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.ObjectReader; import com.fasterxml.jackson.databind.ObjectWriter; +import com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream; import com.google.flatbuffers.FlatBufferBuilder; /** @@ -83,10 +89,30 @@ public static Schema fromJSON(String json) throws IOException { return reader.readValue(Preconditions.checkNotNull(json)); } + /** + * Deserialize a schema that has been serialized using {@link #toByteArray()}. + * @param buffer the bytes to deserialize. + * @return The deserialized schema. + */ + @Deprecated public static Schema deserialize(ByteBuffer buffer) { return convertSchema(org.apache.arrow.flatbuf.Schema.getRootAsSchema(buffer)); } + /** + * Deserialize a schema that has been serialized as a message using {@link #serializeAsMessage()}. + * @param buffer the bytes to deserialize. + * @return The deserialized schema. + */ + public static Schema deserializeMessage(ByteBuffer buffer) { + ByteBufferBackedInputStream stream = new ByteBufferBackedInputStream(buffer); + try (ReadChannel channel = new ReadChannel(Channels.newChannel(stream))) { + return MessageSerializer.deserializeSchema(channel); + } catch (IOException ex) { + throw new RuntimeException(ex); + } + } + /** Converts a flatbuffer schema to its POJO representation. */ public static Schema convertSchema(org.apache.arrow.flatbuf.Schema schema) { List fields = new ArrayList<>(); @@ -217,9 +243,27 @@ public int getSchema(FlatBufferBuilder builder) { return org.apache.arrow.flatbuf.Schema.endSchema(builder); } + /** + * Returns the serialized flatbuffer bytes of the schema wrapped in a message table. + * Use {@link #deserializeMessage() to rebuild the Schema.} + */ + public byte[] serializeAsMessage() { + ByteArrayOutputStream out = new ByteArrayOutputStream(); + try (WriteChannel channel = new WriteChannel(Channels.newChannel(out))) { + long size = MessageSerializer.serialize( + new WriteChannel(Channels.newChannel(out)), this); + return out.toByteArray(); + } catch (IOException ex) { + throw new RuntimeException(ex); + } + } + /** * Returns the serialized flatbuffer representation of this schema. + * @deprecated This method does not encapsulate the schema in a Message payload which is incompatible with other + * languages. Use {@link #serializeAsMessage()} instead. */ + @Deprecated public byte[] toByteArray() { FlatBufferBuilder builder = new FlatBufferBuilder(); int schemaOffset = this.getSchema(builder); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index 0e5375865a8bd..7b62247c6e12d 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -24,6 +24,7 @@ import static org.junit.Assert.assertTrue; import java.io.IOException; +import java.nio.ByteBuffer; import java.util.HashMap; import java.util.Map; @@ -216,6 +217,35 @@ public void testMetadata() throws IOException { contains(schema, "\"" + METADATA_KEY + "\" : \"testKey\"", "\"" + METADATA_VALUE + "\" : \"testValue\""); } + @Test + public void testMessageSerialization() { + Schema schema = new Schema(asList( + field("a", false, new Null()), + field("b", new Struct(), field("ba", new Null())), + field("c", new List(), field("ca", new Null())), + field("d", new Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new Null())), + field("e", new Int(8, true)), + field("f", new FloatingPoint(FloatingPointPrecision.SINGLE)), + field("g", new Utf8()), + field("h", new Binary()), + field("i", new Bool()), + field("j", new Decimal(5, 5, 128)), + field("k", new Date(DateUnit.DAY)), + field("l", new Date(DateUnit.MILLISECOND)), + field("m", new Time(TimeUnit.SECOND, 32)), + field("n", new Time(TimeUnit.MILLISECOND, 32)), + field("o", new Time(TimeUnit.MICROSECOND, 64)), + field("p", new Time(TimeUnit.NANOSECOND, 64)), + field("q", new Timestamp(TimeUnit.MILLISECOND, "UTC")), + field("r", new Timestamp(TimeUnit.MICROSECOND, null)), + field("s", new Interval(IntervalUnit.DAY_TIME)), + field("t", new FixedSizeBinary(100)), + field("u", new Duration(TimeUnit.SECOND)), + field("v", new Duration(TimeUnit.MICROSECOND)) + )); + roundTripMessage(schema); + } + private void roundTrip(Schema schema) throws IOException { String json = schema.toJson(); Schema actual = Schema.fromJSON(json); @@ -225,6 +255,15 @@ private void roundTrip(Schema schema) throws IOException { assertEquals(schema.hashCode(), actual.hashCode()); } + private void roundTripMessage(Schema schema) { + byte[] bytes = schema.serializeAsMessage(); + Schema actual = Schema.deserializeMessage(ByteBuffer.wrap(bytes)); + assertEquals(schema.toJson(), actual.toJson()); + assertEquals(schema, actual); + validateFieldsHashcode(schema.getFields(), actual.getFields()); + assertEquals(schema.hashCode(), actual.hashCode()); + } + private void validateFieldsHashcode(java.util.List schemaFields, java.util.List actualFields) { assertEquals(schemaFields.size(), actualFields.size()); if (schemaFields.size() == 0) { From efbabd3e9583a07cc44e13de84772e9de4f6953c Mon Sep 17 00:00:00 2001 From: Austin Dickey Date: Mon, 18 Sep 2023 15:39:11 -0500 Subject: [PATCH 30/96] GH-37771: [Go][Benchmarking] Update Conbench git info (#37772) This PR fixes the Go benchmarking script to comply with the new Conbench server and client versions. Fixes https://github.com/apache/arrow/issues/37771. * Closes: #37771 Authored-by: Austin Dickey Signed-off-by: Matt Topol --- ci/scripts/go_bench_adapt.py | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/ci/scripts/go_bench_adapt.py b/ci/scripts/go_bench_adapt.py index e4eea5c17af37..a05e25de8bdd3 100644 --- a/ci/scripts/go_bench_adapt.py +++ b/ci/scripts/go_bench_adapt.py @@ -20,7 +20,7 @@ import uuid import logging from pathlib import Path -from typing import List, Optional, Dict +from typing import List from benchadapt import BenchmarkResult from benchadapt.adapters import BenchmarkAdapter @@ -33,9 +33,9 @@ # `github_commit_info` is meant to communicate GitHub-flavored commit # information to Conbench. See -# https://github.com/conbench/conbench/blob/7c4968e631ecdc064559c86a1174a1353713b700/benchadapt/python/benchadapt/result.py#L66 +# https://github.com/conbench/conbench/blob/cf7931f/benchadapt/python/benchadapt/result.py#L66 # for a specification. -github_commit_info: Optional[Dict] = None +github_commit_info = {"repository": "https://github.com/apache/arrow"} if os.environ.get("CONBENCH_REF") == "main": # Assume GitHub Actions CI. The environment variable lookups below are @@ -53,7 +53,7 @@ # This is probably a local dev environment, for testing. In this case, it # does usually not make sense to provide commit information (not a - # controlled CI environment). Explicitly keep `github_commit_info=None` to + # controlled CI environment). Explicitly leave out "commit" and "pr_number" to # reflect that (to not send commit information). # Reflect 'local dev' scenario in run_reason. Allow user to (optionally) @@ -114,10 +114,9 @@ def _transform_results(self) -> List[BenchmarkResult]: run_reason=run_reason, github=github_commit_info, ) - if github_commit_info is not None: - parsed.run_name = ( - f"{parsed.run_reason}: {github_commit_info['commit']}" - ) + parsed.run_name = ( + f"{parsed.run_reason}: {github_commit_info.get('commit')}" + ) parsed_results.append(parsed) return parsed_results From 440dc92caa73ca67c8ca98cebfb74f33788150bf Mon Sep 17 00:00:00 2001 From: Dewey Dunnington Date: Mon, 18 Sep 2023 22:50:33 +0200 Subject: [PATCH 31/96] GH-37576: [R] Use `SafeCallIntoR()` to call garbage collector after a failed allocation (#37565) ### Rationale for this change The `gc_memory_pool()` is the one we use almost everywhere in the R package. It uses a special allocation mechanism that calls into R to run the garbage collector after a failed allocation (in case there are any large objects that can be removed). In the case where an allocation happens on another thread (most of the time when running exec plans), the call into R may cause a crash: even though the memory pool was ensuring serialized access using a mutex, this is not sufficient for R (for reasons I don't understand). ### What changes are included in this PR? Use `SafeCallIntoR()` to run the garbage collector instead. This ensures that the calling thread is used for any call into R (or errors if this is not possible). ### Are these changes tested? Yes: there is an existing test that ensures this code path occurs at least once. ### Are there any user-facing changes? No. * Closes: #37576 Authored-by: Dewey Dunnington Signed-off-by: Dewey Dunnington --- r/src/memorypool.cpp | 14 +++++++------- r/tests/testthat/test-arrow.R | 2 +- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/r/src/memorypool.cpp b/r/src/memorypool.cpp index 027aa8ef2aa8d..696e913eadc70 100644 --- a/r/src/memorypool.cpp +++ b/r/src/memorypool.cpp @@ -16,8 +16,8 @@ // under the License. #include -#include #include "./arrow_types.h" +#include "./safe-call-into-r.h" class GcMemoryPool : public arrow::MemoryPool { public: @@ -59,17 +59,17 @@ class GcMemoryPool : public arrow::MemoryPool { if (call().ok()) { return arrow::Status::OK(); } else { - auto lock = mutex_.Lock(); - // ARROW-10080: Allocation may fail spuriously since the garbage collector is lazy. // Force it to run then try again in case any reusable allocations have been freed. - static cpp11::function gc = cpp11::package("base")["gc"]; - gc(); + arrow::Status r_call = SafeCallIntoRVoid([] { + cpp11::function gc = cpp11::package("base")["gc"]; + gc(); + }); + ARROW_RETURN_NOT_OK(r_call); + return call(); } - return call(); } - arrow::util::Mutex mutex_; arrow::MemoryPool* pool_; }; diff --git a/r/tests/testthat/test-arrow.R b/r/tests/testthat/test-arrow.R index 071a5ad3d982c..c6ae27ac52296 100644 --- a/r/tests/testthat/test-arrow.R +++ b/r/tests/testthat/test-arrow.R @@ -64,6 +64,6 @@ test_that("MemoryPool calls gc() to free memory when allocation fails (ARROW-100 on.exit(suppressMessages(untrace(gc))) # We expect this should fail because we don't have this much memory, # but it should gc() and retry (and fail again) - expect_error(BufferOutputStream$create(2**60)) + expect_error(BufferOutputStream$create(2**60), "Out of memory") expect_true(env$gc_was_called) }) From 42192d82cdd2a63bd3baf48ffbbda181d2c9cb97 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 19 Sep 2023 08:53:01 +0900 Subject: [PATCH 32/96] MINOR: [C#] Bump xunit from 2.5.0 to 2.5.1 in /csharp (#37775) Bumps [xunit](https://github.com/xunit/xunit) from 2.5.0 to 2.5.1.
Commits
  • 9dc851b v2.5.1
  • 86b0eef #2770: Make SerializationHelper public
  • f49dafc File re-sort for SerializationHelper & XunitSerializationInfo
  • d0004ae #2773: Add Assert.RaisesAny and Assert.RaisesAnyAsync non-generic for EventAr...
  • c1dba28 Latest assertions
  • b7c828e Move .editorconfig up into contentFiles
  • 08e74e7 Add .editorconfig to assert source NuGet package to indicate generated code (...
  • 79f411a #2767: Verify types match when comparing FileSystemInfo values (v2)
  • 9c28f58 Latest dependencies
  • 63dce4f #2767: Special case FileSystemInfo objects by just comparing the FullName in ...
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=xunit&package-manager=nuget&previous-version=2.5.0&new-version=2.5.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../Apache.Arrow.Compression.Tests.csproj | 2 +- .../Apache.Arrow.Flight.Sql.Tests.csproj | 2 +- .../Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj | 2 +- csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj b/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj index e06e0f9ef2845..730099246edc9 100644 --- a/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj +++ b/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj @@ -8,7 +8,7 @@ - + diff --git a/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj b/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj index dff3f1e541c08..d829d02cfada6 100644 --- a/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj +++ b/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj @@ -7,7 +7,7 @@ - + diff --git a/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj b/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj index 7f5a726ee5f03..4d00208fbaf2e 100644 --- a/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj +++ b/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj @@ -7,7 +7,7 @@ - + diff --git a/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj b/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj index d6dc25d6b5e20..ccd630d279279 100644 --- a/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj +++ b/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj @@ -11,7 +11,7 @@ - + all runtime; build; native; contentfiles; analyzers From 96907837707f79483baf4f3eda00b09c7fcb246b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 19 Sep 2023 08:54:25 +0900 Subject: [PATCH 33/96] MINOR: [C#] Bump Google.Protobuf from 3.24.2 to 3.24.3 in /csharp (#37776) Bumps [Google.Protobuf](https://github.com/protocolbuffers/protobuf) from 3.24.2 to 3.24.3.
Commits
  • ee13554 Updating version.json and repo version numbers to: 24.3
  • a19e9ab Merge pull request #13881 from honglooker/24.x
  • 6af84ee bring protobuf dep [upb] to parity for 24.3
  • 3465661 Silence warnings about extra semicolon in non-TSAN builds. (#13820)
  • 398a84c Workaround ICE on gcc 7.5 by not having one overload call the other one with ...
  • e5b5696 Merge pull request #13697 from protocolbuffers/24.x-202308252017
  • afecb47 Updating version.json to: 24.3-dev
  • See full diff in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Google.Protobuf&package-manager=nuget&previous-version=3.24.2&new-version=3.24.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj b/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj index 0ffb2f0a8e518..3a3a7d406b128 100644 --- a/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj +++ b/csharp/src/Apache.Arrow.Flight/Apache.Arrow.Flight.csproj @@ -5,7 +5,7 @@ - + From 25fa89df2b55012b447b9c0348733570cb695566 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 19 Sep 2023 11:46:24 +0900 Subject: [PATCH 34/96] MINOR: [C#] Bump xunit.runner.visualstudio from 2.5.0 to 2.5.1 in /csharp (#37774) Bumps [xunit.runner.visualstudio](https://github.com/xunit/xunit) from 2.5.0 to 2.5.1.
Commits
  • 9dc851b v2.5.1
  • 86b0eef #2770: Make SerializationHelper public
  • f49dafc File re-sort for SerializationHelper & XunitSerializationInfo
  • d0004ae #2773: Add Assert.RaisesAny and Assert.RaisesAnyAsync non-generic for EventAr...
  • c1dba28 Latest assertions
  • b7c828e Move .editorconfig up into contentFiles
  • 08e74e7 Add .editorconfig to assert source NuGet package to indicate generated code (...
  • 79f411a #2767: Verify types match when comparing FileSystemInfo values (v2)
  • 9c28f58 Latest dependencies
  • 63dce4f #2767: Special case FileSystemInfo objects by just comparing the FullName in ...
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=xunit.runner.visualstudio&package-manager=nuget&previous-version=2.5.0&new-version=2.5.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- .../Apache.Arrow.Compression.Tests.csproj | 2 +- .../Apache.Arrow.Flight.Sql.Tests.csproj | 2 +- .../Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj | 2 +- csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj b/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj index 730099246edc9..fc21b06ced689 100644 --- a/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj +++ b/csharp/test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj @@ -9,7 +9,7 @@ - + diff --git a/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj b/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj index d829d02cfada6..48ba93f58b973 100644 --- a/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj +++ b/csharp/test/Apache.Arrow.Flight.Sql.Tests/Apache.Arrow.Flight.Sql.Tests.csproj @@ -8,7 +8,7 @@ - + diff --git a/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj b/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj index 4d00208fbaf2e..6dd816ac73e86 100644 --- a/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj +++ b/csharp/test/Apache.Arrow.Flight.Tests/Apache.Arrow.Flight.Tests.csproj @@ -8,7 +8,7 @@ - + diff --git a/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj b/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj index ccd630d279279..e7af9e2246276 100644 --- a/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj +++ b/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj @@ -12,7 +12,7 @@ - + all runtime; build; native; contentfiles; analyzers From 3b646ad4c2b826fe08b31d19e6435f73650bcb5e Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Tue, 19 Sep 2023 16:41:31 +0200 Subject: [PATCH 35/96] GH-37537: [Integration][C++] Add C Data Interface integration testing (#37769) ### Rationale for this change Currently there are no systematic integration tests between implementations of the C Data Interface, only a couple ad-hoc tests. ### What changes are included in this PR? 1. Add Archery infrastructure for integration testing of the C Data Interface 2. Add implementation of this interface for Arrow C++ ### Are these changes tested? Yes, by construction. ### Are there any user-facing changes? No. * Closes: #37537 Lead-authored-by: Antoine Pitrou Co-authored-by: Antoine Pitrou Co-authored-by: Will Jones Signed-off-by: Antoine Pitrou --- ci/scripts/integration_arrow.sh | 6 +- cpp/src/arrow/CMakeLists.txt | 6 +- .../c_data_integration_internal.cc | 145 ++++++++++ .../integration/c_data_integration_internal.h | 48 ++++ cpp/src/arrow/integration/json_integration.cc | 7 +- cpp/src/arrow/symbols.map | 1 + dev/archery/archery/cli.py | 21 +- dev/archery/archery/integration/cdata.py | 107 +++++++ dev/archery/archery/integration/datagen.py | 110 +++++--- dev/archery/archery/integration/runner.py | 267 ++++++++++++++---- dev/archery/archery/integration/scenario.py | 7 +- dev/archery/archery/integration/tester.py | 183 +++++++++++- dev/archery/archery/integration/tester_cpp.py | 114 +++++++- dev/archery/archery/integration/util.py | 4 +- dev/archery/setup.py | 11 +- 15 files changed, 913 insertions(+), 124 deletions(-) create mode 100644 cpp/src/arrow/integration/c_data_integration_internal.cc create mode 100644 cpp/src/arrow/integration/c_data_integration_internal.h create mode 100644 dev/archery/archery/integration/cdata.py diff --git a/ci/scripts/integration_arrow.sh b/ci/scripts/integration_arrow.sh index 30cbb2d63791c..a165f8027bf8f 100755 --- a/ci/scripts/integration_arrow.sh +++ b/ci/scripts/integration_arrow.sh @@ -22,10 +22,12 @@ set -ex arrow_dir=${1} gold_dir=$arrow_dir/testing/data/arrow-ipc-stream/integration -pip install -e $arrow_dir/dev/archery +pip install -e $arrow_dir/dev/archery[integration] # Rust can be enabled by exporting ARCHERY_INTEGRATION_WITH_RUST=1 -archery integration \ +time archery integration \ + --run-c-data \ + --run-ipc \ --run-flight \ --with-cpp=1 \ --with-csharp=1 \ diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index f474d0c517fa0..9a6117011535e 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -383,7 +383,11 @@ endif() # if(ARROW_BUILD_INTEGRATION OR ARROW_BUILD_TESTS) - list(APPEND ARROW_SRCS integration/json_integration.cc integration/json_internal.cc) + list(APPEND + ARROW_SRCS + integration/c_data_integration_internal.cc + integration/json_integration.cc + integration/json_internal.cc) endif() if(ARROW_CSV) diff --git a/cpp/src/arrow/integration/c_data_integration_internal.cc b/cpp/src/arrow/integration/c_data_integration_internal.cc new file mode 100644 index 0000000000000..79e09eaf91a39 --- /dev/null +++ b/cpp/src/arrow/integration/c_data_integration_internal.cc @@ -0,0 +1,145 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/integration/c_data_integration_internal.h" + +#include +#include + +#include "arrow/c/bridge.h" +#include "arrow/integration/json_integration.h" +#include "arrow/io/file.h" +#include "arrow/memory_pool.h" +#include "arrow/pretty_print.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/status.h" +#include "arrow/type.h" +#include "arrow/type_fwd.h" +#include "arrow/util/logging.h" + +namespace arrow::internal::integration { +namespace { + +template +const char* StatusToErrorString(Func&& func) { + static std::string error; + + Status st = func(); + if (st.ok()) { + return nullptr; + } + error = st.ToString(); + ARROW_CHECK_GT(error.length(), 0); + return error.c_str(); +} + +Result> ReadSchemaFromJson(const std::string& json_path, + MemoryPool* pool) { + ARROW_ASSIGN_OR_RAISE(auto file, io::ReadableFile::Open(json_path, pool)); + ARROW_ASSIGN_OR_RAISE(auto reader, IntegrationJsonReader::Open(pool, file)); + return reader->schema(); +} + +Result> ReadBatchFromJson(const std::string& json_path, + int num_batch, MemoryPool* pool) { + ARROW_ASSIGN_OR_RAISE(auto file, io::ReadableFile::Open(json_path, pool)); + ARROW_ASSIGN_OR_RAISE(auto reader, IntegrationJsonReader::Open(pool, file)); + return reader->ReadRecordBatch(num_batch); +} + +// XXX ideally, we should allow use of a custom memory pool in the C bridge API, +// but that requires non-trivial refactor + +Status ExportSchemaFromJson(std::string json_path, ArrowSchema* out) { + auto pool = default_memory_pool(); + ARROW_ASSIGN_OR_RAISE(auto schema, ReadSchemaFromJson(json_path, pool)); + return ExportSchema(*schema, out); +} + +Status ImportSchemaAndCompareToJson(std::string json_path, ArrowSchema* c_schema) { + auto pool = default_memory_pool(); + ARROW_ASSIGN_OR_RAISE(auto json_schema, ReadSchemaFromJson(json_path, pool)); + ARROW_ASSIGN_OR_RAISE(auto imported_schema, ImportSchema(c_schema)); + if (!imported_schema->Equals(json_schema, /*check_metadata=*/true)) { + return Status::Invalid("Schemas are different:", "\n- Json Schema: ", *json_schema, + "\n- Imported Schema: ", *imported_schema); + } + return Status::OK(); +} + +Status ExportBatchFromJson(std::string json_path, int num_batch, ArrowArray* out) { + auto pool = default_memory_pool(); + ARROW_ASSIGN_OR_RAISE(auto batch, ReadBatchFromJson(json_path, num_batch, pool)); + return ExportRecordBatch(*batch, out); +} + +Status ImportBatchAndCompareToJson(std::string json_path, int num_batch, + ArrowArray* c_batch) { + auto pool = default_memory_pool(); + ARROW_ASSIGN_OR_RAISE(auto batch, ReadBatchFromJson(json_path, num_batch, pool)); + ARROW_ASSIGN_OR_RAISE(auto imported_batch, ImportRecordBatch(c_batch, batch->schema())); + RETURN_NOT_OK(imported_batch->ValidateFull()); + if (!imported_batch->Equals(*batch, /*check_metadata=*/true)) { + std::stringstream pp_expected; + std::stringstream pp_actual; + PrettyPrintOptions options(/*indent=*/2); + options.window = 50; + ARROW_CHECK_OK(PrettyPrint(*batch, options, &pp_expected)); + ARROW_CHECK_OK(PrettyPrint(*imported_batch, options, &pp_actual)); + return Status::Invalid("Record Batches are different:", "\n- Json Batch: ", + pp_expected.str(), "\n- Imported Batch: ", pp_actual.str()); + } + return Status::OK(); +} + +} // namespace +} // namespace arrow::internal::integration + +const char* ArrowCpp_CDataIntegration_ExportSchemaFromJson(const char* json_path, + ArrowSchema* out) { + using namespace arrow::internal::integration; // NOLINT(build/namespaces) + return StatusToErrorString([=]() { return ExportSchemaFromJson(json_path, out); }); +} + +const char* ArrowCpp_CDataIntegration_ImportSchemaAndCompareToJson(const char* json_path, + ArrowSchema* schema) { + using namespace arrow::internal::integration; // NOLINT(build/namespaces) + return StatusToErrorString( + [=]() { return ImportSchemaAndCompareToJson(json_path, schema); }); +} + +const char* ArrowCpp_CDataIntegration_ExportBatchFromJson(const char* json_path, + int num_batch, + ArrowArray* out) { + using namespace arrow::internal::integration; // NOLINT(build/namespaces) + return StatusToErrorString( + [=]() { return ExportBatchFromJson(json_path, num_batch, out); }); +} + +const char* ArrowCpp_CDataIntegration_ImportBatchAndCompareToJson(const char* json_path, + int num_batch, + ArrowArray* batch) { + using namespace arrow::internal::integration; // NOLINT(build/namespaces) + return StatusToErrorString( + [=]() { return ImportBatchAndCompareToJson(json_path, num_batch, batch); }); +} + +int64_t ArrowCpp_BytesAllocated() { + auto pool = arrow::default_memory_pool(); + return pool->bytes_allocated(); +} diff --git a/cpp/src/arrow/integration/c_data_integration_internal.h b/cpp/src/arrow/integration/c_data_integration_internal.h new file mode 100644 index 0000000000000..0a62363dffab3 --- /dev/null +++ b/cpp/src/arrow/integration/c_data_integration_internal.h @@ -0,0 +1,48 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/c/abi.h" +#include "arrow/util/visibility.h" + +// This file only serves as documentation for the C Data Interface integration +// entrypoints. The actual functions are called by Archery through DLL symbol lookup. + +extern "C" { + +ARROW_EXPORT +const char* ArrowCpp_CDataIntegration_ExportSchemaFromJson(const char* json_path, + ArrowSchema* out); + +ARROW_EXPORT +const char* ArrowCpp_CDataIntegration_ImportSchemaAndCompareToJson(const char* json_path, + ArrowSchema* schema); + +ARROW_EXPORT +const char* ArrowCpp_CDataIntegration_ExportBatchFromJson(const char* json_path, + int num_batch, ArrowArray* out); + +ARROW_EXPORT +const char* ArrowCpp_CDataIntegration_ImportBatchAndCompareToJson(const char* json_path, + int num_batch, + ArrowArray* batch); + +ARROW_EXPORT +int64_t ArrowCpp_BytesAllocated(); + +} // extern "C" diff --git a/cpp/src/arrow/integration/json_integration.cc b/cpp/src/arrow/integration/json_integration.cc index 178abe5e8b687..590f6eddd7c24 100644 --- a/cpp/src/arrow/integration/json_integration.cc +++ b/cpp/src/arrow/integration/json_integration.cc @@ -144,10 +144,9 @@ class IntegrationJsonReader::Impl { } Result> ReadRecordBatch(int i) { - DCHECK_GE(i, 0) << "i out of bounds"; - DCHECK_LT(i, static_cast(record_batches_->GetArray().Size())) - << "i out of bounds"; - + if (i < 0 || i >= static_cast(record_batches_->GetArray().Size())) { + return Status::IndexError("record batch index ", i, " out of bounds"); + } return json::ReadRecordBatch(record_batches_->GetArray()[i], schema_, &dictionary_memo_, pool_); } diff --git a/cpp/src/arrow/symbols.map b/cpp/src/arrow/symbols.map index 9ef0e404bc091..0144e6116554b 100644 --- a/cpp/src/arrow/symbols.map +++ b/cpp/src/arrow/symbols.map @@ -32,6 +32,7 @@ }; # Also export C-level helpers arrow_*; + Arrow*; # ARROW-14771: export Protobuf symbol table descriptor_table_Flight_2eproto; descriptor_table_FlightSql_2eproto; diff --git a/dev/archery/archery/cli.py b/dev/archery/archery/cli.py index 70f865cc2fa70..7a3b45f9788e6 100644 --- a/dev/archery/archery/cli.py +++ b/dev/archery/archery/cli.py @@ -723,8 +723,12 @@ def _set_default(opt, default): envvar="ARCHERY_INTEGRATION_WITH_RUST") @click.option('--write_generated_json', default="", help='Generate test JSON to indicated path') +@click.option('--run-ipc', is_flag=True, default=False, + help='Run IPC integration tests') @click.option('--run-flight', is_flag=True, default=False, help='Run Flight integration tests') +@click.option('--run-c-data', is_flag=True, default=False, + help='Run C Data Interface integration tests') @click.option('--debug', is_flag=True, default=False, help='Run executables in debug mode as relevant') @click.option('--serial', is_flag=True, default=False, @@ -753,15 +757,19 @@ def integration(with_all=False, random_seed=12345, **args): gen_path = args['write_generated_json'] languages = ['cpp', 'csharp', 'java', 'js', 'go', 'rust'] + formats = ['ipc', 'flight', 'c_data'] enabled_languages = 0 for lang in languages: - param = 'with_{}'.format(lang) + param = f'with_{lang}' if with_all: args[param] = with_all + enabled_languages += args[param] - if args[param]: - enabled_languages += 1 + enabled_formats = 0 + for fmt in formats: + param = f'run_{fmt}' + enabled_formats += args[param] if gen_path: # XXX See GH-37575: this option is only used by the JS test suite @@ -769,8 +777,13 @@ def integration(with_all=False, random_seed=12345, **args): os.makedirs(gen_path, exist_ok=True) write_js_test_json(gen_path) else: + if enabled_formats == 0: + raise click.UsageError( + "Need to enable at least one format to test " + "(IPC, Flight, C Data Interface); try --help") if enabled_languages == 0: - raise Exception("Must enable at least 1 language to test") + raise click.UsageError( + "Need to enable at least one language to test; try --help") run_all_tests(**args) diff --git a/dev/archery/archery/integration/cdata.py b/dev/archery/archery/integration/cdata.py new file mode 100644 index 0000000000000..c201f5f867f8f --- /dev/null +++ b/dev/archery/archery/integration/cdata.py @@ -0,0 +1,107 @@ +# licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import cffi +from contextlib import contextmanager +import functools + +from .tester import CDataExporter, CDataImporter + + +_c_data_decls = """ + struct ArrowSchema { + // Array type description + const char* format; + const char* name; + const char* metadata; + int64_t flags; + int64_t n_children; + struct ArrowSchema** children; + struct ArrowSchema* dictionary; + + // Release callback + void (*release)(struct ArrowSchema*); + // Opaque producer-specific data + void* private_data; + }; + + struct ArrowArray { + // Array data description + int64_t length; + int64_t null_count; + int64_t offset; + int64_t n_buffers; + int64_t n_children; + const void** buffers; + struct ArrowArray** children; + struct ArrowArray* dictionary; + + // Release callback + void (*release)(struct ArrowArray*); + // Opaque producer-specific data + void* private_data; + }; + + struct ArrowArrayStream { + int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out); + int (*get_next)(struct ArrowArrayStream*, struct ArrowArray* out); + + const char* (*get_last_error)(struct ArrowArrayStream*); + + // Release callback + void (*release)(struct ArrowArrayStream*); + // Opaque producer-specific data + void* private_data; + }; + """ + + +@functools.lru_cache +def ffi() -> cffi.FFI: + """ + Return a FFI object supporting C Data Interface types. + """ + ffi = cffi.FFI() + ffi.cdef(_c_data_decls) + return ffi + + +@contextmanager +def check_memory_released(exporter: CDataExporter, importer: CDataImporter): + """ + A context manager for memory release checks. + + The context manager arranges cooperation between the exporter and importer + to try and release memory at the end of the enclosed block. + + However, if either the exporter or importer doesn't support deterministic + memory release, no memory check is performed. + """ + do_check = (exporter.supports_releasing_memory and + importer.supports_releasing_memory) + if do_check: + before = exporter.record_allocation_state() + yield + # We don't use a `finally` clause: if the enclosed block raised an + # exception, no need to add another one. + if do_check: + ok = exporter.compare_allocation_state(before, importer.gc_until) + if not ok: + after = exporter.record_allocation_state() + raise RuntimeError( + f"Memory was not released correctly after roundtrip: " + f"before = {before}, after = {after} (should have been equal)") diff --git a/dev/archery/archery/integration/datagen.py b/dev/archery/archery/integration/datagen.py index f924c8a73cb8a..53f7ba58bff99 100644 --- a/dev/archery/archery/integration/datagen.py +++ b/dev/archery/archery/integration/datagen.py @@ -25,6 +25,7 @@ import numpy as np from .util import frombytes, tobytes, random_bytes, random_utf8 +from .util import SKIP_C_SCHEMA, SKIP_C_ARRAY def metadata_key_values(pairs): @@ -1224,15 +1225,16 @@ def get_json(self): class File(object): def __init__(self, name, schema, batches, dictionaries=None, - skip=None, path=None, quirks=None): + skip_testers=None, path=None, quirks=None): self.name = name self.schema = schema self.dictionaries = dictionaries or [] self.batches = batches - self.skip = set() + self.skipped_testers = set() + self.skipped_formats = {} self.path = path - if skip: - self.skip.update(skip) + if skip_testers: + self.skipped_testers.update(skip_testers) # For tracking flags like whether to validate decimal values # fit into the given precision (ARROW-13558). self.quirks = set() @@ -1258,14 +1260,39 @@ def write(self, path): f.write(json.dumps(self.get_json(), indent=2).encode('utf-8')) self.path = path - def skip_category(self, category): - """Skip this test for the given category. + def skip_tester(self, tester): + """Skip this test for the given tester (such as 'C#'). + """ + self.skipped_testers.add(tester) + return self - Category should be SKIP_ARROW or SKIP_FLIGHT. + def skip_format(self, format, tester='all'): + """Skip this test for the given format, and optionally tester. """ - self.skip.add(category) + self.skipped_formats.setdefault(format, set()).add(tester) return self + def add_skips_from(self, other_file): + """Add skips from another File object. + """ + self.skipped_testers.update(other_file.skipped_testers) + for format, testers in other_file.skipped_formats.items(): + self.skipped_formats.setdefault(format, set()).update(testers) + + def should_skip(self, tester, format): + """Whether this (tester, format) combination should be skipped. + """ + if tester in self.skipped_testers: + return True + testers = self.skipped_formats.get(format, ()) + return 'all' in testers or tester in testers + + @property + def num_batches(self): + """The number of record batches in this file. + """ + return len(self.batches) + def get_field(name, type_, **kwargs): if type_ == 'binary': @@ -1295,8 +1322,8 @@ def get_field(name, type_, **kwargs): raise TypeError(dtype) -def _generate_file(name, fields, batch_sizes, dictionaries=None, skip=None, - metadata=None): +def _generate_file(name, fields, batch_sizes, *, + dictionaries=None, metadata=None): schema = Schema(fields, metadata=metadata) batches = [] for size in batch_sizes: @@ -1307,7 +1334,7 @@ def _generate_file(name, fields, batch_sizes, dictionaries=None, skip=None, batches.append(RecordBatch(size, columns)) - return File(name, schema, batches, dictionaries, skip=skip) + return File(name, schema, batches, dictionaries) def generate_custom_metadata_case(): @@ -1666,8 +1693,8 @@ def _temp_path(): generate_primitive_case([0, 0, 0], name='primitive_zerolength'), generate_primitive_large_offsets_case([17, 20]) - .skip_category('C#') - .skip_category('JS'), + .skip_tester('C#') + .skip_tester('JS'), generate_null_case([10, 0]), @@ -1676,66 +1703,71 @@ def _temp_path(): generate_decimal128_case(), generate_decimal256_case() - .skip_category('JS'), + .skip_tester('JS'), generate_datetime_case(), generate_duration_case() - .skip_category('C#') - .skip_category('JS'), # TODO(ARROW-5239): Intervals + JS + .skip_tester('C#') + .skip_tester('JS'), # TODO(ARROW-5239): Intervals + JS generate_interval_case() - .skip_category('C#') - .skip_category('JS'), # TODO(ARROW-5239): Intervals + JS + .skip_tester('C#') + .skip_tester('JS'), # TODO(ARROW-5239): Intervals + JS generate_month_day_nano_interval_case() - .skip_category('C#') - .skip_category('JS'), + .skip_tester('C#') + .skip_tester('JS'), generate_map_case() - .skip_category('C#'), + .skip_tester('C#'), generate_non_canonical_map_case() - .skip_category('C#') - .skip_category('Java'), # TODO(ARROW-8715) + .skip_tester('C#') + .skip_tester('Java') # TODO(ARROW-8715) + # Canonical map names are restored on import, so the schemas are unequal + .skip_format(SKIP_C_SCHEMA, 'C++'), generate_nested_case(), generate_recursive_nested_case(), generate_nested_large_offsets_case() - .skip_category('C#') - .skip_category('JS'), + .skip_tester('C#') + .skip_tester('JS'), generate_unions_case() - .skip_category('C#'), + .skip_tester('C#'), generate_custom_metadata_case() - .skip_category('C#'), + .skip_tester('C#'), generate_duplicate_fieldnames_case() - .skip_category('C#') - .skip_category('JS'), + .skip_tester('C#') + .skip_tester('JS'), generate_dictionary_case() - .skip_category('C#'), + .skip_tester('C#'), generate_dictionary_unsigned_case() - .skip_category('C#') - .skip_category('Java'), # TODO(ARROW-9377) + .skip_tester('C#') + .skip_tester('Java'), # TODO(ARROW-9377) generate_nested_dictionary_case() - .skip_category('C#') - .skip_category('Java'), # TODO(ARROW-7779) + .skip_tester('C#') + .skip_tester('Java'), # TODO(ARROW-7779) generate_run_end_encoded_case() - .skip_category('C#') - .skip_category('Java') - .skip_category('JS') - .skip_category('Rust'), + .skip_tester('C#') + .skip_tester('Java') + .skip_tester('JS') + .skip_tester('Rust'), generate_extension_case() - .skip_category('C#'), + .skip_tester('C#') + # TODO: ensure the extension is registered in the C++ entrypoint + .skip_format(SKIP_C_SCHEMA, 'C++') + .skip_format(SKIP_C_ARRAY, 'C++'), ] generated_paths = [] diff --git a/dev/archery/archery/integration/runner.py b/dev/archery/archery/integration/runner.py index 0ee9ab814e5e6..2fd1d2d7f0c44 100644 --- a/dev/archery/archery/integration/runner.py +++ b/dev/archery/archery/integration/runner.py @@ -25,17 +25,19 @@ import sys import tempfile import traceback -from typing import Callable, List +from typing import Callable, List, Optional +from . import cdata from .scenario import Scenario -from .tester import Tester -from .tester_cpp import CPPTester +from .tester import Tester, CDataExporter, CDataImporter +from .tester_cpp import CppTester from .tester_go import GoTester from .tester_rust import RustTester from .tester_java import JavaTester from .tester_js import JSTester from .tester_csharp import CSharpTester -from .util import guid, SKIP_ARROW, SKIP_FLIGHT, printer +from .util import guid, printer +from .util import SKIP_C_ARRAY, SKIP_C_SCHEMA, SKIP_FLIGHT, SKIP_IPC from ..utils.source import ARROW_ROOT_DEFAULT from . import datagen @@ -76,7 +78,7 @@ def __init__(self, json_files, self.json_files = [json_file for json_file in self.json_files if self.match in json_file.name] - def run(self): + def run_ipc(self): """ Run Arrow IPC integration tests for the matrix of enabled implementations. @@ -84,23 +86,24 @@ def run(self): for producer, consumer in itertools.product( filter(lambda t: t.PRODUCER, self.testers), filter(lambda t: t.CONSUMER, self.testers)): - self._compare_implementations( + self._compare_ipc_implementations( producer, consumer, self._produce_consume, self.json_files) if self.gold_dirs: for gold_dir, consumer in itertools.product( self.gold_dirs, filter(lambda t: t.CONSUMER, self.testers)): - log('\n\n\n\n') + log('\n') log('******************************************************') log('Tests against golden files in {}'.format(gold_dir)) log('******************************************************') def run_gold(_, consumer, test_case: datagen.File): return self._run_gold(gold_dir, consumer, test_case) - self._compare_implementations( + self._compare_ipc_implementations( consumer, consumer, run_gold, self._gold_tests(gold_dir)) + log('\n') def run_flight(self): """ @@ -112,6 +115,18 @@ def run_flight(self): self.testers) for server, client in itertools.product(servers, clients): self._compare_flight_implementations(server, client) + log('\n') + + def run_c_data(self): + """ + Run Arrow C Data interface integration tests for the matrix of + enabled implementations. + """ + for producer, consumer in itertools.product( + filter(lambda t: t.C_DATA_EXPORTER, self.testers), + filter(lambda t: t.C_DATA_IMPORTER, self.testers)): + self._compare_c_data_implementations(producer, consumer) + log('\n') def _gold_tests(self, gold_dir): prefix = os.path.basename(os.path.normpath(gold_dir)) @@ -125,28 +140,31 @@ def _gold_tests(self, gold_dir): with open(out_path, "wb") as out: out.write(i.read()) + # Find the generated file with the same name as this gold file try: - skip = next(f for f in self.json_files - if f.name == name).skip + equiv_json_file = next(f for f in self.json_files + if f.name == name) except StopIteration: - skip = set() + equiv_json_file = None + + skip_testers = set() if name == 'union' and prefix == '0.17.1': - skip.add("Java") - skip.add("JS") + skip_testers.add("Java") + skip_testers.add("JS") if prefix == '1.0.0-bigendian' or prefix == '1.0.0-littleendian': - skip.add("C#") - skip.add("Java") - skip.add("JS") - skip.add("Rust") + skip_testers.add("C#") + skip_testers.add("Java") + skip_testers.add("JS") + skip_testers.add("Rust") if prefix == '2.0.0-compression': - skip.add("C#") - skip.add("JS") + skip_testers.add("C#") + skip_testers.add("JS") # See https://github.com/apache/arrow/pull/9822 for how to # disable specific compression type tests. if prefix == '4.0.0-shareddict': - skip.add("C#") + skip_testers.add("C#") quirks = set() if prefix in {'0.14.1', '0.17.1', @@ -157,12 +175,18 @@ def _gold_tests(self, gold_dir): quirks.add("no_date64_validate") quirks.add("no_times_validate") - yield datagen.File(name, None, None, skip=skip, path=out_path, - quirks=quirks) + json_file = datagen.File(name, schema=None, batches=None, + path=out_path, + skip_testers=skip_testers, + quirks=quirks) + if equiv_json_file is not None: + json_file.add_skips_from(equiv_json_file) + yield json_file def _run_test_cases(self, case_runner: Callable[[datagen.File], Outcome], - test_cases: List[datagen.File]) -> None: + test_cases: List[datagen.File], + *, serial: Optional[bool] = None) -> None: """ Populate self.failures with the outcomes of the ``case_runner`` ran against ``test_cases`` @@ -171,10 +195,13 @@ def case_wrapper(test_case): with printer.cork(): return case_runner(test_case) + if serial is None: + serial = self.serial + if self.failures and self.stop_on_error: return - if self.serial: + if serial: for outcome in map(case_wrapper, test_cases): if outcome.failure is not None: self.failures.append(outcome.failure) @@ -189,7 +216,7 @@ def case_wrapper(test_case): if self.stop_on_error: break - def _compare_implementations( + def _compare_ipc_implementations( self, producer: Tester, consumer: Tester, @@ -221,22 +248,17 @@ def _run_ipc_test_case( outcome = Outcome() json_path = test_case.path - log('==========================================================') + log('=' * 70) log('Testing file {0}'.format(json_path)) - log('==========================================================') - - if producer.name in test_case.skip: - log('-- Skipping test because producer {0} does ' - 'not support'.format(producer.name)) - outcome.skipped = True - elif consumer.name in test_case.skip: - log('-- Skipping test because consumer {0} does ' - 'not support'.format(consumer.name)) + if test_case.should_skip(producer.name, SKIP_IPC): + log(f'-- Skipping test because producer {producer.name} does ' + f'not support IPC') outcome.skipped = True - elif SKIP_ARROW in test_case.skip: - log('-- Skipping test') + elif test_case.should_skip(consumer.name, SKIP_IPC): + log(f'-- Skipping test because consumer {consumer.name} does ' + f'not support IPC') outcome.skipped = True else: @@ -247,6 +269,8 @@ def _run_ipc_test_case( outcome.failure = Failure(test_case, producer, consumer, sys.exc_info()) + log('=' * 70) + return outcome def _produce_consume(self, @@ -344,22 +368,17 @@ def _run_flight_test_case(self, """ outcome = Outcome() - log('=' * 58) + log('=' * 70) log('Testing file {0}'.format(test_case.name)) - log('=' * 58) - - if producer.name in test_case.skip: - log('-- Skipping test because producer {0} does ' - 'not support'.format(producer.name)) - outcome.skipped = True - elif consumer.name in test_case.skip: - log('-- Skipping test because consumer {0} does ' - 'not support'.format(consumer.name)) + if test_case.should_skip(producer.name, SKIP_FLIGHT): + log(f'-- Skipping test because producer {producer.name} does ' + f'not support Flight') outcome.skipped = True - elif SKIP_FLIGHT in test_case.skip: - log('-- Skipping test') + elif test_case.should_skip(consumer.name, SKIP_FLIGHT): + log(f'-- Skipping test because consumer {consumer.name} does ' + f'not support Flight') outcome.skipped = True else: @@ -380,6 +399,125 @@ def _run_flight_test_case(self, outcome.failure = Failure(test_case, producer, consumer, sys.exc_info()) + log('=' * 70) + + return outcome + + def _compare_c_data_implementations( + self, + producer: Tester, + consumer: Tester + ): + log('##########################################################') + log(f'C Data Interface: ' + f'{producer.name} exporting, {consumer.name} importing') + log('##########################################################') + + # Serial execution is required for proper memory accounting + serial = True + + exporter = producer.make_c_data_exporter() + importer = consumer.make_c_data_importer() + + case_runner = partial(self._run_c_schema_test_case, producer, consumer, + exporter, importer) + self._run_test_cases(case_runner, self.json_files, serial=serial) + + case_runner = partial(self._run_c_array_test_cases, producer, consumer, + exporter, importer) + self._run_test_cases(case_runner, self.json_files, serial=serial) + + def _run_c_schema_test_case(self, + producer: Tester, consumer: Tester, + exporter: CDataExporter, + importer: CDataImporter, + test_case: datagen.File) -> Outcome: + """ + Run one C ArrowSchema test case. + """ + outcome = Outcome() + + def do_run(): + json_path = test_case.path + ffi = cdata.ffi() + c_schema_ptr = ffi.new("struct ArrowSchema*") + with cdata.check_memory_released(exporter, importer): + exporter.export_schema_from_json(json_path, c_schema_ptr) + importer.import_schema_and_compare_to_json(json_path, c_schema_ptr) + + log('=' * 70) + log(f'Testing C ArrowSchema from file {test_case.name!r}') + + if test_case.should_skip(producer.name, SKIP_C_SCHEMA): + log(f'-- Skipping test because producer {producer.name} does ' + f'not support C ArrowSchema') + outcome.skipped = True + + elif test_case.should_skip(consumer.name, SKIP_C_SCHEMA): + log(f'-- Skipping test because consumer {consumer.name} does ' + f'not support C ArrowSchema') + outcome.skipped = True + + else: + try: + do_run() + except Exception: + traceback.print_exc(file=printer.stdout) + outcome.failure = Failure(test_case, producer, consumer, + sys.exc_info()) + + log('=' * 70) + + return outcome + + def _run_c_array_test_cases(self, + producer: Tester, consumer: Tester, + exporter: CDataExporter, + importer: CDataImporter, + test_case: datagen.File) -> Outcome: + """ + Run one set C ArrowArray test cases. + """ + outcome = Outcome() + + def do_run(): + json_path = test_case.path + ffi = cdata.ffi() + c_array_ptr = ffi.new("struct ArrowArray*") + for num_batch in range(test_case.num_batches): + log(f'... with record batch #{num_batch}') + with cdata.check_memory_released(exporter, importer): + exporter.export_batch_from_json(json_path, + num_batch, + c_array_ptr) + importer.import_batch_and_compare_to_json(json_path, + num_batch, + c_array_ptr) + + log('=' * 70) + log(f'Testing C ArrowArray ' + f'from file {test_case.name!r}') + + if test_case.should_skip(producer.name, SKIP_C_ARRAY): + log(f'-- Skipping test because producer {producer.name} does ' + f'not support C ArrowArray') + outcome.skipped = True + + elif test_case.should_skip(consumer.name, SKIP_C_ARRAY): + log(f'-- Skipping test because consumer {consumer.name} does ' + f'not support C ArrowArray') + outcome.skipped = True + + else: + try: + do_run() + except Exception: + traceback.print_exc(file=printer.stdout) + outcome.failure = Failure(test_case, producer, consumer, + sys.exc_info()) + + log('=' * 70) + return outcome @@ -387,7 +525,7 @@ def get_static_json_files(): glob_pattern = os.path.join(ARROW_ROOT_DEFAULT, 'integration', 'data', '*.json') return [ - datagen.File(name=os.path.basename(p), path=p, skip=set(), + datagen.File(name=os.path.basename(p), path=p, schema=None, batches=None) for p in glob.glob(glob_pattern) ] @@ -395,13 +533,14 @@ def get_static_json_files(): def run_all_tests(with_cpp=True, with_java=True, with_js=True, with_csharp=True, with_go=True, with_rust=False, - run_flight=False, tempdir=None, **kwargs): + run_ipc=False, run_flight=False, run_c_data=False, + tempdir=None, **kwargs): tempdir = tempdir or tempfile.mkdtemp(prefix='arrow-integration-') testers: List[Tester] = [] if with_cpp: - testers.append(CPPTester(**kwargs)) + testers.append(CppTester(**kwargs)) if with_java: testers.append(JavaTester(**kwargs)) @@ -434,54 +573,57 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True, Scenario( "ordered", description="Ensure FlightInfo.ordered is supported.", - skip={"JS", "C#", "Rust"}, + skip_testers={"JS", "C#", "Rust"}, ), Scenario( "expiration_time:do_get", description=("Ensure FlightEndpoint.expiration_time with " "DoGet is working as expected."), - skip={"JS", "C#", "Rust"}, + skip_testers={"JS", "C#", "Rust"}, ), Scenario( "expiration_time:list_actions", description=("Ensure FlightEndpoint.expiration_time related " "pre-defined actions is working with ListActions " "as expected."), - skip={"JS", "C#", "Rust"}, + skip_testers={"JS", "C#", "Rust"}, ), Scenario( "expiration_time:cancel_flight_info", description=("Ensure FlightEndpoint.expiration_time and " "CancelFlightInfo are working as expected."), - skip={"JS", "C#", "Rust"}, + skip_testers={"JS", "C#", "Rust"}, ), Scenario( "expiration_time:renew_flight_endpoint", description=("Ensure FlightEndpoint.expiration_time and " "RenewFlightEndpoint are working as expected."), - skip={"JS", "C#", "Rust"}, + skip_testers={"JS", "C#", "Rust"}, ), Scenario( "poll_flight_info", description="Ensure PollFlightInfo is supported.", - skip={"JS", "C#", "Rust"} + skip_testers={"JS", "C#", "Rust"} ), Scenario( "flight_sql", description="Ensure Flight SQL protocol is working as expected.", - skip={"Rust"} + skip_testers={"Rust"} ), Scenario( "flight_sql:extension", description="Ensure Flight SQL extensions work as expected.", - skip={"Rust"} + skip_testers={"Rust"} ), ] runner = IntegrationRunner(json_files, flight_scenarios, testers, **kwargs) - runner.run() + if run_ipc: + runner.run_ipc() if run_flight: runner.run_flight() + if run_c_data: + runner.run_c_data() fail_count = 0 if runner.failures: @@ -492,7 +634,8 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True, log(test_case.name, producer.name, "producing, ", consumer.name, "consuming") if exc_info: - traceback.print_exception(*exc_info) + exc_type, exc_value, exc_tb = exc_info + log(f'{exc_type}: {exc_value}') log() log(fail_count, "failures") diff --git a/dev/archery/archery/integration/scenario.py b/dev/archery/archery/integration/scenario.py index 1fcbca64e6a1f..89c64452e5fc5 100644 --- a/dev/archery/archery/integration/scenario.py +++ b/dev/archery/archery/integration/scenario.py @@ -23,7 +23,10 @@ class Scenario: Does not correspond to a particular IPC JSON file. """ - def __init__(self, name, description, skip=None): + def __init__(self, name, description, skip_testers=None): self.name = name self.description = description - self.skip = skip or set() + self.skipped_testers = skip_testers or set() + + def should_skip(self, tester, format): + return tester in self.skipped_testers diff --git a/dev/archery/archery/integration/tester.py b/dev/archery/archery/integration/tester.py index 54bfe621efd92..6a3061992d006 100644 --- a/dev/archery/archery/integration/tester.py +++ b/dev/archery/archery/integration/tester.py @@ -17,12 +17,181 @@ # Base class for language-specific integration test harnesses +from abc import ABC, abstractmethod +import os import subprocess +import typing from .util import log -class Tester(object): +_Predicate = typing.Callable[[], bool] + + +class CDataExporter(ABC): + + @abstractmethod + def export_schema_from_json(self, json_path: os.PathLike, + c_schema_ptr: object): + """ + Read a JSON integration file and export its schema. + + Parameters + ---------- + json_path : Path + Path to the JSON file + c_schema_ptr : cffi pointer value + Pointer to the ``ArrowSchema`` struct to export to. + """ + + @abstractmethod + def export_batch_from_json(self, json_path: os.PathLike, + num_batch: int, + c_array_ptr: object): + """ + Read a JSON integration file and export one of its batches. + + Parameters + ---------- + json_path : Path + Path to the JSON file + num_batch : int + Number of the record batch in the JSON file + c_schema_ptr : cffi pointer value + Pointer to the ``ArrowArray`` struct to export to. + """ + + @property + @abstractmethod + def supports_releasing_memory(self) -> bool: + """ + Whether the implementation is able to release memory deterministically. + + Here, "release memory" means that, after the `release` callback of + a C Data Interface export is called, `compare_allocation_state` is + able to trigger the deallocation of the memory underlying the export + (for example buffer data). + + If false, then `record_allocation_state` and `compare_allocation_state` + are allowed to raise NotImplementedError. + """ + + def record_allocation_state(self) -> object: + """ + Record the current memory allocation state. + + Returns + ------- + state : object + Opaque object representing the allocation state, + for example the number of allocated bytes. + """ + raise NotImplementedError + + def compare_allocation_state(self, recorded: object, + gc_until: typing.Callable[[_Predicate], bool] + ) -> bool: + """ + Compare the current memory allocation state with the recorded one. + + Parameters + ---------- + recorded : object + The previous allocation state returned by + `record_allocation_state()` + gc_until : callable + A callable itself accepting a callable predicate, and + returning a boolean. + `gc_until` should try to release memory until the predicate + becomes true, or until it decides to give up. The final value + of the predicate should be returned. + `gc_until` is typically provided by the C Data Interface importer. + + Returns + ------- + success : bool + Whether memory allocation state finally reached its previously + recorded value. + """ + raise NotImplementedError + + +class CDataImporter(ABC): + + @abstractmethod + def import_schema_and_compare_to_json(self, json_path: os.PathLike, + c_schema_ptr: object): + """ + Import schema and compare it to the schema of a JSON integration file. + + An error is raised if importing fails or the schemas differ. + + Parameters + ---------- + json_path : Path + The path to the JSON file + c_schema_ptr : cffi pointer value + Pointer to the ``ArrowSchema`` struct to import from. + """ + + @abstractmethod + def import_batch_and_compare_to_json(self, json_path: os.PathLike, + num_batch: int, + c_array_ptr: object): + """ + Import record batch and compare it to one of the batches + from a JSON integration file. + + The schema used for importing the record batch is the one from + the JSON file. + + An error is raised if importing fails or the batches differ. + + Parameters + ---------- + json_path : Path + The path to the JSON file + num_batch : int + Number of the record batch in the JSON file + c_array_ptr : cffi pointer value + Pointer to the ``ArrowArray`` struct to import from. + """ + + @property + @abstractmethod + def supports_releasing_memory(self) -> bool: + """ + Whether the implementation is able to release memory deterministically. + + Here, "release memory" means calling the `release` callback of + a C Data Interface export (which should then trigger a deallocation + mechanism on the exporter). + + If false, then `gc_until` is allowed to raise NotImplementedError. + """ + + def gc_until(self, predicate: _Predicate): + """ + Try to release memory until the predicate becomes true, or fail. + + Depending on the CDataImporter implementation, this may for example + try once, or run a garbage collector a given number of times, or + any other implementation-specific strategy for releasing memory. + + The running time should be kept reasonable and compatible with + execution of multiple C Data integration tests. + + This should not raise if `supports_releasing_memory` is true. + + Returns + ------- + success : bool + The final value of the predicate. + """ + raise NotImplementedError + + +class Tester: """ The interface to declare a tester to run integration tests against. """ @@ -34,8 +203,12 @@ class Tester(object): FLIGHT_SERVER = False # whether the language supports receiving Flight FLIGHT_CLIENT = False + # whether the language supports the C Data Interface as an exporter + C_DATA_EXPORTER = False + # whether the language supports the C Data Interface as an importer + C_DATA_IMPORTER = False - # the name shown in the logs + # the name used for skipping and shown in the logs name = "unknown" def __init__(self, debug=False, **args): @@ -85,3 +258,9 @@ def flight_server(self, scenario_name=None): def flight_request(self, port, json_path=None, scenario_name=None): raise NotImplementedError + + def make_c_data_exporter(self) -> CDataExporter: + raise NotImplementedError + + def make_c_data_importer(self) -> CDataImporter: + raise NotImplementedError diff --git a/dev/archery/archery/integration/tester_cpp.py b/dev/archery/archery/integration/tester_cpp.py index 52cc565dc00a3..9ddc3c480002a 100644 --- a/dev/archery/archery/integration/tester_cpp.py +++ b/dev/archery/archery/integration/tester_cpp.py @@ -16,10 +16,12 @@ # under the License. import contextlib +import functools import os import subprocess -from .tester import Tester +from . import cdata +from .tester import Tester, CDataExporter, CDataImporter from .util import run_cmd, log from ..utils.source import ARROW_ROOT_DEFAULT @@ -39,12 +41,19 @@ "localhost", ] +_dll_suffix = ".dll" if os.name == "nt" else ".so" -class CPPTester(Tester): +_DLL_PATH = _EXE_PATH +_ARROW_DLL = os.path.join(_DLL_PATH, "libarrow" + _dll_suffix) + + +class CppTester(Tester): PRODUCER = True CONSUMER = True FLIGHT_SERVER = True FLIGHT_CLIENT = True + C_DATA_EXPORTER = True + C_DATA_IMPORTER = True name = 'C++' @@ -133,3 +142,104 @@ def flight_request(self, port, json_path=None, scenario_name=None): if self.debug: log(' '.join(cmd)) run_cmd(cmd) + + def make_c_data_exporter(self): + return CppCDataExporter(self.debug, self.args) + + def make_c_data_importer(self): + return CppCDataImporter(self.debug, self.args) + + +_cpp_c_data_entrypoints = """ + const char* ArrowCpp_CDataIntegration_ExportSchemaFromJson( + const char* json_path, struct ArrowSchema* out); + const char* ArrowCpp_CDataIntegration_ImportSchemaAndCompareToJson( + const char* json_path, struct ArrowSchema* schema); + + const char* ArrowCpp_CDataIntegration_ExportBatchFromJson( + const char* json_path, int num_batch, struct ArrowArray* out); + const char* ArrowCpp_CDataIntegration_ImportBatchAndCompareToJson( + const char* json_path, int num_batch, struct ArrowArray* batch); + + int64_t ArrowCpp_BytesAllocated(); + """ + + +@functools.lru_cache +def _load_ffi(ffi, lib_path=_ARROW_DLL): + ffi.cdef(_cpp_c_data_entrypoints) + dll = ffi.dlopen(lib_path) + dll.ArrowCpp_CDataIntegration_ExportSchemaFromJson + return dll + + +class _CDataBase: + + def __init__(self, debug, args): + self.debug = debug + self.args = args + self.ffi = cdata.ffi() + self.dll = _load_ffi(self.ffi) + + def _check_c_error(self, c_error): + """ + Check a `const char*` error return from an integration entrypoint. + + A null means success, a non-empty string is an error message. + The string is statically allocated on the C++ side. + """ + assert self.ffi.typeof(c_error) is self.ffi.typeof("const char*") + if c_error != self.ffi.NULL: + error = self.ffi.string(c_error).decode('utf8', + errors='replace') + raise RuntimeError( + f"C++ C Data Integration call failed: {error}") + + +class CppCDataExporter(CDataExporter, _CDataBase): + + def export_schema_from_json(self, json_path, c_schema_ptr): + c_error = self.dll.ArrowCpp_CDataIntegration_ExportSchemaFromJson( + str(json_path).encode(), c_schema_ptr) + self._check_c_error(c_error) + + def export_batch_from_json(self, json_path, num_batch, c_array_ptr): + c_error = self.dll.ArrowCpp_CDataIntegration_ExportBatchFromJson( + str(json_path).encode(), num_batch, c_array_ptr) + self._check_c_error(c_error) + + @property + def supports_releasing_memory(self): + return True + + def record_allocation_state(self): + return self.dll.ArrowCpp_BytesAllocated() + + def compare_allocation_state(self, recorded, gc_until): + def pred(): + # No GC on our side, so just compare allocation state + return self.record_allocation_state() == recorded + + return gc_until(pred) + + +class CppCDataImporter(CDataImporter, _CDataBase): + + def import_schema_and_compare_to_json(self, json_path, c_schema_ptr): + c_error = self.dll.ArrowCpp_CDataIntegration_ImportSchemaAndCompareToJson( + str(json_path).encode(), c_schema_ptr) + self._check_c_error(c_error) + + def import_batch_and_compare_to_json(self, json_path, num_batch, + c_array_ptr): + c_error = self.dll.ArrowCpp_CDataIntegration_ImportBatchAndCompareToJson( + str(json_path).encode(), num_batch, c_array_ptr) + self._check_c_error(c_error) + + @property + def supports_releasing_memory(self): + return True + + def gc_until(self, predicate): + # No GC on our side, so can evaluate predicate immediately + return predicate() diff --git a/dev/archery/archery/integration/util.py b/dev/archery/archery/integration/util.py index 80ba30052e4da..afef7d5eb13b9 100644 --- a/dev/archery/archery/integration/util.py +++ b/dev/archery/archery/integration/util.py @@ -32,8 +32,10 @@ def guid(): # SKIP categories -SKIP_ARROW = 'arrow' +SKIP_C_ARRAY = 'c_array' +SKIP_C_SCHEMA = 'c_schema' SKIP_FLIGHT = 'flight' +SKIP_IPC = 'ipc' class _Printer: diff --git a/dev/archery/setup.py b/dev/archery/setup.py index 627e576fb6f59..08e41225f673a 100755 --- a/dev/archery/setup.py +++ b/dev/archery/setup.py @@ -28,16 +28,17 @@ jinja_req = 'jinja2>=2.11' extras = { - 'lint': ['numpydoc==1.1.0', 'autopep8', 'flake8==6.1.0', 'cython-lint', - 'cmake_format==0.6.13'], 'benchmark': ['pandas'], - 'docker': ['ruamel.yaml', 'python-dotenv'], - 'release': ['pygithub', jinja_req, 'jira', 'semver', 'gitpython'], 'crossbow': ['github3.py', jinja_req, 'pygit2>=1.6.0', 'requests', 'ruamel.yaml', 'setuptools_scm'], 'crossbow-upload': ['github3.py', jinja_req, 'ruamel.yaml', 'setuptools_scm'], - 'numpydoc': ['numpydoc==1.1.0'] + 'docker': ['ruamel.yaml', 'python-dotenv'], + 'integration': ['cffi'], + 'lint': ['numpydoc==1.1.0', 'autopep8', 'flake8==6.1.0', 'cython-lint', + 'cmake_format==0.6.13'], + 'numpydoc': ['numpydoc==1.1.0'], + 'release': ['pygithub', jinja_req, 'jira', 'semver', 'gitpython'], } extras['bot'] = extras['crossbow'] + ['pygithub', 'jira'] extras['all'] = list(set(functools.reduce(operator.add, extras.values()))) From 64ad8e564ea013101b8565ce200e54e5c85bac8d Mon Sep 17 00:00:00 2001 From: Jonathan Keane Date: Tue, 19 Sep 2023 11:15:27 -0500 Subject: [PATCH 36/96] GH-33807: [R] Add a message if we detect running under emulation (#37777) Resolves #33807 and #37034 ### Rationale for this change If someone is running R under emulation, arrow segfaults without error. We can detect this when we load so can also warn people that this is not recommended. Though the version of R being run is not directly an arrow issue, arrow fails very quickly in this configuration. ### What changes are included in this PR? Detect when running under rosetta (on macOS only) and warn when the library is attached ### Are these changes tested? No, given the paucity of ARM-based mac CI, testing this organically would be difficult. But the logic is straightforward. ### Are there any user-facing changes? Yes, a warning when someone loads arrow under emulation. * Closes: #33807 Authored-by: Jonathan Keane Signed-off-by: Jonathan Keane --- r/R/arrow-package.R | 21 +++++++++++++++++++++ r/R/install-arrow.R | 4 +--- r/README.md | 2 ++ 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/r/R/arrow-package.R b/r/R/arrow-package.R index 8f44f8936bdd3..09183250ba3e0 100644 --- a/r/R/arrow-package.R +++ b/r/R/arrow-package.R @@ -183,6 +183,22 @@ configure_tzdb <- function() { # Just to be extra safe, let's wrap this in a try(); # we don't want a failed startup message to prevent the package from loading try({ + # On MacOS only, Check if we are running in under emulation, and warn this will not work + if (on_rosetta()) { + packageStartupMessage( + paste( + "Warning:", + " It appears that you are running R and Arrow in emulation (i.e. you're", + " running an Intel version of R on a non-Intel mac). This configuration is", + " not supported by arrow, you should install a native (arm64) build of R", + " and use arrow with that. See https://cran.r-project.org/bin/macosx/", + "", + sep = "\n" + ) + ) + } + + features <- arrow_info()$capabilities # That has all of the #ifdef features, plus the compression libs and the # string libraries (but not the memory allocators, they're added elsewhere) @@ -225,6 +241,11 @@ on_macos_10_13_or_lower <- function() { package_version(unname(Sys.info()["release"])) < "18.0.0" } +on_rosetta <- function() { + identical(tolower(Sys.info()[["sysname"]]), "darwin") && + identical(system("sysctl -n sysctl.proc_translated", intern = TRUE), "1") +} + option_use_threads <- function() { !is_false(getOption("arrow.use_threads")) } diff --git a/r/R/install-arrow.R b/r/R/install-arrow.R index 8380fa2af989c..7017d4f39b876 100644 --- a/r/R/install-arrow.R +++ b/r/R/install-arrow.R @@ -61,7 +61,6 @@ install_arrow <- function(nightly = FALSE, verbose = Sys.getenv("ARROW_R_DEV", FALSE), repos = getOption("repos"), ...) { - sysname <- tolower(Sys.info()[["sysname"]]) conda <- isTRUE(grepl("conda", R.Version()$platform)) if (conda) { @@ -80,8 +79,7 @@ install_arrow <- function(nightly = FALSE, # On the M1, we can't use the usual autobrew, which pulls Intel dependencies apple_m1 <- grepl("arm-apple|aarch64.*darwin", R.Version()$platform) # On Rosetta, we have to build without JEMALLOC, so we also can't autobrew - rosetta <- identical(sysname, "darwin") && identical(system("sysctl -n sysctl.proc_translated", intern = TRUE), "1") - if (rosetta) { + if (on_rosetta()) { Sys.setenv(ARROW_JEMALLOC = "OFF") } if (apple_m1 || rosetta) { diff --git a/r/README.md b/r/README.md index d343d6979c0a7..3c1e3570ffdd4 100644 --- a/r/README.md +++ b/r/README.md @@ -73,6 +73,8 @@ additional steps should be required. There are some special cases to note: +- On macOS, the R you use with Arrow should match the architecture of the machine you are using. If you're using an ARM (aka M1, M2, etc.) processor use R compiled for arm64. If you're using an Intel based mac, use R compiled for x86. Using R and Arrow compiled for Intel based macs on an ARM based mac will result in segfaults and crashes. + - On Linux the installation process can sometimes be more involved because CRAN does not host binaries for Linux. For more information please see the [installation guide](https://arrow.apache.org/docs/r/articles/install.html). From 76c4a6e317436d616ddcd62ca21245bebef6091d Mon Sep 17 00:00:00 2001 From: mwish Date: Wed, 20 Sep 2023 00:29:10 +0800 Subject: [PATCH 37/96] GH-37487: [C++][Parquet] Dataset: Implement sync `ParquetFileFormat::GetReader` (#37514) ### Rationale for this change As https://github.com/apache/arrow/issues/37487 says. When thread cnt == 1, the thread might blocking in `ParquetFileFormat::GetReaderAsync`, that's because: 1. `ParquetFileFormat::CountRows` would call `EnsureCompleteMetadata` in `io_executor` 2. `EnsureCompleteMetadata` call `ParquetFileFormat::GetReader`, which dispatch real request to async mode 3. `async` is executed in `io_executor`. 1/3 in same fix-sized executor, causing deadlock. ### What changes are included in this PR? Implement sync `ParquetFileFormat::GetReader`. ### Are these changes tested? Currently not ### Are there any user-facing changes? Bugfix * Closes: #37487 Authored-by: mwish Signed-off-by: Benjamin Kietzman --- cpp/src/arrow/dataset/file_parquet.cc | 61 +++++++++++++++++----- cpp/src/arrow/dataset/file_parquet_test.cc | 25 +++++++++ 2 files changed, 74 insertions(+), 12 deletions(-) diff --git a/cpp/src/arrow/dataset/file_parquet.cc b/cpp/src/arrow/dataset/file_parquet.cc index c30441d911e4e..9d0e8a6515878 100644 --- a/cpp/src/arrow/dataset/file_parquet.cc +++ b/cpp/src/arrow/dataset/file_parquet.cc @@ -88,6 +88,22 @@ parquet::ArrowReaderProperties MakeArrowReaderProperties( return properties; } +parquet::ArrowReaderProperties MakeArrowReaderProperties( + const ParquetFileFormat& format, const parquet::FileMetaData& metadata, + const ScanOptions& options, const ParquetFragmentScanOptions& parquet_scan_options) { + auto arrow_properties = MakeArrowReaderProperties(format, metadata); + arrow_properties.set_batch_size(options.batch_size); + // Must be set here since the sync ScanTask handles pre-buffering itself + arrow_properties.set_pre_buffer( + parquet_scan_options.arrow_reader_properties->pre_buffer()); + arrow_properties.set_cache_options( + parquet_scan_options.arrow_reader_properties->cache_options()); + arrow_properties.set_io_context( + parquet_scan_options.arrow_reader_properties->io_context()); + arrow_properties.set_use_threads(options.use_threads); + return arrow_properties; +} + template Result> GetSchemaManifest( const M& metadata, const parquet::ArrowReaderProperties& properties) { @@ -410,13 +426,42 @@ Result> ParquetFileFormat::Inspect( Result> ParquetFileFormat::GetReader( const FileSource& source, const std::shared_ptr& options) const { - return GetReaderAsync(source, options, nullptr).result(); + return GetReader(source, options, /*metadata=*/nullptr); } Result> ParquetFileFormat::GetReader( const FileSource& source, const std::shared_ptr& options, const std::shared_ptr& metadata) const { - return GetReaderAsync(source, options, metadata).result(); + ARROW_ASSIGN_OR_RAISE( + auto parquet_scan_options, + GetFragmentScanOptions(kParquetTypeName, options.get(), + default_fragment_scan_options)); + auto properties = + MakeReaderProperties(*this, parquet_scan_options.get(), options->pool); + ARROW_ASSIGN_OR_RAISE(auto input, source.Open()); + // `parquet::ParquetFileReader::Open` will not wrap the exception as status, + // so using `open_parquet_file` to wrap it. + auto open_parquet_file = [&]() -> Result> { + BEGIN_PARQUET_CATCH_EXCEPTIONS + auto reader = parquet::ParquetFileReader::Open(std::move(input), + std::move(properties), metadata); + return reader; + END_PARQUET_CATCH_EXCEPTIONS + }; + + auto reader_opt = open_parquet_file(); + if (!reader_opt.ok()) { + return WrapSourceError(reader_opt.status(), source.path()); + } + auto reader = std::move(reader_opt).ValueOrDie(); + + std::shared_ptr reader_metadata = reader->metadata(); + auto arrow_properties = + MakeArrowReaderProperties(*this, *reader_metadata, *options, *parquet_scan_options); + std::unique_ptr arrow_reader; + RETURN_NOT_OK(parquet::arrow::FileReader::Make( + options->pool, std::move(reader), std::move(arrow_properties), &arrow_reader)); + return arrow_reader; } Future> ParquetFileFormat::GetReaderAsync( @@ -445,16 +490,8 @@ Future> ParquetFileFormat::GetReader ARROW_ASSIGN_OR_RAISE(std::unique_ptr reader, reader_fut.MoveResult()); std::shared_ptr metadata = reader->metadata(); - auto arrow_properties = MakeArrowReaderProperties(*self, *metadata); - arrow_properties.set_batch_size(options->batch_size); - // Must be set here since the sync ScanTask handles pre-buffering itself - arrow_properties.set_pre_buffer( - parquet_scan_options->arrow_reader_properties->pre_buffer()); - arrow_properties.set_cache_options( - parquet_scan_options->arrow_reader_properties->cache_options()); - arrow_properties.set_io_context( - parquet_scan_options->arrow_reader_properties->io_context()); - arrow_properties.set_use_threads(options->use_threads); + auto arrow_properties = + MakeArrowReaderProperties(*this, *metadata, *options, *parquet_scan_options); std::unique_ptr arrow_reader; RETURN_NOT_OK(parquet::arrow::FileReader::Make(options->pool, std::move(reader), std::move(arrow_properties), diff --git a/cpp/src/arrow/dataset/file_parquet_test.cc b/cpp/src/arrow/dataset/file_parquet_test.cc index 42f923f0e6a27..8527c3af64c83 100644 --- a/cpp/src/arrow/dataset/file_parquet_test.cc +++ b/cpp/src/arrow/dataset/file_parquet_test.cc @@ -18,12 +18,14 @@ #include "arrow/dataset/file_parquet.h" #include +#include #include #include #include "arrow/compute/api_scalar.h" #include "arrow/dataset/dataset_internal.h" #include "arrow/dataset/test_util_internal.h" +#include "arrow/io/interfaces.h" #include "arrow/io/memory.h" #include "arrow/io/test_common.h" #include "arrow/io/util_internal.h" @@ -367,6 +369,29 @@ TEST_F(TestParquetFileFormat, MultithreadedScan) { ASSERT_EQ(batches.size(), kNumRowGroups); } +TEST_F(TestParquetFileFormat, SingleThreadExecutor) { + // Reset capacity for io executor + struct PoolResetGuard { + int original_capacity = io::GetIOThreadPoolCapacity(); + ~PoolResetGuard() { DCHECK_OK(io::SetIOThreadPoolCapacity(original_capacity)); } + } guard; + ASSERT_OK(io::SetIOThreadPoolCapacity(1)); + + auto reader = GetRecordBatchReader(schema({field("utf8", utf8())})); + + ASSERT_OK_AND_ASSIGN(auto buffer, ParquetFormatHelper::Write(reader.get())); + auto buffer_reader = std::make_shared<::arrow::io::BufferReader>(buffer); + auto source = std::make_shared(std::move(buffer_reader), buffer->size()); + auto options = std::make_shared(); + + { + auto fragment = MakeFragment(*source); + auto count_rows = fragment->CountRows(literal(true), options); + ASSERT_OK_AND_ASSIGN(auto result, count_rows.MoveResult()); + ASSERT_EQ(expected_rows(), result); + } +} + class TestParquetFileSystemDataset : public WriteFileSystemDatasetMixin, public testing::Test { public: From 0e6a68322c525aea84d8b7b127a8716bf484b227 Mon Sep 17 00:00:00 2001 From: mwish Date: Wed, 20 Sep 2023 00:51:41 +0800 Subject: [PATCH 38/96] GH-37670: [C++] IO FileInterface extend from enable_shared_from_this (#37713) ### Rationale for this change S3 `FlushAsync` might has lifetime problem, this patch fixes that. ### What changes are included in this PR? 1. Move `enable_shared_from_this` to `FileInterface` 2. Update S3 `FlushAsync` 3. Implement sync Flush to avoid call `share_from_this` in dtor. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #37670 Lead-authored-by: mwish Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Benjamin Kietzman Signed-off-by: Benjamin Kietzman --- cpp/src/arrow/filesystem/s3fs.cc | 85 +++++++++++++++------------ cpp/src/arrow/filesystem/s3fs_test.cc | 25 ++++++++ cpp/src/arrow/io/interfaces.cc | 5 +- cpp/src/arrow/io/interfaces.h | 6 +- 4 files changed, 79 insertions(+), 42 deletions(-) diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc index 29f8882225ae3..08fbcde6fd9de 100644 --- a/cpp/src/arrow/filesystem/s3fs.cc +++ b/cpp/src/arrow/filesystem/s3fs.cc @@ -1454,14 +1454,7 @@ class ObjectOutputStream final : public io::OutputStream { // OutputStream interface - Status Close() override { - auto fut = CloseAsync(); - return fut.status(); - } - - Future<> CloseAsync() override { - if (closed_) return Status::OK(); - + Status EnsureReadyToFlushFromClose() { if (current_part_) { // Upload last part RETURN_NOT_OK(CommitCurrentPart()); @@ -1472,36 +1465,56 @@ class ObjectOutputStream final : public io::OutputStream { RETURN_NOT_OK(UploadPart("", 0)); } - // Wait for in-progress uploads to finish (if async writes are enabled) - return FlushAsync().Then([this]() { - ARROW_ASSIGN_OR_RAISE(auto client_lock, holder_->Lock()); + return Status::OK(); + } - // At this point, all part uploads have finished successfully - DCHECK_GT(part_number_, 1); - DCHECK_EQ(upload_state_->completed_parts.size(), - static_cast(part_number_ - 1)); - - S3Model::CompletedMultipartUpload completed_upload; - completed_upload.SetParts(upload_state_->completed_parts); - S3Model::CompleteMultipartUploadRequest req; - req.SetBucket(ToAwsString(path_.bucket)); - req.SetKey(ToAwsString(path_.key)); - req.SetUploadId(upload_id_); - req.SetMultipartUpload(std::move(completed_upload)); - - auto outcome = - client_lock.Move()->CompleteMultipartUploadWithErrorFixup(std::move(req)); - if (!outcome.IsSuccess()) { - return ErrorToStatus( - std::forward_as_tuple("When completing multiple part upload for key '", - path_.key, "' in bucket '", path_.bucket, "': "), - "CompleteMultipartUpload", outcome.GetError()); - } + Status FinishPartUploadAfterFlush() { + ARROW_ASSIGN_OR_RAISE(auto client_lock, holder_->Lock()); - holder_ = nullptr; - closed_ = true; - return Status::OK(); - }); + // At this point, all part uploads have finished successfully + DCHECK_GT(part_number_, 1); + DCHECK_EQ(upload_state_->completed_parts.size(), + static_cast(part_number_ - 1)); + + S3Model::CompletedMultipartUpload completed_upload; + completed_upload.SetParts(upload_state_->completed_parts); + S3Model::CompleteMultipartUploadRequest req; + req.SetBucket(ToAwsString(path_.bucket)); + req.SetKey(ToAwsString(path_.key)); + req.SetUploadId(upload_id_); + req.SetMultipartUpload(std::move(completed_upload)); + + auto outcome = + client_lock.Move()->CompleteMultipartUploadWithErrorFixup(std::move(req)); + if (!outcome.IsSuccess()) { + return ErrorToStatus( + std::forward_as_tuple("When completing multiple part upload for key '", + path_.key, "' in bucket '", path_.bucket, "': "), + "CompleteMultipartUpload", outcome.GetError()); + } + + holder_ = nullptr; + closed_ = true; + return Status::OK(); + } + + Status Close() override { + if (closed_) return Status::OK(); + + RETURN_NOT_OK(EnsureReadyToFlushFromClose()); + + RETURN_NOT_OK(Flush()); + return FinishPartUploadAfterFlush(); + } + + Future<> CloseAsync() override { + if (closed_) return Status::OK(); + + RETURN_NOT_OK(EnsureReadyToFlushFromClose()); + + auto self = std::dynamic_pointer_cast(shared_from_this()); + // Wait for in-progress uploads to finish (if async writes are enabled) + return FlushAsync().Then([self]() { return self->FinishPartUploadAfterFlush(); }); } bool closed() const override { return closed_; } diff --git a/cpp/src/arrow/filesystem/s3fs_test.cc b/cpp/src/arrow/filesystem/s3fs_test.cc index e9f14fde72316..f88ee7eef9332 100644 --- a/cpp/src/arrow/filesystem/s3fs_test.cc +++ b/cpp/src/arrow/filesystem/s3fs_test.cc @@ -590,6 +590,21 @@ class TestS3FS : public S3TestMixin { AssertObjectContents(client_.get(), "bucket", "somefile", "new data"); } + void TestOpenOutputStreamCloseAsyncDestructor() { + std::shared_ptr stream; + ASSERT_OK_AND_ASSIGN(stream, fs_->OpenOutputStream("bucket/somefile")); + ASSERT_OK(stream->Write("new data")); + // Destructor implicitly closes stream and completes the multipart upload. + // GH-37670: Testing it doesn't matter whether flush is triggered asynchronously + // after CloseAsync or synchronously after stream.reset() since we're just + // checking that `closeAsyncFut` keeps the stream alive until completion + // rather than segfaulting on a dangling stream + auto closeAsyncFut = stream->CloseAsync(); + stream.reset(); + ASSERT_OK(closeAsyncFut.MoveResult()); + AssertObjectContents(client_.get(), "bucket", "somefile", "new data"); + } + protected: S3Options options_; std::shared_ptr fs_; @@ -1177,6 +1192,16 @@ TEST_F(TestS3FS, OpenOutputStreamDestructorSyncWrite) { TestOpenOutputStreamDestructor(); } +TEST_F(TestS3FS, OpenOutputStreamAsyncDestructorBackgroundWrites) { + TestOpenOutputStreamCloseAsyncDestructor(); +} + +TEST_F(TestS3FS, OpenOutputStreamAsyncDestructorSyncWrite) { + options_.background_writes = false; + MakeFileSystem(); + TestOpenOutputStreamCloseAsyncDestructor(); +} + TEST_F(TestS3FS, OpenOutputStreamMetadata) { std::shared_ptr stream; diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index e7819e139f67a..d3229fd067cbe 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -123,7 +123,8 @@ Result> InputStream::ReadMetadata() { // executor Future> InputStream::ReadMetadataAsync( const IOContext& ctx) { - auto self = shared_from_this(); + std::shared_ptr self = + std::dynamic_pointer_cast(shared_from_this()); return DeferNotOk(internal::SubmitIO(ctx, [self] { return self->ReadMetadata(); })); } @@ -165,7 +166,7 @@ Result> RandomAccessFile::ReadAt(int64_t position, Future> RandomAccessFile::ReadAsync(const IOContext& ctx, int64_t position, int64_t nbytes) { - auto self = checked_pointer_cast(shared_from_this()); + auto self = std::dynamic_pointer_cast(shared_from_this()); return DeferNotOk(internal::SubmitIO( ctx, [self, position, nbytes] { return self->ReadAt(position, nbytes); })); } diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index dcbe4feb261fb..d2a11b7b6d7ce 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -96,7 +96,7 @@ struct ARROW_EXPORT IOContext { StopToken stop_token_; }; -class ARROW_EXPORT FileInterface { +class ARROW_EXPORT FileInterface : public std::enable_shared_from_this { public: virtual ~FileInterface() = 0; @@ -205,9 +205,7 @@ class ARROW_EXPORT OutputStream : virtual public FileInterface, public Writable OutputStream() = default; }; -class ARROW_EXPORT InputStream : virtual public FileInterface, - virtual public Readable, - public std::enable_shared_from_this { +class ARROW_EXPORT InputStream : virtual public FileInterface, virtual public Readable { public: /// \brief Advance or skip stream indicated number of bytes /// \param[in] nbytes the number to move forward From fa4310635c784f03fe825ecc818efa3eca361ec0 Mon Sep 17 00:00:00 2001 From: James Duong Date: Tue, 19 Sep 2023 11:10:37 -0700 Subject: [PATCH 39/96] GH-37705: [Java] Extra input methods for binary writers (#37791) ### Rationale for this change ByteBuffer and byte[] are commonly used to hold binary data. The current writers require working with ArrowBuf objects which need to be populated by copying from these types, then copying into the vector. ### What changes are included in this PR? Add methods to VarBinary and LargeVarBinary writers to take in common binary parameters - byte[] and ByteBuffer. The writer now sets these objects on the Vectors directly. ### Are these changes tested? Yes. * Closes: #37705 Authored-by: James Duong Signed-off-by: David Li --- .../templates/AbstractFieldWriter.java | 18 +++ .../codegen/templates/ComplexWriters.java | 33 +++++ .../complex/writer/TestSimpleWriter.java | 138 ++++++++++++++++++ 3 files changed, 189 insertions(+) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java diff --git a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java index 1f80f25266b57..5e6580b6131c1 100644 --- a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java @@ -124,6 +124,24 @@ public void write(${name}Holder holder) { } + <#if minor.class?ends_with("VarBinary")> + public void writeTo${minor.class}(byte[] value) { + fail("${name}"); + } + + public void writeTo${minor.class}(byte[] value, int offset, int length) { + fail("${name}"); + } + + public void writeTo${minor.class}(ByteBuffer value) { + fail("${name}"); + } + + public void writeTo${minor.class}(ByteBuffer value, int offset, int length) { + fail("${name}"); + } + + public void writeNull() { diff --git a/java/vector/src/main/codegen/templates/ComplexWriters.java b/java/vector/src/main/codegen/templates/ComplexWriters.java index 0b1e321afb70e..4ae4c4f75f208 100644 --- a/java/vector/src/main/codegen/templates/ComplexWriters.java +++ b/java/vector/src/main/codegen/templates/ComplexWriters.java @@ -180,6 +180,28 @@ public void writeNull() { vector.setValueCount(idx()+1); } + + <#if minor.class?ends_with("VarBinary")> + public void writeTo${minor.class}(byte[] value) { + vector.setSafe(idx(), value); + vector.setValueCount(idx() + 1); + } + + public void writeTo${minor.class}(byte[] value, int offset, int length) { + vector.setSafe(idx(), value, offset, length); + vector.setValueCount(idx() + 1); + } + + public void writeTo${minor.class}(ByteBuffer value) { + vector.setSafe(idx(), value, 0, value.remaining()); + vector.setValueCount(idx() + 1); + } + + public void writeTo${minor.class}(ByteBuffer value, int offset, int length) { + vector.setSafe(idx(), value, offset, length); + vector.setValueCount(idx() + 1); + } + } <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/writer/${eName}Writer.java" /> @@ -223,6 +245,17 @@ public interface ${eName}Writer extends BaseWriter { @Deprecated public void writeBigEndianBytesTo${minor.class}(byte[] value); + +<#if minor.class?ends_with("VarBinary")> + public void writeTo${minor.class}(byte[] value); + + public void writeTo${minor.class}(byte[] value, int offset, int length); + + public void writeTo${minor.class}(ByteBuffer value); + + public void writeTo${minor.class}(ByteBuffer value, int offset, int length); + + } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java new file mode 100644 index 0000000000000..7c06509b23c87 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex.writer; + +import java.nio.ByteBuffer; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.LargeVarBinaryVector; +import org.apache.arrow.vector.VarBinaryVector; +import org.apache.arrow.vector.complex.impl.LargeVarBinaryWriterImpl; +import org.apache.arrow.vector.complex.impl.VarBinaryWriterImpl; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +public class TestSimpleWriter { + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Integer.MAX_VALUE); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testWriteByteArrayToVarBinary() { + try (VarBinaryVector vector = new VarBinaryVector("test", allocator); + VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + writer.writeToVarBinary(input); + byte[] result = vector.get(0); + Assert.assertArrayEquals(input, result); + } + } + + @Test + public void testWriteByteArrayWithOffsetToVarBinary() { + try (VarBinaryVector vector = new VarBinaryVector("test", allocator); + VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + writer.writeToVarBinary(input, 1, 1); + byte[] result = vector.get(0); + Assert.assertArrayEquals(new byte[] { 0x02 }, result); + } + } + + @Test + public void testWriteByteBufferToVarBinary() { + try (VarBinaryVector vector = new VarBinaryVector("test", allocator); + VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + ByteBuffer buffer = ByteBuffer.wrap(input); + writer.writeToVarBinary(buffer); + byte[] result = vector.get(0); + Assert.assertArrayEquals(input, result); + } + } + + @Test + public void testWriteByteBufferWithOffsetToVarBinary() { + try (VarBinaryVector vector = new VarBinaryVector("test", allocator); + VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + ByteBuffer buffer = ByteBuffer.wrap(input); + writer.writeToVarBinary(buffer, 1, 1); + byte[] result = vector.get(0); + Assert.assertArrayEquals(new byte[] { 0x02 }, result); + } + } + + @Test + public void testWriteByteArrayToLargeVarBinary() { + try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); + LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + writer.writeToLargeVarBinary(input); + byte[] result = vector.get(0); + Assert.assertArrayEquals(input, result); + } + } + + @Test + public void testWriteByteArrayWithOffsetToLargeVarBinary() { + try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); + LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + writer.writeToLargeVarBinary(input, 1, 1); + byte[] result = vector.get(0); + Assert.assertArrayEquals(new byte[] { 0x02 }, result); + } + } + + @Test + public void testWriteByteBufferToLargeVarBinary() { + try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); + LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + ByteBuffer buffer = ByteBuffer.wrap(input); + writer.writeToLargeVarBinary(buffer); + byte[] result = vector.get(0); + Assert.assertArrayEquals(input, result); + } + } + + @Test + public void testWriteByteBufferWithOffsetToLargeVarBinary() { + try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); + LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + byte[] input = new byte[] { 0x01, 0x02 }; + ByteBuffer buffer = ByteBuffer.wrap(input); + writer.writeToLargeVarBinary(buffer, 1, 1); + byte[] result = vector.get(0); + Assert.assertArrayEquals(new byte[] { 0x02 }, result); + } + } +} From cda1e8fa00dbb0215f7e7e07b81dca3554e2f430 Mon Sep 17 00:00:00 2001 From: Kevin Gurney Date: Tue, 19 Sep 2023 16:06:29 -0400 Subject: [PATCH 40/96] GH-37532: [CI][Docs][MATLAB] Remove `GoogleTest` support from the CMake build system for the MATLAB interface (#37784) ### Rationale for this change This pull request removes `GoogleTest` support from the CMake build system for the MATLAB interface. 1. `GoogleTest` support adds a lot of additional complexity to the CMake build system for the MATLAB interface, and we currently don't have any standalone C++ tests for the MATLAB interface code. 2. In order to use `GoogleTest` in the MATLAB CI workflows, we are currently relying on building the tests for the Arrow C++ libraries in order to "re-use" the `GoogleTest binaries. This adds additional overhead to the MATLAB CI workflows. 3. If we want to test some internal C++ code for the MATLAB interface in the future, we can instead use a MEX function to call the code from a MATLAB test as suggested by @ kou in https://github.com/apache/arrow/issues/37532#issue-1879320217. 4. There is [precedent for testing internal C++ code without GoogleTest for the Python bindings](https://github.com/apache/arrow/pull/14117). 5. On a somewhat related note - removing `GoogleTest` support will help unblock https://github.com/apache/arrow/pull/37773 as discussed in https://github.com/apache/arrow/pull/37773#issuecomment-1724207859. ### What changes are included in this PR? 1. Removed the `MATLAB_BUILD_TESTS` flag from the CMake build system for the MATLAB interface since there are no longer any C++ tests for the MATLAB interface to build. 2. Updated the `matlab_build.sh` CI workflow script to avoid building the tests for the Arrow C++ libraries and to no longer call `ctest`. 3. Updated the `README.md` for the MATLAB interface to no longer mention building or running C++ tests. 4. Updated the design document for the MATLAB Interface to no longer mention `GoogleTest` since we may end up testing internal C++ code using MEX function calls from MATLAB instead. 5. Removed placeholder C++ test (i.e. `placeholder_test.cc`). ### Are these changes tested? Yes. The MATLAB CI workflow is passing on all platforms. ### Are there any user-facing changes? Yes. There are no longer any C++ tests for the MATLAB interface. The `MATLAB_BUILD_TESTS` flag has been removed from the CMake build system to reflect this change. If a user supplies a value for `MATLAB_BUILD_TESTS` when building the MATLAB interface, the flag will be ignored by CMake. ### Future Directions 1. Add more developer-focused documentation on how to test C++ code via MEX function calls from MATLAB. ### Notes 1. In the future, we can consider testing internal C++ code using MEX function calls from MATLAB tests as suggested by @ kou in https://github.com/apache/arrow/issues/37532#issue-1879320217. Currently, we don't have any C++ tests that need to be adapted to use this approach. 2. Thank you @ sgilmore10 for your help with this pull request! * Closes: #37532 Lead-authored-by: Kevin Gurney Co-authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- ci/scripts/matlab_build.sh | 2 - matlab/CMakeLists.txt | 266 +++--------------- matlab/README.md | 19 -- ...atlab_interface_for_apache_arrow_design.md | 12 +- matlab/src/placeholder_test.cc | 27 -- 5 files changed, 49 insertions(+), 277 deletions(-) delete mode 100644 matlab/src/placeholder_test.cc diff --git a/ci/scripts/matlab_build.sh b/ci/scripts/matlab_build.sh index 235002da3afc6..d3f86adbb8a2b 100755 --- a/ci/scripts/matlab_build.sh +++ b/ci/scripts/matlab_build.sh @@ -29,8 +29,6 @@ cmake \ -S ${source_dir} \ -B ${build_dir} \ -G Ninja \ - -D MATLAB_BUILD_TESTS=ON \ -D CMAKE_INSTALL_PREFIX=${install_dir} \ -D MATLAB_ADD_INSTALL_DIR_TO_SEARCH_PATH=OFF cmake --build ${build_dir} --config Release --target install -ctest --test-dir ${build_dir} diff --git a/matlab/CMakeLists.txt b/matlab/CMakeLists.txt index d73173b58e78a..c8100a389ace0 100644 --- a/matlab/CMakeLists.txt +++ b/matlab/CMakeLists.txt @@ -17,9 +17,9 @@ cmake_minimum_required(VERSION 3.20) -# Build the Arrow C++ libraries. +# Build the Arrow C++ libraries using ExternalProject_Add. function(build_arrow) - set(options BUILD_GTEST) + set(options) set(one_value_args) set(multi_value_args) @@ -37,70 +37,50 @@ function(build_arrow) set(ARROW_CMAKE_ARGS "-DCMAKE_INSTALL_PREFIX=${ARROW_PREFIX}" "-DCMAKE_INSTALL_LIBDIR=lib" "-DARROW_BUILD_STATIC=OFF") - if(Arrow_FOUND - AND NOT GTest_FOUND - AND ARG_BUILD_GTEST) - # If find_package has already found a valid Arrow installation, then - # we don't want to link against the Arrow libraries that will be built - # from source. - # - # However, we still need to create a library target to trigger building - # of the arrow_ep target, which will ultimately build the bundled - # GoogleTest binaries. - add_library(arrow_shared_for_gtest SHARED IMPORTED) - set(ARROW_LIBRARY_TARGET arrow_shared_for_gtest) + add_library(arrow_shared SHARED IMPORTED) + set(ARROW_LIBRARY_TARGET arrow_shared) + + # Set the runtime shared library (.dll, .so, or .dylib) + if(WIN32) + # The shared library (i.e. .dll) is located in the "bin" directory. + set(ARROW_SHARED_LIBRARY_DIR "${ARROW_PREFIX}/bin") else() - add_library(arrow_shared SHARED IMPORTED) - set(ARROW_LIBRARY_TARGET arrow_shared) - - # Set the runtime shared library (.dll, .so, or .dylib) - if(WIN32) - # The shared library (i.e. .dll) is located in the "bin" directory. - set(ARROW_SHARED_LIBRARY_DIR "${ARROW_PREFIX}/bin") - else() - # The shared library (i.e. .so or .dylib) is located in the "lib" directory. - set(ARROW_SHARED_LIBRARY_DIR "${ARROW_PREFIX}/lib") - endif() - - set(ARROW_SHARED_LIB_FILENAME - "${CMAKE_SHARED_LIBRARY_PREFIX}arrow${CMAKE_SHARED_LIBRARY_SUFFIX}") - set(ARROW_SHARED_LIB "${ARROW_SHARED_LIBRARY_DIR}/${ARROW_SHARED_LIB_FILENAME}") - - set_target_properties(arrow_shared PROPERTIES IMPORTED_LOCATION ${ARROW_SHARED_LIB}) - - # Set the link-time import library (.lib) - if(WIN32) - # The import library (i.e. .lib) is located in the "lib" directory. - set(ARROW_IMPORT_LIB_FILENAME - "${CMAKE_IMPORT_LIBRARY_PREFIX}arrow${CMAKE_IMPORT_LIBRARY_SUFFIX}") - set(ARROW_IMPORT_LIB "${ARROW_PREFIX}/lib/${ARROW_IMPORT_LIB_FILENAME}") - - set_target_properties(arrow_shared PROPERTIES IMPORTED_IMPLIB ${ARROW_IMPORT_LIB}) - endif() - - # Set the include directories - set(ARROW_INCLUDE_DIR "${ARROW_PREFIX}/include") - file(MAKE_DIRECTORY "${ARROW_INCLUDE_DIR}") - set_target_properties(arrow_shared PROPERTIES INTERFACE_INCLUDE_DIRECTORIES - ${ARROW_INCLUDE_DIR}) - - # Set the build byproducts for the ExternalProject build - # The appropriate libraries need to be guaranteed to be available when linking the test - # executables. - if(WIN32) - set(ARROW_BUILD_BYPRODUCTS "${ARROW_IMPORT_LIB}") - else() - set(ARROW_BUILD_BYPRODUCTS "${ARROW_SHARED_LIB}") - endif() + # The shared library (i.e. .so or .dylib) is located in the "lib" directory. + set(ARROW_SHARED_LIBRARY_DIR "${ARROW_PREFIX}/lib") endif() - # Building the Arrow C++ libraries and bundled GoogleTest binaries requires ExternalProject. - include(ExternalProject) + set(ARROW_SHARED_LIB_FILENAME + "${CMAKE_SHARED_LIBRARY_PREFIX}arrow${CMAKE_SHARED_LIBRARY_SUFFIX}") + set(ARROW_SHARED_LIB "${ARROW_SHARED_LIBRARY_DIR}/${ARROW_SHARED_LIB_FILENAME}") + + set_target_properties(arrow_shared PROPERTIES IMPORTED_LOCATION ${ARROW_SHARED_LIB}) + + # Set the link-time import library (.lib) + if(WIN32) + # The import library (i.e. .lib) is located in the "lib" directory. + set(ARROW_IMPORT_LIB_FILENAME + "${CMAKE_IMPORT_LIBRARY_PREFIX}arrow${CMAKE_IMPORT_LIBRARY_SUFFIX}") + set(ARROW_IMPORT_LIB "${ARROW_PREFIX}/lib/${ARROW_IMPORT_LIB_FILENAME}") - if(ARG_BUILD_GTEST) - enable_gtest() + set_target_properties(arrow_shared PROPERTIES IMPORTED_IMPLIB ${ARROW_IMPORT_LIB}) endif() + # Set the include directories + set(ARROW_INCLUDE_DIR "${ARROW_PREFIX}/include") + file(MAKE_DIRECTORY "${ARROW_INCLUDE_DIR}") + set_target_properties(arrow_shared PROPERTIES INTERFACE_INCLUDE_DIRECTORIES + ${ARROW_INCLUDE_DIR}) + + # Set the build byproducts for the ExternalProject build + if(WIN32) + set(ARROW_BUILD_BYPRODUCTS "${ARROW_IMPORT_LIB}") + else() + set(ARROW_BUILD_BYPRODUCTS "${ARROW_SHARED_LIB}") + endif() + + # Building the Arrow C++ libraries requires ExternalProject. + include(ExternalProject) + externalproject_add(arrow_ep SOURCE_DIR "${CMAKE_SOURCE_DIR}/../cpp" BINARY_DIR "${ARROW_BINARY_DIR}" @@ -109,69 +89,8 @@ function(build_arrow) add_dependencies(${ARROW_LIBRARY_TARGET} arrow_ep) - if(ARG_BUILD_GTEST) - build_gtest() - endif() endfunction() -macro(enable_gtest) - set(ARROW_GTEST_INCLUDE_DIR "${ARROW_PREFIX}/include/arrow-gtest") - - set(ARROW_GTEST_IMPORT_LIB_DIR "${ARROW_PREFIX}/lib") - if(WIN32) - set(ARROW_GTEST_SHARED_LIB_DIR "${ARROW_PREFIX}/bin") - else() - set(ARROW_GTEST_SHARED_LIB_DIR "${ARROW_PREFIX}/lib") - endif() - set(ARROW_GTEST_IMPORT_LIB - "${ARROW_GTEST_IMPORT_LIB_DIR}/${CMAKE_IMPORT_LIBRARY_PREFIX}arrow_gtest${CMAKE_IMPORT_LIBRARY_SUFFIX}" - ) - set(ARROW_GTEST_MAIN_IMPORT_LIB - "${ARROW_GTEST_IMPORT_LIB_DIR}/${CMAKE_IMPORT_LIBRARY_PREFIX}arrow_gtest_main${CMAKE_IMPORT_LIBRARY_SUFFIX}" - ) - set(ARROW_GTEST_SHARED_LIB - "${ARROW_GTEST_SHARED_LIB_DIR}/${CMAKE_SHARED_LIBRARY_PREFIX}arrow_gtest${CMAKE_SHARED_LIBRARY_SUFFIX}" - ) - set(ARROW_GTEST_MAIN_SHARED_LIB - "${ARROW_GTEST_SHARED_LIB_DIR}/${CMAKE_SHARED_LIBRARY_PREFIX}arrow_gtest_main${CMAKE_SHARED_LIBRARY_SUFFIX}" - ) - - list(APPEND ARROW_CMAKE_ARGS "-DARROW_BUILD_TESTS=ON") - - # The appropriate libraries need to be guaranteed to be available when linking the test - # executables. - if(WIN32) - # On Windows, add the gtest link libraries as BUILD_BYPRODUCTS for arrow_ep. - list(APPEND ARROW_BUILD_BYPRODUCTS "${ARROW_GTEST_IMPORT_LIB}" - "${ARROW_GTEST_MAIN_IMPORT_LIB}") - else() - # On Linux and macOS, add the gtest shared libraries as BUILD_BYPRODUCTS for arrow_ep. - list(APPEND ARROW_BUILD_BYPRODUCTS "${ARROW_GTEST_SHARED_LIB}" - "${ARROW_GTEST_MAIN_SHARED_LIB}") - endif() -endmacro() - -# Build the GoogleTest binaries that are bundled with the Arrow C++ libraries. -macro(build_gtest) - file(MAKE_DIRECTORY "${ARROW_GTEST_INCLUDE_DIR}") - - # Create target GTest::gtest - add_library(GTest::gtest SHARED IMPORTED) - set_target_properties(GTest::gtest - PROPERTIES IMPORTED_IMPLIB ${ARROW_GTEST_IMPORT_LIB} - IMPORTED_LOCATION ${ARROW_GTEST_SHARED_LIB} - INTERFACE_INCLUDE_DIRECTORIES - ${ARROW_GTEST_INCLUDE_DIR}) - add_dependencies(GTest::gtest arrow_ep) - - # Create target GTest::gtest_main - add_library(GTest::gtest_main SHARED IMPORTED) - set_target_properties(GTest::gtest_main - PROPERTIES IMPORTED_IMPLIB ${ARROW_GTEST_MAIN_IMPORT_LIB} - IMPORTED_LOCATION ${ARROW_GTEST_MAIN_SHARED_LIB}) - add_dependencies(GTest::gtest_main arrow_ep) -endmacro() - set(CMAKE_CXX_STANDARD 17) set(MLARROW_VERSION "14.0.0-SNAPSHOT") @@ -185,8 +104,6 @@ if(WIN32) set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreadedDLL") endif() -option(MATLAB_BUILD_TESTS "Build the C++ tests for the MATLAB interface" OFF) - # Add tools/cmake directory to the CMAKE_MODULE_PATH. list(PREPEND CMAKE_MODULE_PATH ${CMAKE_SOURCE_DIR}/tools/cmake) @@ -208,56 +125,9 @@ else() set(MATLAB_BUILD_OUTPUT_DIR "${CMAKE_BINARY_DIR}") endif() -# Only build the MATLAB interface C++ tests if MATLAB_BUILD_TESTS=ON. -if(MATLAB_BUILD_TESTS) - # find_package(GTest) supports custom GTEST_ROOT as well as package managers. - find_package(GTest) - - if(NOT GTest_FOUND) - # find_package(Arrow) supports custom ARROW_HOME as well as package - # managers. - find_package(Arrow QUIET) - # Trigger an automatic build of the Arrow C++ libraries and bundled - # GoogleTest binaries. If a valid Arrow installation was not already - # found by find_package, then build_arrow will use the Arrow - # C++ libraries that are built from source. - build_arrow(BUILD_GTEST) - else() - # On Windows, IMPORTED_LOCATION needs to be set to indicate where the shared - # libraries live when GTest is found. - if(WIN32) - set(GTEST_SHARED_LIB_DIR "${GTEST_ROOT}/bin") - set(GTEST_SHARED_LIBRARY_FILENAME - "${CMAKE_SHARED_LIBRARY_PREFIX}gtest${CMAKE_SHARED_LIBRARY_SUFFIX}") - set(GTEST_SHARED_LIBRARY_LIB - "${GTEST_SHARED_LIB_DIR}/${GTEST_SHARED_LIBRARY_FILENAME}") - - set(GTEST_MAIN_SHARED_LIB_DIR "${GTEST_ROOT}/bin") - set(GTEST_MAIN_SHARED_LIBRARY_FILENAME - "${CMAKE_SHARED_LIBRARY_PREFIX}gtest_main${CMAKE_SHARED_LIBRARY_SUFFIX}") - set(GTEST_MAIN_SHARED_LIBRARY_LIB - "${GTEST_MAIN_SHARED_LIB_DIR}/${GTEST_MAIN_SHARED_LIBRARY_FILENAME}") - - set_target_properties(GTest::gtest PROPERTIES IMPORTED_LOCATION - "${GTEST_SHARED_LIBRARY_LIB}") - - set_target_properties(GTest::gtest_main - PROPERTIES IMPORTED_LOCATION - "${GTEST_MAIN_SHARED_LIBRARY_LIB}") - endif() - - find_package(Arrow QUIET) - if(NOT Arrow_FOUND) - # Trigger an automatic build of the Arrow C++ libraries. - build_arrow() - endif() - endif() - -else() - find_package(Arrow QUIET) - if(NOT Arrow_FOUND) - build_arrow() - endif() +find_package(Arrow QUIET) +if(NOT Arrow_FOUND) + build_arrow() endif() # MATLAB is Required @@ -311,56 +181,6 @@ else() message(STATUS "ARROW_INCLUDE_DIR: ${ARROW_INCLUDE_DIR}") endif() -# ############################################################################## -# C++ Tests -# ############################################################################## -# Only build the C++ tests if MATLAB_BUILD_TESTS=ON. -if(MATLAB_BUILD_TESTS) - enable_testing() - - # Define a test executable target. TODO: Remove the placeholder test. This is - # just for testing GoogleTest integration. - add_executable(placeholder_test ${CMAKE_SOURCE_DIR}/src/placeholder_test.cc) - - # Declare a dependency on the GTest::gtest and GTest::gtest_main IMPORTED - # targets. - target_link_libraries(placeholder_test GTest::gtest GTest::gtest_main) - - # Ensure using GTest:gtest and GTest::gtest_main on macOS without - # specifying DYLD_LIBRARY_DIR. - set_target_properties(placeholder_test - PROPERTIES BUILD_RPATH - "$;$" - ) - - # Add test targets for C++ tests. - add_test(PlaceholderTestTarget placeholder_test) - - # On Windows: - # Add the directory of gtest.dll and gtest_main.dll to the %PATH% for running - # all tests. - # Add the directory of libmx.dll, libmex.dll, and libarrow.dll to the %PATH% for running - # CheckNumArgsTestTarget. - # Note: When appending to the path using set_test_properties' ENVIRONMENT property, - # make sure that we escape ';' to prevent CMake from interpreting the input as - # a list of strings. - if(WIN32) - get_target_property(GTEST_SHARED_LIB GTest::gtest IMPORTED_LOCATION) - get_filename_component(GTEST_SHARED_LIB_DIR ${GTEST_SHARED_LIB} DIRECTORY) - - get_target_property(GTEST_MAIN_SHARED_LIB GTest::gtest_main IMPORTED_LOCATION) - get_filename_component(GTEST_MAIN_SHARED_LIB_DIR ${GTEST_MAIN_SHARED_LIB} DIRECTORY) - - set_tests_properties(PlaceholderTestTarget - PROPERTIES ENVIRONMENT - "PATH=${GTEST_SHARED_LIB_DIR}\;${GTEST_MAIN_SHARED_LIB_DIR}\;$ENV{PATH}" - ) - - get_target_property(ARROW_SHARED_LIB arrow_shared IMPORTED_LOCATION) - get_filename_component(ARROW_SHARED_LIB_DIR ${ARROW_SHARED_LIB} DIRECTORY) - endif() -endif() - # ############################################################################## # Install # ############################################################################## diff --git a/matlab/README.md b/matlab/README.md index d6b08fbee1c15..0a2bdf01f465f 100644 --- a/matlab/README.md +++ b/matlab/README.md @@ -100,31 +100,12 @@ As part of the install step, the installation directory is added to the [MATLAB ## Test -There are two kinds of tests for the MATLAB Interface: MATLAB and C++. - -### MATLAB - To run the MATLAB tests, start MATLAB in the `arrow/matlab` directory and call the [`runtests`](https://mathworks.com/help/matlab/ref/runtests.html) command on the `test` directory with `IncludeSubFolders=true`: ``` matlab >> runtests("test", IncludeSubFolders=true); ``` -### C++ - -To enable the C++ tests, set the `MATLAB_BUILD_TESTS` flag to `ON` at build time: - -```console -$ cmake -S . -B build -D MATLAB_BUILD_TESTS=ON -$ cmake --build build --config Release -``` - -After building with the `MATLAB_BUILD_TESTS` flag enabled, the C++ tests can be run using [CTest](https://cmake.org/cmake/help/latest/manual/ctest.1.html): - -```console -$ ctest --test-dir build -``` - ## Usage Included below are some example code snippets that illustrate how to use the MATLAB interface. diff --git a/matlab/doc/matlab_interface_for_apache_arrow_design.md b/matlab/doc/matlab_interface_for_apache_arrow_design.md index 79b43fd02518b..17c7ba254c0ea 100644 --- a/matlab/doc/matlab_interface_for_apache_arrow_design.md +++ b/matlab/doc/matlab_interface_for_apache_arrow_design.md @@ -257,14 +257,13 @@ For large tables used in a multi-process "data processing pipeline", a user coul ## Testing To ensure code quality, we would like to include the following testing infrastructure, at a minimum: -1. C++ APIs - - GoogleTest C++ Unit Tests - - Integration with CI workflows -2. MATLAB APIs - - [MATLAB Class-Based Unit Tests] - - Integration with CI workflows + +1. [MATLAB Class-Based Unit Tests] +2. [MATLAB CI Workflows] 3. [Integration Testing] +**Note**: To test internal C++ code, we can use a [MEX function] to call the C++ code from a MATLAB Class-Based Unit Test. + ## Documentation To ensure usability, discoverability, and accessibility, we would like to include high quality documentation for the MATLAB Interface for Apache Arrow. @@ -318,3 +317,4 @@ The table below provides a high-level roadmap for the development of specific ca [`apache-arrow` package via the `npm` package manager]: https://www.npmjs.com/package/apache-arrow [Rust user]: https://github.com/apache/arrow-rs [`arrow` crate via the `cargo` package manager]: https://crates.io/crates/arrow +[MATLAB CI Workflows]: https://github.com/apache/arrow/actions/workflows/matlab.yml diff --git a/matlab/src/placeholder_test.cc b/matlab/src/placeholder_test.cc deleted file mode 100644 index eef37e178f623..0000000000000 --- a/matlab/src/placeholder_test.cc +++ /dev/null @@ -1,27 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include - -namespace arrow { -namespace matlab { -namespace test { -// TODO: Remove this placeholder test. -TEST(PlaceholderTestSuite, PlaceholderTestCase) { ASSERT_TRUE(true); } -} // namespace test -} // namespace matlab -} // namespace arrow From 868b9bde7b7eefe487995dca279149a46eb99c34 Mon Sep 17 00:00:00 2001 From: Chris Jordan-Squire <788080+chrisjordansquire@users.noreply.github.com> Date: Tue, 19 Sep 2023 20:16:45 -0400 Subject: [PATCH 41/96] GH-36111: [C++] Refactor dict_internal.h to use Result (#37754) ### Rationale for this change This change addresses #36111 , updating the method GetDictionaryArrayData in dict_internal.h to return a Result instead of a Status type. ### What changes are included in this PR? This is a narrow interpretation of the above issue, only changing the return type with minimal updates to the call sites or their return types. ### Are these changes tested? Yes. The C++ test suite was run on the `ninja-debug` build target using the command `ctest -j16 --output-on-failure`. All tests not requiring parquet data passed. (I was unsure how to setup those tests based on the Arrow development guidelines.) ### Are there any user-facing changes? No. The only changes should be library-internal. * Closes: #36111 Authored-by: Chris Jordan-Squire Signed-off-by: Benjamin Kietzman --- cpp/src/arrow/array/array_dict.cc | 12 ++--- cpp/src/arrow/array/builder_dict.cc | 5 ++- cpp/src/arrow/array/dict_internal.h | 47 ++++++++------------ cpp/src/arrow/compute/kernels/vector_hash.cc | 5 ++- 4 files changed, 31 insertions(+), 38 deletions(-) diff --git a/cpp/src/arrow/array/array_dict.cc b/cpp/src/arrow/array/array_dict.cc index cccc7bb78220d..c9e2f93cde66f 100644 --- a/cpp/src/arrow/array/array_dict.cc +++ b/cpp/src/arrow/array/array_dict.cc @@ -282,9 +282,9 @@ class DictionaryUnifierImpl : public DictionaryUnifier { *out_type = arrow::dictionary(index_type, value_type_); // Build unified dictionary array - std::shared_ptr data; - RETURN_NOT_OK(DictTraits::GetDictionaryArrayData(pool_, value_type_, memo_table_, - 0 /* start_offset */, &data)); + ARROW_ASSIGN_OR_RAISE( + auto data, DictTraits::GetDictionaryArrayData(pool_, value_type_, memo_table_, + 0 /* start_offset */)); *out_dict = MakeArray(data); return Status::OK(); } @@ -299,9 +299,9 @@ class DictionaryUnifierImpl : public DictionaryUnifier { } // Build unified dictionary array - std::shared_ptr data; - RETURN_NOT_OK(DictTraits::GetDictionaryArrayData(pool_, value_type_, memo_table_, - 0 /* start_offset */, &data)); + ARROW_ASSIGN_OR_RAISE( + auto data, DictTraits::GetDictionaryArrayData(pool_, value_type_, memo_table_, + 0 /* start_offset */)); *out_dict = MakeArray(data); return Status::OK(); } diff --git a/cpp/src/arrow/array/builder_dict.cc b/cpp/src/arrow/array/builder_dict.cc index 061fb600412fd..525b0afbc908a 100644 --- a/cpp/src/arrow/array/builder_dict.cc +++ b/cpp/src/arrow/array/builder_dict.cc @@ -106,8 +106,9 @@ class DictionaryMemoTable::DictionaryMemoTableImpl { enable_if_memoize Visit(const T&) { using ConcreteMemoTable = typename DictionaryTraits::MemoTableType; auto memo_table = checked_cast(memo_table_); - return DictionaryTraits::GetDictionaryArrayData(pool_, value_type_, *memo_table, - start_offset_, out_); + ARROW_ASSIGN_OR_RAISE(*out_, DictionaryTraits::GetDictionaryArrayData( + pool_, value_type_, *memo_table, start_offset_)); + return Status::OK(); } }; diff --git a/cpp/src/arrow/array/dict_internal.h b/cpp/src/arrow/array/dict_internal.h index 5245c8d0ff313..3c1c8c453d1e7 100644 --- a/cpp/src/arrow/array/dict_internal.h +++ b/cpp/src/arrow/array/dict_internal.h @@ -29,6 +29,7 @@ #include "arrow/array.h" #include "arrow/buffer.h" +#include "arrow/result.h" #include "arrow/status.h" #include "arrow/type.h" #include "arrow/type_traits.h" @@ -63,11 +64,9 @@ struct DictionaryTraits { using T = BooleanType; using MemoTableType = typename HashTraits::MemoTableType; - static Status GetDictionaryArrayData(MemoryPool* pool, - const std::shared_ptr& type, - const MemoTableType& memo_table, - int64_t start_offset, - std::shared_ptr* out) { + static Result> GetDictionaryArrayData( + MemoryPool* pool, const std::shared_ptr& type, + const MemoTableType& memo_table, int64_t start_offset) { if (start_offset < 0) { return Status::Invalid("invalid start_offset ", start_offset); } @@ -82,7 +81,9 @@ struct DictionaryTraits { : builder.Append(bool_values[i])); } - return builder.FinishInternal(out); + std::shared_ptr out; + RETURN_NOT_OK(builder.FinishInternal(&out)); + return out; } }; // namespace internal @@ -91,11 +92,9 @@ struct DictionaryTraits> { using c_type = typename T::c_type; using MemoTableType = typename HashTraits::MemoTableType; - static Status GetDictionaryArrayData(MemoryPool* pool, - const std::shared_ptr& type, - const MemoTableType& memo_table, - int64_t start_offset, - std::shared_ptr* out) { + static Result> GetDictionaryArrayData( + MemoryPool* pool, const std::shared_ptr& type, + const MemoTableType& memo_table, int64_t start_offset) { auto dict_length = static_cast(memo_table.size()) - start_offset; // This makes a copy, but we assume a dictionary array is usually small // compared to the size of the dictionary-using array. @@ -112,8 +111,7 @@ struct DictionaryTraits> { RETURN_NOT_OK( ComputeNullBitmap(pool, memo_table, start_offset, &null_count, &null_bitmap)); - *out = ArrayData::Make(type, dict_length, {null_bitmap, dict_buffer}, null_count); - return Status::OK(); + return ArrayData::Make(type, dict_length, {null_bitmap, dict_buffer}, null_count); } }; @@ -121,11 +119,9 @@ template struct DictionaryTraits> { using MemoTableType = typename HashTraits::MemoTableType; - static Status GetDictionaryArrayData(MemoryPool* pool, - const std::shared_ptr& type, - const MemoTableType& memo_table, - int64_t start_offset, - std::shared_ptr* out) { + static Result> GetDictionaryArrayData( + MemoryPool* pool, const std::shared_ptr& type, + const MemoTableType& memo_table, int64_t start_offset) { using offset_type = typename T::offset_type; // Create the offsets buffer @@ -148,11 +144,9 @@ struct DictionaryTraits> { RETURN_NOT_OK( ComputeNullBitmap(pool, memo_table, start_offset, &null_count, &null_bitmap)); - *out = ArrayData::Make(type, dict_length, + return ArrayData::Make(type, dict_length, {null_bitmap, std::move(dict_offsets), std::move(dict_data)}, null_count); - - return Status::OK(); } }; @@ -160,11 +154,9 @@ template struct DictionaryTraits> { using MemoTableType = typename HashTraits::MemoTableType; - static Status GetDictionaryArrayData(MemoryPool* pool, - const std::shared_ptr& type, - const MemoTableType& memo_table, - int64_t start_offset, - std::shared_ptr* out) { + static Result> GetDictionaryArrayData( + MemoryPool* pool, const std::shared_ptr& type, + const MemoTableType& memo_table, int64_t start_offset) { const T& concrete_type = internal::checked_cast(*type); // Create the data buffer @@ -182,9 +174,8 @@ struct DictionaryTraits> { RETURN_NOT_OK( ComputeNullBitmap(pool, memo_table, start_offset, &null_count, &null_bitmap)); - *out = ArrayData::Make(type, dict_length, {null_bitmap, std::move(dict_data)}, + return ArrayData::Make(type, dict_length, {null_bitmap, std::move(dict_data)}, null_count); - return Status::OK(); } }; diff --git a/cpp/src/arrow/compute/kernels/vector_hash.cc b/cpp/src/arrow/compute/kernels/vector_hash.cc index a7bb2d88c291b..d9143b760f32b 100644 --- a/cpp/src/arrow/compute/kernels/vector_hash.cc +++ b/cpp/src/arrow/compute/kernels/vector_hash.cc @@ -285,8 +285,9 @@ class RegularHashKernel : public HashKernel { Status FlushFinal(ExecResult* out) override { return action_.FlushFinal(out); } Status GetDictionary(std::shared_ptr* out) override { - return DictionaryTraits::GetDictionaryArrayData(pool_, type_, *memo_table_, - 0 /* start_offset */, out); + ARROW_ASSIGN_OR_RAISE(*out, DictionaryTraits::GetDictionaryArrayData( + pool_, type_, *memo_table_, 0 /* start_offset */)); + return Status::OK(); } std::shared_ptr value_type() const override { return type_; } From f52ebbbf76df5d5a16257aae2504d23319723ae5 Mon Sep 17 00:00:00 2001 From: Judah Rand <17158624+judahrand@users.noreply.github.com> Date: Wed, 20 Sep 2023 09:52:16 +0100 Subject: [PATCH 42/96] GH-37470: [Python][Parquet] Add missing arguments to `ParquetFileWriteOptions` (#37469) ### Rationale for this change I think this may have been missed when this feature was added. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: #37470 Authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com> Signed-off-by: AlenkaF --- python/pyarrow/_dataset_parquet.pyx | 8 +++++++ python/pyarrow/tests/test_dataset.py | 32 ++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+) diff --git a/python/pyarrow/_dataset_parquet.pyx b/python/pyarrow/_dataset_parquet.pyx index 79bd270ce54d2..cf5c44c1c964a 100644 --- a/python/pyarrow/_dataset_parquet.pyx +++ b/python/pyarrow/_dataset_parquet.pyx @@ -595,6 +595,10 @@ cdef class ParquetFileWriteOptions(FileWriteOptions): ), column_encoding=self._properties["column_encoding"], data_page_version=self._properties["data_page_version"], + encryption_properties=self._properties["encryption_properties"], + write_batch_size=self._properties["write_batch_size"], + dictionary_pagesize_limit=self._properties["dictionary_pagesize_limit"], + write_page_index=self._properties["write_page_index"], ) def _set_arrow_properties(self): @@ -631,6 +635,10 @@ cdef class ParquetFileWriteOptions(FileWriteOptions): coerce_timestamps=None, allow_truncated_timestamps=False, use_compliant_nested_type=True, + encryption_properties=None, + write_batch_size=None, + dictionary_pagesize_limit=None, + write_page_index=False, ) self._set_properties() self._set_arrow_properties() diff --git a/python/pyarrow/tests/test_dataset.py b/python/pyarrow/tests/test_dataset.py index b8a0c38089980..e0988f2752033 100644 --- a/python/pyarrow/tests/test_dataset.py +++ b/python/pyarrow/tests/test_dataset.py @@ -5291,6 +5291,38 @@ def test_write_dataset_preserve_field_metadata(tempdir): assert dataset.to_table().schema.equals(schema_metadata, check_metadata=True) +def test_write_dataset_write_page_index(tempdir): + for write_statistics in [True, False]: + for write_page_index in [True, False]: + schema = pa.schema([ + pa.field("x", pa.int64()), + pa.field("y", pa.int64())]) + + arrays = [[1, 2, 3], [None, 5, None]] + table = pa.Table.from_arrays(arrays, schema=schema) + + file_format = ds.ParquetFileFormat() + base_dir = tempdir / f"write_page_index_{write_page_index}" + ds.write_dataset( + table, + base_dir, + format="parquet", + file_options=file_format.make_write_options( + write_statistics=write_statistics, + write_page_index=write_page_index, + ), + existing_data_behavior='overwrite_or_ignore', + ) + ds1 = ds.dataset(base_dir, format="parquet") + + for file in ds1.files: + # Can retrieve sorting columns from metadata + metadata = pq.read_metadata(file) + cc = metadata.row_group(0).column(0) + assert cc.has_offset_index is write_page_index + assert cc.has_column_index is write_page_index & write_statistics + + @pytest.mark.parametrize('dstype', [ "fs", "mem" ]) From 008d2777ea3bdb6cf5f62144ace42ff725bc6255 Mon Sep 17 00:00:00 2001 From: mwish Date: Wed, 20 Sep 2023 20:08:20 +0800 Subject: [PATCH 43/96] GH-37111: [C++][Parquet] Dataset: Fixing Schema Cast (#37793) ### Rationale for this change Parquet and Arrow has two schema: 1. Parquet has a SchemaElement[1], it's language and implement independent. Parquet Arrow will extract the schema and decude it. 2. Parquet arrow stores schema and possible `field_id` in `key_value_metadata`[2] when `store_schema` enabled. When deserializing, it will first parse `SchemaElement`[1], and if self-defined key_value_metadata exists, it will use `key_value_metadata` to override the [1] [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L356 [2] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1033 The bug raise from that, when dataset parsing `SchemaManifest`, it doesn't use `key_value_metadata` from `Metadata`, which raises the problem. For duration, when `store_schema` enabled, it will store `Int64` as physical type, and add a `::arrow::Duration` in `key_value_metadata`. And there is no `equal(Duration, i64)`. So raise the un-impl ### What changes are included in this PR? Set `key_value_metadata` in implemented. ### Are these changes tested? Yes ### Are there any user-facing changes? bugfix * Closes: #37111 Authored-by: mwish Signed-off-by: Benjamin Kietzman --- cpp/src/arrow/dataset/file_parquet.cc | 7 ++--- cpp/src/arrow/dataset/file_parquet_test.cc | 31 ++++++++++++++++++++-- 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/dataset/file_parquet.cc b/cpp/src/arrow/dataset/file_parquet.cc index 9d0e8a6515878..751937e93b937 100644 --- a/cpp/src/arrow/dataset/file_parquet.cc +++ b/cpp/src/arrow/dataset/file_parquet.cc @@ -104,11 +104,12 @@ parquet::ArrowReaderProperties MakeArrowReaderProperties( return arrow_properties; } -template Result> GetSchemaManifest( - const M& metadata, const parquet::ArrowReaderProperties& properties) { + const parquet::FileMetaData& metadata, + const parquet::ArrowReaderProperties& properties) { auto manifest = std::make_shared(); - const std::shared_ptr& key_value_metadata = nullptr; + const std::shared_ptr& key_value_metadata = + metadata.key_value_metadata(); RETURN_NOT_OK(SchemaManifest::Make(metadata.schema(), key_value_metadata, properties, manifest.get())); return manifest; diff --git a/cpp/src/arrow/dataset/file_parquet_test.cc b/cpp/src/arrow/dataset/file_parquet_test.cc index 8527c3af64c83..177ca824179a8 100644 --- a/cpp/src/arrow/dataset/file_parquet_test.cc +++ b/cpp/src/arrow/dataset/file_parquet_test.cc @@ -65,11 +65,15 @@ class ParquetFormatHelper { public: using FormatType = ParquetFileFormat; - static Result> Write(RecordBatchReader* reader) { + static Result> Write( + RecordBatchReader* reader, + const std::shared_ptr& arrow_properties = + default_arrow_writer_properties()) { auto pool = ::arrow::default_memory_pool(); std::shared_ptr out; auto sink = CreateOutputStream(pool); - RETURN_NOT_OK(WriteRecordBatchReader(reader, pool, sink)); + RETURN_NOT_OK(WriteRecordBatchReader(reader, pool, sink, default_writer_properties(), + arrow_properties)); return sink->Finish(); } static std::shared_ptr MakeFormat() { @@ -703,6 +707,29 @@ TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingStringC CountRowGroupsInFragment(fragment, {0, 3}, equal(field_ref("x"), literal("a"))); } +TEST_P(TestParquetFileFormatScan, PredicatePushdownRowGroupFragmentsUsingDurationColumn) { + // GH-37111: Parquet arrow stores writer schema and possible field_id in + // key_value_metadata when store_schema enabled. When storing `arrow::duration`, it will + // be stored as int64. This test ensures that dataset can parse the writer schema + // correctly. + auto table = TableFromJSON(schema({field("t", duration(TimeUnit::NANO))}), + { + R"([{"t": 1}])", + R"([{"t": 2}, {"t": 3}])", + }); + TableBatchReader table_reader(*table); + ASSERT_OK_AND_ASSIGN( + auto buffer, + ParquetFormatHelper::Write( + &table_reader, ArrowWriterProperties::Builder().store_schema()->build())); + auto source = std::make_shared(buffer); + SetSchema({field("t", duration(TimeUnit::NANO))}); + ASSERT_OK_AND_ASSIGN(auto fragment, format_->MakeFragment(*source)); + + auto expr = equal(field_ref("t"), literal(::arrow::DurationScalar(1, TimeUnit::NANO))); + CountRowGroupsInFragment(fragment, {0}, expr); +} + // Tests projection with nested/indexed FieldRefs. // https://github.com/apache/arrow/issues/35579 TEST_P(TestParquetFileFormatScan, ProjectWithNonNamedFieldRefs) { From fecb9681a07c7375d9ba30767a91ff0326d6dbfc Mon Sep 17 00:00:00 2001 From: Jacob Wujciak-Jens Date: Wed, 20 Sep 2023 16:41:36 +0200 Subject: [PATCH 44/96] GH-37750: [R][C++] Add compatability with IntelLLVM (#37781) ### Rationale for this change The IntelLLVM compiler has a different compiler ID that we currently can't handle and causes the build to fail. ### What changes are included in this PR? This adds the missing flags/IDs. Mimalloc in the current bundled version is not compatible with IntelLLVM this can be worked around by using a system version, a simple version bump is not sufficient and I don't think it's worth the effort to fix it properly for now. For the R package build I have disabled mimalloc with IntelLLVM to avoid issues on CRAN. * Closes: #37750 Authored-by: Jacob Wujciak-Jens Signed-off-by: Nic Crane --- cpp/cmake_modules/BuildUtils.cmake | 2 +- cpp/cmake_modules/SetupCxxFlags.cmake | 9 ++++++--- r/inst/build_arrow_static.sh | 7 +++++++ 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 9112b836c9ef4..083ac2fe9a862 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -99,7 +99,7 @@ function(arrow_create_merged_static_lib output_target) if(APPLE) set(BUNDLE_COMMAND "libtool" "-no_warning_for_no_symbols" "-static" "-o" ${output_lib_path} ${all_library_paths}) - elseif(CMAKE_CXX_COMPILER_ID MATCHES "^(Clang|GNU|Intel)$") + elseif(CMAKE_CXX_COMPILER_ID MATCHES "^(Clang|GNU|Intel|IntelLLVM)$") set(ar_script_path ${CMAKE_BINARY_DIR}/${ARG_NAME}.ar) file(WRITE ${ar_script_path}.in "CREATE ${output_lib_path}\n") diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index a5f5659723c28..5531415ac2277 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -329,7 +329,8 @@ if("${BUILD_WARNING_LEVEL}" STREQUAL "CHECKIN") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-sign-conversion") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wunused-result") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wdate-time") - elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel") + elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel" OR CMAKE_CXX_COMPILER_ID STREQUAL + "IntelLLVM") if(WIN32) set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wall") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wno-deprecated") @@ -360,7 +361,8 @@ elseif("${BUILD_WARNING_LEVEL}" STREQUAL "EVERYTHING") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wextra") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wno-unused-parameter") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wunused-result") - elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel") + elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel" OR CMAKE_CXX_COMPILER_ID STREQUAL + "IntelLLVM") if(WIN32) set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wall") else() @@ -383,7 +385,8 @@ else() OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU") set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -Wall") - elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel") + elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Intel" OR CMAKE_CXX_COMPILER_ID STREQUAL + "IntelLLVM") if(WIN32) set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} /Wall") else() diff --git a/r/inst/build_arrow_static.sh b/r/inst/build_arrow_static.sh index fe56b9fca9e59..52ac5b7d3245b 100755 --- a/r/inst/build_arrow_static.sh +++ b/r/inst/build_arrow_static.sh @@ -55,6 +55,13 @@ else ARROW_DEFAULT_PARAM="OFF" fi +# Disable mimalloc on IntelLLVM because the bundled version (2.0.x) does not support it +case "$CXX" in + *icpx*) + ARROW_MIMALLOC="OFF" + ;; +esac + mkdir -p "${BUILD_DIR}" pushd "${BUILD_DIR}" ${CMAKE} -DARROW_BOOST_USE_SHARED=OFF \ From 6b1bcae924a21bba7438502acf0f952efd895706 Mon Sep 17 00:00:00 2001 From: Tim Schaub Date: Wed, 20 Sep 2023 08:50:01 -0600 Subject: [PATCH 45/96] GH-37779: [Go] Link to the pkg.go.dev site for Go reference docs (#37780) ### Rationale for this change This change updates the Go `README.md` to point to the latest version of the reference documentation. ### What changes are included in this PR? The Go `README.md` has been updated with a badge that points to [the v14 version](https://pkg.go.dev/github.com/apache/arrow/go/v14) of the docs. In addition, the `dev/release/utils-prepare.sh` script has been modified so that the link in the Go `README.md` will be updated when the release version changes. ### Are these changes tested? I tried running `PREPARE_CHANGELOG=0 ./dev/release/01-prepare.sh 14 15 1`, but I don't have `mvn` installed, so was not able to fully validate the changes. In the `go` directory, I ran the following and confirmed that the changes looked like what I would expect: ```shell find . "(" -name "*.go*" -o -name "go.mod" -o -name README.md ")" -exec sed -i.bak -E -e \ "s|(github\\.com/apache/arrow/go)/v[0-9]+|\1/v15|g" {} \; ``` ### Are there any user-facing changes? Documentation changes only. * Closes: #37779 Lead-authored-by: Tim Schaub Co-authored-by: Sutou Kouhei Signed-off-by: Matt Topol --- dev/release/post-11-bump-versions-test.rb | 35 +++++++++++++++------- dev/release/utils-prepare.sh | 4 +-- go/README.md | 2 +- go/arrow/flight/flightsql/driver/README.md | 6 ++-- 4 files changed, 31 insertions(+), 16 deletions(-) diff --git a/dev/release/post-11-bump-versions-test.rb b/dev/release/post-11-bump-versions-test.rb index 79d17e84eb7cb..0ef4646236740 100644 --- a/dev/release/post-11-bump-versions-test.rb +++ b/dev/release/post-11-bump-versions-test.rb @@ -235,7 +235,7 @@ def test_version_post_tag ] end - Dir.glob("go/**/{go.mod,*.go,*.go.*}") do |path| + Dir.glob("go/**/{go.mod,*.go,*.go.*,README.md}") do |path| if path == "go/arrow/doc.go" expected_changes << { path: path, @@ -253,19 +253,34 @@ def test_version_post_tag hunks = [] if release_type == :major lines = File.readlines(path, chomp: true) - target_lines = lines.grep(/#{Regexp.escape(import_path)}/) + target_lines = lines.each_with_index.select do |line, i| + line.include?(import_path) + end next if target_lines.empty? - hunk = [] - target_lines.each do |line| - hunk << "-#{line}" + n_context_lines = 3 # The default of Git's diff.context + target_hunks = [[target_lines.first[0]]] + previous_i = target_lines.first[1] + target_lines[1..-1].each do |line, i| + if i - previous_i < n_context_lines + target_hunks.last << line + else + target_hunks << [line] + end + previous_i = i end - target_lines.each do |line| - new_line = line.gsub("v#{@snapshot_major_version}") do - "v#{@next_major_version}" + target_hunks.each do |lines| + hunk = [] + lines.each do |line,| + hunk << "-#{line}" + end + lines.each do |line| + new_line = line.gsub("v#{@snapshot_major_version}") do + "v#{@next_major_version}" + end + hunk << "+#{new_line}" end - hunk << "+#{new_line}" + hunks << hunk end - hunks << hunk end if path == "go/parquet/writer_properties.go" hunks << [ diff --git a/dev/release/utils-prepare.sh b/dev/release/utils-prepare.sh index ceb51812c11ae..464702b811d8b 100644 --- a/dev/release/utils-prepare.sh +++ b/dev/release/utils-prepare.sh @@ -155,8 +155,8 @@ update_versions() { popd pushd "${ARROW_DIR}/go" - find . "(" -name "*.go*" -o -name "go.mod" ")" -exec sed -i.bak -E -e \ - "s|(github\\.com/apache/arrow/go)/v[0-9]+|\1/v${major_version}|" {} \; + find . "(" -name "*.go*" -o -name "go.mod" -o -name README.md ")" -exec sed -i.bak -E -e \ + "s|(github\\.com/apache/arrow/go)/v[0-9]+|\1/v${major_version}|g" {} \; # update parquet writer version sed -i.bak -E -e \ "s/\"parquet-go version .+\"/\"parquet-go version ${version}\"/" \ diff --git a/go/README.md b/go/README.md index 5b3f72760f331..660549cb1b366 100644 --- a/go/README.md +++ b/go/README.md @@ -20,7 +20,7 @@ Apache Arrow for Go =================== -[![GoDoc](https://godoc.org/github.com/apache/arrow/go/arrow?status.svg)](https://godoc.org/github.com/apache/arrow/go/arrow) +[![Go Reference](https://pkg.go.dev/badge/github.com/apache/arrow/go/v14.svg)](https://pkg.go.dev/github.com/apache/arrow/go/v14) [Apache Arrow][arrow] is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format diff --git a/go/arrow/flight/flightsql/driver/README.md b/go/arrow/flight/flightsql/driver/README.md index f81cb9250e1c9..b8850527c19c1 100644 --- a/go/arrow/flight/flightsql/driver/README.md +++ b/go/arrow/flight/flightsql/driver/README.md @@ -36,7 +36,7 @@ connection pooling, transactions combined with ease of use (see (#usage)). ## Prerequisites * Go 1.17+ -* Installation via `go get -u github.com/apache/arrow/go/v12/arrow/flight/flightsql` +* Installation via `go get -u github.com/apache/arrow/go/v14/arrow/flight/flightsql` * Backend speaking FlightSQL --------------------------------------- @@ -55,7 +55,7 @@ import ( "database/sql" "time" - _ "github.com/apache/arrow/go/v12/arrow/flight/flightsql" + _ "github.com/apache/arrow/go/v14/arrow/flight/flightsql" ) // Open the connection to an SQLite backend @@ -141,7 +141,7 @@ import ( "log" "time" - "github.com/apache/arrow/go/v12/arrow/flight/flightsql" + "github.com/apache/arrow/go/v14/arrow/flight/flightsql" ) func main() { From 00481a2799420f8f00ca7fc137769c1c99186977 Mon Sep 17 00:00:00 2001 From: david dali susanibar arce Date: Wed, 20 Sep 2023 10:47:32 -0500 Subject: [PATCH 46/96] GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression (#35570) ### Rationale for this change To close https://github.com/apache/arrow/issues/34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by https://github.com/apache/arrow/pull/35798 This PR needs/use this PRs/Issues: - https://github.com/apache/arrow/pull/34834 - https://github.com/apache/arrow/pull/34227 - https://github.com/apache/arrow/issues/35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: #34252 Lead-authored-by: david dali susanibar arce Co-authored-by: Weston Pace Co-authored-by: benibus Co-authored-by: David Li Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li --- docs/source/java/dataset.rst | 29 +- docs/source/java/substrait.rst | 351 +++++++++++++++++- java/dataset/src/main/cpp/jni_wrapper.cc | 57 ++- .../apache/arrow/dataset/jni/JniWrapper.java | 8 +- .../arrow/dataset/jni/NativeDataset.java | 4 + .../arrow/dataset/scanner/ScanOptions.java | 80 +++- .../substrait/TestAceroSubstraitConsumer.java | 265 ++++++++++++- 7 files changed, 770 insertions(+), 24 deletions(-) diff --git a/docs/source/java/dataset.rst b/docs/source/java/dataset.rst index 35ffa81058072..a4381e0814638 100644 --- a/docs/source/java/dataset.rst +++ b/docs/source/java/dataset.rst @@ -132,12 +132,10 @@ within method ``Scanner::schema()``: .. _java-dataset-projection: -Projection -========== +Projection (Subset of Columns) +============================== -User can specify projections in ScanOptions. For ``FileSystemDataset``, only -column projection is allowed for now, which means, only column names -in the projection list will be accepted. For example: +User can specify projections in ScanOptions. For example: .. code-block:: Java @@ -159,6 +157,27 @@ Or use shortcut construtor: Then all columns will be emitted during scanning. +Projection (Produce New Columns) and Filters +============================================ + +User can specify projections (new columns) or filters in ScanOptions using Substrait. For example: + +.. code-block:: Java + + ByteBuffer substraitExpressionFilter = getSubstraitExpressionFilter(); + ByteBuffer substraitExpressionProject = getSubstraitExpressionProjection(); + // Use Substrait APIs to create an Expression and serialize to a ByteBuffer + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitExpressionFilter(substraitExpressionFilter) + .substraitExpressionProjection(getSubstraitExpressionProjection()) + .build(); + +.. seealso:: + + :doc:`Executing Projections and Filters Using Extended Expressions ` + Projections and Filters using Substrait. + Read Data from HDFS =================== diff --git a/docs/source/java/substrait.rst b/docs/source/java/substrait.rst index 41effedbf01d9..d8d49a96e88f8 100644 --- a/docs/source/java/substrait.rst +++ b/docs/source/java/substrait.rst @@ -22,8 +22,10 @@ Substrait The ``arrow-dataset`` module can execute Substrait_ plans via the :doc:`Acero <../cpp/streaming_execution>` query engine. -Executing Substrait Plans -========================= +.. contents:: + +Executing Queries Using Substrait Plans +======================================= Plans can reference data in files via URIs, or "named tables" that must be provided along with the plan. @@ -102,6 +104,349 @@ Here is an example of a Java program that queries a Parquet file using Java Subs 0 ALGERIA 0 haggle. carefully final deposits detect slyly agai 1 ARGENTINA 1 al foxes promise slyly according to the regular accounts. bold requests alon +Executing Projections and Filters Using Extended Expressions +============================================================ + +Dataset also supports projections and filters with Substrait's `Extended Expression`_. +This requires the substrait-java library. + +This Java program: + +- Loads a Parquet file containing the "nation" table from the TPC-H benchmark. +- Projects two new columns: + - ``N_NAME || ' - ' || N_COMMENT`` + - ``N_REGIONKEY + 10`` +- Applies a filter: ``N_NATIONKEY > 18`` + +.. code-block:: Java + + import io.substrait.extension.ExtensionCollector; + import io.substrait.proto.Expression; + import io.substrait.proto.ExpressionReference; + import io.substrait.proto.ExtendedExpression; + import io.substrait.proto.FunctionArgument; + import io.substrait.proto.SimpleExtensionDeclaration; + import io.substrait.proto.SimpleExtensionURI; + import io.substrait.type.NamedStruct; + import io.substrait.type.Type; + import io.substrait.type.TypeCreator; + import io.substrait.type.proto.TypeProtoConverter; + import java.nio.ByteBuffer; + import java.util.ArrayList; + import java.util.Arrays; + import java.util.Base64; + import java.util.HashMap; + import java.util.List; + import java.util.Optional; + import org.apache.arrow.dataset.file.FileFormat; + import org.apache.arrow.dataset.file.FileSystemDatasetFactory; + import org.apache.arrow.dataset.jni.NativeMemoryPool; + import org.apache.arrow.dataset.scanner.ScanOptions; + import org.apache.arrow.dataset.scanner.Scanner; + import org.apache.arrow.dataset.source.Dataset; + import org.apache.arrow.dataset.source.DatasetFactory; + import org.apache.arrow.memory.BufferAllocator; + import org.apache.arrow.memory.RootAllocator; + import org.apache.arrow.vector.ipc.ArrowReader; + + public class ClientSubstraitExtendedExpressionsCookbook { + + public static void main(String[] args) throws Exception { + // project and filter dataset using extended expression definition - 03 Expressions: + // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3 + // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10 + // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18 + projectAndFilterDataset(); + } + + public static void projectAndFilterDataset() { + String uri = "file:///Users/data/tpch_parquet/nation.parquet"; + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitFilter(getSubstraitExpressionFilter()) + .substraitProjection(getSubstraitExpressionProjection()) + .build(); + try ( + BufferAllocator allocator = new RootAllocator(); + DatasetFactory datasetFactory = new FileSystemDatasetFactory( + allocator, NativeMemoryPool.getDefault(), + FileFormat.PARQUET, uri); + Dataset dataset = datasetFactory.finish(); + Scanner scanner = dataset.newScan(options); + ArrowReader reader = scanner.scanBatches() + ) { + while (reader.loadNextBatch()) { + System.out.println( + reader.getVectorSchemaRoot().contentToTSVString()); + } + } catch (Exception e) { + throw new RuntimeException(e); + } + } + + private static ByteBuffer getSubstraitExpressionProjection() { + // Expression: N_REGIONKEY + 10 = col 3 + 10 + Expression.Builder selectionBuilderProjectOne = Expression.newBuilder(). + setSelection( + Expression.FieldReference.newBuilder(). + setDirectReference( + Expression.ReferenceSegment.newBuilder(). + setStructField( + Expression.ReferenceSegment.StructField.newBuilder().setField( + 2) + ) + ) + ); + Expression.Builder literalBuilderProjectOne = Expression.newBuilder() + .setLiteral( + Expression.Literal.newBuilder().setI32(10) + ); + io.substrait.proto.Type outputProjectOne = TypeCreator.NULLABLE.I32.accept( + new TypeProtoConverter(new ExtensionCollector())); + Expression.Builder expressionBuilderProjectOne = Expression. + newBuilder(). + setScalarFunction( + Expression. + ScalarFunction. + newBuilder(). + setFunctionReference(0). + setOutputType(outputProjectOne). + addArguments( + 0, + FunctionArgument.newBuilder().setValue( + selectionBuilderProjectOne) + ). + addArguments( + 1, + FunctionArgument.newBuilder().setValue( + literalBuilderProjectOne) + ) + ); + ExpressionReference.Builder expressionReferenceBuilderProjectOne = ExpressionReference.newBuilder(). + setExpression(expressionBuilderProjectOne) + .addOutputNames("ADD_TEN_TO_COLUMN_N_REGIONKEY"); + + // Expression: name || name = N_NAME || "-" || N_COMMENT = col 1 || col 3 + Expression.Builder selectionBuilderProjectTwo = Expression.newBuilder(). + setSelection( + Expression.FieldReference.newBuilder(). + setDirectReference( + Expression.ReferenceSegment.newBuilder(). + setStructField( + Expression.ReferenceSegment.StructField.newBuilder().setField( + 1) + ) + ) + ); + Expression.Builder selectionBuilderProjectTwoConcatLiteral = Expression.newBuilder() + .setLiteral( + Expression.Literal.newBuilder().setString(" - ") + ); + Expression.Builder selectionBuilderProjectOneToConcat = Expression.newBuilder(). + setSelection( + Expression.FieldReference.newBuilder(). + setDirectReference( + Expression.ReferenceSegment.newBuilder(). + setStructField( + Expression.ReferenceSegment.StructField.newBuilder().setField( + 3) + ) + ) + ); + io.substrait.proto.Type outputProjectTwo = TypeCreator.NULLABLE.STRING.accept( + new TypeProtoConverter(new ExtensionCollector())); + Expression.Builder expressionBuilderProjectTwo = Expression. + newBuilder(). + setScalarFunction( + Expression. + ScalarFunction. + newBuilder(). + setFunctionReference(1). + setOutputType(outputProjectTwo). + addArguments( + 0, + FunctionArgument.newBuilder().setValue( + selectionBuilderProjectTwo) + ). + addArguments( + 1, + FunctionArgument.newBuilder().setValue( + selectionBuilderProjectTwoConcatLiteral) + ). + addArguments( + 2, + FunctionArgument.newBuilder().setValue( + selectionBuilderProjectOneToConcat) + ) + ); + ExpressionReference.Builder expressionReferenceBuilderProjectTwo = ExpressionReference.newBuilder(). + setExpression(expressionBuilderProjectTwo) + .addOutputNames("CONCAT_COLUMNS_N_NAME_AND_N_COMMENT"); + + List columnNames = Arrays.asList("N_NATIONKEY", "N_NAME", + "N_REGIONKEY", "N_COMMENT"); + List dataTypes = Arrays.asList( + TypeCreator.NULLABLE.I32, + TypeCreator.NULLABLE.STRING, + TypeCreator.NULLABLE.I32, + TypeCreator.NULLABLE.STRING + ); + NamedStruct of = NamedStruct.of( + columnNames, + Type.Struct.builder().fields(dataTypes).nullable(false).build() + ); + // Extensions URI + HashMap extensionUris = new HashMap<>(); + extensionUris.put( + "key-001", + SimpleExtensionURI.newBuilder() + .setExtensionUriAnchor(1) + .setUri("/functions_arithmetic.yaml") + .build() + ); + // Extensions + ArrayList extensions = new ArrayList<>(); + SimpleExtensionDeclaration extensionFunctionAdd = SimpleExtensionDeclaration.newBuilder() + .setExtensionFunction( + SimpleExtensionDeclaration.ExtensionFunction.newBuilder() + .setFunctionAnchor(0) + .setName("add:i32_i32") + .setExtensionUriReference(1)) + .build(); + SimpleExtensionDeclaration extensionFunctionGreaterThan = SimpleExtensionDeclaration.newBuilder() + .setExtensionFunction( + SimpleExtensionDeclaration.ExtensionFunction.newBuilder() + .setFunctionAnchor(1) + .setName("concat:vchar") + .setExtensionUriReference(2)) + .build(); + extensions.add(extensionFunctionAdd); + extensions.add(extensionFunctionGreaterThan); + // Extended Expression + ExtendedExpression.Builder extendedExpressionBuilder = + ExtendedExpression.newBuilder(). + addReferredExpr(0, + expressionReferenceBuilderProjectOne). + addReferredExpr(1, + expressionReferenceBuilderProjectTwo). + setBaseSchema(of.toProto(new TypeProtoConverter( + new ExtensionCollector()))); + extendedExpressionBuilder.addAllExtensionUris(extensionUris.values()); + extendedExpressionBuilder.addAllExtensions(extensions); + ExtendedExpression extendedExpression = extendedExpressionBuilder.build(); + byte[] extendedExpressions = Base64.getDecoder().decode( + Base64.getEncoder().encodeToString( + extendedExpression.toByteArray())); + ByteBuffer substraitExpressionProjection = ByteBuffer.allocateDirect( + extendedExpressions.length); + substraitExpressionProjection.put(extendedExpressions); + return substraitExpressionProjection; + } + + private static ByteBuffer getSubstraitExpressionFilter() { + // Expression: Filter: N_NATIONKEY > 18 = col 1 > 18 + Expression.Builder selectionBuilderFilterOne = Expression.newBuilder(). + setSelection( + Expression.FieldReference.newBuilder(). + setDirectReference( + Expression.ReferenceSegment.newBuilder(). + setStructField( + Expression.ReferenceSegment.StructField.newBuilder().setField( + 0) + ) + ) + ); + Expression.Builder literalBuilderFilterOne = Expression.newBuilder() + .setLiteral( + Expression.Literal.newBuilder().setI32(18) + ); + io.substrait.proto.Type outputFilterOne = TypeCreator.NULLABLE.BOOLEAN.accept( + new TypeProtoConverter(new ExtensionCollector())); + Expression.Builder expressionBuilderFilterOne = Expression. + newBuilder(). + setScalarFunction( + Expression. + ScalarFunction. + newBuilder(). + setFunctionReference(1). + setOutputType(outputFilterOne). + addArguments( + 0, + FunctionArgument.newBuilder().setValue( + selectionBuilderFilterOne) + ). + addArguments( + 1, + FunctionArgument.newBuilder().setValue( + literalBuilderFilterOne) + ) + ); + ExpressionReference.Builder expressionReferenceBuilderFilterOne = ExpressionReference.newBuilder(). + setExpression(expressionBuilderFilterOne) + .addOutputNames("COLUMN_N_NATIONKEY_GREATER_THAN_18"); + + List columnNames = Arrays.asList("N_NATIONKEY", "N_NAME", + "N_REGIONKEY", "N_COMMENT"); + List dataTypes = Arrays.asList( + TypeCreator.NULLABLE.I32, + TypeCreator.NULLABLE.STRING, + TypeCreator.NULLABLE.I32, + TypeCreator.NULLABLE.STRING + ); + NamedStruct of = NamedStruct.of( + columnNames, + Type.Struct.builder().fields(dataTypes).nullable(false).build() + ); + // Extensions URI + HashMap extensionUris = new HashMap<>(); + extensionUris.put( + "key-001", + SimpleExtensionURI.newBuilder() + .setExtensionUriAnchor(1) + .setUri("/functions_comparison.yaml") + .build() + ); + // Extensions + ArrayList extensions = new ArrayList<>(); + SimpleExtensionDeclaration extensionFunctionLowerThan = SimpleExtensionDeclaration.newBuilder() + .setExtensionFunction( + SimpleExtensionDeclaration.ExtensionFunction.newBuilder() + .setFunctionAnchor(1) + .setName("gt:any_any") + .setExtensionUriReference(1)) + .build(); + extensions.add(extensionFunctionLowerThan); + // Extended Expression + ExtendedExpression.Builder extendedExpressionBuilder = + ExtendedExpression.newBuilder(). + addReferredExpr(0, + expressionReferenceBuilderFilterOne). + setBaseSchema(of.toProto(new TypeProtoConverter( + new ExtensionCollector()))); + extendedExpressionBuilder.addAllExtensionUris(extensionUris.values()); + extendedExpressionBuilder.addAllExtensions(extensions); + ExtendedExpression extendedExpression = extendedExpressionBuilder.build(); + byte[] extendedExpressions = Base64.getDecoder().decode( + Base64.getEncoder().encodeToString( + extendedExpression.toByteArray())); + ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect( + extendedExpressions.length); + substraitExpressionFilter.put(extendedExpressions); + return substraitExpressionFilter; + } + } + +.. code-block:: text + + ADD_TEN_TO_COLUMN_N_REGIONKEY CONCAT_COLUMNS_N_NAME_AND_N_COMMENT + 13 ROMANIA - ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account + 14 SAUDI ARABIA - ts. silent requests haggle. closely express packages sleep across the blithely + 12 VIETNAM - hely enticingly express accounts. even, final + 13 RUSSIA - requests against the platelets use never according to the quickly regular pint + 13 UNITED KINGDOM - eans boost carefully special requests. accounts are. carefull + 11 UNITED STATES - y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be + .. _`Substrait`: https://substrait.io/ .. _`Substrait Java`: https://github.com/substrait-io/substrait-java -.. _`Acero`: https://arrow.apache.org/docs/cpp/streaming_execution.html \ No newline at end of file +.. _`Acero`: https://arrow.apache.org/docs/cpp/streaming_execution.html +.. _`Extended Expression`: https://github.com/substrait-io/substrait/blob/main/site/docs/expressions/extended_expression.md diff --git a/java/dataset/src/main/cpp/jni_wrapper.cc b/java/dataset/src/main/cpp/jni_wrapper.cc index 5640bc4349670..49e0f1720909f 100644 --- a/java/dataset/src/main/cpp/jni_wrapper.cc +++ b/java/dataset/src/main/cpp/jni_wrapper.cc @@ -29,6 +29,8 @@ #include "arrow/filesystem/path_util.h" #include "arrow/filesystem/s3fs.h" #include "arrow/engine/substrait/util.h" +#include "arrow/engine/substrait/serde.h" +#include "arrow/engine/substrait/relation.h" #include "arrow/ipc/api.h" #include "arrow/util/iterator.h" #include "jni_util.h" @@ -200,7 +202,6 @@ arrow::Result> SchemaFromColumnNames( return arrow::Status::Invalid("Partition column '", ref.ToString(), "' is not in dataset schema"); } } - return schema(std::move(columns))->WithMetadata(input->metadata()); } } // namespace @@ -317,6 +318,14 @@ std::shared_ptr GetTableByName(const std::vector& nam return it->second; } +std::shared_ptr LoadArrowBufferFromByteBuffer(JNIEnv* env, jobject byte_buffer) { + const auto *buff = reinterpret_cast(env->GetDirectBufferAddress(byte_buffer)); + int length = env->GetDirectBufferCapacity(byte_buffer); + std::shared_ptr buffer = JniGetOrThrow(arrow::AllocateBuffer(length)); + std::memcpy(buffer->mutable_data(), buff, length); + return buffer; +} + /* * Class: org_apache_arrow_dataset_jni_NativeMemoryPool * Method: getDefaultMemoryPool @@ -455,11 +464,12 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset /* * Class: org_apache_arrow_dataset_jni_JniWrapper * Method: createScanner - * Signature: (J[Ljava/lang/String;JJ)J + * Signature: (J[Ljava/lang/String;Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;JJ)J */ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner( - JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size, - jlong memory_pool_id) { + JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, + jobject substrait_projection, jobject substrait_filter, + jlong batch_size, jlong memory_pool_id) { JNI_METHOD_START arrow::MemoryPool* pool = reinterpret_cast(memory_pool_id); if (pool == nullptr) { @@ -474,6 +484,40 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann std::vector column_vector = ToStringVector(env, columns); JniAssertOkOrThrow(scanner_builder->Project(column_vector)); } + if (substrait_projection != nullptr) { + std::shared_ptr buffer = LoadArrowBufferFromByteBuffer(env, + substrait_projection); + std::vector project_exprs; + std::vector project_names; + arrow::engine::BoundExpressions bounded_expression = + JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer)); + for(arrow::engine::NamedExpression& named_expression : + bounded_expression.named_expressions) { + project_exprs.push_back(std::move(named_expression.expression)); + project_names.push_back(std::move(named_expression.name)); + } + JniAssertOkOrThrow(scanner_builder->Project(std::move(project_exprs), std::move(project_names))); + } + if (substrait_filter != nullptr) { + std::shared_ptr buffer = LoadArrowBufferFromByteBuffer(env, + substrait_filter); + std::optional filter_expr = std::nullopt; + arrow::engine::BoundExpressions bounded_expression = + JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer)); + for(arrow::engine::NamedExpression& named_expression : + bounded_expression.named_expressions) { + filter_expr = named_expression.expression; + if (named_expression.expression.type()->id() == arrow::Type::BOOL) { + filter_expr = named_expression.expression; + } else { + JniThrow("There is no filter expression in the expression provided"); + } + } + if (filter_expr == std::nullopt) { + JniThrow("The filter expression has not been provided"); + } + JniAssertOkOrThrow(scanner_builder->Filter(*filter_expr)); + } JniAssertOkOrThrow(scanner_builder->BatchSize(batch_size)); auto scanner = JniGetOrThrow(scanner_builder->Finish()); @@ -748,10 +792,7 @@ JNIEXPORT void JNICALL arrow::engine::ConversionOptions conversion_options; conversion_options.named_table_provider = std::move(table_provider); // mapping arrow::Buffer - auto *buff = reinterpret_cast(env->GetDirectBufferAddress(plan)); - int length = env->GetDirectBufferCapacity(plan); - std::shared_ptr buffer = JniGetOrThrow(arrow::AllocateBuffer(length)); - std::memcpy(buffer->mutable_data(), buff, length); + std::shared_ptr buffer = LoadArrowBufferFromByteBuffer(env, plan); // execute plan std::shared_ptr reader_out = JniGetOrThrow(arrow::engine::ExecuteSerializedPlan(*buffer, nullptr, nullptr, conversion_options)); diff --git a/java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java b/java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java index 93cc5d7a37040..a7df5be42f13b 100644 --- a/java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java +++ b/java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java @@ -17,6 +17,8 @@ package org.apache.arrow.dataset.jni; +import java.nio.ByteBuffer; + /** * JNI wrapper for Dataset API's native implementation. */ @@ -66,15 +68,19 @@ private JniWrapper() { /** * Create Scanner from a Dataset and get the native pointer of the Dataset. + * * @param datasetId the native pointer of the arrow::dataset::Dataset instance. * @param columns desired column names. * Columns not in this list will not be emitted when performing scan operation. Null equals * to "all columns". + * @param substraitProjection substrait extended expression to evaluate for project new columns + * @param substraitFilter substrait extended expression to evaluate for apply filter * @param batchSize batch size of scanned record batches. * @param memoryPool identifier of memory pool used in the native scanner. * @return the native pointer of the arrow::dataset::Scanner instance. */ - public native long createScanner(long datasetId, String[] columns, long batchSize, long memoryPool); + public native long createScanner(long datasetId, String[] columns, ByteBuffer substraitProjection, + ByteBuffer substraitFilter, long batchSize, long memoryPool); /** * Get a serialized schema from native instance of a Scanner. diff --git a/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataset.java b/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataset.java index 30ff1a9302f7a..d9abad9971c4e 100644 --- a/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataset.java +++ b/java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataset.java @@ -40,8 +40,12 @@ public synchronized NativeScanner newScan(ScanOptions options) { if (closed) { throw new NativeInstanceReleasedException(); } + long scannerId = JniWrapper.get().createScanner(datasetId, options.getColumns().orElse(null), + options.getSubstraitProjection().orElse(null), + options.getSubstraitFilter().orElse(null), options.getBatchSize(), context.getMemoryPool().getNativeInstanceId()); + return new NativeScanner(context, scannerId); } diff --git a/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java b/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java index f5a1af384b24e..995d05ac3b314 100644 --- a/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java +++ b/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java @@ -17,6 +17,7 @@ package org.apache.arrow.dataset.scanner; +import java.nio.ByteBuffer; import java.util.Optional; import org.apache.arrow.util.Preconditions; @@ -25,8 +26,10 @@ * Options used during scanning. */ public class ScanOptions { - private final Optional columns; private final long batchSize; + private final Optional columns; + private final Optional substraitProjection; + private final Optional substraitFilter; /** * Constructor. @@ -56,6 +59,8 @@ public ScanOptions(long batchSize, Optional columns) { Preconditions.checkNotNull(columns); this.batchSize = batchSize; this.columns = columns; + this.substraitProjection = Optional.empty(); + this.substraitFilter = Optional.empty(); } public ScanOptions(long batchSize) { @@ -69,4 +74,77 @@ public Optional getColumns() { public long getBatchSize() { return batchSize; } + + public Optional getSubstraitProjection() { + return substraitProjection; + } + + public Optional getSubstraitFilter() { + return substraitFilter; + } + + /** + * Builder for Options used during scanning. + */ + public static class Builder { + private final long batchSize; + private Optional columns; + private ByteBuffer substraitProjection; + private ByteBuffer substraitFilter; + + /** + * Constructor. + * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch} + */ + public Builder(long batchSize) { + this.batchSize = batchSize; + } + + /** + * Set the Projected columns. Empty for scanning all columns. + * + * @param columns Projected columns. Empty for scanning all columns. + * @return the ScanOptions configured. + */ + public Builder columns(Optional columns) { + Preconditions.checkNotNull(columns); + this.columns = columns; + return this; + } + + /** + * Set the Substrait extended expression for Projection new columns. + * + * @param substraitProjection Expressions to evaluate for project new columns. + * @return the ScanOptions configured. + */ + public Builder substraitProjection(ByteBuffer substraitProjection) { + Preconditions.checkNotNull(substraitProjection); + this.substraitProjection = substraitProjection; + return this; + } + + /** + * Set the Substrait extended expression for Filter. + * + * @param substraitFilter Expressions to evaluate for apply Filter. + * @return the ScanOptions configured. + */ + public Builder substraitFilter(ByteBuffer substraitFilter) { + Preconditions.checkNotNull(substraitFilter); + this.substraitFilter = substraitFilter; + return this; + } + + public ScanOptions build() { + return new ScanOptions(this); + } + } + + private ScanOptions(Builder builder) { + batchSize = builder.batchSize; + columns = builder.columns; + substraitProjection = Optional.ofNullable(builder.substraitProjection); + substraitFilter = Optional.ofNullable(builder.substraitFilter); + } } diff --git a/java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java b/java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java index c23b7e002880a..0fba72892cdc6 100644 --- a/java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java +++ b/java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java @@ -18,6 +18,8 @@ package org.apache.arrow.dataset.substrait; import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertThrows; +import static org.junit.Assert.assertTrue; import java.nio.ByteBuffer; import java.nio.file.Files; @@ -27,6 +29,7 @@ import java.util.Collections; import java.util.HashMap; import java.util.Map; +import java.util.Optional; import org.apache.arrow.dataset.ParquetWriteSupport; import org.apache.arrow.dataset.TestDataset; @@ -85,7 +88,7 @@ public void testRunQueryLocalFiles() throws Exception { } @Test - public void testRunQueryNamedTableNation() throws Exception { + public void testRunQueryNamedTable() throws Exception { //Query: //SELECT id, name FROM Users //Isthmus: @@ -123,7 +126,7 @@ public void testRunQueryNamedTableNation() throws Exception { } @Test(expected = RuntimeException.class) - public void testRunQueryNamedTableNationWithException() throws Exception { + public void testRunQueryNamedTableWithException() throws Exception { //Query: //SELECT id, name FROM Users //Isthmus: @@ -160,7 +163,7 @@ public void testRunQueryNamedTableNationWithException() throws Exception { } @Test - public void testRunBinaryQueryNamedTableNation() throws Exception { + public void testRunBinaryQueryNamedTable() throws Exception { //Query: //SELECT id, name FROM Users //Isthmus: @@ -187,9 +190,7 @@ public void testRunBinaryQueryNamedTableNation() throws Exception { Map mapTableToArrowReader = new HashMap<>(); mapTableToArrowReader.put("USERS", reader); // get binary plan - byte[] plan = Base64.getDecoder().decode(binaryPlan); - ByteBuffer substraitPlan = ByteBuffer.allocateDirect(plan.length); - substraitPlan.put(plan); + ByteBuffer substraitPlan = getByteBuffer(binaryPlan); // run query try (ArrowReader arrowReader = new AceroSubstraitConsumer(rootAllocator()).runQuery( substraitPlan, @@ -204,4 +205,256 @@ public void testRunBinaryQueryNamedTableNation() throws Exception { } } } + + @Test + public void testRunExtendedExpressionsFilter() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("id", new ArrowType.Int(32, true)), + Field.nullable("name", new ArrowType.Utf8()) + ), null); + // Substrait Extended Expression: Filter: + // Expression 01: WHERE ID < 20 + String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" + + "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" + + "BCgRiAhABGAI="; + ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitFilter(substraitExpressionFilter) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish(); + Scanner scanner = dataset.newScan(options); + ArrowReader reader = scanner.scanBatches() + ) { + assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields()); + int rowcount = 0; + while (reader.loadNextBatch()) { + rowcount += reader.getVectorSchemaRoot().getRowCount(); + assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]")); + assertTrue(reader.getVectorSchemaRoot().getVector("name").toString() + .equals("[value_19, value_1, value_11]")); + } + assertEquals(3, rowcount); + } + } + + @Test + public void testRunExtendedExpressionsFilterWithProjectionsInsteadOfFilterException() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("id", new ArrowType.Int(32, true)), + Field.nullable("name", new ArrowType.Utf8()) + ), null); + // Substrait Extended Expression: Project New Column: + // Expression ADD: id + 2 + // Expression CONCAT: name + '-' + name + String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" + + "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" + + "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" + + "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI="; + ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitFilter(substraitExpressionFilter) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish() + ) { + Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options)); + assertTrue(e.getMessage().startsWith("There is no filter expression in the expression provided")); + } + } + + @Test + public void testRunExtendedExpressionsFilterWithEmptyFilterException() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("id", new ArrowType.Int(32, true)), + Field.nullable("name", new ArrowType.Utf8()) + ), null); + String base64EncodedSubstraitFilter = ""; + ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitFilter(substraitExpressionFilter) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish() + ) { + Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options)); + assertTrue(e.getMessage().contains("no anonymous struct type was provided to which names could be attached.")); + } + } + + @Test + public void testRunExtendedExpressionsProjection() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)), + Field.nullable("concat_column_a_and_b", new ArrowType.Utf8()) + ), null); + // Substrait Extended Expression: Project New Column: + // Expression ADD: id + 2 + // Expression CONCAT: name + '-' + name + String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" + + "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" + + "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" + + "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI="; + ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitProjection(substraitExpressionProject) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish(); + Scanner scanner = dataset.newScan(options); + ArrowReader reader = scanner.scanBatches() + ) { + assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields()); + int rowcount = 0; + while (reader.loadNextBatch()) { + assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString() + .equals("[21, 3, 13, 23, 47]")); + assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString() + .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11, " + + "value_21 - value_21, value_45 - value_45]")); + rowcount += reader.getVectorSchemaRoot().getRowCount(); + } + assertEquals(5, rowcount); + } + } + + @Test + public void testRunExtendedExpressionsProjectionWithFilterInsteadOfProjectionException() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("filter_id_lower_than_20", new ArrowType.Bool()) + ), null); + // Substrait Extended Expression: Filter: + // Expression 01: WHERE ID < 20 + String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" + + "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" + + "BCgRiAhABGAI="; + ByteBuffer substraitExpressionFilter = getByteBuffer(binarySubstraitExpressionFilter); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitProjection(substraitExpressionFilter) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish(); + Scanner scanner = dataset.newScan(options); + ArrowReader reader = scanner.scanBatches() + ) { + assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields()); + int rowcount = 0; + while (reader.loadNextBatch()) { + assertTrue(reader.getVectorSchemaRoot().getVector("filter_id_lower_than_20").toString() + .equals("[true, true, true, false, false]")); + rowcount += reader.getVectorSchemaRoot().getRowCount(); + } + assertEquals(5, rowcount); + } + } + + @Test + public void testRunExtendedExpressionsProjectionWithEmptyProjectionException() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("id", new ArrowType.Int(32, true)), + Field.nullable("name", new ArrowType.Utf8()) + ), null); + String base64EncodedSubstraitFilter = ""; + ByteBuffer substraitExpressionProjection = getByteBuffer(base64EncodedSubstraitFilter); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitProjection(substraitExpressionProjection) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish() + ) { + Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options)); + assertTrue(e.getMessage().contains("no anonymous struct type was provided to which names could be attached.")); + } + } + + @Test + public void testRunExtendedExpressionsProjectAndFilter() throws Exception { + final Schema schema = new Schema(Arrays.asList( + Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)), + Field.nullable("concat_column_a_and_b", new ArrowType.Utf8()) + ), null); + // Substrait Extended Expression: Project New Column: + // Expression ADD: id + 2 + // Expression CONCAT: name + '-' + name + String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" + + "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" + + "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" + + "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI="; + ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject); + // Substrait Extended Expression: Filter: + // Expression 01: WHERE ID < 20 + String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" + + "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" + + "BCgRiAhABGAI="; + ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter); + ParquetWriteSupport writeSupport = ParquetWriteSupport + .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1", + 11, "value_11", 21, "value_21", 45, "value_45"); + ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768) + .columns(Optional.empty()) + .substraitProjection(substraitExpressionProject) + .substraitFilter(substraitExpressionFilter) + .build(); + try ( + DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), + FileFormat.PARQUET, writeSupport.getOutputURI()); + Dataset dataset = datasetFactory.finish(); + Scanner scanner = dataset.newScan(options); + ArrowReader reader = scanner.scanBatches() + ) { + assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields()); + int rowcount = 0; + while (reader.loadNextBatch()) { + assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString() + .equals("[21, 3, 13]")); + assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString() + .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11]")); + rowcount += reader.getVectorSchemaRoot().getRowCount(); + } + assertEquals(3, rowcount); + } + } + + private static ByteBuffer getByteBuffer(String base64EncodedSubstrait) { + byte[] decodedSubstrait = Base64.getDecoder().decode(base64EncodedSubstrait); + ByteBuffer substraitExpression = ByteBuffer.allocateDirect(decodedSubstrait.length); + substraitExpression.put(decodedSubstrait); + return substraitExpression; + } } From e068b7f845514637d3ffebd6e341b38439742a40 Mon Sep 17 00:00:00 2001 From: Kevin Gurney Date: Wed, 20 Sep 2023 13:21:45 -0400 Subject: [PATCH 47/96] GH-37805: [CI][MATLAB] Hard-code `release` to `R2023a` for `matlab-actions/setup-matlab` action in MATLAB CI workflows (#37808) ### Rationale for this change Due to a recent change to default to `R2023b` by default for the `[matlab-actions/setup-matlab`](https://github.com/matlab-actions/setup-matlab) action in GitHub Actions, the MATLAB CI workflows are failing. Example failure logs: https://github.com/apache/arrow/actions/runs/6250586979/job/16970596069?pr=37773#step:9:70 Our preferred solution to address this in the short term is to explicitly specify the [`release` parameter](https://github.com/matlab-actions/setup-matlab#set-up-matlab) to the `setup-matlab` action to be `R2023a`. In the long term, we can work on figuring out why the build is failing on Windows with `R2023b`. For reference - it appears that the `FindMatlab` CMake module only recently added R2023b to its list of recognized versions: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/8804 ### What changes are included in this PR? 1. Hard-coded MATLAB `release` to `R2023a` for the [`matlab-actions/setup-matlab`](https://github.com/matlab-actions/setup-matlab) action in the MATLAB CI workflows. ### Are these changes tested? Yes. 1. [MATLAB CI workflows are passing on all platforms in `mathworks/arrow`](https://github.com/mathworks/arrow/actions/runs/6251345588). ### Are there any user-facing changes? No. This change only impacts the MATLAB CI workflows which were defaulting to `R2023a` by default in the past. ### Future Directions 1. #37809 * Closes: #37805 Authored-by: Kevin Gurney Signed-off-by: Kevin Gurney --- .github/workflows/matlab.yml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.github/workflows/matlab.yml b/.github/workflows/matlab.yml index 221ed5c77cd47..6921e12213b5b 100644 --- a/.github/workflows/matlab.yml +++ b/.github/workflows/matlab.yml @@ -53,6 +53,8 @@ jobs: run: sudo apt-get install ninja-build - name: Install MATLAB uses: matlab-actions/setup-matlab@v1 + with: + release: R2023a - name: Install ccache run: sudo apt-get install ccache - name: Setup ccache @@ -99,6 +101,8 @@ jobs: run: brew install ninja - name: Install MATLAB uses: matlab-actions/setup-matlab@v1 + with: + release: R2023a - name: Install ccache run: brew install ccache - name: Setup ccache @@ -135,6 +139,8 @@ jobs: fetch-depth: 0 - name: Install MATLAB uses: matlab-actions/setup-matlab@v1 + with: + release: R2023a - name: Download Timezone Database shell: bash run: ci/scripts/download_tz_database.sh From 2b34e37a956ac59b79e74da1dde8f037c9c88c5d Mon Sep 17 00:00:00 2001 From: Kevin Gurney Date: Wed, 20 Sep 2023 14:31:12 -0400 Subject: [PATCH 48/96] GH-37770: [MATLAB] Add CSV `TableReader` and `TableWriter` MATLAB classes (#37773) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change To enable initial CSV I/O support, this PR adds `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter` MATLAB classes to the MATLAB interface. ### What changes are included in this PR? 1. Added a new `arrow.io.csv.TableReader` class 2. Added a new `arrow.io.csv.TableWriter` class **Example** ```matlab >> matlabTableWrite = array2table(rand(3)) matlabTableWrite = 3×3 table Var1 Var2 Var3 _______ ________ _______ 0.91131 0.091595 0.24594 0.51315 0.27368 0.62119 0.42942 0.88665 0.49501 >> arrowTableWrite = arrow.table(matlabTableWrite) arrowTableWrite = Var1: double Var2: double Var3: double ---- Var1: [ [ 0.9113083542736461, 0.5131490075412158, 0.42942202968065213 ] ] Var2: [ [ 0.09159480217154525, 0.27367730380496647, 0.8866478145458545 ] ] Var3: [ [ 0.2459443412735529, 0.6211893868708748, 0.49500739584280073 ] ] >> writer = arrow.io.csv.TableWriter("example.csv") writer = TableWriter with properties: Filename: "example.csv" >> writer.write(arrowTableWrite) >> reader = arrow.io.csv.TableReader("example.csv") reader = TableReader with properties: Filename: "example.csv" >> arrowTableRead = reader.read() arrowTableRead = Var1: double Var2: double Var3: double ---- Var1: [ [ 0.9113083542736461, 0.5131490075412158, 0.42942202968065213 ] ] Var2: [ [ 0.09159480217154525, 0.27367730380496647, 0.8866478145458545 ] ] Var3: [ [ 0.2459443412735529, 0.6211893868708748, 0.49500739584280073 ] ] >> matlabTableRead = table(arrowTableRead) matlabTableRead = 3×3 table Var1 Var2 Var3 _______ ________ _______ 0.91131 0.091595 0.24594 0.51315 0.27368 0.62119 0.42942 0.88665 0.49501 >> isequal(arrowTableRead, arrowTableWrite) ans = logical 1 >> isequal(matlabTableRead, matlabTableWrite) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Added new CSV I/O tests including `test/arrow/io/csv/tRoundTrip.m` and `test/arrow/io/csv/tError.m`. 2. Both of these test classes inherit from a `CSVTest` superclass. ### Are there any user-facing changes? Yes. 1. Users can now read and write CSV files using `arrow.io.csv.TableReader` and `arrow.io.csv.TableWriter`. ### Future Directions 1. Expose [options](https://github.com/apache/arrow/blob/main/cpp/src/arrow/csv/options.h) for controlling CSV reading and writing in MATLAB. 2. Add more read/write tests for null value handling and other datatypes beyond numeric and string values. 4. Add a `RecordBatchReader` and `RecordBatchWriter` for CSV. 5. Add support for more I/O formats like Parquet, JSON, ORC, Arrow IPC, etc. ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! 2. I chose to add both the `TableReader` and `TableWriter` in one pull request because it simplified testing. My apologies for the slightly lengthy pull request. * Closes: #37770 Lead-authored-by: Kevin Gurney Co-authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- matlab/CMakeLists.txt | 5 +- matlab/src/cpp/arrow/matlab/error/error.h | 3 + .../arrow/matlab/io/csv/proxy/table_reader.cc | 93 ++++++++++++++++ .../arrow/matlab/io/csv/proxy/table_reader.h | 38 +++++++ .../arrow/matlab/io/csv/proxy/table_writer.cc | 86 +++++++++++++++ .../arrow/matlab/io/csv/proxy/table_writer.h | 38 +++++++ matlab/src/cpp/arrow/matlab/proxy/factory.cc | 4 + .../src/matlab/+arrow/+io/+csv/TableReader.m | 51 +++++++++ .../src/matlab/+arrow/+io/+csv/TableWriter.m | 51 +++++++++ matlab/test/arrow/io/csv/CSVTest.m | 102 ++++++++++++++++++ matlab/test/arrow/io/csv/tError.m | 73 +++++++++++++ matlab/test/arrow/io/csv/tRoundTrip.m | 62 +++++++++++ .../cmake/BuildMatlabArrowInterface.cmake | 4 +- 13 files changed, 606 insertions(+), 4 deletions(-) create mode 100644 matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.cc create mode 100644 matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.h create mode 100644 matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.cc create mode 100644 matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.h create mode 100644 matlab/src/matlab/+arrow/+io/+csv/TableReader.m create mode 100644 matlab/src/matlab/+arrow/+io/+csv/TableWriter.m create mode 100644 matlab/test/arrow/io/csv/CSVTest.m create mode 100644 matlab/test/arrow/io/csv/tError.m create mode 100644 matlab/test/arrow/io/csv/tRoundTrip.m diff --git a/matlab/CMakeLists.txt b/matlab/CMakeLists.txt index c8100a389ace0..b7af37a278536 100644 --- a/matlab/CMakeLists.txt +++ b/matlab/CMakeLists.txt @@ -34,8 +34,9 @@ function(build_arrow) set(ARROW_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/arrow_ep-prefix") set(ARROW_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}/arrow_ep-build") - set(ARROW_CMAKE_ARGS "-DCMAKE_INSTALL_PREFIX=${ARROW_PREFIX}" - "-DCMAKE_INSTALL_LIBDIR=lib" "-DARROW_BUILD_STATIC=OFF") + set(ARROW_CMAKE_ARGS + "-DCMAKE_INSTALL_PREFIX=${ARROW_PREFIX}" "-DCMAKE_INSTALL_LIBDIR=lib" + "-DARROW_BUILD_STATIC=OFF" "-DARROW_CSV=ON") add_library(arrow_shared SHARED IMPORTED) set(ARROW_LIBRARY_TARGET arrow_shared) diff --git a/matlab/src/cpp/arrow/matlab/error/error.h b/matlab/src/cpp/arrow/matlab/error/error.h index 4ff77da8d8360..ada9954353d9b 100644 --- a/matlab/src/cpp/arrow/matlab/error/error.h +++ b/matlab/src/cpp/arrow/matlab/error/error.h @@ -182,6 +182,9 @@ namespace arrow::matlab::error { static const char* TABLE_INVALID_NUMERIC_COLUMN_INDEX = "arrow:tabular:table:InvalidNumericColumnIndex"; static const char* FAILED_TO_OPEN_FILE_FOR_WRITE = "arrow:io:FailedToOpenFileForWrite"; static const char* FAILED_TO_OPEN_FILE_FOR_READ = "arrow:io:FailedToOpenFileForRead"; + static const char* CSV_FAILED_TO_WRITE_TABLE = "arrow:io:csv:FailedToWriteTable"; + static const char* CSV_FAILED_TO_CREATE_TABLE_READER = "arrow:io:csv:FailedToCreateTableReader"; + static const char* CSV_FAILED_TO_READ_TABLE = "arrow:io:csv:FailedToReadTable"; static const char* FEATHER_FAILED_TO_WRITE_TABLE = "arrow:io:feather:FailedToWriteTable"; static const char* TABLE_FROM_RECORD_BATCH = "arrow:table:FromRecordBatch"; static const char* FEATHER_FAILED_TO_CREATE_READER = "arrow:io:feather:FailedToCreateReader"; diff --git a/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.cc b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.cc new file mode 100644 index 0000000000000..ab9935ce145a8 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.cc @@ -0,0 +1,93 @@ +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "libmexclass/proxy/ProxyManager.h" + +#include "arrow/matlab/error/error.h" +#include "arrow/matlab/io/csv/proxy/table_reader.h" +#include "arrow/matlab/tabular/proxy/table.h" + +#include "arrow/util/utf8.h" + +#include "arrow/result.h" + +#include "arrow/io/file.h" +#include "arrow/io/interfaces.h" +#include "arrow/csv/reader.h" +#include "arrow/table.h" + +namespace arrow::matlab::io::csv::proxy { + + TableReader::TableReader(const std::string& filename) : filename{filename} { + REGISTER_METHOD(TableReader, read); + REGISTER_METHOD(TableReader, getFilename); + } + + libmexclass::proxy::MakeResult TableReader::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { + namespace mda = ::matlab::data; + using TableReaderProxy = arrow::matlab::io::csv::proxy::TableReader; + + mda::StructArray args = constructor_arguments[0]; + const mda::StringArray filename_utf16_mda = args[0]["Filename"]; + const auto filename_utf16 = std::u16string(filename_utf16_mda[0]); + MATLAB_ASSIGN_OR_ERROR(const auto filename, arrow::util::UTF16StringToUTF8(filename_utf16), error::UNICODE_CONVERSION_ERROR_ID); + + return std::make_shared(filename); + } + + void TableReader::read(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + using namespace libmexclass::proxy; + namespace csv = ::arrow::csv; + using TableProxy = arrow::matlab::tabular::proxy::Table; + + mda::ArrayFactory factory; + + // Create a file input stream. + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto source, arrow::io::ReadableFile::Open(filename, arrow::default_memory_pool()), context, error::FAILED_TO_OPEN_FILE_FOR_READ); + + const ::arrow::io::IOContext io_context; + const auto read_options = csv::ReadOptions::Defaults(); + const auto parse_options = csv::ParseOptions::Defaults(); + const auto convert_options = csv::ConvertOptions::Defaults(); + + // Create a TableReader from the file input stream. + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto table_reader, + csv::TableReader::Make(io_context, source, read_options, parse_options, convert_options), + context, + error::CSV_FAILED_TO_CREATE_TABLE_READER); + + // Read a Table from the file. + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto table, table_reader->Read(), context, error::CSV_FAILED_TO_READ_TABLE); + + auto table_proxy = std::make_shared(table); + const auto table_proxy_id = ProxyManager::manageProxy(table_proxy); + + const auto table_proxy_id_mda = factory.createScalar(table_proxy_id); + + context.outputs[0] = table_proxy_id_mda; + } + + void TableReader::getFilename(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + mda::ArrayFactory factory; + + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto filename_utf16, arrow::util::UTF8StringToUTF16(filename), context, error::UNICODE_CONVERSION_ERROR_ID); + auto filename_utf16_mda = factory.createScalar(filename_utf16); + context.outputs[0] = filename_utf16_mda; + } + +} diff --git a/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.h b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.h new file mode 100644 index 0000000000000..d5dfce50e4096 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_reader.h @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "libmexclass/proxy/Proxy.h" + +namespace arrow::matlab::io::csv::proxy { + + class TableReader : public libmexclass::proxy::Proxy { + public: + TableReader(const std::string& filename); + ~TableReader() {} + static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments); + + protected: + void read(libmexclass::proxy::method::Context& context); + void getFilename(libmexclass::proxy::method::Context& context); + + private: + const std::string filename; + }; + +} diff --git a/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.cc b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.cc new file mode 100644 index 0000000000000..b24bd81b06681 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.cc @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/matlab/io/csv/proxy/table_writer.h" +#include "arrow/matlab/tabular/proxy/table.h" +#include "arrow/matlab/error/error.h" + +#include "arrow/result.h" +#include "arrow/table.h" +#include "arrow/util/utf8.h" + +#include "arrow/io/file.h" +#include "arrow/csv/writer.h" +#include "arrow/csv/options.h" + +#include "libmexclass/proxy/ProxyManager.h" + +namespace arrow::matlab::io::csv::proxy { + + TableWriter::TableWriter(const std::string& filename) : filename{filename} { + REGISTER_METHOD(TableWriter, getFilename); + REGISTER_METHOD(TableWriter, write); + } + + libmexclass::proxy::MakeResult TableWriter::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { + namespace mda = ::matlab::data; + mda::StructArray opts = constructor_arguments[0]; + const mda::StringArray filename_mda = opts[0]["Filename"]; + using TableWriterProxy = ::arrow::matlab::io::csv::proxy::TableWriter; + + const auto filename_utf16 = std::u16string(filename_mda[0]); + MATLAB_ASSIGN_OR_ERROR(const auto filename_utf8, + arrow::util::UTF16StringToUTF8(filename_utf16), + error::UNICODE_CONVERSION_ERROR_ID); + + return std::make_shared(filename_utf8); + } + + void TableWriter::getFilename(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto utf16_filename, + arrow::util::UTF8StringToUTF16(filename), + context, + error::UNICODE_CONVERSION_ERROR_ID); + mda::ArrayFactory factory; + auto str_mda = factory.createScalar(utf16_filename); + context.outputs[0] = str_mda; + } + + void TableWriter::write(libmexclass::proxy::method::Context& context) { + namespace csv = ::arrow::csv; + namespace mda = ::matlab::data; + using TableProxy = ::arrow::matlab::tabular::proxy::Table; + + mda::StructArray opts = context.inputs[0]; + const mda::TypedArray table_proxy_id_mda = opts[0]["TableProxyID"]; + const uint64_t table_proxy_id = table_proxy_id_mda[0]; + + auto proxy = libmexclass::proxy::ProxyManager::getProxy(table_proxy_id); + auto table_proxy = std::static_pointer_cast(proxy); + auto table = table_proxy->unwrap(); + + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto output_stream, + arrow::io::FileOutputStream::Open(filename), + context, + error::FAILED_TO_OPEN_FILE_FOR_WRITE); + const auto options = csv::WriteOptions::Defaults(); + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(csv::WriteCSV(*table, options, output_stream.get()), + context, + error::CSV_FAILED_TO_WRITE_TABLE); + } +} diff --git a/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.h b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.h new file mode 100644 index 0000000000000..b9916bd9bdc22 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/io/csv/proxy/table_writer.h @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "libmexclass/proxy/Proxy.h" + +namespace arrow::matlab::io::csv::proxy { + + class TableWriter : public libmexclass::proxy::Proxy { + public: + TableWriter(const std::string& filename); + ~TableWriter() {} + static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments); + + protected: + void getFilename(libmexclass::proxy::method::Context& context); + void write(libmexclass::proxy::method::Context& context); + + private: + const std::string filename; + }; + +} diff --git a/matlab/src/cpp/arrow/matlab/proxy/factory.cc b/matlab/src/cpp/arrow/matlab/proxy/factory.cc index ebeb020a9e7c7..d1f46c7e2f71f 100644 --- a/matlab/src/cpp/arrow/matlab/proxy/factory.cc +++ b/matlab/src/cpp/arrow/matlab/proxy/factory.cc @@ -37,6 +37,8 @@ #include "arrow/matlab/type/proxy/field.h" #include "arrow/matlab/io/feather/proxy/writer.h" #include "arrow/matlab/io/feather/proxy/reader.h" +#include "arrow/matlab/io/csv/proxy/table_writer.h" +#include "arrow/matlab/io/csv/proxy/table_reader.h" #include "factory.h" @@ -85,6 +87,8 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const ClassName& class_name, REGISTER_PROXY(arrow.type.proxy.StructType , arrow::matlab::type::proxy::StructType); REGISTER_PROXY(arrow.io.feather.proxy.Writer , arrow::matlab::io::feather::proxy::Writer); REGISTER_PROXY(arrow.io.feather.proxy.Reader , arrow::matlab::io::feather::proxy::Reader); + REGISTER_PROXY(arrow.io.csv.proxy.TableWriter , arrow::matlab::io::csv::proxy::TableWriter); + REGISTER_PROXY(arrow.io.csv.proxy.TableReader , arrow::matlab::io::csv::proxy::TableReader); return libmexclass::error::Error{error::UNKNOWN_PROXY_ERROR_ID, "Did not find matching C++ proxy for " + class_name}; }; diff --git a/matlab/src/matlab/+arrow/+io/+csv/TableReader.m b/matlab/src/matlab/+arrow/+io/+csv/TableReader.m new file mode 100644 index 0000000000000..1e0308bb8d4fe --- /dev/null +++ b/matlab/src/matlab/+arrow/+io/+csv/TableReader.m @@ -0,0 +1,51 @@ +%TABLEREADER Reads tabular data from a CSV file into an arrow.tabular.Table. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef TableReader + + properties (GetAccess=public, SetAccess=private, Hidden) + Proxy + end + + properties (Dependent, SetAccess=private, GetAccess=public) + Filename + end + + methods + + function obj = TableReader(filename) + arguments + filename (1, 1) string {mustBeNonmissing, mustBeNonzeroLengthText} + end + + args = struct(Filename=filename); + obj.Proxy = arrow.internal.proxy.create("arrow.io.csv.proxy.TableReader", args); + end + + function table = read(obj) + tableProxyID = obj.Proxy.read(); + proxy = libmexclass.proxy.Proxy(Name="arrow.tabular.proxy.Table", ID=tableProxyID); + table = arrow.tabular.Table(proxy); + end + + function filename = get.Filename(obj) + filename = obj.Proxy.getFilename(); + end + + end + +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+io/+csv/TableWriter.m b/matlab/src/matlab/+arrow/+io/+csv/TableWriter.m new file mode 100644 index 0000000000000..eb1aafe08f545 --- /dev/null +++ b/matlab/src/matlab/+arrow/+io/+csv/TableWriter.m @@ -0,0 +1,51 @@ +%TABLEWRITER Writes tabular data in an arrow.tabular.Table to a CSV file. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. +classdef TableWriter < matlab.mixin.Scalar + + properties(Hidden, SetAccess=private, GetAccess=public) + Proxy + end + + properties(Dependent, SetAccess=private, GetAccess=public) + Filename + end + + methods + function obj = TableWriter(filename) + arguments + filename (1, 1) string {mustBeNonmissing, mustBeNonzeroLengthText} + end + + args = struct(Filename=filename); + proxyName = "arrow.io.csv.proxy.TableWriter"; + obj.Proxy = arrow.internal.proxy.create(proxyName, args); + end + + function write(obj, table) + arguments + obj (1, 1) arrow.io.csv.TableWriter + table (1, 1) arrow.tabular.Table + end + args = struct(TableProxyID=table.Proxy.ID); + obj.Proxy.write(args); + end + + function filename = get.Filename(obj) + filename = obj.Proxy.getFilename(); + end + end +end diff --git a/matlab/test/arrow/io/csv/CSVTest.m b/matlab/test/arrow/io/csv/CSVTest.m new file mode 100644 index 0000000000000..49f77eaaa7c63 --- /dev/null +++ b/matlab/test/arrow/io/csv/CSVTest.m @@ -0,0 +1,102 @@ +%CSVTEST Super class for CSV related tests. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. +classdef CSVTest < matlab.unittest.TestCase + + properties + Filename + end + + methods (TestClassSetup) + + function initializeProperties(~) + % Seed the random number generator. + rng(1); + end + + end + + methods (TestMethodSetup) + + function setupTestFilename(testCase) + import matlab.unittest.fixtures.TemporaryFolderFixture + fixture = testCase.applyFixture(TemporaryFolderFixture); + testCase.Filename = fullfile(fixture.Folder, "filename.csv"); + end + + end + + methods + + function verifyRoundTrip(testCase, arrowTable) + import arrow.io.csv.* + + writer = TableWriter(testCase.Filename); + reader = TableReader(testCase.Filename); + + writer.write(arrowTable); + arrowTableRead = reader.read(); + + testCase.verifyEqual(arrowTableRead, arrowTable); + end + + function arrowTable = makeArrowTable(testCase, opts) + arguments + testCase + opts.Type + opts.ColumnNames + opts.NumRows + opts.WithNulls (1, 1) logical = false + end + + if opts.Type == "numeric" + matlabTable = array2table(rand(opts.NumRows, numel(opts.ColumnNames))); + elseif opts.Type == "string" + matlabTable = array2table("A" + rand(opts.NumRows, numel(opts.ColumnNames)) + "B"); + end + + if opts.WithNulls + matlabTable = testCase.setNullValues(matlabTable, NullPercentage=0.2); + end + + arrays = cell(1, width(matlabTable)); + for ii = 1:width(matlabTable) + arrays{ii} = arrow.array(matlabTable.(ii)); + end + arrowTable = arrow.tabular.Table.fromArrays(arrays{:}, ColumnNames=opts.ColumnNames); + end + + function tWithNulls = setNullValues(testCase, t, opts) + arguments + testCase %#ok + t table + opts.NullPercentage (1, 1) double {mustBeGreaterThanOrEqual(opts.NullPercentage, 0)} = 0.5 + end + + tWithNulls = t; + for ii = 1:width(t) + temp = tWithNulls.(ii); + numValues = numel(temp); + numNulls = uint64(opts.NullPercentage * numValues); + nullIndices = randperm(numValues, numNulls); + temp(nullIndices) = missing; + tWithNulls.(ii) = temp; + end + end + + end + +end diff --git a/matlab/test/arrow/io/csv/tError.m b/matlab/test/arrow/io/csv/tError.m new file mode 100644 index 0000000000000..24c420e7ba2dd --- /dev/null +++ b/matlab/test/arrow/io/csv/tError.m @@ -0,0 +1,73 @@ +%TERROR Error tests for CSV. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. +classdef tError < CSVTest + + methods(Test) + + function EmptyFile(testCase) + import arrow.io.csv.* + + arrowTableWrite = arrow.table(); + + writer = TableWriter(testCase.Filename); + reader = TableReader(testCase.Filename); + + writer.write(arrowTableWrite); + fcn = @() reader.read(); + testCase.verifyError(fcn, "arrow:io:csv:FailedToReadTable"); + end + + function InvalidWriterFilenameType(testCase) + import arrow.io.csv.* + fcn = @() TableWriter(table); + testCase.verifyError(fcn, "MATLAB:validation:UnableToConvert"); + fcn = @() TableWriter(["a", "b"]); + testCase.verifyError(fcn, "MATLAB:validation:IncompatibleSize"); + end + + function InvalidReaderFilenameType(testCase) + import arrow.io.csv.* + fcn = @() TableReader(table); + testCase.verifyError(fcn, "MATLAB:validation:UnableToConvert"); + fcn = @() TableReader(["a", "b"]); + testCase.verifyError(fcn, "MATLAB:validation:IncompatibleSize"); + end + + function InvalidWriterWriteType(testCase) + import arrow.io.csv.* + writer = TableWriter(testCase.Filename); + fcn = @() writer.write("text"); + testCase.verifyError(fcn, "MATLAB:validation:UnableToConvert"); + end + + function WriterFilenameNoSetter(testCase) + import arrow.io.csv.* + writer = TableWriter(testCase.Filename); + fcn = @() setfield(writer, "Filename", "filename.csv"); + testCase.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function ReaderFilenameNoSetter(testCase) + import arrow.io.csv.* + reader = TableReader(testCase.Filename); + fcn = @() setfield(reader, "Filename", "filename.csv"); + testCase.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + end + +end \ No newline at end of file diff --git a/matlab/test/arrow/io/csv/tRoundTrip.m b/matlab/test/arrow/io/csv/tRoundTrip.m new file mode 100644 index 0000000000000..cb35822580106 --- /dev/null +++ b/matlab/test/arrow/io/csv/tRoundTrip.m @@ -0,0 +1,62 @@ +%TROUNDTRIP Round trip tests for CSV. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. +classdef tRoundTrip < CSVTest + + properties (TestParameter) + NumRows = { ... + 2, ... + 10, ... + 100 ... + } + WithNulls = { ... + true, ... + false ... + } + ColumnNames = {... + ["A", "B", "C"], ... + ["😀", "🌲", "🥭", " ", "ABC"], ... + [" ", " ", " "] + } + end + + methods(Test) + + function Numeric(testCase, NumRows, WithNulls, ColumnNames) + arrowTable = testCase.makeArrowTable(... + Type="numeric", ... + NumRows=NumRows, ... + WithNulls=WithNulls, ... + ColumnNames=ColumnNames ... + ); + + testCase.verifyRoundTrip(arrowTable); + end + + function String(testCase, NumRows, ColumnNames) + arrowTable = testCase.makeArrowTable(... + Type="string", ... + NumRows=NumRows, ... + WithNulls=false, ... + ColumnNames=ColumnNames ... + ); + + testCase.verifyRoundTrip(arrowTable); + end + + end + +end \ No newline at end of file diff --git a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake index 40c6b5a51d4fe..294612dda370f 100644 --- a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake +++ b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake @@ -70,10 +70,10 @@ set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_SOURCES "${CMAKE_SOURCE_DIR}/src/cpp/a "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/type/proxy/wrap.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/feather/proxy/writer.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/feather/proxy/reader.cc" + "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/csv/proxy/table_writer.cc" + "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/io/csv/proxy/table_reader.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/index/validate.cc") - - set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_FACTORY_INCLUDE_DIR "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/proxy") set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_FACTORY_SOURCES "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/proxy/factory.cc") set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_LIBRARY_INCLUDE_DIRS ${MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_LIBRARY_ROOT_INCLUDE_DIR} From 7b30ba48e7f3605507d1daecbd041c16b667178a Mon Sep 17 00:00:00 2001 From: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Date: Wed, 20 Sep 2023 15:43:55 -0400 Subject: [PATCH 49/96] GH-37653: [MATLAB] Add `arrow.array.StructArray` MATLAB class (#37806) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change Now that many of the commonly-used "primitive" array types have been added to the MATLAB Inferface, we can implement `arrow.array.StructArray` class. ### What changes are included in this PR? Added `arrow.array.StructArray` MATLAB class. *Methods* of `arrow.array.StructArray` include: - `fromArrays(arrays, nvpairs)` - `field(i)` -> get the `i` field as an `arrow.array.Array`. `i` can be a positive integer or a field name. - `toMATLAB()` -> convert to a MATLAB `table` - `table()` -> convert to a MATLAB `table` *Properties* of `arrow.array.StructArray` include: - `Type` - `Length` - `NumFields` - `FieldNames` - `Valid` **Example Usage** ```matlab >> a = arrow.array([1, 2, 3, 4]); >> b = arrow.array(["A", "B", "C", "D"]); >> s = arrow.array.StructArray.fromArrays(a, b, FieldNames=["A", "B"]) s = -- is_valid: all not null -- child 0 type: double [ 1, 2, 3, 4 ] -- child 1 type: string [ "A", "B", "C", "D" ] % Convert StructArray to a MATLAB table >> t = toMATLAB(s) t = 4×2 table A B _ ___ 1 "A" 2 "B" 3 "C" 4 "D" ``` ### Are these changes tested? Yes. Added a new test class `tStructArray.m` ### Are there any user-facing changes? Yes. Users can now construct an `arrow.array.StructArray` instance. ### Notes 1. Although [`struct`](https://www.mathworks.com/help/matlab/ref/struct.html) is a MATLAB datatype, `StructArray`'s `toMATLAB` method returns a MATLAB `table`. We went with this design because the layout of MATLAB `table`s more closely resembles `StructArray`s. MATLAB `tables` ensure a consistent schema and the data is laid out in a columnar format. In a future PR, we plan on adding a `struct` method to `StructArray`, which will return a MATLAB `struct` array. 2. I removed the virtual `toMATLAB` method from `proxy::Array` because the nested array MATLAB will implement their `toMATLAB` method by invoking the `toMATLAB` method on their field arrays. There's no need for the C++ proxy classes of nested arrays to have a `toMATLAB` method. ### Future Directions 1. Add a `fromMATLAB` static method to create `StructArray`s from MATLAB `tables` and MATLAB `struct` arrays. 4. Add a `fromTable` static method to create `StructArray`s from `arrow.tabular.Table`s 5. Add a `fromRecordBatch` static method to create `StructArray`s from `arrow.tabular.RecordBatch`s * Closes: #37653 Authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- .../src/cpp/arrow/matlab/array/proxy/array.cc | 1 - .../src/cpp/arrow/matlab/array/proxy/array.h | 2 - .../arrow/matlab/array/proxy/boolean_array.cc | 4 +- .../arrow/matlab/array/proxy/boolean_array.h | 2 +- .../arrow/matlab/array/proxy/numeric_array.h | 6 +- .../arrow/matlab/array/proxy/string_array.cc | 4 +- .../arrow/matlab/array/proxy/string_array.h | 2 +- .../arrow/matlab/array/proxy/struct_array.cc | 199 +++++++++++++ .../arrow/matlab/array/proxy/struct_array.h | 44 +++ .../src/cpp/arrow/matlab/array/proxy/wrap.cc | 3 + matlab/src/cpp/arrow/matlab/error/error.h | 2 +- matlab/src/cpp/arrow/matlab/proxy/factory.cc | 2 + matlab/src/matlab/+arrow/+array/Array.m | 5 +- .../src/matlab/+arrow/+array/BooleanArray.m | 6 +- .../src/matlab/+arrow/+array/ChunkedArray.m | 3 +- matlab/src/matlab/+arrow/+array/Date32Array.m | 2 +- matlab/src/matlab/+arrow/+array/Date64Array.m | 2 +- .../src/matlab/+arrow/+array/Float32Array.m | 2 +- .../src/matlab/+arrow/+array/Float64Array.m | 2 +- matlab/src/matlab/+arrow/+array/Int16Array.m | 2 +- matlab/src/matlab/+arrow/+array/Int32Array.m | 2 +- matlab/src/matlab/+arrow/+array/Int64Array.m | 2 +- matlab/src/matlab/+arrow/+array/Int8Array.m | 2 +- .../src/matlab/+arrow/+array/NumericArray.m | 2 +- matlab/src/matlab/+arrow/+array/StringArray.m | 6 +- matlab/src/matlab/+arrow/+array/StructArray.m | 146 +++++++++ matlab/src/matlab/+arrow/+array/Time32Array.m | 2 +- matlab/src/matlab/+arrow/+array/Time64Array.m | 2 +- .../src/matlab/+arrow/+array/TimestampArray.m | 2 +- matlab/src/matlab/+arrow/+array/UInt16Array.m | 2 +- matlab/src/matlab/+arrow/+array/UInt32Array.m | 2 +- matlab/src/matlab/+arrow/+array/UInt64Array.m | 2 +- matlab/src/matlab/+arrow/+array/UInt8Array.m | 2 +- .../+tabular/createAllSupportedArrayTypes.m | 11 + .../+arrow/+internal/+validate/parseValid.m | 46 +++ .../+internal/+validate/parseValidElements.m | 25 +- .../+arrow/+type/+traits/StructTraits.m | 17 +- matlab/src/matlab/+arrow/+type/StructType.m | 32 +- matlab/test/arrow/array/tStructArray.m | 277 ++++++++++++++++++ matlab/test/arrow/type/traits/tStructTraits.m | 10 +- .../cmake/BuildMatlabArrowInterface.cmake | 1 + 41 files changed, 803 insertions(+), 85 deletions(-) create mode 100644 matlab/src/cpp/arrow/matlab/array/proxy/struct_array.cc create mode 100644 matlab/src/cpp/arrow/matlab/array/proxy/struct_array.h create mode 100644 matlab/src/matlab/+arrow/+array/StructArray.m create mode 100644 matlab/src/matlab/+arrow/+internal/+validate/parseValid.m create mode 100644 matlab/test/arrow/array/tStructArray.m diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/array.cc b/matlab/src/cpp/arrow/matlab/array/proxy/array.cc index ed6152259891d..5fa533632f928 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/array.cc +++ b/matlab/src/cpp/arrow/matlab/array/proxy/array.cc @@ -31,7 +31,6 @@ namespace arrow::matlab::array::proxy { // Register Proxy methods. REGISTER_METHOD(Array, toString); - REGISTER_METHOD(Array, toMATLAB); REGISTER_METHOD(Array, getLength); REGISTER_METHOD(Array, getValid); REGISTER_METHOD(Array, getType); diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/array.h b/matlab/src/cpp/arrow/matlab/array/proxy/array.h index 185e107f75391..46e1fa5a81380 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/array.h +++ b/matlab/src/cpp/arrow/matlab/array/proxy/array.h @@ -42,8 +42,6 @@ class Array : public libmexclass::proxy::Proxy { void getType(libmexclass::proxy::method::Context& context); - virtual void toMATLAB(libmexclass::proxy::method::Context& context) = 0; - void isEqual(libmexclass::proxy::method::Context& context); std::shared_ptr array; diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.cc b/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.cc index 5be0cfb5a3d13..6a6e478274823 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.cc +++ b/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.cc @@ -25,7 +25,9 @@ namespace arrow::matlab::array::proxy { BooleanArray::BooleanArray(std::shared_ptr array) - : arrow::matlab::array::proxy::Array{std::move(array)} {} + : arrow::matlab::array::proxy::Array{std::move(array)} { + REGISTER_METHOD(BooleanArray, toMATLAB); + } libmexclass::proxy::MakeResult BooleanArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { ::matlab::data::StructArray opts = constructor_arguments[0]; diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.h b/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.h index 775673c29eada..edc00b178e42a 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.h +++ b/matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.h @@ -31,7 +31,7 @@ namespace arrow::matlab::array::proxy { static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments); protected: - void toMATLAB(libmexclass::proxy::method::Context& context) override; + void toMATLAB(libmexclass::proxy::method::Context& context); }; } diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h b/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h index f9da38dbaa062..4b4ddb6588678 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h +++ b/matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h @@ -40,7 +40,9 @@ class NumericArray : public arrow::matlab::array::proxy::Array { public: NumericArray(const std::shared_ptr> numeric_array) - : arrow::matlab::array::proxy::Array{std::move(numeric_array)} {} + : arrow::matlab::array::proxy::Array{std::move(numeric_array)} { + REGISTER_METHOD(NumericArray, toMATLAB); + } static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { using MatlabBuffer = arrow::matlab::buffer::MatlabBuffer; @@ -67,7 +69,7 @@ class NumericArray : public arrow::matlab::array::proxy::Array { } protected: - void toMATLAB(libmexclass::proxy::method::Context& context) override { + void toMATLAB(libmexclass::proxy::method::Context& context) { using CType = typename arrow::TypeTraits::CType; using NumericArray = arrow::NumericArray; diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc index c583e8851a3ac..7160e88a3c8a0 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc +++ b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc @@ -28,7 +28,9 @@ namespace arrow::matlab::array::proxy { StringArray::StringArray(const std::shared_ptr string_array) - : arrow::matlab::array::proxy::Array(std::move(string_array)) {} + : arrow::matlab::array::proxy::Array(std::move(string_array)) { + REGISTER_METHOD(StringArray, toMATLAB); + } libmexclass::proxy::MakeResult StringArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { namespace mda = ::matlab::data; diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/string_array.h b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.h index bdcfedd7cdda3..4cc01f0a02f8c 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/string_array.h +++ b/matlab/src/cpp/arrow/matlab/array/proxy/string_array.h @@ -32,7 +32,7 @@ namespace arrow::matlab::array::proxy { static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments); protected: - void toMATLAB(libmexclass::proxy::method::Context& context) override; + void toMATLAB(libmexclass::proxy::method::Context& context); }; } diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/struct_array.cc b/matlab/src/cpp/arrow/matlab/array/proxy/struct_array.cc new file mode 100644 index 0000000000000..c6d9e47a9b0c4 --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/array/proxy/struct_array.cc @@ -0,0 +1,199 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/matlab/array/proxy/struct_array.h" +#include "arrow/matlab/array/proxy/wrap.h" +#include "arrow/matlab/bit/pack.h" +#include "arrow/matlab/error/error.h" +#include "arrow/matlab/index/validate.h" + +#include "arrow/util/utf8.h" + +#include "libmexclass/proxy/ProxyManager.h" + +namespace arrow::matlab::array::proxy { + + StructArray::StructArray(std::shared_ptr struct_array) + : proxy::Array{std::move(struct_array)} { + REGISTER_METHOD(StructArray, getNumFields); + REGISTER_METHOD(StructArray, getFieldByIndex); + REGISTER_METHOD(StructArray, getFieldByName); + REGISTER_METHOD(StructArray, getFieldNames); + } + + libmexclass::proxy::MakeResult StructArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { + namespace mda = ::matlab::data; + using libmexclass::proxy::ProxyManager; + + mda::StructArray opts = constructor_arguments[0]; + const mda::TypedArray arrow_array_proxy_ids = opts[0]["ArrayProxyIDs"]; + const mda::StringArray field_names_mda = opts[0]["FieldNames"]; + const mda::TypedArray validity_bitmap_mda = opts[0]["Valid"]; + + std::vector> arrow_arrays; + arrow_arrays.reserve(arrow_array_proxy_ids.getNumberOfElements()); + + // Retrieve all of the Arrow Array Proxy instances from the libmexclass ProxyManager. + for (const auto& arrow_array_proxy_id : arrow_array_proxy_ids) { + auto proxy = ProxyManager::getProxy(arrow_array_proxy_id); + auto arrow_array_proxy = std::static_pointer_cast(proxy); + auto arrow_array = arrow_array_proxy->unwrap(); + arrow_arrays.push_back(arrow_array); + } + + // Convert the utf-16 encoded field names into utf-8 encoded strings + std::vector field_names; + field_names.reserve(field_names_mda.getNumberOfElements()); + for (const auto& field_name : field_names_mda) { + const auto field_name_utf16 = std::u16string(field_name); + MATLAB_ASSIGN_OR_ERROR(const auto field_name_utf8, + arrow::util::UTF16StringToUTF8(field_name_utf16), + error::UNICODE_CONVERSION_ERROR_ID); + field_names.push_back(field_name_utf8); + } + + // Pack the validity bitmap values. + MATLAB_ASSIGN_OR_ERROR(auto validity_bitmap_buffer, + bit::packValid(validity_bitmap_mda), + error::BITPACK_VALIDITY_BITMAP_ERROR_ID); + + // Create the StructArray + MATLAB_ASSIGN_OR_ERROR(auto array, + arrow::StructArray::Make(arrow_arrays, field_names, validity_bitmap_buffer), + error::STRUCT_ARRAY_MAKE_FAILED); + + // Construct the StructArray Proxy + auto struct_array = std::static_pointer_cast(array); + return std::make_shared(std::move(struct_array)); + } + + void StructArray::getNumFields(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + + mda::ArrayFactory factory; + const auto num_fields = array->type()->num_fields(); + context.outputs[0] = factory.createScalar(num_fields); + } + + void StructArray::getFieldByIndex(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + using namespace libmexclass::proxy; + + mda::StructArray args = context.inputs[0]; + const mda::TypedArray index_mda = args[0]["Index"]; + const auto matlab_index = int32_t(index_mda[0]); + + auto struct_array = std::static_pointer_cast(array); + + const auto num_fields = struct_array->type()->num_fields(); + + // Validate there is at least 1 field + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT( + index::validateNonEmptyContainer(num_fields), + context, error::INDEX_EMPTY_CONTAINER); + + // Validate the matlab index provided is within the range [1, num_fields] + MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT( + index::validateInRange(matlab_index, num_fields), + context, error::INDEX_OUT_OF_RANGE); + + // Note: MATLAB uses 1-based indexing, so subtract 1. + const int32_t index = matlab_index - 1; + + auto field_array = struct_array->field(index); + + // Wrap the array within a proxy object if possible. + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto field_array_proxy, + proxy::wrap(field_array), + context, error::UNKNOWN_PROXY_FOR_ARRAY_TYPE); + const auto field_array_proxy_id = ProxyManager::manageProxy(field_array_proxy); + const auto type_id = field_array->type_id(); + + // Return a struct with two fields: ProxyID and TypeID. The MATLAB + // layer will use these values to construct the appropriate MATLAB + // arrow.array.Array subclass. + mda::ArrayFactory factory; + mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"}); + output[0]["ProxyID"] = factory.createScalar(field_array_proxy_id); + output[0]["TypeID"] = factory.createScalar(static_cast(type_id)); + context.outputs[0] = output; + } + + void StructArray::getFieldByName(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + using libmexclass::proxy::ProxyManager; + + mda::StructArray args = context.inputs[0]; + + const mda::StringArray name_mda = args[0]["Name"]; + const auto name_utf16 = std::u16string(name_mda[0]); + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto name, + arrow::util::UTF16StringToUTF8(name_utf16), + context, error::UNICODE_CONVERSION_ERROR_ID); + + + auto struct_array = std::static_pointer_cast(array); + auto field_array = struct_array->GetFieldByName(name); + if (!field_array) { + // Return an error if we could not query the field by name. + const auto msg = "Could not find field named " + name + "."; + context.error = libmexclass::error::Error{ + error::ARROW_TABULAR_SCHEMA_AMBIGUOUS_FIELD_NAME, msg}; + return; + } + + // Wrap the array within a proxy object if possible. + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto field_array_proxy, + proxy::wrap(field_array), + context, error::UNKNOWN_PROXY_FOR_ARRAY_TYPE); + const auto field_array_proxy_id = ProxyManager::manageProxy(field_array_proxy); + const auto type_id = field_array->type_id(); + + // Return a struct with two fields: ProxyID and TypeID. The MATLAB + // layer will use these values to construct the appropriate MATLAB + // arrow.array.Array subclass. + mda::ArrayFactory factory; + mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"}); + output[0]["ProxyID"] = factory.createScalar(field_array_proxy_id); + output[0]["TypeID"] = factory.createScalar(static_cast(type_id)); + context.outputs[0] = output; + } + + void StructArray::getFieldNames(libmexclass::proxy::method::Context& context) { + namespace mda = ::matlab::data; + + const auto& fields = array->type()->fields(); + const auto num_fields = fields.size(); + std::vector names; + names.reserve(num_fields); + + for (size_t i = 0; i < num_fields; ++i) { + auto str_utf8 = fields[i]->name(); + + // MATLAB strings are UTF-16 encoded. Must convert UTF-8 + // encoded field names before returning to MATLAB. + MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto str_utf16, + arrow::util::UTF8StringToUTF16(str_utf8), + context, error::UNICODE_CONVERSION_ERROR_ID); + const mda::MATLABString matlab_string = mda::MATLABString(std::move(str_utf16)); + names.push_back(matlab_string); + } + + mda::ArrayFactory factory; + context.outputs[0] = factory.createArray({1, num_fields}, names.begin(), names.end()); + } +} diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/struct_array.h b/matlab/src/cpp/arrow/matlab/array/proxy/struct_array.h new file mode 100644 index 0000000000000..cfb548c4e50df --- /dev/null +++ b/matlab/src/cpp/arrow/matlab/array/proxy/struct_array.h @@ -0,0 +1,44 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/matlab/array/proxy/array.h" + +namespace arrow::matlab::array::proxy { + +class StructArray : public arrow::matlab::array::proxy::Array { + public: + StructArray(std::shared_ptr struct_array); + + ~StructArray() {} + + static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments); + + protected: + + void getNumFields(libmexclass::proxy::method::Context& context); + + void getFieldByIndex(libmexclass::proxy::method::Context& context); + + void getFieldByName(libmexclass::proxy::method::Context& context); + + void getFieldNames(libmexclass::proxy::method::Context& context); + +}; + +} diff --git a/matlab/src/cpp/arrow/matlab/array/proxy/wrap.cc b/matlab/src/cpp/arrow/matlab/array/proxy/wrap.cc index a8e3f239919cc..b14f4b18711cb 100644 --- a/matlab/src/cpp/arrow/matlab/array/proxy/wrap.cc +++ b/matlab/src/cpp/arrow/matlab/array/proxy/wrap.cc @@ -21,6 +21,7 @@ #include "arrow/matlab/array/proxy/boolean_array.h" #include "arrow/matlab/array/proxy/numeric_array.h" #include "arrow/matlab/array/proxy/string_array.h" +#include "arrow/matlab/array/proxy/struct_array.h" namespace arrow::matlab::array::proxy { @@ -61,6 +62,8 @@ namespace arrow::matlab::array::proxy { return std::make_shared>(std::static_pointer_cast(array)); case ID::STRING: return std::make_shared(std::static_pointer_cast(array)); + case ID::STRUCT: + return std::make_shared(std::static_pointer_cast(array)); default: return arrow::Status::NotImplemented("Unsupported DataType: " + array->type()->ToString()); } diff --git a/matlab/src/cpp/arrow/matlab/error/error.h b/matlab/src/cpp/arrow/matlab/error/error.h index ada9954353d9b..347bc25b5f3a6 100644 --- a/matlab/src/cpp/arrow/matlab/error/error.h +++ b/matlab/src/cpp/arrow/matlab/error/error.h @@ -195,7 +195,7 @@ namespace arrow::matlab::error { static const char* CHUNKED_ARRAY_MAKE_FAILED = "arrow:chunkedarray:MakeFailed"; static const char* CHUNKED_ARRAY_NUMERIC_INDEX_WITH_EMPTY_CHUNKED_ARRAY = "arrow:chunkedarray:NumericIndexWithEmptyChunkedArray"; static const char* CHUNKED_ARRAY_INVALID_NUMERIC_CHUNK_INDEX = "arrow:chunkedarray:InvalidNumericChunkIndex"; - + static const char* STRUCT_ARRAY_MAKE_FAILED = "arrow:array:StructArrayMakeFailed"; static const char* INDEX_EMPTY_CONTAINER = "arrow:index:EmptyContainer"; static const char* INDEX_OUT_OF_RANGE = "arrow:index:OutOfRange"; } diff --git a/matlab/src/cpp/arrow/matlab/proxy/factory.cc b/matlab/src/cpp/arrow/matlab/proxy/factory.cc index d1f46c7e2f71f..62ed84fedcf6a 100644 --- a/matlab/src/cpp/arrow/matlab/proxy/factory.cc +++ b/matlab/src/cpp/arrow/matlab/proxy/factory.cc @@ -21,6 +21,7 @@ #include "arrow/matlab/array/proxy/timestamp_array.h" #include "arrow/matlab/array/proxy/time32_array.h" #include "arrow/matlab/array/proxy/time64_array.h" +#include "arrow/matlab/array/proxy/struct_array.h" #include "arrow/matlab/array/proxy/chunked_array.h" #include "arrow/matlab/tabular/proxy/record_batch.h" #include "arrow/matlab/tabular/proxy/table.h" @@ -57,6 +58,7 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const ClassName& class_name, REGISTER_PROXY(arrow.array.proxy.Int64Array , arrow::matlab::array::proxy::NumericArray); REGISTER_PROXY(arrow.array.proxy.BooleanArray , arrow::matlab::array::proxy::BooleanArray); REGISTER_PROXY(arrow.array.proxy.StringArray , arrow::matlab::array::proxy::StringArray); + REGISTER_PROXY(arrow.array.proxy.StructArray , arrow::matlab::array::proxy::StructArray); REGISTER_PROXY(arrow.array.proxy.TimestampArray, arrow::matlab::array::proxy::NumericArray); REGISTER_PROXY(arrow.array.proxy.Time32Array , arrow::matlab::array::proxy::NumericArray); REGISTER_PROXY(arrow.array.proxy.Time64Array , arrow::matlab::array::proxy::NumericArray); diff --git a/matlab/src/matlab/+arrow/+array/Array.m b/matlab/src/matlab/+arrow/+array/Array.m index 4505d4b006ad8..436d5b80aa6a8 100644 --- a/matlab/src/matlab/+arrow/+array/Array.m +++ b/matlab/src/matlab/+arrow/+array/Array.m @@ -21,12 +21,9 @@ Proxy end - properties (Dependent) + properties(Dependent, SetAccess=private, GetAccess=public) Length Valid % Validity bitmap - end - - properties(Dependent, SetAccess=private, GetAccess=public) Type(1, 1) arrow.type.Type end diff --git a/matlab/src/matlab/+arrow/+array/BooleanArray.m b/matlab/src/matlab/+arrow/+array/BooleanArray.m index b9ef36b5a70c9..dc38ef93e545c 100644 --- a/matlab/src/matlab/+arrow/+array/BooleanArray.m +++ b/matlab/src/matlab/+arrow/+array/BooleanArray.m @@ -16,8 +16,8 @@ classdef BooleanArray < arrow.array.Array % arrow.array.BooleanArray - properties (Hidden, SetAccess=private) - NullSubstitionValue = false; + properties (Hidden, GetAccess=public, SetAccess=private) + NullSubstitutionValue = false; end methods @@ -35,7 +35,7 @@ function matlabArray = toMATLAB(obj) matlabArray = obj.Proxy.toMATLAB(); - matlabArray(~obj.Valid) = obj.NullSubstitionValue; + matlabArray(~obj.Valid) = obj.NullSubstitutionValue; end end diff --git a/matlab/src/matlab/+arrow/+array/ChunkedArray.m b/matlab/src/matlab/+arrow/+array/ChunkedArray.m index 96d7bb57a4021..ede95323f4865 100644 --- a/matlab/src/matlab/+arrow/+array/ChunkedArray.m +++ b/matlab/src/matlab/+arrow/+array/ChunkedArray.m @@ -66,7 +66,8 @@ for ii = 1:obj.NumChunks chunk = obj.chunk(ii); endIndex = startIndex + chunk.Length - 1; - data(startIndex:endIndex) = toMATLAB(chunk); + % Use 2D indexing to support tabular MATLAB types. + data(startIndex:endIndex, :) = toMATLAB(chunk); startIndex = endIndex + 1; end end diff --git a/matlab/src/matlab/+arrow/+array/Date32Array.m b/matlab/src/matlab/+arrow/+array/Date32Array.m index a462bd4f85ac1..cfe56bc67fb94 100644 --- a/matlab/src/matlab/+arrow/+array/Date32Array.m +++ b/matlab/src/matlab/+arrow/+array/Date32Array.m @@ -17,7 +17,7 @@ classdef Date32Array < arrow.array.Array - properties(Access=private) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = NaT end diff --git a/matlab/src/matlab/+arrow/+array/Date64Array.m b/matlab/src/matlab/+arrow/+array/Date64Array.m index f5da26bbb5594..c67b82a5bbc47 100644 --- a/matlab/src/matlab/+arrow/+array/Date64Array.m +++ b/matlab/src/matlab/+arrow/+array/Date64Array.m @@ -17,7 +17,7 @@ classdef Date64Array < arrow.array.Array - properties(Access=private) + properties(Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = NaT end diff --git a/matlab/src/matlab/+arrow/+array/Float32Array.m b/matlab/src/matlab/+arrow/+array/Float32Array.m index fe90db335b5aa..d12e772c41428 100644 --- a/matlab/src/matlab/+arrow/+array/Float32Array.m +++ b/matlab/src/matlab/+arrow/+array/Float32Array.m @@ -16,7 +16,7 @@ classdef Float32Array < arrow.array.NumericArray % arrow.array.Float32Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = single(NaN); end diff --git a/matlab/src/matlab/+arrow/+array/Float64Array.m b/matlab/src/matlab/+arrow/+array/Float64Array.m index ecf91e28954b5..028331b4f99c0 100644 --- a/matlab/src/matlab/+arrow/+array/Float64Array.m +++ b/matlab/src/matlab/+arrow/+array/Float64Array.m @@ -16,7 +16,7 @@ classdef Float64Array < arrow.array.NumericArray % arrow.array.Float64Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = NaN; end diff --git a/matlab/src/matlab/+arrow/+array/Int16Array.m b/matlab/src/matlab/+arrow/+array/Int16Array.m index 53c96c6eeb85c..aee94b39c8969 100644 --- a/matlab/src/matlab/+arrow/+array/Int16Array.m +++ b/matlab/src/matlab/+arrow/+array/Int16Array.m @@ -16,7 +16,7 @@ classdef Int16Array < arrow.array.NumericArray % arrow.array.Int16Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = int16(0) end diff --git a/matlab/src/matlab/+arrow/+array/Int32Array.m b/matlab/src/matlab/+arrow/+array/Int32Array.m index d85bcaf627f7b..a0c0c76afa0e7 100644 --- a/matlab/src/matlab/+arrow/+array/Int32Array.m +++ b/matlab/src/matlab/+arrow/+array/Int32Array.m @@ -16,7 +16,7 @@ classdef Int32Array < arrow.array.NumericArray % arrow.array.Int32Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = int32(0) end diff --git a/matlab/src/matlab/+arrow/+array/Int64Array.m b/matlab/src/matlab/+arrow/+array/Int64Array.m index 72199df88ded1..1f8b1c793984a 100644 --- a/matlab/src/matlab/+arrow/+array/Int64Array.m +++ b/matlab/src/matlab/+arrow/+array/Int64Array.m @@ -16,7 +16,7 @@ classdef Int64Array < arrow.array.NumericArray % arrow.array.Int64Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = int64(0); end diff --git a/matlab/src/matlab/+arrow/+array/Int8Array.m b/matlab/src/matlab/+arrow/+array/Int8Array.m index 0e9d8eec0edf5..02e21178ffe49 100644 --- a/matlab/src/matlab/+arrow/+array/Int8Array.m +++ b/matlab/src/matlab/+arrow/+array/Int8Array.m @@ -16,7 +16,7 @@ classdef Int8Array < arrow.array.NumericArray % arrow.array.Int8Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = int8(0); end diff --git a/matlab/src/matlab/+arrow/+array/NumericArray.m b/matlab/src/matlab/+arrow/+array/NumericArray.m index 8f465ce425e23..088ccfd6aa53f 100644 --- a/matlab/src/matlab/+arrow/+array/NumericArray.m +++ b/matlab/src/matlab/+arrow/+array/NumericArray.m @@ -16,7 +16,7 @@ classdef NumericArray < arrow.array.Array % arrow.array.NumericArray - properties(Abstract, Access=protected) + properties(Abstract, Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue; end diff --git a/matlab/src/matlab/+arrow/+array/StringArray.m b/matlab/src/matlab/+arrow/+array/StringArray.m index 18fdec9ac70c3..e016aeb704a4d 100644 --- a/matlab/src/matlab/+arrow/+array/StringArray.m +++ b/matlab/src/matlab/+arrow/+array/StringArray.m @@ -16,8 +16,8 @@ classdef StringArray < arrow.array.Array % arrow.array.StringArray - properties (Hidden, SetAccess=private) - NullSubstitionValue = string(missing); + properties (Hidden, GetAccess=public, SetAccess=private) + NullSubstitutionValue = string(missing); end methods @@ -35,7 +35,7 @@ function matlabArray = toMATLAB(obj) matlabArray = obj.Proxy.toMATLAB(); - matlabArray(~obj.Valid) = obj.NullSubstitionValue; + matlabArray(~obj.Valid) = obj.NullSubstitutionValue; end end diff --git a/matlab/src/matlab/+arrow/+array/StructArray.m b/matlab/src/matlab/+arrow/+array/StructArray.m new file mode 100644 index 0000000000000..589e39fecd015 --- /dev/null +++ b/matlab/src/matlab/+arrow/+array/StructArray.m @@ -0,0 +1,146 @@ +% arrow.array.StructArray + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef StructArray < arrow.array.Array + + properties (Dependent, GetAccess=public, SetAccess=private) + NumFields + FieldNames + end + + properties (Hidden, Dependent, GetAccess=public, SetAccess=private) + NullSubstitutionValue + end + + methods + function obj = StructArray(proxy) + arguments + proxy(1, 1) libmexclass.proxy.Proxy {validate(proxy, "arrow.array.proxy.StructArray")} + end + import arrow.internal.proxy.validate + obj@arrow.array.Array(proxy); + end + + function numFields = get.NumFields(obj) + numFields = obj.Proxy.getNumFields(); + end + + function fieldNames = get.FieldNames(obj) + fieldNames = obj.Proxy.getFieldNames(); + end + + function F = field(obj, idx) + import arrow.internal.validate.* + + idx = index.numericOrString(idx, "int32", AllowNonScalar=false); + + if isnumeric(idx) + args = struct(Index=idx); + fieldStruct = obj.Proxy.getFieldByIndex(args); + else + args = struct(Name=idx); + fieldStruct = obj.Proxy.getFieldByName(args); + end + + traits = arrow.type.traits.traits(arrow.type.ID(fieldStruct.TypeID)); + proxy = libmexclass.proxy.Proxy(Name=traits.ArrayProxyClassName, ID=fieldStruct.ProxyID); + F = traits.ArrayConstructor(proxy); + end + + function T = toMATLAB(obj) + T = table(obj); + end + + function T = table(obj) + import arrow.tabular.internal.* + + numFields = obj.NumFields; + matlabArrays = cell(1, numFields); + + invalid = ~obj.Valid; + numInvalid = nnz(invalid); + + for ii = 1:numFields + arrowArray = obj.field(ii); + matlabArray = toMATLAB(arrowArray); + if numInvalid ~= 0 + % MATLAB tables do not support null values themselves. + % So, to encode the StructArray's null values, we + % iterate over each variable in the resulting MATLAB + % table, and for each variable, we set the value of all + % null elements to the "NullSubstitutionValue" that + % corresponds to the variable's type (e.g. NaN for + % double, NaT for datetime, etc.). + matlabArray(invalid, :) = repmat(arrowArray.NullSubstitutionValue, [numInvalid 1]); + end + matlabArrays{ii} = matlabArray; + end + + fieldNames = [obj.Type.Fields.Name]; + validVariableNames = makeValidVariableNames(fieldNames); + validDimensionNames = makeValidDimensionNames(validVariableNames); + + T = table(matlabArrays{:}, ... + VariableNames=validVariableNames, ... + DimensionNames=validDimensionNames); + end + + function nullSubVal = get.NullSubstitutionValue(obj) + % Return a cell array containing each field's type-specifc + % "null" value. For example, NaN is the type-specific null + % value for Float32Arrays and Float64Arrays + numFields = obj.NumFields; + nullSubVal = cell(1, numFields); + for ii = 1:obj.NumFields + nullSubVal{ii} = obj.field(ii).NullSubstitutionValue; + end + end + end + + methods (Static) + function array = fromArrays(arrowArrays, opts) + arguments(Repeating) + arrowArrays(1, 1) arrow.array.Array + end + arguments + opts.FieldNames(1, :) string {mustBeNonmissing} = compose("Field%d", 1:numel(arrowArrays)) + opts.Valid + end + + import arrow.tabular.internal.validateArrayLengths + import arrow.tabular.internal.validateColumnNames + import arrow.array.internal.getArrayProxyIDs + import arrow.internal.validate.parseValid + + if numel(arrowArrays) == 0 + error("arrow:struct:ZeroFields", ... + "Must supply at least one field array."); + end + + validateArrayLengths(arrowArrays); + validateColumnNames(opts.FieldNames, numel(arrowArrays)); + validElements = parseValid(opts, arrowArrays{1}.Length); + + arrayProxyIDs = getArrayProxyIDs(arrowArrays); + args = struct(ArrayProxyIDs=arrayProxyIDs, ... + FieldNames=opts.FieldNames, Valid=validElements); + proxyName = "arrow.array.proxy.StructArray"; + proxy = arrow.internal.proxy.create(proxyName, args); + array = arrow.array.StructArray(proxy); + end + end +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+array/Time32Array.m b/matlab/src/matlab/+arrow/+array/Time32Array.m index 85babd26a721a..ae40a3a0b740c 100644 --- a/matlab/src/matlab/+arrow/+array/Time32Array.m +++ b/matlab/src/matlab/+arrow/+array/Time32Array.m @@ -17,7 +17,7 @@ classdef Time32Array < arrow.array.Array - properties(Access=private) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = seconds(NaN); end diff --git a/matlab/src/matlab/+arrow/+array/Time64Array.m b/matlab/src/matlab/+arrow/+array/Time64Array.m index f85eeb1f8f0c9..cd4b948324272 100644 --- a/matlab/src/matlab/+arrow/+array/Time64Array.m +++ b/matlab/src/matlab/+arrow/+array/Time64Array.m @@ -17,7 +17,7 @@ classdef Time64Array < arrow.array.Array - properties(Access=private) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = seconds(NaN); end diff --git a/matlab/src/matlab/+arrow/+array/TimestampArray.m b/matlab/src/matlab/+arrow/+array/TimestampArray.m index 80198f965fe92..9289d0a099f7c 100644 --- a/matlab/src/matlab/+arrow/+array/TimestampArray.m +++ b/matlab/src/matlab/+arrow/+array/TimestampArray.m @@ -16,7 +16,7 @@ classdef TimestampArray < arrow.array.Array % arrow.array.TimestampArray - properties(Access=private) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = NaT; end diff --git a/matlab/src/matlab/+arrow/+array/UInt16Array.m b/matlab/src/matlab/+arrow/+array/UInt16Array.m index 9d3f33c279175..d5487ee130d93 100644 --- a/matlab/src/matlab/+arrow/+array/UInt16Array.m +++ b/matlab/src/matlab/+arrow/+array/UInt16Array.m @@ -16,7 +16,7 @@ classdef UInt16Array < arrow.array.NumericArray % arrow.array.UInt16Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = uint16(0) end diff --git a/matlab/src/matlab/+arrow/+array/UInt32Array.m b/matlab/src/matlab/+arrow/+array/UInt32Array.m index 5235d4fb15576..43c1caac3b791 100644 --- a/matlab/src/matlab/+arrow/+array/UInt32Array.m +++ b/matlab/src/matlab/+arrow/+array/UInt32Array.m @@ -16,7 +16,7 @@ classdef UInt32Array < arrow.array.NumericArray % arrow.array.UInt32Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = uint32(0) end diff --git a/matlab/src/matlab/+arrow/+array/UInt64Array.m b/matlab/src/matlab/+arrow/+array/UInt64Array.m index 2d69bd031ac31..047e7102dd5c5 100644 --- a/matlab/src/matlab/+arrow/+array/UInt64Array.m +++ b/matlab/src/matlab/+arrow/+array/UInt64Array.m @@ -16,7 +16,7 @@ classdef UInt64Array < arrow.array.NumericArray % arrow.array.UInt64Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = uint64(0) end diff --git a/matlab/src/matlab/+arrow/+array/UInt8Array.m b/matlab/src/matlab/+arrow/+array/UInt8Array.m index 3d007376bc89a..901a003161220 100644 --- a/matlab/src/matlab/+arrow/+array/UInt8Array.m +++ b/matlab/src/matlab/+arrow/+array/UInt8Array.m @@ -16,7 +16,7 @@ classdef UInt8Array < arrow.array.NumericArray % arrow.array.UInt8Array - properties (Access=protected) + properties (Hidden, GetAccess=public, SetAccess=private) NullSubstitutionValue = uint8(0) end diff --git a/matlab/src/matlab/+arrow/+internal/+test/+tabular/createAllSupportedArrayTypes.m b/matlab/src/matlab/+arrow/+internal/+test/+tabular/createAllSupportedArrayTypes.m index c0bedaf2faf39..d3a751ca46731 100644 --- a/matlab/src/matlab/+arrow/+internal/+test/+tabular/createAllSupportedArrayTypes.m +++ b/matlab/src/matlab/+arrow/+internal/+test/+tabular/createAllSupportedArrayTypes.m @@ -23,6 +23,10 @@ opts.NumRows(1, 1) {mustBeFinite, mustBeNonnegative} = 3; end + % Seed the random number generator to ensure + % reproducible results in tests. + rng(1); + import arrow.type.ID import arrow.array.* @@ -59,6 +63,13 @@ matlabData{ii} = randomDatetimes(opts.NumRows); cmd = compose("%s.fromMATLAB(matlabData{ii})", name); arrowArrays{ii} = eval(cmd); + elseif name == "arrow.array.StructArray" + dates = randomDatetimes(opts.NumRows); + strings = randomStrings(opts.NumRows); + timestampArray = arrow.array(dates); + stringArray = arrow.array(strings); + arrowArrays{ii} = StructArray.fromArrays(timestampArray, stringArray); + matlabData{ii} = table(dates, strings, VariableNames=["Field1", "Field2"]); else error("arrow:test:SupportedArrayCase", ... "Missing if-branch for array class " + name); diff --git a/matlab/src/matlab/+arrow/+internal/+validate/parseValid.m b/matlab/src/matlab/+arrow/+internal/+validate/parseValid.m new file mode 100644 index 0000000000000..3281e24ec1963 --- /dev/null +++ b/matlab/src/matlab/+arrow/+internal/+validate/parseValid.m @@ -0,0 +1,46 @@ +%PARSEVALID Utility function for parsing the Valid name-value pair. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +function validElements = parseValid(opts, numElements) + if ~isfield(opts, "Valid") + % If Valid is not a field in opts, return an empty logical array. + validElements = logical.empty(0, 1); + return; + end + + valid = opts.Valid; + if islogical(valid) + validElements = reshape(valid, [], 1); + if ~isscalar(validElements) + % Verify the logical vector has the correct number of elements + validateattributes(validElements, "logical", {'numel', numElements}); + elseif validElements == false + validElements = false(numElements, 1); + else % validElements == true + % Return an empty logical to represent all elements are valid. + validElements = logical.empty(0, 1); + end + else + % valid is a list of indices. Verify the indices are numeric, + % integers, and within the range [1, numElements] + validateattributes(valid, "numeric", {'integer', '>', 0, '<=', numElements}); + % Create a logical vector that contains true values at the indices + % specified by opts.Valid. + validElements = false([numElements 1]); + validElements(valid) = true; + end +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+internal/+validate/parseValidElements.m b/matlab/src/matlab/+arrow/+internal/+validate/parseValidElements.m index 4081f4092740b..8a43dbb4d78e1 100644 --- a/matlab/src/matlab/+arrow/+internal/+validate/parseValidElements.m +++ b/matlab/src/matlab/+arrow/+internal/+validate/parseValidElements.m @@ -21,7 +21,7 @@ % precedence over InferNulls. if isfield(opts, "Valid") - validElements = parseValid(numel(data), opts.Valid); + validElements = arrow.internal.validate.parseValid(opts, numel(data)); else validElements = parseInferNulls(data, opts.InferNulls); end @@ -33,29 +33,6 @@ end end -function validElements = parseValid(numElements, valid) - if islogical(valid) - validElements = reshape(valid, [], 1); - if ~isscalar(validElements) - % Verify the logical vector has the correct number of elements - validateattributes(validElements, "logical", {'numel', numElements}); - elseif validElements == false - validElements = false(numElements, 1); - else % validElements == true - % Return an empty logical to represent all elements are valid. - validElements = logical.empty(0, 1); - end - else - % valid is a list of indices. Verify the indices are numeric, - % integers, and within the range 1 < indices < numElements. - validateattributes(valid, "numeric", {'integer', '>', 0, '<=', numElements}); - % Create a logical vector that contains true values at the indices - % specified by opts.Valid. - validElements = false([numElements 1]); - validElements(valid) = true; - end -end - function validElements = parseInferNulls(data, inferNulls) if inferNulls && ~(isinteger(data) || islogical(data)) % Only call ismissing on data types that have a "missing" value, diff --git a/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m b/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m index a8ed98f8ae468..0f8b7b3a2a663 100644 --- a/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m +++ b/matlab/src/matlab/+arrow/+type/+traits/StructTraits.m @@ -16,21 +16,18 @@ classdef StructTraits < arrow.type.traits.TypeTraits properties (Constant) - % TODO: When arrow.array.StructArray is implemented, set these - % properties appropriately - ArrayConstructor = missing - ArrayClassName = missing - ArrayProxyClassName = missing + ArrayConstructor = @arrow.array.StructArray + ArrayClassName = "arrow.array.StructArray" + ArrayProxyClassName = "arrow.array.proxy.StructArray" + + % TODO: Implement fromMATLAB ArrayStaticConstructor = missing TypeConstructor = @arrow.type.StructType TypeClassName = "arrow.type.StructType" TypeProxyClassName = "arrow.type.proxy.StructType" - - % TODO: When arrow.array.StructArray is implemented, set these - % properties appropriately - MatlabConstructor = missing - MatlabClassName = missing + MatlabConstructor = @table + MatlabClassName = "table" end end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+type/StructType.m b/matlab/src/matlab/+arrow/+type/StructType.m index 6c1318f6376f3..331ac75a2ee16 100644 --- a/matlab/src/matlab/+arrow/+type/StructType.m +++ b/matlab/src/matlab/+arrow/+type/StructType.m @@ -33,14 +33,28 @@ end methods (Hidden) - % TODO: Consider using a mixin approach to add this behavior. For - % example, ChunkedArray's toMATLAB method could check if its - % Type inherits from a mixin called "Preallocateable" (or something - % more descriptive). If so, we can call preallocateMATLABArray - % in the toMATLAB method. - function preallocateMATLABArray(~) - error("arrow:type:UnsupportedFunction", ... - "preallocateMATLABArray is not supported for StructType"); - end + function data = preallocateMATLABArray(obj, numElements) + import arrow.tabular.internal.* + + fields = obj.Fields; + + % Construct the VariableNames and VariableDimensionNames + fieldNames = [fields.Name]; + validVariableNames = makeValidVariableNames(fieldNames); + validDimensionNames = makeValidDimensionNames(validVariableNames); + + % Recursively call preallocateMATLABArray to handle + % preallocation of nested types + variableData = cell(1, numel(fields)); + for ii = 1:numel(fields) + type = fields(ii).Type; + variableData{ii} = preallocateMATLABArray(type, numElements); + end + + % Return a table with the appropriate schema and dimensions + data = table(variableData{:}, ... + VariableNames=validVariableNames, ... + DimensionNames=validDimensionNames); + end end end \ No newline at end of file diff --git a/matlab/test/arrow/array/tStructArray.m b/matlab/test/arrow/array/tStructArray.m new file mode 100644 index 0000000000000..639df65befbf5 --- /dev/null +++ b/matlab/test/arrow/array/tStructArray.m @@ -0,0 +1,277 @@ +%TSTRUCTARRAY Unit tests for arrow.array.StructArray + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +classdef tStructArray < matlab.unittest.TestCase + + properties + Float64Array = arrow.array([1 NaN 3 4 5]); + StringArray = arrow.array(["A" "B" "C" "D" missing]); + end + + methods (Test) + function Basic(tc) + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + tc.verifyInstanceOf(array, "arrow.array.StructArray"); + end + + function FieldNames(tc) + % Verify the FieldNames property is set to the expected value. + import arrow.array.StructArray + + % Default field names used + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + tc.verifyEqual(array.FieldNames, ["Field1", "Field2"]); + + % Field names provided + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["A", "B"]); + tc.verifyEqual(array.FieldNames, ["A", "B"]); + + % Duplicate field names provided + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["C", "C"]); + tc.verifyEqual(array.FieldNames, ["C", "C"]); + end + + function FieldNamesError(tc) + % Verify the FieldNames nv-pair errors when expected. + import arrow.array.StructArray + + % Wrong type provided + fcn = @() StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames={table table}); + tc.verifyError(fcn, "MATLAB:validation:UnableToConvert"); + + % Wrong number of field names provided + fcn = @() StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames="A"); + tc.verifyError(fcn, "arrow:tabular:WrongNumberColumnNames"); + + % Missing string provided + fcn = @() StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["A" missing]); + tc.verifyError(fcn, "MATLAB:validators:mustBeNonmissing"); + end + + function FieldNamesNoSetter(tc) + % Verify the FieldNames property is read-only. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + fcn = @() setfield(array, "FieldNames", ["A", "B"]); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function NumFields(tc) + % Verify the NumFields property is set to the expected value. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + tc.verifyEqual(array.NumFields, int32(2)); + end + + function NumFieldsNoSetter(tc) + % Verify the NumFields property is read-only. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + fcn = @() setfield(array, "NumFields", 10); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function Valid(tc) + % Verify the Valid property is set to the expected value. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + expectedValid = true([5 1]); + tc.verifyEqual(array.Valid, expectedValid); + + % Supply the Valid nv-pair + valid = [true true false true false]; + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, Valid=valid); + tc.verifyEqual(array.Valid, valid'); + end + + function ValidNVPairError(tc) + % Verify the Valid nv-pair errors when expected. + import arrow.array.StructArray + + % Provided an invalid index + fcn = @() StructArray.fromArrays(tc.Float64Array, tc.StringArray, Valid=10); + tc.verifyError(fcn, "MATLAB:notLessEqual"); + + % Provided a logical vector with more elements than the array + % length + fcn = @() StructArray.fromArrays(tc.Float64Array, tc.StringArray, Valid=false([7 1])); + tc.verifyError(fcn, "MATLAB:incorrectNumel"); + end + + function ValidNoSetter(tc) + % Verify the Valid property is read-only. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + fcn = @() setfield(array, "Valid", false); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function Length(tc) + % Verify the Length property is set to the expected value. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + tc.verifyEqual(array.Length, int64(5)); + end + + function LengthNoSetter(tc) + % Verify the Length property is read-only. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + fcn = @() setfield(array, "Length", 1); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function Type(tc) + % Verify the Type property is set to the expected value. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + field1 = arrow.field("X", arrow.float64()); + field2 = arrow.field("Y", arrow.string()); + expectedType = arrow.struct(field1, field2); + tc.verifyEqual(array.Type, expectedType); + end + + function TypeNoSetter(tc) + % Verify the Type property is read-only. + import arrow.array.StructArray + + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + fcn = @() setfield(array, "Type", tc.Float64Array.Type); + tc.verifyError(fcn, "MATLAB:class:SetProhibited"); + end + + function FieldByIndex(tc) + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + + % Extract 1st field + field1 = array.field(1); + tc.verifyEqual(field1, tc.Float64Array); + + % Extract 2nd field + field2 = array.field(2); + tc.verifyEqual(field2, tc.StringArray); + end + + function FieldByIndexError(tc) + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + + % Supply a nonscalar vector + fcn = @() array.field([1 2]); + tc.verifyError(fcn, "arrow:badsubscript:NonScalar"); + + % Supply a noninteger + fcn = @() array.field(1.1); + tc.verifyError(fcn, "arrow:badsubscript:NonInteger"); + end + + function FieldByName(tc) + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + + % Extract 1st field + field1 = array.field("Field1"); + tc.verifyEqual(field1, tc.Float64Array); + + % Extract 2nd field + field2 = array.field("Field2"); + tc.verifyEqual(field2, tc.StringArray); + end + + function FieldByNameError(tc) + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray); + + % Supply a nonscalar string array + fcn = @() array.field(["Field1" "Field2"]); + tc.verifyError(fcn, "arrow:badsubscript:NonScalar"); + + % Supply a nonexistent field name + fcn = @() array.field("B"); + tc.verifyError(fcn, "arrow:tabular:schema:AmbiguousFieldName"); + end + + function toMATLAB(tc) + % Verify toMATLAB returns the expected MATLAB table + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + expectedTable = table(toMATLAB(tc.Float64Array), toMATLAB(tc.StringArray), VariableNames=["X", "Y"]); + actualTable = toMATLAB(array); + tc.verifyEqual(actualTable, expectedTable); + + % Verify table elements that correspond to "null" values + % in the StructArray are set to the type-specific null values. + valid = [1 2 5]; + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"], Valid=valid); + float64NullValue = tc.Float64Array.NullSubstitutionValue; + stringNullValue = tc.StringArray.NullSubstitutionValue; + expectedTable([3 4], :) = repmat({float64NullValue stringNullValue}, [2 1]); + actualTable = toMATLAB(array); + tc.verifyEqual(actualTable, expectedTable); + end + + function table(tc) + % Verify toMATLAB returns the expected MATLAB table + import arrow.array.StructArray + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + expectedTable = table(toMATLAB(tc.Float64Array), toMATLAB(tc.StringArray), VariableNames=["X", "Y"]); + actualTable = table(array); + tc.verifyEqual(actualTable, expectedTable); + + % Verify table elements that correspond to "null" values + % in the StructArray are set to the type-specific null values. + valid = [1 2 5]; + array = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"], Valid=valid); + float64NullValue = tc.Float64Array.NullSubstitutionValue; + stringNullValue = tc.StringArray.NullSubstitutionValue; + expectedTable([3 4], :) = repmat({float64NullValue stringNullValue}, [2 1]); + actualTable = toMATLAB(array); + tc.verifyEqual(actualTable, expectedTable); + end + + function IsEqualTrue(tc) + % Verify isequal returns true when expected. + import arrow.array.StructArray + array1 = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + array2 = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + tc.verifyTrue(isequal(array1, array2)); + end + + function IsEqualFalse(tc) + % Verify isequal returns false when expected. + import arrow.array.StructArray + array1 = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["X", "Y"]); + array2 = StructArray.fromArrays(tc.StringArray, tc.Float64Array, FieldNames=["X", "Y"]); + array3 = StructArray.fromArrays(tc.Float64Array, tc.StringArray, FieldNames=["A", "B"]); + % StructArrays have the same FieldNames but the Fields have different types. + tc.verifyFalse(isequal(array1, array2)); + % Fields of the StructArrays have the same types but the StructArrays have different FieldNames. + tc.verifyFalse(isequal(array1, array3)); + end + + end +end \ No newline at end of file diff --git a/matlab/test/arrow/type/traits/tStructTraits.m b/matlab/test/arrow/type/traits/tStructTraits.m index 6a97b1e1852d6..07833aca162b5 100644 --- a/matlab/test/arrow/type/traits/tStructTraits.m +++ b/matlab/test/arrow/type/traits/tStructTraits.m @@ -17,15 +17,15 @@ properties TraitsConstructor = @arrow.type.traits.StructTraits - ArrayConstructor = missing - ArrayClassName = missing - ArrayProxyClassName = missing + ArrayConstructor = @arrow.array.StructArray + ArrayClassName = "arrow.array.StructArray" + ArrayProxyClassName = "arrow.array.proxy.StructArray" ArrayStaticConstructor = missing TypeConstructor = @arrow.type.StructType TypeClassName = "arrow.type.StructType" TypeProxyClassName = "arrow.type.proxy.StructType" - MatlabConstructor = missing - MatlabClassName = missing + MatlabConstructor = @table + MatlabClassName = "table" end end \ No newline at end of file diff --git a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake index 294612dda370f..149a688b27e15 100644 --- a/matlab/tools/cmake/BuildMatlabArrowInterface.cmake +++ b/matlab/tools/cmake/BuildMatlabArrowInterface.cmake @@ -47,6 +47,7 @@ set(MATLAB_ARROW_LIBMEXCLASS_CLIENT_PROXY_SOURCES "${CMAKE_SOURCE_DIR}/src/cpp/a "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/timestamp_array.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/time32_array.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/time64_array.cc" + "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/struct_array.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/chunked_array.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/array/proxy/wrap.cc" "${CMAKE_SOURCE_DIR}/src/cpp/arrow/matlab/tabular/proxy/record_batch.cc" From 6eb08dd886e6dec691fae5704a3f7e9e89552ae4 Mon Sep 17 00:00:00 2001 From: Tim Schaub Date: Wed, 20 Sep 2023 19:47:18 -0600 Subject: [PATCH 50/96] GH-35775: [Go][Parquet] Allow key value file metadata to be written after writing row groups (#37786) ### Rationale for this change The key value file metadata may include information generated while writing row groups. Currently, it is not possible to set the key value file metadata after creating a writer. With the changes in this branch, key value pairs may be added any time before closing the writer. ### What changes are included in this PR? This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`. ### Are these changes tested? Tests are added for the new functionality. ### Are there any user-facing changes? The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`. This is a breaking change. Although the field was exported, setting it did not result in new key value metadata being written. Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option. The `WithWriteMetadata` option can still be used to provide the initial key value metadata values. In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer. The `FileMetadata` field on the parquet `file.Writer` has been removed. Previously, setting this field value had no effect. **This PR includes breaking changes to public APIs.** The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct. Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer. The `FileMetadata` field on the parquet `file.Writer` has been removed. * Closes: #35775 Authored-by: Tim Schaub Signed-off-by: Matt Topol --- go/parquet/file/file_writer.go | 62 +++++++++++++++---------- go/parquet/file/file_writer_test.go | 29 ++++++++++++ go/parquet/metadata/file.go | 5 ++ go/parquet/metadata/metadata_test.go | 35 ++++++++++++++ go/parquet/pqarrow/encode_arrow_test.go | 45 ++++++++++++++++++ go/parquet/pqarrow/file_writer.go | 5 ++ 6 files changed, 156 insertions(+), 25 deletions(-) diff --git a/go/parquet/file/file_writer.go b/go/parquet/file/file_writer.go index 64a21473c293a..c6289434bbe6e 100644 --- a/go/parquet/file/file_writer.go +++ b/go/parquet/file/file_writer.go @@ -41,23 +41,24 @@ type Writer struct { // The Schema of this writer Schema *schema.Schema - // The current FileMetadata to write - FileMetadata *metadata.FileMetaData - // The current keyvalue metadata - KeyValueMetadata metadata.KeyValueMetadata } -type WriteOption func(*Writer) +type writerConfig struct { + props *parquet.WriterProperties + keyValueMetadata metadata.KeyValueMetadata +} + +type WriteOption func(*writerConfig) func WithWriterProps(props *parquet.WriterProperties) WriteOption { - return func(w *Writer) { - w.props = props + return func(c *writerConfig) { + c.props = props } } func WithWriteMetadata(meta metadata.KeyValueMetadata) WriteOption { - return func(w *Writer) { - w.KeyValueMetadata = meta + return func(c *writerConfig) { + c.keyValueMetadata = meta } } @@ -66,19 +67,23 @@ func WithWriteMetadata(meta metadata.KeyValueMetadata) WriteOption { // If props is nil, then the default Writer Properties will be used. If the key value metadata is not nil, // it will be added to the file. func NewParquetWriter(w io.Writer, sc *schema.GroupNode, opts ...WriteOption) *Writer { + config := &writerConfig{} + for _, o := range opts { + o(config) + } + if config.props == nil { + config.props = parquet.NewWriterProperties() + } + fileSchema := schema.NewSchema(sc) fw := &Writer{ + props: config.props, sink: &utils.TellWrapper{Writer: w}, open: true, Schema: fileSchema, } - for _, o := range opts { - o(fw) - } - if fw.props == nil { - fw.props = parquet.NewWriterProperties() - } - fw.metadata = *metadata.NewFileMetadataBuilder(fw.Schema, fw.props, fw.KeyValueMetadata) + + fw.metadata = *metadata.NewFileMetadataBuilder(fw.Schema, fw.props, config.keyValueMetadata) fw.startFile() return fw } @@ -154,6 +159,11 @@ func (fw *Writer) startFile() { } } +// AppendKeyValueMetadata appends a key/value pair to the existing key/value metadata +func (fw *Writer) AppendKeyValueMetadata(key string, value string) error { + return fw.metadata.AppendKeyValueMetadata(key, value) +} + // Close closes any open row group writer and writes the file footer. Subsequent // calls to close will have no effect. func (fw *Writer) Close() (err error) { @@ -180,11 +190,12 @@ func (fw *Writer) Close() (err error) { fileEncryptProps := fw.props.FileEncryptionProperties() if fileEncryptProps == nil { // non encrypted file - if fw.FileMetadata, err = fw.metadata.Finish(); err != nil { + fileMetadata, err := fw.metadata.Finish() + if err != nil { return err } - _, err = writeFileMetadata(fw.FileMetadata, fw.sink) + _, err = writeFileMetadata(fileMetadata, fw.sink) return err } @@ -193,12 +204,12 @@ func (fw *Writer) Close() (err error) { return nil } -func (fw *Writer) closeEncryptedFile(props *parquet.FileEncryptionProperties) (err error) { +func (fw *Writer) closeEncryptedFile(props *parquet.FileEncryptionProperties) error { // encrypted file with encrypted footer if props.EncryptedFooter() { - fw.FileMetadata, err = fw.metadata.Finish() + fileMetadata, err := fw.metadata.Finish() if err != nil { - return + return err } footerLen := int64(0) @@ -211,7 +222,7 @@ func (fw *Writer) closeEncryptedFile(props *parquet.FileEncryptionProperties) (e footerLen += n footerEncryptor := fw.fileEncryptor.GetFooterEncryptor() - n, err = writeEncryptedFileMetadata(fw.FileMetadata, fw.sink, footerEncryptor, true) + n, err = writeEncryptedFileMetadata(fileMetadata, fw.sink, footerEncryptor, true) if err != nil { return err } @@ -224,11 +235,12 @@ func (fw *Writer) closeEncryptedFile(props *parquet.FileEncryptionProperties) (e return err } } else { - if fw.FileMetadata, err = fw.metadata.Finish(); err != nil { - return + fileMetadata, err := fw.metadata.Finish() + if err != nil { + return err } footerSigningEncryptor := fw.fileEncryptor.GetFooterSigningEncryptor() - if _, err = writeEncryptedFileMetadata(fw.FileMetadata, fw.sink, footerSigningEncryptor, false); err != nil { + if _, err = writeEncryptedFileMetadata(fileMetadata, fw.sink, footerSigningEncryptor, false); err != nil { return err } } diff --git a/go/parquet/file/file_writer_test.go b/go/parquet/file/file_writer_test.go index 0cca1cd40d4c9..af083ebe60e4f 100644 --- a/go/parquet/file/file_writer_test.go +++ b/go/parquet/file/file_writer_test.go @@ -30,6 +30,7 @@ import ( "github.com/apache/arrow/go/v14/parquet/internal/testutils" "github.com/apache/arrow/go/v14/parquet/schema" "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" "github.com/stretchr/testify/suite" ) @@ -371,6 +372,34 @@ func TestAllNulls(t *testing.T) { assert.Equal(t, []int16{0, 0, 0}, defLevels[:]) } +func TestKeyValueMetadata(t *testing.T) { + fields := schema.FieldList{ + schema.NewInt32Node("unused", parquet.Repetitions.Optional, -1), + } + sc, _ := schema.NewGroupNode("root", parquet.Repetitions.Required, fields, -1) + sink := encoding.NewBufferWriter(0, memory.DefaultAllocator) + + writer := file.NewParquetWriter(sink, sc) + + testKey := "testKey" + testValue := "testValue" + writer.AppendKeyValueMetadata(testKey, testValue) + writer.Close() + + buffer := sink.Finish() + defer buffer.Release() + props := parquet.NewReaderProperties(memory.DefaultAllocator) + props.BufferedStreamEnabled = true + + reader, err := file.NewParquetReader(bytes.NewReader(buffer.Bytes()), file.WithReadProps(props)) + assert.NoError(t, err) + + metadata := reader.MetaData() + got := metadata.KeyValueMetadata().FindValue(testKey) + require.NotNil(t, got) + assert.Equal(t, testValue, *got) +} + func createSerializeTestSuite(typ reflect.Type) suite.TestingSuite { return &SerializeTestSuite{PrimitiveTypedTest: testutils.NewPrimitiveTypedTest(typ)} } diff --git a/go/parquet/metadata/file.go b/go/parquet/metadata/file.go index efe3c01c25b33..dddd95c5df670 100644 --- a/go/parquet/metadata/file.go +++ b/go/parquet/metadata/file.go @@ -95,6 +95,11 @@ func (f *FileMetaDataBuilder) AppendRowGroup() *RowGroupMetaDataBuilder { return f.currentRgBldr } +// AppendKeyValueMetadata appends a key/value pair to the existing key/value metadata +func (f *FileMetaDataBuilder) AppendKeyValueMetadata(key string, value string) error { + return f.kvmeta.Append(key, value) +} + // Finish will finalize the metadata of the number of rows, row groups, // version etc. This will clear out this filemetadatabuilder so it can // be re-used diff --git a/go/parquet/metadata/metadata_test.go b/go/parquet/metadata/metadata_test.go index 0db64d88ab0f4..b685dd2223274 100644 --- a/go/parquet/metadata/metadata_test.go +++ b/go/parquet/metadata/metadata_test.go @@ -272,6 +272,41 @@ func TestKeyValueMetadata(t *testing.T) { assert.True(t, faccessor.KeyValueMetadata().Equals(kvmeta)) } +func TestKeyValueMetadataAppend(t *testing.T) { + props := parquet.NewWriterProperties(parquet.WithVersion(parquet.V1_0)) + + fields := schema.FieldList{ + schema.NewInt32Node("int_col", parquet.Repetitions.Required, -1), + schema.NewFloat32Node("float_col", parquet.Repetitions.Required, -1), + } + root, err := schema.NewGroupNode("schema", parquet.Repetitions.Repeated, fields, -1) + require.NoError(t, err) + schema := schema.NewSchema(root) + + kvmeta := metadata.NewKeyValueMetadata() + key1 := "test_key1" + value1 := "test_value1" + require.NoError(t, kvmeta.Append(key1, value1)) + + fbuilder := metadata.NewFileMetadataBuilder(schema, props, kvmeta) + + key2 := "test_key2" + value2 := "test_value2" + require.NoError(t, fbuilder.AppendKeyValueMetadata(key2, value2)) + faccessor, err := fbuilder.Finish() + require.NoError(t, err) + + kv := faccessor.KeyValueMetadata() + + got1 := kv.FindValue(key1) + require.NotNil(t, got1) + assert.Equal(t, value1, *got1) + + got2 := kv.FindValue(key2) + require.NotNil(t, got2) + assert.Equal(t, value2, *got2) +} + func TestApplicationVersion(t *testing.T) { version := metadata.NewAppVersion("parquet-mr version 1.7.9") version1 := metadata.NewAppVersion("parquet-mr version 1.8.0") diff --git a/go/parquet/pqarrow/encode_arrow_test.go b/go/parquet/pqarrow/encode_arrow_test.go index 654d3d813cf85..3c20cf2d4757b 100644 --- a/go/parquet/pqarrow/encode_arrow_test.go +++ b/go/parquet/pqarrow/encode_arrow_test.go @@ -360,6 +360,51 @@ func simpleRoundTrip(t *testing.T, tbl arrow.Table, rowGroupSize int64) { } } +func TestWriteKeyValueMetadata(t *testing.T) { + kv := map[string]string{ + "key1": "value1", + "key2": "value2", + "key3": "value3", + } + + sc := arrow.NewSchema([]arrow.Field{ + {Name: "int32", Type: arrow.PrimitiveTypes.Int32, Nullable: true}, + }, nil) + bldr := array.NewRecordBuilder(memory.DefaultAllocator, sc) + defer bldr.Release() + for _, b := range bldr.Fields() { + b.AppendNull() + } + + rec := bldr.NewRecord() + defer rec.Release() + + props := parquet.NewWriterProperties( + parquet.WithVersion(parquet.V1_0), + ) + var buf bytes.Buffer + fw, err := pqarrow.NewFileWriter(sc, &buf, props, pqarrow.DefaultWriterProps()) + require.NoError(t, err) + err = fw.Write(rec) + require.NoError(t, err) + + for key, value := range kv { + require.NoError(t, fw.AppendKeyValueMetadata(key, value)) + } + + err = fw.Close() + require.NoError(t, err) + + reader, err := file.NewParquetReader(bytes.NewReader(buf.Bytes())) + require.NoError(t, err) + + for key, value := range kv { + got := reader.MetaData().KeyValueMetadata().FindValue(key) + require.NotNil(t, got) + assert.Equal(t, value, *got) + } +} + func TestWriteEmptyLists(t *testing.T) { sc := arrow.NewSchema([]arrow.Field{ {Name: "f1", Type: arrow.ListOf(arrow.FixedWidthTypes.Date32)}, diff --git a/go/parquet/pqarrow/file_writer.go b/go/parquet/pqarrow/file_writer.go index 052220e716c77..aa0bae7b1fdfb 100644 --- a/go/parquet/pqarrow/file_writer.go +++ b/go/parquet/pqarrow/file_writer.go @@ -272,6 +272,11 @@ func (fw *FileWriter) WriteTable(tbl arrow.Table, chunkSize int64) error { return nil } +// AppendKeyValueMetadata appends a key/value pair to the existing key/value metadata +func (fw *FileWriter) AppendKeyValueMetadata(key string, value string) error { + return fw.wr.AppendKeyValueMetadata(key, value) +} + // Close flushes out the data and closes the file. It can be called multiple times, // subsequent calls after the first will have no effect. func (fw *FileWriter) Close() error { From 6cd34f3cc4a5b75b51cab926348c6c7fc6172955 Mon Sep 17 00:00:00 2001 From: Chris Jordan-Squire <788080+chrisjordansquire@users.noreply.github.com> Date: Wed, 20 Sep 2023 23:48:54 -0400 Subject: [PATCH 51/96] GH-35095: [C++] Prevent write after close in arrow::ipc::IpcFormatWriter (#37783) This addresses GH-35095 by adding a flag to IpcFormatWriter to track when a writer has been closed, and check this flag before writes. ### Rationale for this change This addresses #35095 , preventing stream and file IPC writers from writing record batches once the IPC writer has been closed. ### What changes are included in this PR? Adding a flag so that an IpcFormatWriter to track when it's been closed, a check before writes in IpcFormatWriter, and two tests to confirm it works as expected. ### Are these changes tested? Yes, the changes are tested. The two tests were added, and the C++ test suite ran. No unexpected failures appeared. ### Are there any user-facing changes? Other than newly returning an invalid status when writing after close, no, there should not be any user-facing changes. * Closes: #35095 Lead-authored-by: Chris Jordan-Squire Co-authored-by: Sutou Kouhei Signed-off-by: Sutou Kouhei --- cpp/src/arrow/ipc/read_write_test.cc | 19 +++++++++++++++++++ cpp/src/arrow/ipc/writer.cc | 8 +++++++- 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/ipc/read_write_test.cc b/cpp/src/arrow/ipc/read_write_test.cc index 69b827b8fe78d..3ae007c20efe7 100644 --- a/cpp/src/arrow/ipc/read_write_test.cc +++ b/cpp/src/arrow/ipc/read_write_test.cc @@ -1519,6 +1519,22 @@ class ReaderWriterMixin : public ExtensionTypesMixin { } } + void TestWriteAfterClose() { + // Part of GH-35095. + std::shared_ptr batch_ints; + ASSERT_OK(MakeIntRecordBatch(&batch_ints)); + + auto schema = batch_ints->schema(); + + WriterHelper writer_helper; + ASSERT_OK(writer_helper.Init(schema, IpcWriteOptions::Defaults())); + ASSERT_OK(writer_helper.WriteBatch(batch_ints)); + ASSERT_OK(writer_helper.Finish()); + + // Write after close raises status + ASSERT_RAISES(Invalid, writer_helper.WriteBatch(batch_ints)); + } + void TestWriteDifferentSchema() { // Test writing batches with a different schema than the RecordBatchWriter // was initialized with. @@ -1991,6 +2007,9 @@ TEST_F(TestFileFormatGenerator, DictionaryRoundTrip) { TestDictionaryRoundtrip() TEST_F(TestFileFormatGeneratorCoalesced, DictionaryRoundTrip) { TestDictionaryRoundtrip(); } +TEST_F(TestFileFormat, WriteAfterClose) { TestWriteAfterClose(); } + +TEST_F(TestStreamFormat, WriteAfterClose) { TestWriteAfterClose(); } TEST_F(TestStreamFormat, DifferentSchema) { TestWriteDifferentSchema(); } diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 1d230601566a0..e4b49ed56464e 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -1070,6 +1070,9 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter { Status WriteRecordBatch( const RecordBatch& batch, const std::shared_ptr& custom_metadata) override { + if (closed_) { + return Status::Invalid("Destination already closed"); + } if (!batch.schema()->Equals(schema_, false /* check_metadata */)) { return Status::Invalid("Tried to write record batch with different schema"); } @@ -1101,7 +1104,9 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter { Status Close() override { RETURN_NOT_OK(CheckStarted()); - return payload_writer_->Close(); + RETURN_NOT_OK(payload_writer_->Close()); + closed_ = true; + return Status::OK(); } Status Start() { @@ -1213,6 +1218,7 @@ class ARROW_EXPORT IpcFormatWriter : public RecordBatchWriter { std::unordered_map> last_dictionaries_; bool started_ = false; + bool closed_ = false; IpcWriteOptions options_; WriteStats stats_; }; From 79e49dbfb71efc70555417ba19cb612eb50924e8 Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Thu, 21 Sep 2023 11:32:06 +0200 Subject: [PATCH 52/96] GH-37803: [CI][Dev][Python] Release and merge script errors (#37819) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What changes are included in this PR? Pin the version of `setuptools_scm` to `<8.0.0`. * Closes: #37803 Authored-by: AlenkaF Signed-off-by: Raúl Cumplido --- ci/conda_env_archery.txt | 2 +- ci/conda_env_crossbow.txt | 2 +- ci/conda_env_python.txt | 2 +- dev/archery/setup.py | 2 +- dev/tasks/conda-recipes/arrow-cpp/meta.yaml | 4 ++-- python/pyproject.toml | 2 +- python/requirements-build.txt | 2 +- python/requirements-wheel-build.txt | 2 +- python/setup.py | 2 +- 9 files changed, 10 insertions(+), 10 deletions(-) diff --git a/ci/conda_env_archery.txt b/ci/conda_env_archery.txt index ace7a42acb026..40875e0a55039 100644 --- a/ci/conda_env_archery.txt +++ b/ci/conda_env_archery.txt @@ -25,7 +25,7 @@ jira pygit2 pygithub ruamel.yaml -setuptools_scm +setuptools_scm<8.0.0 toolz # benchmark diff --git a/ci/conda_env_crossbow.txt b/ci/conda_env_crossbow.txt index 347294650ca28..59b799720f12b 100644 --- a/ci/conda_env_crossbow.txt +++ b/ci/conda_env_crossbow.txt @@ -21,5 +21,5 @@ jinja2 jira pygit2 ruamel.yaml -setuptools_scm +setuptools_scm<8.0.0 toolz diff --git a/ci/conda_env_python.txt b/ci/conda_env_python.txt index 4ae5c3614a1dc..d914229ec58c0 100644 --- a/ci/conda_env_python.txt +++ b/ci/conda_env_python.txt @@ -28,4 +28,4 @@ pytest-faulthandler pytest-lazy-fixture s3fs>=2021.8.0 setuptools -setuptools_scm +setuptools_scm<8.0.0 diff --git a/dev/archery/setup.py b/dev/archery/setup.py index 08e41225f673a..e2c89ae204bd6 100755 --- a/dev/archery/setup.py +++ b/dev/archery/setup.py @@ -30,7 +30,7 @@ extras = { 'benchmark': ['pandas'], 'crossbow': ['github3.py', jinja_req, 'pygit2>=1.6.0', 'requests', - 'ruamel.yaml', 'setuptools_scm'], + 'ruamel.yaml', 'setuptools_scm<8.0.0'], 'crossbow-upload': ['github3.py', jinja_req, 'ruamel.yaml', 'setuptools_scm'], 'docker': ['ruamel.yaml', 'python-dotenv'], diff --git a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml index ac4b29eb5ee7e..fbe40af3dae01 100644 --- a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml +++ b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml @@ -244,7 +244,7 @@ outputs: - numpy - python - setuptools - - setuptools_scm + - setuptools_scm <8.0.0 run: # - {{ pin_subpackage('libarrow', exact=True) }} - libarrow ={{ version }}=*_{{ PKG_BUILDNUM }}_{{ build_ext }} @@ -327,7 +327,7 @@ outputs: - numpy - python - setuptools - - setuptools_scm + - setuptools_scm <8.0.0 run: - {{ pin_subpackage('pyarrow', exact=True) }} - python diff --git a/python/pyproject.toml b/python/pyproject.toml index 7e61304585809..a1de6ac4f1c7e 100644 --- a/python/pyproject.toml +++ b/python/pyproject.toml @@ -19,7 +19,7 @@ requires = [ "cython >= 0.29.31,<3", "oldest-supported-numpy>=0.14", - "setuptools_scm", + "setuptools_scm < 8.0.0", "setuptools >= 40.1.0", "wheel" ] diff --git a/python/requirements-build.txt b/python/requirements-build.txt index 6378d1b94e1bb..efd653ec470d5 100644 --- a/python/requirements-build.txt +++ b/python/requirements-build.txt @@ -1,4 +1,4 @@ cython>=0.29.31,<3 oldest-supported-numpy>=0.14 -setuptools_scm +setuptools_scm<8.0.0 setuptools>=38.6.0 diff --git a/python/requirements-wheel-build.txt b/python/requirements-wheel-build.txt index e4f5243fbc2fe..00504b0c731a1 100644 --- a/python/requirements-wheel-build.txt +++ b/python/requirements-wheel-build.txt @@ -1,5 +1,5 @@ cython>=0.29.31,<3 oldest-supported-numpy>=0.14 -setuptools_scm +setuptools_scm<8.0.0 setuptools>=58 wheel diff --git a/python/setup.py b/python/setup.py index abd9d03cfb17e..062aac307b1e4 100755 --- a/python/setup.py +++ b/python/setup.py @@ -492,7 +492,7 @@ def has_ext_modules(foo): 'pyarrow/_generated_version.py'), 'version_scheme': guess_next_dev_version }, - setup_requires=['setuptools_scm', 'cython >= 0.29.31,<3'] + setup_requires, + setup_requires=['setuptools_scm < 8.0.0', 'cython >= 0.29.31,<3'] + setup_requires, install_requires=install_requires, tests_require=['pytest', 'pandas', 'hypothesis'], python_requires='>=3.8', From 9d6d501fb654a93496070b59fbe7f36f4f1ed604 Mon Sep 17 00:00:00 2001 From: Benjamin Kietzman Date: Thu, 21 Sep 2023 08:00:57 -0400 Subject: [PATCH 53/96] GH-35627: [Format][Integration] Add string-view to arrow format (#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from https://github.com/apache/arrow/pull/35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: #35627 Authored-by: Benjamin Kietzman Signed-off-by: Benjamin Kietzman --- dev/archery/archery/integration/datagen.py | 105 +++++++++++++++++++ docs/source/format/Columnar.rst | 112 ++++++++++++++++++--- format/Message.fbs | 16 +++ format/Schema.fbs | 24 +++++ 4 files changed, 243 insertions(+), 14 deletions(-) diff --git a/dev/archery/archery/integration/datagen.py b/dev/archery/archery/integration/datagen.py index 53f7ba58bff99..5ac32da56a8de 100644 --- a/dev/archery/archery/integration/datagen.py +++ b/dev/archery/archery/integration/datagen.py @@ -665,6 +665,26 @@ def _get_type(self): return OrderedDict([('name', 'largeutf8')]) +class BinaryViewField(BinaryField): + + @property + def column_class(self): + return BinaryViewColumn + + def _get_type(self): + return OrderedDict([('name', 'binaryview')]) + + +class StringViewField(StringField): + + @property + def column_class(self): + return StringViewColumn + + def _get_type(self): + return OrderedDict([('name', 'utf8view')]) + + class Schema(object): def __init__(self, fields, metadata=None): @@ -744,6 +764,74 @@ class LargeStringColumn(_BaseStringColumn, _LargeOffsetsMixin): pass +class BinaryViewColumn(PrimitiveColumn): + + def _encode_value(self, x): + return frombytes(binascii.hexlify(x).upper()) + + def _get_buffers(self): + views = [] + data_buffers = [] + # a small default data buffer size is used so we can exercise + # arrays with multiple data buffers with small data sets + DEFAULT_BUFFER_SIZE = 32 + INLINE_SIZE = 12 + + for i, v in enumerate(self.values): + if not self.is_valid[i]: + v = b'' + assert isinstance(v, bytes) + + if len(v) <= INLINE_SIZE: + # Append an inline view, skip data buffer management. + views.append(OrderedDict([ + ('SIZE', len(v)), + ('INLINED', self._encode_value(v)), + ])) + continue + + if len(data_buffers) == 0: + # No data buffers have been added yet; + # add this string whole (we may append to it later). + offset = 0 + data_buffers.append(v) + elif len(data_buffers[-1]) + len(v) > DEFAULT_BUFFER_SIZE: + # Appending this string to the current active data buffer + # would overflow the default buffer size; add it whole. + offset = 0 + data_buffers.append(v) + else: + # Append this string to the current active data buffer. + offset = len(data_buffers[-1]) + data_buffers[-1] += v + + # the prefix is always 4 bytes so it may not be utf-8 + # even if the whole string view is + prefix = frombytes(binascii.hexlify(v[:4]).upper()) + + views.append(OrderedDict([ + ('SIZE', len(v)), + ('PREFIX_HEX', prefix), + ('BUFFER_INDEX', len(data_buffers) - 1), + ('OFFSET', offset), + ])) + + return [ + ('VALIDITY', [int(x) for x in self.is_valid]), + ('VIEWS', views), + ('VARIADIC_DATA_BUFFERS', [ + frombytes(binascii.hexlify(b).upper()) + for b in data_buffers + ]), + ] + + +class StringViewColumn(BinaryViewColumn): + + def _encode_value(self, x): + return frombytes(x) + + class FixedSizeBinaryColumn(PrimitiveColumn): def _encode_value(self, x): @@ -1568,6 +1656,15 @@ def generate_run_end_encoded_case(): return _generate_file("run_end_encoded", fields, batch_sizes) +def generate_binary_view_case(): + fields = [ + BinaryViewField('bv'), + StringViewField('sv'), + ] + batch_sizes = [0, 7, 256] + return _generate_file("binary_view", fields, batch_sizes) + + def generate_nested_large_offsets_case(): fields = [ LargeListField('large_list_nullable', get_field('item', 'int32')), @@ -1763,6 +1860,14 @@ def _temp_path(): .skip_tester('JS') .skip_tester('Rust'), + generate_binary_view_case() + .skip_tester('C++') + .skip_tester('C#') + .skip_tester('Go') + .skip_tester('Java') + .skip_tester('JS') + .skip_tester('Rust'), + generate_extension_case() .skip_tester('C#') # TODO: ensure the extension is registered in the C++ entrypoint diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 3390f1b7b5f2c..afbe2a08ee28c 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -21,7 +21,7 @@ Arrow Columnar Format ********************* -*Version: 1.3* +*Version: 1.4* The "Arrow Columnar Format" includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol @@ -108,6 +108,10 @@ the different physical layouts defined by Arrow: * **Variable-size Binary**: a sequence of values each having a variable byte length. Two variants of this layout are supported using 32-bit and 64-bit length encoding. +* **Views of Variable-size Binary**: a sequence of values each having a + variable byte length. In contrast to Variable-size Binary, the values + of this layout are distributed across potentially multiple buffers + instead of densely and sequentially packed in a single buffer. * **Fixed-size List**: a nested layout where each value has the same number of elements taken from a child data type. * **Variable-size List**: a nested layout where each value is a @@ -350,6 +354,51 @@ will be represented as follows: :: |----------------|-----------------------| | joemark | unspecified (padding) | +Variable-size Binary View Layout +-------------------------------- + +.. versionadded:: Arrow Columnar Format 1.4 + +Each value in this layout consists of 0 or more bytes. These bytes' +locations are indicated using a **views** buffer, which may point to one +of potentially several **data** buffers or may contain the characters +inline. + +The views buffer contains `length` view structures with the following layout: + +:: + + * Short strings, length <= 12 + | Bytes 0-3 | Bytes 4-15 | + |------------|---------------------------------------| + | length | data (padded with 0) | + + * Long strings, length > 12 + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | + |------------|------------|------------|-------------| + | length | prefix | buf. index | offset | + +In both the long and short string cases, the first four bytes encode the +length of the string and can be used to determine how the rest of the view +should be interpreted. + +In the short string case the string's bytes are inlined- stored inside the +view itself, in the twelve bytes which follow the length. + +In the long string case, a buffer index indicates which data buffer +stores the data bytes and an offset indicates where in that buffer the +data bytes begin. Buffer index 0 refers to the first data buffer, IE +the first buffer **after** the validity buffer and the views buffer. +The half-open range ``[offset, offset + length)`` must be entirely contained +within the indicated buffer. A copy of the first four bytes of the string is +stored inline in the prefix, after the length. This prefix enables a +profitable fast path for string comparisons, which are frequently determined +within the first four bytes. + +All integers (length, buffer index, and offset) are signed. + +This layout is adapted from TU Munich's `UmbraDB`_. + .. _variable-size-list-layout: Variable-size List Layout @@ -880,19 +929,20 @@ For the avoidance of ambiguity, we provide listing the order and type of memory buffers for each layout. .. csv-table:: Buffer Layouts - :header: "Layout Type", "Buffer 0", "Buffer 1", "Buffer 2" - :widths: 30, 20, 20, 20 - - "Primitive",validity,data, - "Variable Binary",validity,offsets,data - "List",validity,offsets, - "Fixed-size List",validity,, - "Struct",validity,, - "Sparse Union",type ids,, - "Dense Union",type ids,offsets, - "Null",,, - "Dictionary-encoded",validity,data (indices), - "Run-end encoded",,, + :header: "Layout Type", "Buffer 0", "Buffer 1", "Buffer 2", "Variadic Buffers" + :widths: 30, 20, 20, 20, 20 + + "Primitive",validity,data,, + "Variable Binary",validity,offsets,data, + "Variable Binary View",validity,views,,data + "List",validity,offsets,, + "Fixed-size List",validity,,, + "Struct",validity,,, + "Sparse Union",type ids,,, + "Dense Union",type ids,offsets,, + "Null",,,, + "Dictionary-encoded",validity,data (indices),, + "Run-end encoded",,,, Logical Types ============= @@ -1071,6 +1121,39 @@ bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set ``size`` to the actual memory size rather than the padded size. +Variadic buffers +^^^^^^^^^^^^^^^^ + +Some types such as Utf8View are represented using a variable number of buffers. +For each such Field in the pre-ordered flattened logical schema, there will be +an entry in ``variadicBufferCounts`` to indicate the number of variadic buffers +which belong to that Field in the current RecordBatch. + +For example, consider the schema :: + + col1: Struct + col2: Utf8View + +This has two fields with variadic buffers, so ``variadicBufferCounts`` will +have two entries in each RecordBatch. For a RecordBatch of this schema with +``variadicBufferCounts = [3, 2]``, the flattened buffers would be:: + + buffer 0: col1 validity + buffer 1: col1.a validity + buffer 2: col1.a values + buffer 3: col1.b validity + buffer 4: col1.b views + buffer 5: col1.b data + buffer 6: col1.b data + buffer 7: col1.b data + buffer 8: col1.c validity + buffer 9: col1.c values + buffer 10: col2 validity + buffer 11: col2 views + buffer 12: col2 data + buffer 13: col2 data + + Byte Order (`Endianness`_) --------------------------- @@ -1346,3 +1429,4 @@ the Arrow spec. .. _Endianness: https://en.wikipedia.org/wiki/Endianness .. _SIMD: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-introduction-to-the-simd-data-layout-templates .. _Parquet: https://parquet.apache.org/docs/ +.. _UmbraDB: https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf diff --git a/format/Message.fbs b/format/Message.fbs index 170ea8fbced89..92a629f3f9d95 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -99,6 +99,22 @@ table RecordBatch { /// Optional compression of the message body compression: BodyCompression; + + /// Some types such as Utf8View are represented using a variable number of buffers. + /// For each such Field in the pre-ordered flattened logical schema, there will be + /// an entry in variadicBufferCounts to indicate the number of number of variadic + /// buffers which belong to that Field in the current RecordBatch. + /// + /// For example, the schema + /// col1: Struct + /// col2: Utf8View + /// contains two Fields with variadic buffers so variadicBufferCounts will have + /// two entries, the first counting the variadic buffers of `col1.b` and the + /// second counting `col2`'s. + /// + /// This field may be omitted if and only if the schema contains no Fields with + /// a variable number of buffers, such as BinaryView and Utf8View. + variadicBufferCounts: [long]; } /// For sending dictionary encoding information. Any Field can be diff --git a/format/Schema.fbs b/format/Schema.fbs index ce29c25b7d1c8..fdaf623931760 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -22,6 +22,7 @@ /// Version 1.1 - Add Decimal256. /// Version 1.2 - Add Interval MONTH_DAY_NANO. /// Version 1.3 - Add Run-End Encoded. +/// Version 1.4 - Add BinaryView, Utf8View, and variadicBufferCounts. namespace org.apache.arrow.flatbuf; @@ -171,6 +172,27 @@ table LargeUtf8 { table LargeBinary { } +/// Logically the same as Utf8, but the internal representation uses a view +/// struct that contains the string length and either the string's entire data +/// inline (for small strings) or an inlined prefix, an index of another buffer, +/// and an offset pointing to a slice in that buffer (for non-small strings). +/// +/// Since it uses a variable number of data buffers, each Field with this type +/// must have a corresponding entry in `variadicBufferCounts`. +table Utf8View { +} + +/// Logically the same as Binary, but the internal representation uses a header +/// struct that contains the string length and either the string's entire data +/// inline (for small strings) or an inlined prefix, an index of another buffer, +/// and an offset pointing to a slice in that buffer (for non-small strings). +/// +/// Since it uses a variable number of data buffers, each Field with this type +/// must have a corresponding entry in `variadicBufferCounts`. +table BinaryView { +} + + table FixedSizeBinary { /// Number of bytes per value byteWidth: int; @@ -427,6 +449,8 @@ union Type { LargeUtf8, LargeList, RunEndEncoded, + BinaryView, + Utf8View, } /// ---------------------------------------------------------------------- From e83c23bb86156ef32d073c8eb507d601847350e1 Mon Sep 17 00:00:00 2001 From: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Date: Thu, 21 Sep 2023 11:06:38 -0400 Subject: [PATCH 54/96] GH-36730: [Python] Add support for Cython 3.0.0 (#37097) ### Rationale for this change Cython 3.0.0 is the latest release. PyArrow should work with Cython 3.0.0. **Cython 3 is not enabled in this diff.** ### What changes are included in this PR? * Don't use `vector[XXX]&&` * Add a declaration for `postincrement` * See also: https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#c-postincrement-postdecrement-operator * Ignore `C4551` warning (function call missing argument list) with MSVC * See also: https://github.com/cython/cython/issues/4445 * Add missing `const` to `CLocation`'s static methods. * Don't use `StopIteration` to stop generator * See also: https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#python-3-syntax-semantics * non-extern `cdef` functions will now propagate python exceptions automatically unless explicitly labeled `noexcept` * Function binding in cython is now enabled by default. Class methods that are used as wrappers for pickling should be converted to staticmethods. * Numpydocs now validates more Cython 3 objects than Cython <3 * Enum types are now being validated, and some unhelpful validation checks on Enums are now ignored * Added a cython <3 nightly CI job Note: * Cython 3.0.0, 3.0.1, 3.0.2 has an issue when compiling with debug mode https://github.com/cython/cython/issues/5552 ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #36730 Lead-authored-by: Dane Pitkin Co-authored-by: Sutou Kouhei Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: AlenkaF --- ci/docker/conda-python-cython2.dockerfile | 24 +++++++++++++++++ dev/archery/archery/lang/python.py | 9 +++++++ dev/tasks/tasks.yml | 8 ++++++ docker-compose.yml | 25 ++++++++++++++++++ python/CMakeLists.txt | 29 +++++++++++++-------- python/pyarrow/_dataset.pyx | 3 ++- python/pyarrow/_flight.pyx | 10 ++++--- python/pyarrow/_substrait.pyx | 3 ++- python/pyarrow/includes/libarrow_flight.pxd | 15 +++++++---- python/pyarrow/ipc.pxi | 15 ++++++----- python/pyarrow/scalar.pxi | 4 +-- python/pyarrow/tests/test_dataset.py | 6 ++++- python/pyarrow/tests/test_scalars.py | 4 +++ 13 files changed, 124 insertions(+), 31 deletions(-) create mode 100644 ci/docker/conda-python-cython2.dockerfile diff --git a/ci/docker/conda-python-cython2.dockerfile b/ci/docker/conda-python-cython2.dockerfile new file mode 100644 index 0000000000000..d67ef677276c7 --- /dev/null +++ b/ci/docker/conda-python-cython2.dockerfile @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +ARG repo +ARG arch +ARG python=3.8 +FROM ${repo}:${arch}-conda-python-${python} + +RUN mamba install -q -y "cython<3" && \ + mamba clean --all diff --git a/dev/archery/archery/lang/python.py b/dev/archery/archery/lang/python.py index 8600a0d7c48c0..d4c1853d097b2 100644 --- a/dev/archery/archery/lang/python.py +++ b/dev/archery/archery/lang/python.py @@ -16,6 +16,7 @@ # under the License. from contextlib import contextmanager +from enum import EnumMeta import inspect import tokenize @@ -112,6 +113,10 @@ def inspect_signature(obj): class NumpyDoc: + IGNORE_VALIDATION_ERRORS_FOR_TYPE = { + # Enum function signatures should never be documented + EnumMeta: ["PR01"] + } def __init__(self, symbols=None): if not have_numpydoc: @@ -229,6 +234,10 @@ def callback(obj): continue if disallow_rules and errcode in disallow_rules: continue + if any(isinstance(obj, obj_type) and errcode in errcode_list + for obj_type, errcode_list + in NumpyDoc.IGNORE_VALIDATION_ERRORS_FOR_TYPE.items()): + continue errors.append((errcode, errmsg)) if len(errors): diff --git a/dev/tasks/tasks.yml b/dev/tasks/tasks.yml index ed238778635d3..29e038a922412 100644 --- a/dev/tasks/tasks.yml +++ b/dev/tasks/tasks.yml @@ -1286,6 +1286,14 @@ tasks: PYTHON: "3.10" image: conda-python-substrait + test-conda-python-3.10-cython2: + ci: github + template: docker-tests/github.linux.yml + params: + env: + PYTHON: "3.10" + image: conda-python-cython2 + test-debian-11-python-3: ci: azure template: docker-tests/azure.linux.yml diff --git a/docker-compose.yml b/docker-compose.yml index a79b13c0a5f91..8ae06900c57f9 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -119,6 +119,7 @@ x-hierarchy: - conda-python: - conda-python-pandas: - conda-python-docs + - conda-python-cython2 - conda-python-dask - conda-python-hdfs - conda-python-java-integration @@ -1349,6 +1350,30 @@ services: /arrow/ci/scripts/java_build.sh /arrow /build /tmp/dist/java && /arrow/ci/scripts/java_cdata_integration.sh /arrow /tmp/dist/java" ] + conda-python-cython2: + # Usage: + # docker-compose build conda + # docker-compose build conda-cpp + # docker-compose build conda-python + # docker-compose build conda-python-cython2 + # docker-compose run --rm conda-python-cython2 + image: ${REPO}:${ARCH}-conda-python-${PYTHON}-cython2 + build: + context: . + dockerfile: ci/docker/conda-python-cython2.dockerfile + cache_from: + - ${REPO}:${ARCH}-conda-python-${PYTHON}-cython2 + args: + repo: ${REPO} + arch: ${ARCH} + python: ${PYTHON} + shm_size: *shm-size + environment: + <<: [*common, *ccache] + PYTEST_ARGS: # inherit + volumes: *conda-volumes + command: *python-conda-command + ################################## R ######################################## ubuntu-r: diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 242ba8448f4a6..29f8d2da72f3a 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -168,37 +168,44 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${PYARROW_CXXFLAGS}") if(MSVC) # MSVC version of -Wno-return-type-c-linkage - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4190") + string(APPEND CMAKE_CXX_FLAGS " /wd4190") # Cython generates some bitshift expressions that MSVC does not like in # __Pyx_PyFloat_DivideObjC - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4293") + string(APPEND CMAKE_CXX_FLAGS " /wd4293") # Converting to/from C++ bool is pretty wonky in Cython. The C4800 warning # seem harmless, and probably not worth the effort of working around it - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4800") + string(APPEND CMAKE_CXX_FLAGS " /wd4800") # See https://github.com/cython/cython/issues/2731. Change introduced in # Cython 0.29.1 causes "unsafe use of type 'bool' in operation" - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4804") + string(APPEND CMAKE_CXX_FLAGS " /wd4804") + + # See https://github.com/cython/cython/issues/4445. + # + # Cython 3 emits "(void)__Pyx_PyObject_CallMethod0;" to suppress a + # "unused function" warning but the code emits another "function + # call missing argument list" warning. + string(APPEND CMAKE_CXX_FLAGS " /wd4551") else() # Enable perf and other tools to work properly - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer") + string(APPEND CMAKE_CXX_FLAGS " -fno-omit-frame-pointer") # Suppress Cython warnings - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable -Wno-maybe-uninitialized") + string(APPEND CMAKE_CXX_FLAGS " -Wno-unused-variable -Wno-maybe-uninitialized") if(CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang") # Cython warnings in clang - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-constant-logical-operand") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-missing-declarations") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-sometimes-uninitialized") + string(APPEND CMAKE_CXX_FLAGS " -Wno-parentheses-equality") + string(APPEND CMAKE_CXX_FLAGS " -Wno-constant-logical-operand") + string(APPEND CMAKE_CXX_FLAGS " -Wno-missing-declarations") + string(APPEND CMAKE_CXX_FLAGS " -Wno-sometimes-uninitialized") # We have public Cython APIs which return C++ types, which are in an extern # "C" blog (no symbol mangling) and clang doesn't like this - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-return-type-c-linkage") + string(APPEND CMAKE_CXX_FLAGS " -Wno-return-type-c-linkage") endif() endif() diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx index d29fa125e2061..48ee676915311 100644 --- a/python/pyarrow/_dataset.pyx +++ b/python/pyarrow/_dataset.pyx @@ -1078,7 +1078,8 @@ cdef class FileSystemDataset(Dataset): @classmethod def from_paths(cls, paths, schema=None, format=None, filesystem=None, partitions=None, root_partition=None): - """A Dataset created from a list of paths on a particular filesystem. + """ + A Dataset created from a list of paths on a particular filesystem. Parameters ---------- diff --git a/python/pyarrow/_flight.pyx b/python/pyarrow/_flight.pyx index 42b221ed72a1b..79aa24e4ce8e3 100644 --- a/python/pyarrow/_flight.pyx +++ b/python/pyarrow/_flight.pyx @@ -988,8 +988,10 @@ cdef class _MetadataRecordBatchReader(_Weakrefable, _ReadPandasMixin): cdef shared_ptr[CMetadataRecordBatchReader] reader def __iter__(self): - while True: - yield self.read_chunk() + return self + + def __next__(self): + return self.read_chunk() @property def schema(self): @@ -1699,7 +1701,9 @@ cdef class FlightClient(_Weakrefable): def close(self): """Close the client and disconnect.""" - check_flight_status(self.client.get().Close()) + client = self.client.get() + if client != NULL: + check_flight_status(client.Close()) def __del__(self): # Not ideal, but close() wasn't originally present so diff --git a/python/pyarrow/_substrait.pyx b/python/pyarrow/_substrait.pyx index 4efad2c4d1bc5..067cb5f91681b 100644 --- a/python/pyarrow/_substrait.pyx +++ b/python/pyarrow/_substrait.pyx @@ -27,9 +27,10 @@ from pyarrow.includes.libarrow cimport * from pyarrow.includes.libarrow_substrait cimport * +# TODO GH-37235: Fix exception handling cdef CDeclaration _create_named_table_provider( dict named_args, const std_vector[c_string]& names, const CSchema& schema -): +) noexcept: cdef: c_string c_name shared_ptr[CTable] c_in_table diff --git a/python/pyarrow/includes/libarrow_flight.pxd b/python/pyarrow/includes/libarrow_flight.pxd index 4bddd2d080f5f..c4cf5830c4128 100644 --- a/python/pyarrow/includes/libarrow_flight.pxd +++ b/python/pyarrow/includes/libarrow_flight.pxd @@ -118,16 +118,16 @@ cdef extern from "arrow/flight/api.h" namespace "arrow" nogil: c_bool Equals(const CLocation& other) @staticmethod - CResult[CLocation] Parse(c_string& uri_string) + CResult[CLocation] Parse(const c_string& uri_string) @staticmethod - CResult[CLocation] ForGrpcTcp(c_string& host, int port) + CResult[CLocation] ForGrpcTcp(const c_string& host, int port) @staticmethod - CResult[CLocation] ForGrpcTls(c_string& host, int port) + CResult[CLocation] ForGrpcTls(const c_string& host, int port) @staticmethod - CResult[CLocation] ForGrpcUnix(c_string& path) + CResult[CLocation] ForGrpcUnix(const c_string& path) cdef cppclass CFlightEndpoint" arrow::flight::FlightEndpoint": CFlightEndpoint() @@ -172,7 +172,9 @@ cdef extern from "arrow/flight/api.h" namespace "arrow" nogil: CResult[unique_ptr[CFlightInfo]] Next() cdef cppclass CSimpleFlightListing" arrow::flight::SimpleFlightListing": - CSimpleFlightListing(vector[CFlightInfo]&& info) + # This doesn't work with Cython >= 3 + # CSimpleFlightListing(vector[CFlightInfo]&& info) + CSimpleFlightListing(const vector[CFlightInfo]& info) cdef cppclass CFlightPayload" arrow::flight::FlightPayload": shared_ptr[CBuffer] descriptor @@ -310,7 +312,10 @@ cdef extern from "arrow/flight/api.h" namespace "arrow" nogil: cdef cppclass CCallHeaders" arrow::flight::CallHeaders": cppclass const_iterator: pair[c_string, c_string] operator*() + # For Cython < 3 const_iterator operator++() + # For Cython >= 3 + const_iterator operator++(int) bint operator==(const_iterator) bint operator!=(const_iterator) const_iterator cbegin() diff --git a/python/pyarrow/ipc.pxi b/python/pyarrow/ipc.pxi index a8398597fe6cd..53e521fc11468 100644 --- a/python/pyarrow/ipc.pxi +++ b/python/pyarrow/ipc.pxi @@ -436,8 +436,10 @@ cdef class MessageReader(_Weakrefable): return result def __iter__(self): - while True: - yield self.read_next_message() + return self + + def __next__(self): + return self.read_next_message() def read_next_message(self): """ @@ -656,11 +658,10 @@ cdef class RecordBatchReader(_Weakrefable): # cdef block is in lib.pxd def __iter__(self): - while True: - try: - yield self.read_next_batch() - except StopIteration: - return + return self + + def __next__(self): + return self.read_next_batch() @property def schema(self): diff --git a/python/pyarrow/scalar.pxi b/python/pyarrow/scalar.pxi index e07949c675524..9a66dc81226d4 100644 --- a/python/pyarrow/scalar.pxi +++ b/python/pyarrow/scalar.pxi @@ -819,8 +819,8 @@ cdef class MapScalar(ListScalar): Iterate over this element's values. """ arr = self.values - if array is None: - raise StopIteration + if arr is None: + return for k, v in zip(arr.field(self.type.key_field.name), arr.field(self.type.item_field.name)): yield (k.as_py(), v.as_py()) diff --git a/python/pyarrow/tests/test_dataset.py b/python/pyarrow/tests/test_dataset.py index e0988f2752033..39c3c43daea37 100644 --- a/python/pyarrow/tests/test_dataset.py +++ b/python/pyarrow/tests/test_dataset.py @@ -1615,9 +1615,13 @@ def test_fragments_repr(tempdir, dataset): # partitioned parquet dataset fragment = list(dataset.get_fragments())[0] assert ( + # Ordering of partition items is non-deterministic repr(fragment) == "" + "partition=[key=xxx, group=1]>" or + repr(fragment) == + "" ) # single-file parquet dataset (no partition information in repr) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index 5f6c8c813f12a..8a1dcfb057f74 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -700,6 +700,10 @@ def test_map(pickle_module): for i, j in zip(s, v): assert i == j + # test iteration with missing values + for _ in pa.scalar(None, type=ty): + pass + assert s.as_py() == v assert s[1] == ( pa.scalar('b', type=pa.string()), From 772a01c080ad57eb11e9323f5347472b769d45de Mon Sep 17 00:00:00 2001 From: Junming Chen Date: Sat, 23 Sep 2023 00:40:31 +0800 Subject: [PATCH 55/96] GH-36420: [C++] Add An Enum Option For SetLookup Options (#36739) ### Rationale for this change As #36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`. The enum option `null_matching_behavior` will replace `skip_nulls` in the future. ### What changes are included in this PR? Add an enum parameter `null_matching_behavior` for SetLookupOptions. ### Are these changes tested? Two kinds of tests are implemented - Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests - Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in` Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests. ### Are there any user-facing changes? No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely. * Closes: #36420 Lead-authored-by: Junming Chen Co-authored-by: Sutou Kouhei Co-authored-by: Benjamin Kietzman Signed-off-by: Benjamin Kietzman --- c_glib/arrow-glib/compute.cpp | 17 +- cpp/src/arrow/compute/api_scalar.cc | 52 +- cpp/src/arrow/compute/api_scalar.h | 34 +- cpp/src/arrow/compute/expression_test.cc | 5 +- .../compute/kernels/scalar_set_lookup.cc | 104 ++- .../compute/kernels/scalar_set_lookup_test.cc | 756 +++++++++++++++++- cpp/src/arrow/util/reflection_internal.h | 24 + python/pyarrow/_compute.pyx | 2 +- 8 files changed, 942 insertions(+), 52 deletions(-) diff --git a/c_glib/arrow-glib/compute.cpp b/c_glib/arrow-glib/compute.cpp index 7fe005f94a5bb..9692f277d183f 100644 --- a/c_glib/arrow-glib/compute.cpp +++ b/c_glib/arrow-glib/compute.cpp @@ -3346,7 +3346,7 @@ garrow_set_lookup_options_get_property(GObject *object, g_value_set_object(value, priv->value_set); break; case PROP_SET_LOOKUP_OPTIONS_SKIP_NULLS: - g_value_set_boolean(value, options->skip_nulls); + g_value_set_boolean(value, options->skip_nulls.has_value() && options->skip_nulls.value()); break; default: G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); @@ -3398,13 +3398,11 @@ garrow_set_lookup_options_class_init(GArrowSetLookupOptionsClass *klass) * * Since: 6.0.0 */ - spec = g_param_spec_boolean("skip-nulls", - "Skip NULLs", - "Whether NULLs are skipped or not", - options.skip_nulls, - static_cast(G_PARAM_READWRITE)); - g_object_class_install_property(gobject_class, - PROP_SET_LOOKUP_OPTIONS_SKIP_NULLS, + auto skip_nulls = (options.skip_nulls.has_value() && options.skip_nulls.value()); + spec = + g_param_spec_boolean("skip-nulls", "Skip NULLs", "Whether NULLs are skipped or not", + skip_nulls, static_cast(G_PARAM_READWRITE)); + g_object_class_install_property(gobject_class, PROP_SET_LOOKUP_OPTIONS_SKIP_NULLS, spec); } @@ -6458,9 +6456,10 @@ garrow_set_lookup_options_new_raw( arrow_copied_options.get()); auto value_set = garrow_datum_new_raw(&(arrow_copied_set_lookup_options->value_set)); + auto skip_nulls = (arrow_options->skip_nulls.has_value() && arrow_options->skip_nulls.value()); auto options = g_object_new(GARROW_TYPE_SET_LOOKUP_OPTIONS, "value-set", value_set, - "skip-nulls", arrow_options->skip_nulls, + "skip-nulls", skip_nulls, NULL); return GARROW_SET_LOOKUP_OPTIONS(options); } diff --git a/cpp/src/arrow/compute/api_scalar.cc b/cpp/src/arrow/compute/api_scalar.cc index d7a61d0a55985..eaec940556361 100644 --- a/cpp/src/arrow/compute/api_scalar.cc +++ b/cpp/src/arrow/compute/api_scalar.cc @@ -275,6 +275,29 @@ struct EnumTraits } }; +template <> +struct EnumTraits + : BasicEnumTraits { + static std::string name() { return "SetLookupOptions::NullMatchingBehavior"; } + static std::string value_name(compute::SetLookupOptions::NullMatchingBehavior value) { + switch (value) { + case compute::SetLookupOptions::NullMatchingBehavior::MATCH: + return "MATCH"; + case compute::SetLookupOptions::NullMatchingBehavior::SKIP: + return "SKIP"; + case compute::SetLookupOptions::NullMatchingBehavior::EMIT_NULL: + return "EMIT_NULL"; + case compute::SetLookupOptions::NullMatchingBehavior::INCONCLUSIVE: + return "INCONCLUSIVE"; + } + return ""; + } +}; + } // namespace internal namespace compute { @@ -286,6 +309,7 @@ using ::arrow::internal::checked_cast; namespace internal { namespace { +using ::arrow::internal::CoercedDataMember; using ::arrow::internal::DataMember; static auto kArithmeticOptionsType = GetFunctionOptionsType( DataMember("check_overflow", &ArithmeticOptions::check_overflow)); @@ -344,7 +368,8 @@ static auto kRoundToMultipleOptionsType = GetFunctionOptionsType( DataMember("value_set", &SetLookupOptions::value_set), - DataMember("skip_nulls", &SetLookupOptions::skip_nulls)); + CoercedDataMember("null_matching_behavior", &SetLookupOptions::null_matching_behavior, + &SetLookupOptions::GetNullMatchingBehavior)); static auto kSliceOptionsType = GetFunctionOptionsType( DataMember("start", &SliceOptions::start), DataMember("stop", &SliceOptions::stop), DataMember("step", &SliceOptions::step)); @@ -540,8 +565,29 @@ constexpr char RoundToMultipleOptions::kTypeName[]; SetLookupOptions::SetLookupOptions(Datum value_set, bool skip_nulls) : FunctionOptions(internal::kSetLookupOptionsType), value_set(std::move(value_set)), - skip_nulls(skip_nulls) {} -SetLookupOptions::SetLookupOptions() : SetLookupOptions({}, false) {} + skip_nulls(skip_nulls) { + if (skip_nulls) { + this->null_matching_behavior = SetLookupOptions::SKIP; + } else { + this->null_matching_behavior = SetLookupOptions::MATCH; + } +} +SetLookupOptions::SetLookupOptions( + Datum value_set, SetLookupOptions::NullMatchingBehavior null_matching_behavior) + : FunctionOptions(internal::kSetLookupOptionsType), + value_set(std::move(value_set)), + null_matching_behavior(std::move(null_matching_behavior)) {} +SetLookupOptions::SetLookupOptions() + : SetLookupOptions({}, SetLookupOptions::NullMatchingBehavior::MATCH) {} +SetLookupOptions::NullMatchingBehavior SetLookupOptions::GetNullMatchingBehavior() const { + if (!this->skip_nulls.has_value()) { + return this->null_matching_behavior; + } else if (this->skip_nulls.value()) { + return SetLookupOptions::SKIP; + } else { + return SetLookupOptions::MATCH; + } +} constexpr char SetLookupOptions::kTypeName[]; SliceOptions::SliceOptions(int64_t start, int64_t stop, int64_t step) diff --git a/cpp/src/arrow/compute/api_scalar.h b/cpp/src/arrow/compute/api_scalar.h index 0a06a2829f0da..9f12471ddca14 100644 --- a/cpp/src/arrow/compute/api_scalar.h +++ b/cpp/src/arrow/compute/api_scalar.h @@ -268,19 +268,49 @@ class ARROW_EXPORT ExtractRegexOptions : public FunctionOptions { /// Options for IsIn and IndexIn functions class ARROW_EXPORT SetLookupOptions : public FunctionOptions { public: - explicit SetLookupOptions(Datum value_set, bool skip_nulls = false); + /// How to handle null values. + enum NullMatchingBehavior { + /// MATCH, any null in `value_set` is successfully matched in + /// the input. + MATCH, + /// SKIP, any null in `value_set` is ignored and nulls in the input + /// produce null (IndexIn) or false (IsIn) values in the output. + SKIP, + /// EMIT_NULL, any null in `value_set` is ignored and nulls in the + /// input produce null (IndexIn and IsIn) values in the output. + EMIT_NULL, + /// INCONCLUSIVE, null values are regarded as unknown values, which is + /// sql-compatible. nulls in the input produce null (IndexIn and IsIn) + /// values in the output. Besides, if `value_set` contains a null, + /// non-null unmatched values in the input also produce null values + /// (IndexIn and IsIn) in the output. + INCONCLUSIVE + }; + + explicit SetLookupOptions(Datum value_set, NullMatchingBehavior = MATCH); SetLookupOptions(); + + // DEPRECATED(will be removed after removing of skip_nulls) + explicit SetLookupOptions(Datum value_set, bool skip_nulls); + static constexpr char const kTypeName[] = "SetLookupOptions"; /// The set of values to look up input values into. Datum value_set; + + NullMatchingBehavior null_matching_behavior; + + // DEPRECATED(will be removed after removing of skip_nulls) + NullMatchingBehavior GetNullMatchingBehavior() const; + + // DEPRECATED(use null_matching_behavior instead) /// Whether nulls in `value_set` count for lookup. /// /// If true, any null in `value_set` is ignored and nulls in the input /// produce null (IndexIn) or false (IsIn) values in the output. /// If false, any null in `value_set` is successfully matched in /// the input. - bool skip_nulls; + std::optional skip_nulls; }; /// Options for struct_field function diff --git a/cpp/src/arrow/compute/expression_test.cc b/cpp/src/arrow/compute/expression_test.cc index b852f6f6b0cdb..44159e76600fb 100644 --- a/cpp/src/arrow/compute/expression_test.cc +++ b/cpp/src/arrow/compute/expression_test.cc @@ -263,8 +263,9 @@ TEST(Expression, ToString) { auto in_12 = call("index_in", {field_ref("beta")}, compute::SetLookupOptions{ArrayFromJSON(int32(), "[1,2]")}); - EXPECT_EQ(in_12.ToString(), - "index_in(beta, {value_set=int32:[\n 1,\n 2\n], skip_nulls=false})"); + EXPECT_EQ( + in_12.ToString(), + "index_in(beta, {value_set=int32:[\n 1,\n 2\n], null_matching_behavior=MATCH})"); EXPECT_EQ(and_(field_ref("a"), field_ref("b")).ToString(), "(a and b)"); EXPECT_EQ(or_(field_ref("a"), field_ref("b")).ToString(), "(a or b)"); diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc index 00d391653d240..e2d5583e36e6b 100644 --- a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc +++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc @@ -44,6 +44,7 @@ struct SetLookupState : public SetLookupStateBase { explicit SetLookupState(MemoryPool* pool) : memory_pool(pool) {} Status Init(const SetLookupOptions& options) { + this->null_matching_behavior = options.GetNullMatchingBehavior(); if (options.value_set.is_array()) { const ArrayData& value_set = *options.value_set.array(); memo_index_to_value_index.reserve(value_set.length); @@ -66,7 +67,8 @@ struct SetLookupState : public SetLookupStateBase { } else { return Status::Invalid("value_set should be an array or chunked array"); } - if (!options.skip_nulls && lookup_table->GetNull() >= 0) { + if (this->null_matching_behavior != SetLookupOptions::SKIP && + lookup_table->GetNull() >= 0) { null_index = memo_index_to_value_index[lookup_table->GetNull()]; } value_set_type = options.value_set.type(); @@ -117,19 +119,23 @@ struct SetLookupState : public SetLookupStateBase { // be mapped back to indices in the value_set. std::vector memo_index_to_value_index; int32_t null_index = -1; + SetLookupOptions::NullMatchingBehavior null_matching_behavior; }; template <> struct SetLookupState : public SetLookupStateBase { explicit SetLookupState(MemoryPool*) {} - Status Init(const SetLookupOptions& options) { - value_set_has_null = (options.value_set.length() > 0) && !options.skip_nulls; + Status Init(SetLookupOptions& options) { + null_matching_behavior = options.GetNullMatchingBehavior(); + value_set_has_null = (options.value_set.length() > 0) && + this->null_matching_behavior != SetLookupOptions::SKIP; value_set_type = null(); return Status::OK(); } bool value_set_has_null; + SetLookupOptions::NullMatchingBehavior null_matching_behavior; }; // TODO: Put this concept somewhere reusable @@ -270,14 +276,20 @@ struct IndexInVisitor { : ctx(ctx), data(data), out(out), out_bitmap(out->buffers[0].data) {} Status Visit(const DataType& type) { - DCHECK_EQ(type.id(), Type::NA); + DCHECK(false) << "IndexIn " << type; + return Status::NotImplemented("IndexIn has no implementation with value type ", type); + } + + Status Visit(const NullType&) { const auto& state = checked_cast&>(*ctx->state()); if (data.length != 0) { - // skip_nulls is honored for consistency with other types - bit_util::SetBitsTo(out_bitmap, out->offset, out->length, state.value_set_has_null); + bit_util::SetBitsTo(out_bitmap, out->offset, out->length, + state.null_matching_behavior == SetLookupOptions::MATCH && + state.value_set_has_null); // Set all values to 0, which will be unmasked only if null is in the value_set + // and null_matching_behavior is equal to MATCH std::memset(out->GetValues(1), 0x00, out->length * sizeof(int32_t)); } return Status::OK(); @@ -305,7 +317,8 @@ struct IndexInVisitor { bitmap_writer.Next(); }, [&]() { - if (state.null_index != -1) { + if (state.null_index != -1 && + state.null_matching_behavior == SetLookupOptions::MATCH) { bitmap_writer.Set(); // value_set included null @@ -379,49 +392,86 @@ Status ExecIndexIn(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) { return IndexInVisitor(ctx, batch[0].array, out->array_span_mutable()).Execute(); } -// ---------------------------------------------------------------------- - // IsIn writes the results into a preallocated boolean data bitmap struct IsInVisitor { KernelContext* ctx; const ArraySpan& data; ArraySpan* out; + uint8_t* out_boolean_bitmap; + uint8_t* out_null_bitmap; IsInVisitor(KernelContext* ctx, const ArraySpan& data, ArraySpan* out) - : ctx(ctx), data(data), out(out) {} + : ctx(ctx), + data(data), + out(out), + out_boolean_bitmap(out->buffers[1].data), + out_null_bitmap(out->buffers[0].data) {} Status Visit(const DataType& type) { - DCHECK_EQ(type.id(), Type::NA); + DCHECK(false) << "IndexIn " << type; + return Status::NotImplemented("IsIn has no implementation with value type ", type); + } + + Status Visit(const NullType&) { const auto& state = checked_cast&>(*ctx->state()); - // skip_nulls is honored for consistency with other types - bit_util::SetBitsTo(out->buffers[1].data, out->offset, out->length, - state.value_set_has_null); + + if (state.null_matching_behavior == SetLookupOptions::MATCH && + state.value_set_has_null) { + bit_util::SetBitsTo(out_boolean_bitmap, out->offset, out->length, true); + bit_util::SetBitsTo(out_null_bitmap, out->offset, out->length, true); + } else if (state.null_matching_behavior == SetLookupOptions::SKIP || + (!state.value_set_has_null && + state.null_matching_behavior == SetLookupOptions::MATCH)) { + bit_util::SetBitsTo(out_boolean_bitmap, out->offset, out->length, false); + bit_util::SetBitsTo(out_null_bitmap, out->offset, out->length, true); + } else { + bit_util::SetBitsTo(out_null_bitmap, out->offset, out->length, false); + } return Status::OK(); } template Status ProcessIsIn(const SetLookupState& state, const ArraySpan& input) { using T = typename GetViewType::T; - FirstTimeBitmapWriter writer(out->buffers[1].data, out->offset, out->length); + FirstTimeBitmapWriter writer_boolean(out_boolean_bitmap, out->offset, out->length); + FirstTimeBitmapWriter writer_null(out_null_bitmap, out->offset, out->length); + bool value_set_has_null = state.null_index != -1; VisitArraySpanInline( input, [&](T v) { - if (state.lookup_table->Get(v) != -1) { - writer.Set(); - } else { - writer.Clear(); + if (state.lookup_table->Get(v) != -1) { // true + writer_boolean.Set(); + writer_null.Set(); + } else if (state.null_matching_behavior == SetLookupOptions::INCONCLUSIVE && + value_set_has_null) { // null + writer_boolean.Clear(); + writer_null.Clear(); + } else { // false + writer_boolean.Clear(); + writer_null.Set(); } - writer.Next(); + writer_boolean.Next(); + writer_null.Next(); }, [&]() { - if (state.null_index != -1) { - writer.Set(); - } else { - writer.Clear(); + if (state.null_matching_behavior == SetLookupOptions::MATCH && + value_set_has_null) { // true + writer_boolean.Set(); + writer_null.Set(); + } else if (state.null_matching_behavior == SetLookupOptions::SKIP || + (!value_set_has_null && state.null_matching_behavior == + SetLookupOptions::MATCH)) { // false + writer_boolean.Clear(); + writer_null.Set(); + } else { // null + writer_boolean.Clear(); + writer_null.Clear(); } - writer.Next(); + writer_boolean.Next(); + writer_null.Next(); }); - writer.Finish(); + writer_boolean.Finish(); + writer_null.Finish(); return Status::OK(); } @@ -598,7 +648,7 @@ void RegisterScalarSetLookup(FunctionRegistry* registry) { ScalarKernel isin_base; isin_base.init = InitSetLookup; isin_base.exec = ExecIsIn; - isin_base.null_handling = NullHandling::OUTPUT_NOT_NULL; + isin_base.null_handling = NullHandling::COMPUTED_PREALLOCATE; auto is_in = std::make_shared("is_in", Arity::Unary(), is_in_doc); AddBasicSetLookupKernels(isin_base, /*output_type=*/boolean(), is_in.get()); diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc b/cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc index d1645eb8d9a49..89e10d1b54103 100644 --- a/cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc +++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc @@ -50,7 +50,67 @@ namespace compute { void CheckIsIn(const std::shared_ptr input, const std::shared_ptr& value_set, const std::string& expected_json, - bool skip_nulls = false) { + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + auto expected = ArrayFromJSON(boolean(), expected_json); + + ASSERT_OK_AND_ASSIGN(Datum actual_datum, + IsIn(input, SetLookupOptions(value_set, null_matching_behavior))); + std::shared_ptr actual = actual_datum.make_array(); + ValidateOutput(actual_datum); + AssertArraysEqual(*expected, *actual, /*verbose=*/true); +} + +void CheckIsIn(const std::shared_ptr& type, const std::string& input_json, + const std::string& value_set_json, const std::string& expected_json, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + auto input = ArrayFromJSON(type, input_json); + auto value_set = ArrayFromJSON(type, value_set_json); + CheckIsIn(input, value_set, expected_json, null_matching_behavior); +} + +void CheckIsInChunked(const std::shared_ptr& input, + const std::shared_ptr& value_set, + const std::shared_ptr& expected, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + ASSERT_OK_AND_ASSIGN(Datum actual_datum, + IsIn(input, SetLookupOptions(value_set, null_matching_behavior))); + auto actual = actual_datum.chunked_array(); + ValidateOutput(actual_datum); + + // Output contiguous in a single chunk + ASSERT_EQ(1, actual->num_chunks()); + ASSERT_TRUE(actual->Equals(*expected)); +} + +void CheckIsInDictionary(const std::shared_ptr& type, + const std::shared_ptr& index_type, + const std::string& input_dictionary_json, + const std::string& input_index_json, + const std::string& value_set_json, + const std::string& expected_json, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + auto dict_type = dictionary(index_type, type); + auto indices = ArrayFromJSON(index_type, input_index_json); + auto dict = ArrayFromJSON(type, input_dictionary_json); + + ASSERT_OK_AND_ASSIGN(auto input, DictionaryArray::FromArrays(dict_type, indices, dict)); + auto value_set = ArrayFromJSON(type, value_set_json); + auto expected = ArrayFromJSON(boolean(), expected_json); + + ASSERT_OK_AND_ASSIGN(Datum actual_datum, + IsIn(input, SetLookupOptions(value_set, null_matching_behavior))); + std::shared_ptr actual = actual_datum.make_array(); + ValidateOutput(actual_datum); + AssertArraysEqual(*expected, *actual, /*verbose=*/true); +} + +void CheckIsIn(const std::shared_ptr input, + const std::shared_ptr& value_set, const std::string& expected_json, + bool skip_nulls) { auto expected = ArrayFromJSON(boolean(), expected_json); ASSERT_OK_AND_ASSIGN(Datum actual_datum, @@ -62,7 +122,7 @@ void CheckIsIn(const std::shared_ptr input, void CheckIsIn(const std::shared_ptr& type, const std::string& input_json, const std::string& value_set_json, const std::string& expected_json, - bool skip_nulls = false) { + bool skip_nulls) { auto input = ArrayFromJSON(type, input_json); auto value_set = ArrayFromJSON(type, value_set_json); CheckIsIn(input, value_set, expected_json, skip_nulls); @@ -70,8 +130,7 @@ void CheckIsIn(const std::shared_ptr& type, const std::string& input_j void CheckIsInChunked(const std::shared_ptr& input, const std::shared_ptr& value_set, - const std::shared_ptr& expected, - bool skip_nulls = false) { + const std::shared_ptr& expected, bool skip_nulls) { ASSERT_OK_AND_ASSIGN(Datum actual_datum, IsIn(input, SetLookupOptions(value_set, skip_nulls))); auto actual = actual_datum.chunked_array(); @@ -87,7 +146,7 @@ void CheckIsInDictionary(const std::shared_ptr& type, const std::string& input_dictionary_json, const std::string& input_index_json, const std::string& value_set_json, - const std::string& expected_json, bool skip_nulls = false) { + const std::string& expected_json, bool skip_nulls) { auto dict_type = dictionary(index_type, type); auto indices = ArrayFromJSON(index_type, input_index_json); auto dict = ArrayFromJSON(type, input_dictionary_json); @@ -185,18 +244,43 @@ TYPED_TEST(TestIsInKernelPrimitive, IsIn) { /*skip_nulls=*/false); CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, 1]", "[false, true, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, 1]", "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, 1]", "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, 1]", "[null, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, 1]", "[null, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Nulls in right array CheckIsIn(type, "[0, 1, 2, 3, 2]", "[2, null, 1]", "[false, true, true, false, true]", /*skip_nulls=*/false); CheckIsIn(type, "[0, 1, 2, 3, 2]", "[2, null, 1]", "[false, true, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[0, 1, 2, 3, 2]", "[2, null, 1]", "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[0, 1, 2, 3, 2]", "[2, null, 1]", "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[0, 1, 2, 3, 2]", "[2, null, 1]", "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[0, 1, 2, 3, 2]", "[2, null, 1]", "[null, true, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Nulls in both the arrays CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, null, 1]", "[true, true, true, false, true]", /*skip_nulls=*/false); CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, null, 1]", "[false, true, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, null, 1]", "[true, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, null, 1]", + "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, null, 1]", "[null, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[2, null, 1]", "[null, true, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, "[null, 1, 2, 3, 2]", "[null, 2, 2, null, 1, 1]", @@ -204,6 +288,18 @@ TYPED_TEST(TestIsInKernelPrimitive, IsIn) { /*skip_nulls=*/false); CheckIsIn(type, "[null, 1, 2, 3, 2]", "[null, 2, 2, null, 1, 1]", "[false, true, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[null, 2, 2, null, 1, 1]", + "[true, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[null, 2, 2, null, 1, 1]", + "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[null, 2, 2, null, 1, 1]", + "[null, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[null, 1, 2, 3, 2]", "[null, 2, 2, null, 1, 1]", + "[null, true, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Empty Arrays CheckIsIn(type, "[]", "[]", "[]"); @@ -217,11 +313,30 @@ TEST_F(TestIsInKernel, NullType) { CheckIsIn(type, "[]", "[]", "[]"); CheckIsIn(type, "[null, null]", "[null]", "[false, false]", /*skip_nulls=*/true); + CheckIsIn(type, "[null, null]", "[null]", "[false, false]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[null, null]", "[null]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[null, null]", "[null]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); + CheckIsIn(type, "[null, null]", "[]", "[false, false]", /*skip_nulls=*/true); + CheckIsIn(type, "[null, null]", "[]", "[false, false]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[null, null]", "[]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[null, null]", "[]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, "[null, null, null]", "[null, null]", "[true, true, true]"); CheckIsIn(type, "[null, null]", "[null, null]", "[false, false]", /*skip_nulls=*/true); + CheckIsIn(type, "[null, null]", "[null, null]", "[false, false]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[null, null]", "[null, null]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[null, null]", "[null, null]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } TEST_F(TestIsInKernel, TimeTimestamp) { @@ -232,12 +347,36 @@ TEST_F(TestIsInKernel, TimeTimestamp) { "[true, true, false, true, true]", /*skip_nulls=*/false); CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", "[true, false, false, true, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, false, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, null, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, null, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", "[true, true, false, true, true]", /*skip_nulls=*/false); CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", "[true, false, false, true, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, false, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, null, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, null, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } // Disallow mixing timezone-aware and timezone-naive values @@ -260,12 +399,36 @@ TEST_F(TestIsInKernel, TimeDuration) { "[true, true, false, true, true]", /*skip_nulls=*/false); CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", "[true, false, false, true, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, false, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, null, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, null]", + "[true, null, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", "[true, true, false, true, true]", /*skip_nulls=*/false); CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", "[true, false, false, true, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, false, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, null, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[1, null, 5, 1, 2]", "[2, 1, 1, null, 2]", + "[true, null, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } // Different units, cast value_set to values will fail, then cast values to value_set @@ -285,17 +448,53 @@ TEST_F(TestIsInKernel, Boolean) { "[false, true, false, false, true]", /*skip_nulls=*/false); CheckIsIn(type, "[true, false, null, true, false]", "[false]", "[false, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[true, false, null, true, false]", "[false]", + "[false, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[true, false, null, true, false]", "[false]", + "[false, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[true, false, null, true, false]", "[false]", + "[false, true, null, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[true, false, null, true, false]", "[false]", + "[false, true, null, false, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); CheckIsIn(type, "[true, false, null, true, false]", "[false, null]", "[false, true, true, false, true]", /*skip_nulls=*/false); CheckIsIn(type, "[true, false, null, true, false]", "[false, null]", "[false, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[true, false, null, true, false]", "[false, null]", + "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[true, false, null, true, false]", "[false, null]", + "[false, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[true, false, null, true, false]", "[false, null]", + "[false, true, null, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[true, false, null, true, false]", "[false, null]", + "[null, true, null, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, "[true, false, null, true, false]", "[null, false, false, null]", "[false, true, true, false, true]", /*skip_nulls=*/false); CheckIsIn(type, "[true, false, null, true, false]", "[null, false, false, null]", "[false, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, "[true, false, null, true, false]", "[null, false, false, null]", + "[false, true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, "[true, false, null, true, false]", "[null, false, false, null]", + "[false, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, "[true, false, null, true, false]", "[null, false, false, null]", + "[false, true, null, false, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, "[true, false, null, true, false]", "[null, false, false, null]", + "[null, true, null, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } TYPED_TEST_SUITE(TestIsInKernelBinary, BaseBinaryArrowTypes); @@ -309,6 +508,18 @@ TYPED_TEST(TestIsInKernelBinary, Binary) { CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", ""])", "[true, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", ""])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", ""])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", ""])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", ""])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", "", null])", "[true, true, false, true, true]", @@ -316,6 +527,18 @@ TYPED_TEST(TestIsInKernelBinary, Binary) { CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", "", null])", "[true, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", "", null])", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", "", null])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", "", null])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"(["aaa", "", null])", + "[true, true, null, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", @@ -324,6 +547,18 @@ TYPED_TEST(TestIsInKernelBinary, Binary) { CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", R"([null, "aaa", "aaa", "", "", null])", "[true, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", + R"([null, "aaa", "aaa", "", "", null])", "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", + R"([null, "aaa", "aaa", "", "", null])", "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", + R"([null, "aaa", "aaa", "", "", null])", "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["aaa", "", "cc", null, ""])", + R"([null, "aaa", "aaa", "", "", null])", "[true, true, null, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } TEST_F(TestIsInKernel, FixedSizeBinary) { @@ -335,6 +570,18 @@ TEST_F(TestIsInKernel, FixedSizeBinary) { CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb"])", "[true, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb"])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb"])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb"])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb"])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb", null])", "[true, true, false, true, true]", @@ -342,6 +589,18 @@ TEST_F(TestIsInKernel, FixedSizeBinary) { CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb", null])", "[true, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb", null])", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb", null])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb", null])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", R"(["aaa", "bbb", null])", + "[true, true, null, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", @@ -352,6 +611,22 @@ TEST_F(TestIsInKernel, FixedSizeBinary) { R"(["aaa", null, "aaa", "bbb", "bbb", null])", "[true, true, false, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", + R"(["aaa", null, "aaa", "bbb", "bbb", null])", + "[true, true, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", + R"(["aaa", null, "aaa", "bbb", "bbb", null])", + "[true, true, false, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", + R"(["aaa", null, "aaa", "bbb", "bbb", null])", + "[true, true, false, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["aaa", "bbb", "ccc", null, "bbb"])", + R"(["aaa", null, "aaa", "bbb", "bbb", null])", + "[true, true, null, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); ASSERT_RAISES(Invalid, IsIn(ArrayFromJSON(fixed_size_binary(3), R"(["abc"])"), @@ -366,6 +641,18 @@ TEST_F(TestIsInKernel, Decimal) { CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9"])", "[true, false, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9"])", + "[true, false, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9"])", + "[true, false, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9"])", + "[true, false, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9"])", + "[true, false, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9", null])", "[true, false, true, true, true]", @@ -373,6 +660,18 @@ TEST_F(TestIsInKernel, Decimal) { CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", R"(["12.3", "78.9", null])", "[true, false, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"(["12.3", "78.9", null])", "[true, false, true, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"(["12.3", "78.9", null])", "[true, false, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"(["12.3", "78.9", null])", "[true, false, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"(["12.3", "78.9", null])", "[true, null, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in right array CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", @@ -383,6 +682,22 @@ TEST_F(TestIsInKernel, Decimal) { R"([null, "12.3", "12.3", "78.9", "78.9", null])", "[true, false, true, false, true]", /*skip_nulls=*/true); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"([null, "12.3", "12.3", "78.9", "78.9", null])", + "[true, false, true, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"([null, "12.3", "12.3", "78.9", "78.9", null])", + "[true, false, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"([null, "12.3", "12.3", "78.9", "78.9", null])", + "[true, false, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsIn(type, R"(["12.3", "45.6", "78.9", null, "12.3"])", + R"([null, "12.3", "12.3", "78.9", "78.9", null])", + "[true, null, true, null, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); CheckIsIn(ArrayFromJSON(decimal128(4, 2), R"(["12.30", "45.60", "78.90"])"), ArrayFromJSON(type, R"(["12.3", "78.9"])"), "[true, false, true]"); @@ -405,6 +720,20 @@ TEST_F(TestIsInKernel, DictionaryArray) { /*value_set_json=*/"[4.1, 42, -1.0]", /*expected_json=*/"[true, true, false, true]", /*skip_nulls=*/false); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 2, null, 0]", + /*value_set_json=*/R"(["A", "B", "C"])", + /*expected_json=*/"[true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsInDictionary(/*type=*/float32(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/"[4.1, -1.0, 42, 9.8]", + /*input_index_json=*/"[1, 2, null, 0]", + /*value_set_json=*/"[4.1, 42, -1.0]", + /*expected_json=*/"[true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); // With nulls and skip_nulls=false CheckIsInDictionary(/*type=*/utf8(), @@ -428,6 +757,27 @@ TEST_F(TestIsInKernel, DictionaryArray) { /*value_set_json=*/R"(["C", "B", "A"])", /*expected_json=*/"[false, false, false, true, false]", /*skip_nulls=*/false); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[true, false, true, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[true, false, true, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A"])", + /*expected_json=*/"[false, false, false, true, false]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); // With nulls and skip_nulls=true CheckIsInDictionary(/*type=*/utf8(), @@ -451,6 +801,73 @@ TEST_F(TestIsInKernel, DictionaryArray) { /*value_set_json=*/R"(["C", "B", "A"])", /*expected_json=*/"[false, false, false, true, false]", /*skip_nulls=*/true); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[true, false, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[false, false, false, true, false]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A"])", + /*expected_json=*/"[false, false, false, true, false]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + + // With nulls and null_matching_behavior=EMIT_NULL + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[true, false, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[null, false, null, true, null]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A"])", + /*expected_json=*/"[null, false, null, true, null]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + + // With nulls and null_matching_behavior=INCONCLUSIVE + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[true, null, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[null, null, null, true, null]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A"])", + /*expected_json=*/"[null, false, null, true, null]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // With duplicates in value_set CheckIsInDictionary(/*type=*/utf8(), @@ -474,6 +891,41 @@ TEST_F(TestIsInKernel, DictionaryArray) { /*value_set_json=*/R"(["C", "C", "B", "A", null, null, "B"])", /*expected_json=*/"[true, false, false, true, true]", /*skip_nulls=*/true); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 2, null, 0]", + /*value_set_json=*/R"(["A", "A", "B", "A", "B", "C"])", + /*expected_json=*/"[true, true, false, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "C", "B", "A", null, null, "B"])", + /*expected_json=*/"[true, false, true, true, true]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "C", "B", "A", null, null, "B"])", + /*expected_json=*/"[true, false, false, true, true]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "C", "B", "A", null, null, "B"])", + /*expected_json=*/"[true, false, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + CheckIsInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "C", "B", "A", null, null, "B"])", + /*expected_json=*/"[true, null, null, true, true]", + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } } @@ -487,14 +939,38 @@ TEST_F(TestIsInKernel, ChunkedArrayInvoke) { CheckIsInChunked(input, value_set, expected, /*skip_nulls=*/false); CheckIsInChunked(input, value_set, expected, /*skip_nulls=*/true); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::SKIP); + expected = ChunkedArrayFromJSON( + boolean(), {"[true, true, true, true, false]", "[true, null, true, false]"}); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + expected = ChunkedArrayFromJSON( + boolean(), {"[true, true, true, true, false]", "[true, null, true, false]"}); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); value_set = ChunkedArrayFromJSON(utf8(), {R"(["", "def"])", R"([null])"}); expected = ChunkedArrayFromJSON( boolean(), {"[false, true, true, false, false]", "[true, true, false, false]"}); CheckIsInChunked(input, value_set, expected, /*skip_nulls=*/false); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::MATCH); expected = ChunkedArrayFromJSON( boolean(), {"[false, true, true, false, false]", "[true, false, false, false]"}); CheckIsInChunked(input, value_set, expected, /*skip_nulls=*/true); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::SKIP); + expected = ChunkedArrayFromJSON( + boolean(), {"[false, true, true, false, false]", "[true, null, false, false]"}); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + expected = ChunkedArrayFromJSON( + boolean(), {"[null, true, true, null, null]", "[true, null, null, null]"}); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); // Duplicates in value_set value_set = @@ -502,9 +978,21 @@ TEST_F(TestIsInKernel, ChunkedArrayInvoke) { expected = ChunkedArrayFromJSON( boolean(), {"[false, true, true, false, false]", "[true, true, false, false]"}); CheckIsInChunked(input, value_set, expected, /*skip_nulls=*/false); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::MATCH); expected = ChunkedArrayFromJSON( boolean(), {"[false, true, true, false, false]", "[true, false, false, false]"}); CheckIsInChunked(input, value_set, expected, /*skip_nulls=*/true); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::SKIP); + expected = ChunkedArrayFromJSON( + boolean(), {"[false, true, true, false, false]", "[true, null, false, false]"}); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::EMIT_NULL); + expected = ChunkedArrayFromJSON( + boolean(), {"[null, true, true, null, null]", "[true, null, null, null]"}); + CheckIsInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::INCONCLUSIVE); } // ---------------------------------------------------------------------- @@ -514,7 +1002,70 @@ class TestIndexInKernel : public ::testing::Test { public: void CheckIndexIn(const std::shared_ptr& input, const std::shared_ptr& value_set, - const std::string& expected_json, bool skip_nulls = false) { + const std::string& expected_json, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + std::shared_ptr expected = ArrayFromJSON(int32(), expected_json); + + SetLookupOptions options(value_set, null_matching_behavior); + ASSERT_OK_AND_ASSIGN(Datum actual_datum, IndexIn(input, options)); + std::shared_ptr actual = actual_datum.make_array(); + ValidateOutput(actual_datum); + AssertArraysEqual(*expected, *actual, /*verbose=*/true); + } + + void CheckIndexIn(const std::shared_ptr& type, const std::string& input_json, + const std::string& value_set_json, const std::string& expected_json, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + std::shared_ptr input = ArrayFromJSON(type, input_json); + std::shared_ptr value_set = ArrayFromJSON(type, value_set_json); + return CheckIndexIn(input, value_set, expected_json, null_matching_behavior); + } + + void CheckIndexInChunked(const std::shared_ptr& input, + const std::shared_ptr& value_set, + const std::shared_ptr& expected, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + ASSERT_OK_AND_ASSIGN( + Datum actual, + IndexIn(input, SetLookupOptions(value_set, null_matching_behavior))); + ASSERT_EQ(Datum::CHUNKED_ARRAY, actual.kind()); + ValidateOutput(actual); + + auto actual_chunked = actual.chunked_array(); + + // Output contiguous in a single chunk + ASSERT_EQ(1, actual_chunked->num_chunks()); + ASSERT_TRUE(actual_chunked->Equals(*expected)); + } + + void CheckIndexInDictionary( + const std::shared_ptr& type, const std::shared_ptr& index_type, + const std::string& input_dictionary_json, const std::string& input_index_json, + const std::string& value_set_json, const std::string& expected_json, + SetLookupOptions::NullMatchingBehavior null_matching_behavior = + SetLookupOptions::MATCH) { + auto dict_type = dictionary(index_type, type); + auto indices = ArrayFromJSON(index_type, input_index_json); + auto dict = ArrayFromJSON(type, input_dictionary_json); + + ASSERT_OK_AND_ASSIGN(auto input, + DictionaryArray::FromArrays(dict_type, indices, dict)); + auto value_set = ArrayFromJSON(type, value_set_json); + auto expected = ArrayFromJSON(int32(), expected_json); + + SetLookupOptions options(value_set, null_matching_behavior); + ASSERT_OK_AND_ASSIGN(Datum actual_datum, IndexIn(input, options)); + std::shared_ptr actual = actual_datum.make_array(); + ValidateOutput(actual_datum); + AssertArraysEqual(*expected, *actual, /*verbose=*/true); + } + + void CheckIndexIn(const std::shared_ptr& input, + const std::shared_ptr& value_set, + const std::string& expected_json, bool skip_nulls) { std::shared_ptr expected = ArrayFromJSON(int32(), expected_json); SetLookupOptions options(value_set, skip_nulls); @@ -526,7 +1077,7 @@ class TestIndexInKernel : public ::testing::Test { void CheckIndexIn(const std::shared_ptr& type, const std::string& input_json, const std::string& value_set_json, const std::string& expected_json, - bool skip_nulls = false) { + bool skip_nulls) { std::shared_ptr input = ArrayFromJSON(type, input_json); std::shared_ptr value_set = ArrayFromJSON(type, value_set_json); return CheckIndexIn(input, value_set, expected_json, skip_nulls); @@ -553,7 +1104,7 @@ class TestIndexInKernel : public ::testing::Test { const std::string& input_dictionary_json, const std::string& input_index_json, const std::string& value_set_json, - const std::string& expected_json, bool skip_nulls = false) { + const std::string& expected_json, bool skip_nulls) { auto dict_type = dictionary(index_type, type); auto indices = ArrayFromJSON(index_type, input_index_json); auto dict = ArrayFromJSON(type, input_dictionary_json); @@ -656,6 +1207,16 @@ TYPED_TEST(TestIndexInKernelPrimitive, SkipNulls) { /*value_set=*/"[1, 3]", /*expected=*/"[null, 0, null, 1, null]", /*skip_nulls=*/true); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, 3]", + /*expected=*/"[null, 0, null, 1, null]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, 3]", + /*expected=*/"[null, 0, null, 1, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Same with duplicates in value_set this->CheckIndexIn(type, /*input=*/"[0, 1, 2, 3, null]", @@ -667,6 +1228,16 @@ TYPED_TEST(TestIndexInKernelPrimitive, SkipNulls) { /*value_set=*/"[1, 1, 3, 3]", /*expected=*/"[null, 0, null, 2, null]", /*skip_nulls=*/true); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, 1, 3, 3]", + /*expected=*/"[null, 0, null, 2, null]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, 1, 3, 3]", + /*expected=*/"[null, 0, null, 2, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Nulls in value_set this->CheckIndexIn(type, @@ -679,12 +1250,27 @@ TYPED_TEST(TestIndexInKernelPrimitive, SkipNulls) { /*value_set=*/"[1, 1, null, null, 3, 3]", /*expected=*/"[null, 0, null, 4, null]", /*skip_nulls=*/true); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, null, 3]", + /*expected=*/"[null, 0, null, 2, 1]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, 1, null, null, 3, 3]", + /*expected=*/"[null, 0, null, 4, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Same with duplicates in value_set this->CheckIndexIn(type, /*input=*/"[0, 1, 2, 3, null]", /*value_set=*/"[1, 1, null, null, 3, 3]", /*expected=*/"[null, 0, null, 4, 2]", /*skip_nulls=*/false); + this->CheckIndexIn(type, + /*input=*/"[0, 1, 2, 3, null]", + /*value_set=*/"[1, 1, null, null, 3, 3]", + /*expected=*/"[null, 0, null, 4, 2]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); } TEST_F(TestIndexInKernel, NullType) { @@ -695,6 +1281,10 @@ TEST_F(TestIndexInKernel, NullType) { CheckIndexIn(null(), "[null, null]", "[null]", "[null, null]", /*skip_nulls=*/true); CheckIndexIn(null(), "[null, null]", "[]", "[null, null]", /*skip_nulls=*/true); + CheckIndexIn(null(), "[null, null]", "[null]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIndexIn(null(), "[null, null]", "[]", "[null, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); } TEST_F(TestIndexInKernel, TimeTimestamp) { @@ -979,6 +1569,11 @@ TEST_F(TestIndexInKernel, FixedSizeBinary) { /*value_set=*/R"(["aaa", null, "bbb", "ccc"])", /*expected=*/R"([2, null, null, 0, 3, 0])", /*skip_nulls=*/true); + CheckIndexIn(fixed_size_binary(3), + /*input=*/R"(["bbb", null, "ddd", "aaa", "ccc", "aaa"])", + /*value_set=*/R"(["aaa", null, "bbb", "ccc"])", + /*expected=*/R"([2, null, null, 0, 3, 0])", + /*null_matching_behavior=*/SetLookupOptions::SKIP); CheckIndexIn(fixed_size_binary(3), /*input=*/R"(["bbb", null, "ddd", "aaa", "ccc", "aaa"])", @@ -989,6 +1584,11 @@ TEST_F(TestIndexInKernel, FixedSizeBinary) { /*value_set=*/R"(["aaa", "bbb", "ccc"])", /*expected=*/R"([1, null, null, 0, 2, 0])", /*skip_nulls=*/true); + CheckIndexIn(fixed_size_binary(3), + /*input=*/R"(["bbb", null, "ddd", "aaa", "ccc", "aaa"])", + /*value_set=*/R"(["aaa", "bbb", "ccc"])", + /*expected=*/R"([1, null, null, 0, 2, 0])", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Duplicates in value_set CheckIndexIn(fixed_size_binary(3), @@ -1000,6 +1600,11 @@ TEST_F(TestIndexInKernel, FixedSizeBinary) { /*value_set=*/R"(["aaa", "aaa", null, null, "bbb", "bbb", "ccc"])", /*expected=*/R"([4, null, null, 0, 6, 0])", /*skip_nulls=*/true); + CheckIndexIn(fixed_size_binary(3), + /*input=*/R"(["bbb", null, "ddd", "aaa", "ccc", "aaa"])", + /*value_set=*/R"(["aaa", "aaa", null, null, "bbb", "bbb", "ccc"])", + /*expected=*/R"([4, null, null, 0, 6, 0])", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Empty input array CheckIndexIn(fixed_size_binary(5), R"([])", R"(["bbbbb", null, "aaaaa", "ccccc"])", @@ -1026,6 +1631,11 @@ TEST_F(TestIndexInKernel, MonthDayNanoInterval) { /*value_set=*/R"([null, [4, 5, 6], [5, -1, 5]])", /*expected=*/R"([2, 0, 1, 2, null])", /*skip_nulls=*/false); + CheckIndexIn(type, + /*input=*/R"([[5, -1, 5], null, [4, 5, 6], [5, -1, 5], [1, 2, 3]])", + /*value_set=*/R"([null, [4, 5, 6], [5, -1, 5]])", + /*expected=*/R"([2, 0, 1, 2, null])", + /*null_matching_behavior=*/SetLookupOptions::MATCH); // Duplicates in value_set CheckIndexIn( @@ -1034,6 +1644,12 @@ TEST_F(TestIndexInKernel, MonthDayNanoInterval) { /*value_set=*/R"([null, null, [0, 0, 0], [0, 0, 0], [7, 8, 0], [7, 8, 0]])", /*expected=*/R"([4, 0, 2, 4, null])", /*skip_nulls=*/false); + CheckIndexIn( + type, + /*input=*/R"([[7, 8, 0], null, [0, 0, 0], [7, 8, 0], [0, 0, 1]])", + /*value_set=*/R"([null, null, [0, 0, 0], [0, 0, 0], [7, 8, 0], [7, 8, 0]])", + /*expected=*/R"([4, 0, 2, 4, null])", + /*null_matching_behavior=*/SetLookupOptions::MATCH); } TEST_F(TestIndexInKernel, Decimal) { @@ -1048,6 +1664,16 @@ TEST_F(TestIndexInKernel, Decimal) { /*value_set=*/R"([null, "11", "12"])", /*expected=*/R"([2, null, 1, 2, null])", /*skip_nulls=*/true); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"([null, "11", "12"])", + /*expected=*/R"([2, 0, 1, 2, null])", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"([null, "11", "12"])", + /*expected=*/R"([2, null, 1, 2, null])", + /*null_matching_behavior=*/SetLookupOptions::SKIP); CheckIndexIn(type, /*input=*/R"(["12", null, "11", "12", "13"])", @@ -1059,6 +1685,16 @@ TEST_F(TestIndexInKernel, Decimal) { /*value_set=*/R"(["11", "12"])", /*expected=*/R"([1, null, 0, 1, null])", /*skip_nulls=*/true); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"(["11", "12"])", + /*expected=*/R"([1, null, 0, 1, null])", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"(["11", "12"])", + /*expected=*/R"([1, null, 0, 1, null])", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Duplicates in value_set CheckIndexIn(type, @@ -1076,6 +1712,21 @@ TEST_F(TestIndexInKernel, Decimal) { /*value_set=*/R"([null, "11", "12"])", /*expected=*/R"([2, 0, 1, 2, null])", /*skip_nulls=*/false); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"([null, null, "11", "11", "12", "12"])", + /*expected=*/R"([4, 0, 2, 4, null])", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"([null, null, "11", "11", "12", "12"])", + /*expected=*/R"([4, null, 2, 4, null])", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIndexIn(type, + /*input=*/R"(["12", null, "11", "12", "13"])", + /*value_set=*/R"([null, "11", "12"])", + /*expected=*/R"([2, 0, 1, 2, null])", + /*null_matching_behavior=*/SetLookupOptions::MATCH); CheckIndexIn( ArrayFromJSON(decimal256(3, 1), R"(["12.0", null, "11.0", "12.0", "13.0"])"), @@ -1099,6 +1750,20 @@ TEST_F(TestIndexInKernel, DictionaryArray) { /*value_set_json=*/"[4.1, 42, -1.0]", /*expected_json=*/"[2, 1, null, 0]", /*skip_nulls=*/false); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 2, null, 0]", + /*value_set_json=*/R"(["A", "B", "C"])", + /*expected_json=*/"[1, 2, null, 0]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexInDictionary(/*type=*/float32(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/"[4.1, -1.0, 42, 9.8]", + /*input_index_json=*/"[1, 2, null, 0]", + /*value_set_json=*/"[4.1, 42, -1.0]", + /*expected_json=*/"[2, 1, null, 0]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); // With nulls and skip_nulls=false CheckIndexInDictionary(/*type=*/utf8(), @@ -1122,6 +1787,27 @@ TEST_F(TestIndexInKernel, DictionaryArray) { /*value_set_json=*/R"(["C", "B", "A"])", /*expected_json=*/"[null, null, null, 2, null]", /*skip_nulls=*/false); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[1, null, 3, 2, 1]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[3, null, 3, 2, 3]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A"])", + /*expected_json=*/"[null, null, null, 2, null]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); // With nulls and skip_nulls=true CheckIndexInDictionary(/*type=*/utf8(), @@ -1145,6 +1831,27 @@ TEST_F(TestIndexInKernel, DictionaryArray) { /*value_set_json=*/R"(["C", "B", "A"])", /*expected_json=*/"[null, null, null, 2, null]", /*skip_nulls=*/true); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[1, null, null, 2, 1]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A", null])", + /*expected_json=*/"[null, null, null, 2, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "B", "A"])", + /*expected_json=*/"[null, null, null, 2, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); // With duplicates in value_set CheckIndexInDictionary(/*type=*/utf8(), @@ -1168,6 +1875,27 @@ TEST_F(TestIndexInKernel, DictionaryArray) { /*value_set_json=*/R"(["C", "C", "B", "B", "A", "A", null])", /*expected_json=*/"[null, null, null, 4, null]", /*skip_nulls=*/true); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", "B", "C", "D"])", + /*input_index_json=*/"[1, 2, null, 0]", + /*value_set_json=*/R"(["A", "A", "B", "B", "C", "C"])", + /*expected_json=*/"[2, 4, null, 0]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "C", "B", "B", "A", "A", null])", + /*expected_json=*/"[6, null, 6, 4, 6]", + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexInDictionary(/*type=*/utf8(), + /*index_type=*/index_ty, + /*input_dictionary_json=*/R"(["A", null, "C", "D"])", + /*input_index_json=*/"[1, 3, null, 0, 1]", + /*value_set_json=*/R"(["C", "C", "B", "B", "A", "A", null])", + /*expected_json=*/"[null, null, null, 4, null]", + /*null_matching_behavior=*/SetLookupOptions::SKIP); } } @@ -1181,21 +1909,33 @@ TEST_F(TestIndexInKernel, ChunkedArrayInvoke) { CheckIndexInChunked(input, value_set, expected, /*skip_nulls=*/false); CheckIndexInChunked(input, value_set, expected, /*skip_nulls=*/true); + CheckIndexInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::MATCH); + CheckIndexInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Null in value_set value_set = ChunkedArrayFromJSON(utf8(), {R"(["ghi", "def"])", R"([null, "abc"])"}); expected = ChunkedArrayFromJSON(int32(), {"[3, 1, 0, 3, null]", "[1, 2, 3, null]"}); CheckIndexInChunked(input, value_set, expected, /*skip_nulls=*/false); + CheckIndexInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::MATCH); expected = ChunkedArrayFromJSON(int32(), {"[3, 1, 0, 3, null]", "[1, null, 3, null]"}); CheckIndexInChunked(input, value_set, expected, /*skip_nulls=*/true); + CheckIndexInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::SKIP); // Duplicates in value_set value_set = ChunkedArrayFromJSON( utf8(), {R"(["ghi", "ghi", "def"])", R"(["def", null, null, "abc"])"}); expected = ChunkedArrayFromJSON(int32(), {"[6, 2, 0, 6, null]", "[2, 4, 6, null]"}); CheckIndexInChunked(input, value_set, expected, /*skip_nulls=*/false); + CheckIndexInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::MATCH); expected = ChunkedArrayFromJSON(int32(), {"[6, 2, 0, 6, null]", "[2, null, 6, null]"}); CheckIndexInChunked(input, value_set, expected, /*skip_nulls=*/true); + CheckIndexInChunked(input, value_set, expected, + /*null_matching_behavior=*/SetLookupOptions::SKIP); } TEST(TestSetLookup, DispatchBest) { diff --git a/cpp/src/arrow/util/reflection_internal.h b/cpp/src/arrow/util/reflection_internal.h index d7de913bafd88..5d281a265ff71 100644 --- a/cpp/src/arrow/util/reflection_internal.h +++ b/cpp/src/arrow/util/reflection_internal.h @@ -71,6 +71,30 @@ constexpr DataMemberProperty DataMember(std::string_view name, return {name, ptr}; } +template +struct CoercedDataMemberProperty { + using Class = C; + using Type = T; + + constexpr Type get(const Class& obj) const { return (obj.*get_coerced_)(); } + + void set(Class* obj, Type value) const { (*obj).*ptr_for_set_ = std::move(value); } + + constexpr std::string_view name() const { return name_; } + + std::string_view name_; + Type Class::*ptr_for_set_; + Type (Class::*get_coerced_)() const; +}; + +template +constexpr CoercedDataMemberProperty CoercedDataMember(std::string_view name, + Type Class::*ptr, + Type (Class::*get)() + const) { + return {name, ptr, get}; +} + template struct PropertyTuple { template diff --git a/python/pyarrow/_compute.pyx b/python/pyarrow/_compute.pyx index 609307528d2ec..25f77d8160ea8 100644 --- a/python/pyarrow/_compute.pyx +++ b/python/pyarrow/_compute.pyx @@ -2366,7 +2366,7 @@ cdef class Expression(_Weakrefable): 1, 2, 3 - ], skip_nulls=false})> + ], null_matching_behavior=MATCH})> """ def __init__(self): From 7b14b2b2712bc483cd7d14bbc6c38e26d27074ac Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 25 Sep 2023 09:05:40 +0100 Subject: [PATCH 56/96] GH-36638: [R] Error with create_package_with_all_dependencies() on Windows (#37226) ### Rationale for this change Fix path to directory which stops `create_package_with_all_dependencies()` working on Windows ### What changes are included in this PR? Character replacement in the script to ensure that on Windows, wsl-comaptible path is used and paths are normalised. ### Are these changes tested? I tested it locally myself and it works - there are no tests for this function. ### Are there any user-facing changes? Nope * Closes: #36638 Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/R/install-arrow.R | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/r/R/install-arrow.R b/r/R/install-arrow.R index 7017d4f39b876..715478f6d0542 100644 --- a/r/R/install-arrow.R +++ b/r/R/install-arrow.R @@ -215,19 +215,25 @@ create_package_with_all_dependencies <- function(dest_file = NULL, source_file = untar_dir <- tempfile() on.exit(unlink(untar_dir, recursive = TRUE), add = TRUE) utils::untar(source_file, exdir = untar_dir) - tools_dir <- file.path(untar_dir, "arrow/tools") + tools_dir <- file.path(normalizePath(untar_dir, winslash = "/"), "arrow/tools") download_dependencies_sh <- file.path(tools_dir, "download_dependencies_R.sh") # If you change this path, also need to edit nixlibs.R download_dir <- file.path(tools_dir, "thirdparty_dependencies") dir.create(download_dir) download_script <- tempfile(fileext = ".R") + + if (isTRUE(Sys.info()["sysname"] == "Windows")) { + download_dependencies_sh <- wslify_path(download_dependencies_sh) + } + parse_versions_success <- system2( "bash", c(download_dependencies_sh, download_dir), stdout = download_script, stderr = FALSE ) == 0 + if (!parse_versions_success) { - stop("Failed to parse versions.txt") + stop(paste("Failed to parse versions.txt; view ", download_script, "for more information", collapse = "")) } # `source` the download_script to use R to download all the dependency bundles source(download_script) @@ -250,3 +256,14 @@ create_package_with_all_dependencies <- function(dest_file = NULL, source_file = } invisible(dest_file) } + +# Convert a Windows path to a WSL path +# e.g. wslify_path("C:/Users/user/AppData/") returns "/mnt/c/Users/user/AppData" +wslify_path <- function(path) { + m <- regexpr("[A-Z]:/", path) + drive_expr <- regmatches(path, m) + drive_letter <- strsplit(drive_expr, ":/")[[1]] + wslified_drive <- paste0("/mnt/", tolower(drive_letter)) + end_path <- strsplit(path, drive_expr)[[1]][-1] + file.path(wslified_drive, end_path) +} From 64acef8c5ce9731c6c73c5be386535e7af3af5f8 Mon Sep 17 00:00:00 2001 From: David Li Date: Mon, 25 Sep 2023 08:45:08 -0400 Subject: [PATCH 57/96] GH-37722: [Java][FlightRPC] Deprecate stateful login methods (#37833) ### Rationale for this change The existence of these interfaces confuses users and leads them to antipatterns. ### What changes are included in this PR? Deprecate (but not for removal) the old interfaces. ### Are these changes tested? N/A ### Are there any user-facing changes? Yes. * Closes: #37722 Authored-by: David Li Signed-off-by: David Li --- .../org/apache/arrow/flight/auth/ClientAuthHandler.java | 7 +++++++ .../org/apache/arrow/flight/auth/ServerAuthHandler.java | 9 +++++++++ 2 files changed, 16 insertions(+) diff --git a/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ClientAuthHandler.java b/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ClientAuthHandler.java index 985e10aa4dd4b..af7da86e009e6 100644 --- a/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ClientAuthHandler.java +++ b/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ClientAuthHandler.java @@ -19,9 +19,16 @@ import java.util.Iterator; +import org.apache.arrow.flight.FlightClient; + /** * Implement authentication for Flight on the client side. + * + * @deprecated As of 14.0.0. This implements a stateful "login" flow that does not play well with + * distributed or stateless systems. It will not be removed, but should not be used. Instead + * see {@link FlightClient#authenticateBasicToken(String, String)}. */ +@Deprecated public interface ClientAuthHandler { /** * Handle the initial handshake with the server. diff --git a/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ServerAuthHandler.java b/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ServerAuthHandler.java index 3a978b131f26c..378027c9287fe 100644 --- a/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ServerAuthHandler.java +++ b/java/flight/flight-core/src/main/java/org/apache/arrow/flight/auth/ServerAuthHandler.java @@ -20,9 +20,18 @@ import java.util.Iterator; import java.util.Optional; +import org.apache.arrow.flight.FlightServer; +import org.apache.arrow.flight.auth2.CallHeaderAuthenticator; + /** * Interface for Server side authentication handlers. + * + * @deprecated As of 14.0.0. This implements a stateful "login" flow that does not play well with + * distributed or stateless systems. It will not be removed, but should not be used. Instead, + * see {@link FlightServer.Builder#headerAuthenticator(CallHeaderAuthenticator)} + * and {@link CallHeaderAuthenticator}. */ +@Deprecated public interface ServerAuthHandler { /** From e55f912ecbce21ad46043bdffe32712272d6268a Mon Sep 17 00:00:00 2001 From: Curt Hagenlocher Date: Mon, 25 Sep 2023 06:04:18 -0700 Subject: [PATCH 58/96] GH-36795: [C#] Implement support for dense and sparse unions (#36797) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What changes are included in this PR? Support dense and sparse unions in the C# implementation. Adds Archery support for C# unions. ### Are these changes tested? Yes ### Are there any user-facing changes? Unions are now supported in the C# implementation. **This PR includes breaking changes to public APIs.** The public APIs for the UnionArray and UnionType were changed fairly substantially. As these were previously not implemented properly, the impact of the changes ought to be minimal. The ChunkedArray and Column classes were changed to hold IArrowArrays instead of Arrays. To accomodate this, a constructor was added which may introduce ambiguity in calling code. This could be avoided by changing the overloaded constructor to instead be a factory method. This didn't seem worthwhile but could be reconsidered. The metadata version was finally increased to V5.   * Closes: #36795 Authored-by: Curt Hagenlocher Signed-off-by: David Li --- csharp/src/Apache.Arrow/Arrays/Array.cs | 13 +-- .../Arrays/ArrayDataConcatenator.cs | 62 ++++++++++- .../Arrays/ArrayDataTypeComparer.cs | 12 +- .../Apache.Arrow/Arrays/ArrowArrayFactory.cs | 16 ++- .../Apache.Arrow/Arrays/DenseUnionArray.cs | 52 +++++++++ .../Arrays/PrimitiveArrayBuilder.cs | 3 + .../Apache.Arrow/Arrays/SparseUnionArray.cs | 46 ++++++++ csharp/src/Apache.Arrow/Arrays/UnionArray.cs | 77 ++++++++++--- .../src/Apache.Arrow/C/CArrowArrayImporter.cs | 38 +++++++ .../Apache.Arrow/C/CArrowSchemaExporter.cs | 18 +++ .../Apache.Arrow/C/CArrowSchemaImporter.cs | 56 +++++++--- csharp/src/Apache.Arrow/ChunkedArray.cs | 30 +++-- csharp/src/Apache.Arrow/Column.cs | 24 ++-- .../Extensions/FlatbufExtensions.cs | 10 ++ .../Apache.Arrow/Interfaces/IArrowArray.cs | 4 - .../Ipc/ArrowReaderImplementation.cs | 75 ++++++++----- .../src/Apache.Arrow/Ipc/ArrowStreamWriter.cs | 19 +++- .../Ipc/ArrowTypeFlatbufferBuilder.cs | 14 ++- .../src/Apache.Arrow/Ipc/MessageSerializer.cs | 4 + csharp/src/Apache.Arrow/Table.cs | 4 +- csharp/src/Apache.Arrow/Types/UnionType.cs | 11 +- .../IntegrationCommand.cs | 63 ++++++++++- .../Apache.Arrow.IntegrationTest/JsonFile.cs | 4 + .../Apache.Arrow.Tests/ArrayTypeComparer.cs | 19 +++- .../ArrowArrayConcatenatorTests.cs | 104 +++++++++++++++++- .../Apache.Arrow.Tests/ArrowReaderVerifier.cs | 19 ++++ .../CDataInterfacePythonTests.cs | 36 ++++-- csharp/test/Apache.Arrow.Tests/ColumnTests.cs | 2 +- csharp/test/Apache.Arrow.Tests/TableTests.cs | 10 +- csharp/test/Apache.Arrow.Tests/TestData.cs | 64 +++++++++++ dev/archery/archery/integration/datagen.py | 3 +- docs/source/status.rst | 4 +- 32 files changed, 797 insertions(+), 119 deletions(-) create mode 100644 csharp/src/Apache.Arrow/Arrays/DenseUnionArray.cs create mode 100644 csharp/src/Apache.Arrow/Arrays/SparseUnionArray.cs diff --git a/csharp/src/Apache.Arrow/Arrays/Array.cs b/csharp/src/Apache.Arrow/Arrays/Array.cs index a453b0807267f..0838134b19c6d 100644 --- a/csharp/src/Apache.Arrow/Arrays/Array.cs +++ b/csharp/src/Apache.Arrow/Arrays/Array.cs @@ -62,16 +62,7 @@ internal static void Accept(T array, IArrowArrayVisitor visitor) public Array Slice(int offset, int length) { - if (offset > Length) - { - throw new ArgumentException($"Offset {offset} cannot be greater than Length {Length} for Array.Slice"); - } - - length = Math.Min(Data.Length - offset, length); - offset += Data.Offset; - - ArrayData newData = Data.Slice(offset, length); - return ArrowArrayFactory.BuildArray(newData) as Array; + return ArrowArrayFactory.Slice(this, offset, length) as Array; } public void Dispose() @@ -88,4 +79,4 @@ protected virtual void Dispose(bool disposing) } } } -} \ No newline at end of file +} diff --git a/csharp/src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs b/csharp/src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs index 8859ecd7f05b9..806defdc7ce66 100644 --- a/csharp/src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs +++ b/csharp/src/Apache.Arrow/Arrays/ArrayDataConcatenator.cs @@ -49,7 +49,8 @@ private class ArrayDataConcatenationVisitor : IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, - IArrowTypeVisitor + IArrowTypeVisitor, + IArrowTypeVisitor { public ArrayData Result { get; private set; } private readonly IReadOnlyList _arrayDataList; @@ -123,6 +124,33 @@ public void Visit(StructType type) Result = new ArrayData(type, _arrayDataList[0].Length, _arrayDataList[0].NullCount, 0, _arrayDataList[0].Buffers, children); } + public void Visit(UnionType type) + { + int bufferCount = type.Mode switch + { + UnionMode.Sparse => 1, + UnionMode.Dense => 2, + _ => throw new InvalidOperationException("TODO"), + }; + + CheckData(type, bufferCount); + List children = new List(type.Fields.Count); + + for (int i = 0; i < type.Fields.Count; i++) + { + children.Add(Concatenate(SelectChildren(i), _allocator)); + } + + ArrowBuffer[] buffers = new ArrowBuffer[bufferCount]; + buffers[0] = ConcatenateUnionTypeBuffer(); + if (bufferCount > 1) + { + buffers[1] = ConcatenateUnionOffsetBuffer(); + } + + Result = new ArrayData(type, _totalLength, _totalNullCount, 0, buffers, children); + } + public void Visit(IArrowType type) { throw new NotImplementedException($"Concatenation for {type.Name} is not supported yet."); @@ -231,6 +259,38 @@ private ArrowBuffer ConcatenateOffsetBuffer() return builder.Build(_allocator); } + private ArrowBuffer ConcatenateUnionTypeBuffer() + { + var builder = new ArrowBuffer.Builder(_totalLength); + + foreach (ArrayData arrayData in _arrayDataList) + { + builder.Append(arrayData.Buffers[0]); + } + + return builder.Build(_allocator); + } + + private ArrowBuffer ConcatenateUnionOffsetBuffer() + { + var builder = new ArrowBuffer.Builder(_totalLength); + int baseOffset = 0; + + foreach (ArrayData arrayData in _arrayDataList) + { + ReadOnlySpan span = arrayData.Buffers[1].Span.CastTo(); + foreach (int offset in span) + { + builder.Append(baseOffset + offset); + } + + // The next offset must start from the current last offset. + baseOffset += span[arrayData.Length]; + } + + return builder.Build(_allocator); + } + private List SelectChildren(int index) { var children = new List(_arrayDataList.Count); diff --git a/csharp/src/Apache.Arrow/Arrays/ArrayDataTypeComparer.cs b/csharp/src/Apache.Arrow/Arrays/ArrayDataTypeComparer.cs index 8a6bfed29abb6..6b54ec1edb573 100644 --- a/csharp/src/Apache.Arrow/Arrays/ArrayDataTypeComparer.cs +++ b/csharp/src/Apache.Arrow/Arrays/ArrayDataTypeComparer.cs @@ -27,7 +27,8 @@ internal sealed class ArrayDataTypeComparer : IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, - IArrowTypeVisitor + IArrowTypeVisitor, + IArrowTypeVisitor { private readonly IArrowType _expectedType; private bool _dataTypeMatch; @@ -122,6 +123,15 @@ public void Visit(StructType actualType) } } + public void Visit(UnionType actualType) + { + if (_expectedType is UnionType expectedType + && CompareNested(expectedType, actualType)) + { + _dataTypeMatch = true; + } + } + private static bool CompareNested(NestedType expectedType, NestedType actualType) { if (expectedType.Fields.Count != actualType.Fields.Count) diff --git a/csharp/src/Apache.Arrow/Arrays/ArrowArrayFactory.cs b/csharp/src/Apache.Arrow/Arrays/ArrowArrayFactory.cs index f82037bff47b1..aa407203d1858 100644 --- a/csharp/src/Apache.Arrow/Arrays/ArrowArrayFactory.cs +++ b/csharp/src/Apache.Arrow/Arrays/ArrowArrayFactory.cs @@ -62,7 +62,7 @@ public static IArrowArray BuildArray(ArrayData data) case ArrowTypeId.Struct: return new StructArray(data); case ArrowTypeId.Union: - return new UnionArray(data); + return UnionArray.Create(data); case ArrowTypeId.Date64: return new Date64Array(data); case ArrowTypeId.Date32: @@ -91,5 +91,19 @@ public static IArrowArray BuildArray(ArrayData data) throw new NotSupportedException($"An ArrowArray cannot be built for type {data.DataType.TypeId}."); } } + + public static IArrowArray Slice(IArrowArray array, int offset, int length) + { + if (offset > array.Length) + { + throw new ArgumentException($"Offset {offset} cannot be greater than Length {array.Length} for Array.Slice"); + } + + length = Math.Min(array.Data.Length - offset, length); + offset += array.Data.Offset; + + ArrayData newData = array.Data.Slice(offset, length); + return BuildArray(newData); + } } } diff --git a/csharp/src/Apache.Arrow/Arrays/DenseUnionArray.cs b/csharp/src/Apache.Arrow/Arrays/DenseUnionArray.cs new file mode 100644 index 0000000000000..1aacbe11f08b9 --- /dev/null +++ b/csharp/src/Apache.Arrow/Arrays/DenseUnionArray.cs @@ -0,0 +1,52 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +using Apache.Arrow.Types; +using System; +using System.Collections.Generic; +using System.Linq; + +namespace Apache.Arrow +{ + public class DenseUnionArray : UnionArray + { + public ArrowBuffer ValueOffsetBuffer => Data.Buffers[1]; + + public ReadOnlySpan ValueOffsets => ValueOffsetBuffer.Span.CastTo(); + + public DenseUnionArray( + IArrowType dataType, + int length, + IEnumerable children, + ArrowBuffer typeIds, + ArrowBuffer valuesOffsetBuffer, + int nullCount = 0, + int offset = 0) + : base(new ArrayData( + dataType, length, nullCount, offset, new[] { typeIds, valuesOffsetBuffer }, + children.Select(child => child.Data))) + { + _fields = children.ToArray(); + ValidateMode(UnionMode.Dense, Type.Mode); + } + + public DenseUnionArray(ArrayData data) + : base(data) + { + ValidateMode(UnionMode.Dense, Type.Mode); + data.EnsureBufferCount(2); + } + } +} diff --git a/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs b/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs index a50d4b52c3257..67fe46633c18f 100644 --- a/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs +++ b/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs @@ -137,6 +137,9 @@ public TBuilder Append(T value) return Instance; } + public TBuilder Append(T? value) => + (value == null) ? AppendNull() : Append(value.Value); + public TBuilder Append(ReadOnlySpan span) { int len = ValueBuffer.Length; diff --git a/csharp/src/Apache.Arrow/Arrays/SparseUnionArray.cs b/csharp/src/Apache.Arrow/Arrays/SparseUnionArray.cs new file mode 100644 index 0000000000000..b79c44c979e47 --- /dev/null +++ b/csharp/src/Apache.Arrow/Arrays/SparseUnionArray.cs @@ -0,0 +1,46 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +using Apache.Arrow.Types; +using System.Collections.Generic; +using System.Linq; + +namespace Apache.Arrow +{ + public class SparseUnionArray : UnionArray + { + public SparseUnionArray( + IArrowType dataType, + int length, + IEnumerable children, + ArrowBuffer typeIds, + int nullCount = 0, + int offset = 0) + : base(new ArrayData( + dataType, length, nullCount, offset, new[] { typeIds }, + children.Select(child => child.Data))) + { + _fields = children.ToArray(); + ValidateMode(UnionMode.Sparse, Type.Mode); + } + + public SparseUnionArray(ArrayData data) + : base(data) + { + ValidateMode(UnionMode.Sparse, Type.Mode); + data.EnsureBufferCount(1); + } + } +} diff --git a/csharp/src/Apache.Arrow/Arrays/UnionArray.cs b/csharp/src/Apache.Arrow/Arrays/UnionArray.cs index 8bccea2b59e31..0a7ae288fd0c5 100644 --- a/csharp/src/Apache.Arrow/Arrays/UnionArray.cs +++ b/csharp/src/Apache.Arrow/Arrays/UnionArray.cs @@ -15,37 +15,88 @@ using Apache.Arrow.Types; using System; +using System.Collections.Generic; +using System.Threading; namespace Apache.Arrow { - public class UnionArray: Array + public abstract class UnionArray : IArrowArray { - public UnionType Type => Data.DataType as UnionType; + protected IReadOnlyList _fields; - public UnionMode Mode => Type.Mode; + public IReadOnlyList Fields => + LazyInitializer.EnsureInitialized(ref _fields, () => InitializeFields()); + + public ArrayData Data { get; } - public ArrowBuffer TypeBuffer => Data.Buffers[1]; + public UnionType Type => (UnionType)Data.DataType; - public ArrowBuffer ValueOffsetBuffer => Data.Buffers[2]; + public UnionMode Mode => Type.Mode; + + public ArrowBuffer TypeBuffer => Data.Buffers[0]; public ReadOnlySpan TypeIds => TypeBuffer.Span; - public ReadOnlySpan ValueOffsets => ValueOffsetBuffer.Span.CastTo().Slice(0, Length + 1); + public int Length => Data.Length; + + public int Offset => Data.Offset; - public UnionArray(ArrayData data) - : base(data) + public int NullCount => Data.NullCount; + + public bool IsValid(int index) => NullCount == 0 || Fields[TypeIds[index]].IsValid(index); + + public bool IsNull(int index) => !IsValid(index); + + protected UnionArray(ArrayData data) { + Data = data; data.EnsureDataType(ArrowTypeId.Union); - data.EnsureBufferCount(3); } - public IArrowArray GetChild(int index) + public static UnionArray Create(ArrayData data) { - // TODO: Implement - throw new NotImplementedException(); + return ((UnionType)data.DataType).Mode switch + { + UnionMode.Dense => new DenseUnionArray(data), + UnionMode.Sparse => new SparseUnionArray(data), + _ => throw new InvalidOperationException("unknown union mode in array creation") + }; } - public override void Accept(IArrowArrayVisitor visitor) => Accept(this, visitor); + public void Accept(IArrowArrayVisitor visitor) => Array.Accept(this, visitor); + public void Dispose() + { + Dispose(true); + GC.SuppressFinalize(this); + } + + protected virtual void Dispose(bool disposing) + { + if (disposing) + { + Data.Dispose(); + } + } + + protected static void ValidateMode(UnionMode expected, UnionMode actual) + { + if (expected != actual) + { + throw new ArgumentException( + $"Specified union mode <{actual}> does not match expected mode <{expected}>", + "Mode"); + } + } + + private IReadOnlyList InitializeFields() + { + IArrowArray[] result = new IArrowArray[Data.Children.Length]; + for (int i = 0; i < Data.Children.Length; i++) + { + result[i] = ArrowArrayFactory.BuildArray(Data.Children[i]); + } + return result; + } } } diff --git a/csharp/src/Apache.Arrow/C/CArrowArrayImporter.cs b/csharp/src/Apache.Arrow/C/CArrowArrayImporter.cs index 9b7bcb7abe5a5..da1b0f31b8f08 100644 --- a/csharp/src/Apache.Arrow/C/CArrowArrayImporter.cs +++ b/csharp/src/Apache.Arrow/C/CArrowArrayImporter.cs @@ -170,6 +170,15 @@ private ArrayData GetAsArrayData(CArrowArray* cArray, IArrowType type) buffers = new ArrowBuffer[] { ImportValidityBuffer(cArray) }; break; case ArrowTypeId.Union: + UnionType unionType = (UnionType)type; + children = ProcessStructChildren(cArray, unionType.Fields); + buffers = unionType.Mode switch + { + UnionMode.Dense => ImportDenseUnionBuffers(cArray), + UnionMode.Sparse => ImportSparseUnionBuffers(cArray), + _ => throw new InvalidOperationException("unknown union mode in import") + }; ; + break; case ArrowTypeId.Map: break; case ArrowTypeId.Null: @@ -286,6 +295,35 @@ private ArrowBuffer[] ImportFixedSizeListBuffers(CArrowArray* cArray) return buffers; } + private ArrowBuffer[] ImportDenseUnionBuffers(CArrowArray* cArray) + { + if (cArray->n_buffers != 2) + { + throw new InvalidOperationException("Dense union arrays are expected to have exactly two children"); + } + int length = checked((int)cArray->length); + int offsetsLength = length * 4; + + ArrowBuffer[] buffers = new ArrowBuffer[2]; + buffers[0] = new ArrowBuffer(AddMemory((IntPtr)cArray->buffers[0], 0, length)); + buffers[1] = new ArrowBuffer(AddMemory((IntPtr)cArray->buffers[1], 0, offsetsLength)); + + return buffers; + } + + private ArrowBuffer[] ImportSparseUnionBuffers(CArrowArray* cArray) + { + if (cArray->n_buffers != 1) + { + throw new InvalidOperationException("Sparse union arrays are expected to have exactly one child"); + } + + ArrowBuffer[] buffers = new ArrowBuffer[1]; + buffers[0] = new ArrowBuffer(AddMemory((IntPtr)cArray->buffers[0], 0, checked((int)cArray->length))); + + return buffers; + } + private ArrowBuffer[] ImportFixedWidthBuffers(CArrowArray* cArray, int bitWidth) { if (cArray->n_buffers != 2) diff --git a/csharp/src/Apache.Arrow/C/CArrowSchemaExporter.cs b/csharp/src/Apache.Arrow/C/CArrowSchemaExporter.cs index 66142da331ac8..c1a12362a942a 100644 --- a/csharp/src/Apache.Arrow/C/CArrowSchemaExporter.cs +++ b/csharp/src/Apache.Arrow/C/CArrowSchemaExporter.cs @@ -124,6 +124,23 @@ public static unsafe void ExportSchema(Schema schema, CArrowSchema* out_schema) _ => throw new InvalidDataException($"Unsupported time unit for export: {unit}"), }; + private static string FormatUnion(UnionType unionType) + { + StringBuilder builder = new StringBuilder(); + builder.Append(unionType.Mode switch + { + UnionMode.Sparse => "+us:", + UnionMode.Dense => "+ud:", + _ => throw new InvalidDataException($"Unsupported union mode for export: {unionType.Mode}"), + }); + for (int i = 0; i < unionType.TypeIds.Length; i++) + { + if (i > 0) { builder.Append(','); } + builder.Append(unionType.TypeIds[i]); + } + return builder.ToString(); + } + private static string GetFormat(IArrowType datatype) { switch (datatype) @@ -170,6 +187,7 @@ private static string GetFormat(IArrowType datatype) case FixedSizeListType fixedListType: return $"+w:{fixedListType.ListSize}"; case StructType _: return "+s"; + case UnionType u: return FormatUnion(u); // Dictionary case DictionaryType dictionaryType: return GetFormat(dictionaryType.IndexType); diff --git a/csharp/src/Apache.Arrow/C/CArrowSchemaImporter.cs b/csharp/src/Apache.Arrow/C/CArrowSchemaImporter.cs index 2a750d5e8250d..42c8cdd5ef548 100644 --- a/csharp/src/Apache.Arrow/C/CArrowSchemaImporter.cs +++ b/csharp/src/Apache.Arrow/C/CArrowSchemaImporter.cs @@ -184,21 +184,7 @@ public ArrowType GetAsType() } else if (format == "+s") { - var child_schemas = new ImportedArrowSchema[_cSchema->n_children]; - - for (int i = 0; i < _cSchema->n_children; i++) - { - if (_cSchema->GetChild(i) == null) - { - throw new InvalidDataException("Expected struct type child to be non-null."); - } - child_schemas[i] = new ImportedArrowSchema(_cSchema->GetChild(i), isRoot: false); - } - - - List childFields = child_schemas.Select(schema => schema.GetAsField()).ToList(); - - return new StructType(childFields); + return new StructType(ParseChildren("struct")); } else if (format.StartsWith("+w:")) { @@ -265,6 +251,30 @@ public ArrowType GetAsType() return new FixedSizeBinaryType(width); } + // Unions + if (format.StartsWith("+ud:") || format.StartsWith("+us:")) + { + UnionMode unionMode = format[2] == 'd' ? UnionMode.Dense : UnionMode.Sparse; + List typeIds = new List(); + int pos = 4; + do + { + int next = format.IndexOf(',', pos); + if (next < 0) { next = format.Length; } + + int code; + if (!int.TryParse(format.Substring(pos, next - pos), out code)) + { + throw new InvalidDataException($"Invalid type code for union import: {format.Substring(pos, next - pos)}"); + } + typeIds.Add(code); + + pos = next + 1; + } while (pos < format.Length); + + return new UnionType(ParseChildren("union"), typeIds, unionMode); + } + return format switch { // Primitives @@ -324,6 +334,22 @@ public Schema GetAsSchema() } } + private List ParseChildren(string typeName) + { + var child_schemas = new ImportedArrowSchema[_cSchema->n_children]; + + for (int i = 0; i < _cSchema->n_children; i++) + { + if (_cSchema->GetChild(i) == null) + { + throw new InvalidDataException($"Expected {typeName} type child to be non-null."); + } + child_schemas[i] = new ImportedArrowSchema(_cSchema->GetChild(i), isRoot: false); + } + + return child_schemas.Select(schema => schema.GetAsField()).ToList(); + } + private unsafe static IReadOnlyDictionary GetMetadata(byte* metadata) { if (metadata == null) diff --git a/csharp/src/Apache.Arrow/ChunkedArray.cs b/csharp/src/Apache.Arrow/ChunkedArray.cs index 5f25acfe04a2f..f5909f5adfe48 100644 --- a/csharp/src/Apache.Arrow/ChunkedArray.cs +++ b/csharp/src/Apache.Arrow/ChunkedArray.cs @@ -15,7 +15,6 @@ using System; using System.Collections.Generic; -using Apache.Arrow; using Apache.Arrow.Types; namespace Apache.Arrow @@ -25,7 +24,7 @@ namespace Apache.Arrow /// public class ChunkedArray { - private IList Arrays { get; } + private IList Arrays { get; } public IArrowType DataType { get; } public long Length { get; } public long NullCount { get; } @@ -35,9 +34,16 @@ public int ArrayCount get => Arrays.Count; } - public Array Array(int index) => Arrays[index]; + public Array Array(int index) => Arrays[index] as Array; + + public IArrowArray ArrowArray(int index) => Arrays[index]; public ChunkedArray(IList arrays) + : this(Cast(arrays)) + { + } + + public ChunkedArray(IList arrays) { Arrays = arrays ?? throw new ArgumentNullException(nameof(arrays)); if (arrays.Count < 1) @@ -45,14 +51,14 @@ public ChunkedArray(IList arrays) throw new ArgumentException($"Count must be at least 1. Got {arrays.Count} instead"); } DataType = arrays[0].Data.DataType; - foreach (Array array in arrays) + foreach (IArrowArray array in arrays) { Length += array.Length; NullCount += array.NullCount; } } - public ChunkedArray(Array array) : this(new[] { array }) { } + public ChunkedArray(Array array) : this(new IArrowArray[] { array }) { } public ChunkedArray Slice(long offset, long length) { @@ -69,10 +75,10 @@ public ChunkedArray Slice(long offset, long length) curArrayIndex++; } - IList newArrays = new List(); + IList newArrays = new List(); while (curArrayIndex < numArrays && length > 0) { - newArrays.Add(Arrays[curArrayIndex].Slice((int)offset, + newArrays.Add(ArrowArrayFactory.Slice(Arrays[curArrayIndex], (int)offset, length > Arrays[curArrayIndex].Length ? Arrays[curArrayIndex].Length : (int)length)); length -= Arrays[curArrayIndex].Length - offset; offset = 0; @@ -86,6 +92,16 @@ public ChunkedArray Slice(long offset) return Slice(offset, Length - offset); } + private static IArrowArray[] Cast(IList arrays) + { + IArrowArray[] arrowArrays = new IArrowArray[arrays.Count]; + for (int i = 0; i < arrays.Count; i++) + { + arrowArrays[i] = arrays[i]; + } + return arrowArrays; + } + // TODO: Flatten for Structs } } diff --git a/csharp/src/Apache.Arrow/Column.cs b/csharp/src/Apache.Arrow/Column.cs index 4eaf9a559e75d..0709b9142cafd 100644 --- a/csharp/src/Apache.Arrow/Column.cs +++ b/csharp/src/Apache.Arrow/Column.cs @@ -28,19 +28,23 @@ public class Column public ChunkedArray Data { get; } public Column(Field field, IList arrays) + : this(field, new ChunkedArray(arrays), doValidation: true) + { + } + + public Column(Field field, IList arrays) + : this(field, new ChunkedArray(arrays), doValidation: true) { - Data = new ChunkedArray(arrays); - Field = field; - if (!ValidateArrayDataTypes()) - { - throw new ArgumentException($"{Field.DataType} must match {Data.DataType}"); - } } - private Column(Field field, ChunkedArray arrays) + private Column(Field field, ChunkedArray data, bool doValidation = false) { + Data = data; Field = field; - Data = arrays; + if (doValidation && !ValidateArrayDataTypes()) + { + throw new ArgumentException($"{Field.DataType} must match {Data.DataType}"); + } } public long Length => Data.Length; @@ -64,12 +68,12 @@ private bool ValidateArrayDataTypes() for (int i = 0; i < Data.ArrayCount; i++) { - if (Data.Array(i).Data.DataType.TypeId != Field.DataType.TypeId) + if (Data.ArrowArray(i).Data.DataType.TypeId != Field.DataType.TypeId) { return false; } - Data.Array(i).Data.DataType.Accept(dataTypeComparer); + Data.ArrowArray(i).Data.DataType.Accept(dataTypeComparer); if (!dataTypeComparer.DataTypeMatch) { diff --git a/csharp/src/Apache.Arrow/Extensions/FlatbufExtensions.cs b/csharp/src/Apache.Arrow/Extensions/FlatbufExtensions.cs index d2a70bca9e4ec..35c5b3e55157d 100644 --- a/csharp/src/Apache.Arrow/Extensions/FlatbufExtensions.cs +++ b/csharp/src/Apache.Arrow/Extensions/FlatbufExtensions.cs @@ -80,6 +80,16 @@ public static Types.TimeUnit ToArrow(this Flatbuf.TimeUnit unit) throw new ArgumentException($"Unexpected Flatbuf TimeUnit", nameof(unit)); } } + + public static Types.UnionMode ToArrow(this Flatbuf.UnionMode mode) + { + return mode switch + { + Flatbuf.UnionMode.Dense => Types.UnionMode.Dense, + Flatbuf.UnionMode.Sparse => Types.UnionMode.Sparse, + _ => throw new ArgumentException($"Unsupported Flatbuf UnionMode", nameof(mode)), + }; + } } } diff --git a/csharp/src/Apache.Arrow/Interfaces/IArrowArray.cs b/csharp/src/Apache.Arrow/Interfaces/IArrowArray.cs index 50fbc3af6dd72..9bcee36ef4eaf 100644 --- a/csharp/src/Apache.Arrow/Interfaces/IArrowArray.cs +++ b/csharp/src/Apache.Arrow/Interfaces/IArrowArray.cs @@ -32,9 +32,5 @@ public interface IArrowArray : IDisposable ArrayData Data { get; } void Accept(IArrowArrayVisitor visitor); - - //IArrowArray Slice(int offset); - - //IArrowArray Slice(int offset, int length); } } diff --git a/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs b/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs index c9c1b21673316..d3115da52cc6c 100644 --- a/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs +++ b/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs @@ -116,11 +116,11 @@ protected RecordBatch CreateArrowObjectFromMessage( break; case Flatbuf.MessageHeader.DictionaryBatch: Flatbuf.DictionaryBatch dictionaryBatch = message.Header().Value; - ReadDictionaryBatch(dictionaryBatch, bodyByteBuffer, memoryOwner); + ReadDictionaryBatch(message.Version, dictionaryBatch, bodyByteBuffer, memoryOwner); break; case Flatbuf.MessageHeader.RecordBatch: Flatbuf.RecordBatch rb = message.Header().Value; - List arrays = BuildArrays(Schema, bodyByteBuffer, rb); + List arrays = BuildArrays(message.Version, Schema, bodyByteBuffer, rb); return new RecordBatch(Schema, memoryOwner, arrays, (int)rb.Length); default: // NOTE: Skip unsupported message type @@ -136,7 +136,11 @@ internal static ByteBuffer CreateByteBuffer(ReadOnlyMemory buffer) return new ByteBuffer(new ReadOnlyMemoryBufferAllocator(buffer), 0); } - private void ReadDictionaryBatch(Flatbuf.DictionaryBatch dictionaryBatch, ByteBuffer bodyByteBuffer, IMemoryOwner memoryOwner) + private void ReadDictionaryBatch( + MetadataVersion version, + Flatbuf.DictionaryBatch dictionaryBatch, + ByteBuffer bodyByteBuffer, + IMemoryOwner memoryOwner) { long id = dictionaryBatch.Id; IArrowType valueType = DictionaryMemo.GetDictionaryType(id); @@ -149,7 +153,7 @@ private void ReadDictionaryBatch(Flatbuf.DictionaryBatch dictionaryBatch, ByteBu Field valueField = new Field("dummy", valueType, true); var schema = new Schema(new[] { valueField }, default); - IList arrays = BuildArrays(schema, bodyByteBuffer, recordBatch.Value); + IList arrays = BuildArrays(version, schema, bodyByteBuffer, recordBatch.Value); if (arrays.Count != 1) { @@ -167,6 +171,7 @@ private void ReadDictionaryBatch(Flatbuf.DictionaryBatch dictionaryBatch, ByteBu } private List BuildArrays( + MetadataVersion version, Schema schema, ByteBuffer messageBuffer, Flatbuf.RecordBatch recordBatchMessage) @@ -187,8 +192,8 @@ private List BuildArrays( Flatbuf.FieldNode fieldNode = recordBatchEnumerator.CurrentNode; ArrayData arrayData = field.DataType.IsFixedPrimitive() - ? LoadPrimitiveField(ref recordBatchEnumerator, field, in fieldNode, messageBuffer, bufferCreator) - : LoadVariableField(ref recordBatchEnumerator, field, in fieldNode, messageBuffer, bufferCreator); + ? LoadPrimitiveField(version, ref recordBatchEnumerator, field, in fieldNode, messageBuffer, bufferCreator) + : LoadVariableField(version, ref recordBatchEnumerator, field, in fieldNode, messageBuffer, bufferCreator); arrays.Add(ArrowArrayFactory.BuildArray(arrayData)); } while (recordBatchEnumerator.MoveNextNode()); @@ -225,6 +230,7 @@ private IBufferCreator GetBufferCreator(BodyCompression? compression) } private ArrayData LoadPrimitiveField( + MetadataVersion version, ref RecordBatchEnumerator recordBatchEnumerator, Field field, in Flatbuf.FieldNode fieldNode, @@ -245,31 +251,44 @@ private ArrayData LoadPrimitiveField( throw new InvalidDataException("Null count length must be >= 0"); // TODO:Localize exception message } - if (field.DataType.TypeId == ArrowTypeId.Null) + int buffers; + switch (field.DataType.TypeId) { - return new ArrayData(field.DataType, fieldLength, fieldNullCount, 0, System.Array.Empty()); - } - - ArrowBuffer nullArrowBuffer = BuildArrowBuffer(bodyData, recordBatchEnumerator.CurrentBuffer, bufferCreator); - if (!recordBatchEnumerator.MoveNextBuffer()) - { - throw new Exception("Unable to move to the next buffer."); + case ArrowTypeId.Null: + return new ArrayData(field.DataType, fieldLength, fieldNullCount, 0, System.Array.Empty()); + case ArrowTypeId.Union: + if (version < MetadataVersion.V5) + { + if (fieldNullCount > 0) + { + if (recordBatchEnumerator.CurrentBuffer.Length > 0) + { + // With older metadata we can get a validity bitmap. Fixing up union data is hard, + // so we will just quit. + throw new NotSupportedException("Cannot read pre-1.0.0 Union array with top-level validity bitmap"); + } + } + recordBatchEnumerator.MoveNextBuffer(); + } + buffers = ((UnionType)field.DataType).Mode == Types.UnionMode.Dense ? 2 : 1; + break; + case ArrowTypeId.Struct: + case ArrowTypeId.FixedSizeList: + buffers = 1; + break; + default: + buffers = 2; + break; } - ArrowBuffer[] arrowBuff; - if (field.DataType.TypeId == ArrowTypeId.Struct || field.DataType.TypeId == ArrowTypeId.FixedSizeList) + ArrowBuffer[] arrowBuff = new ArrowBuffer[buffers]; + for (int i = 0; i < buffers; i++) { - arrowBuff = new[] { nullArrowBuffer }; - } - else - { - ArrowBuffer valueArrowBuffer = BuildArrowBuffer(bodyData, recordBatchEnumerator.CurrentBuffer, bufferCreator); + arrowBuff[i] = BuildArrowBuffer(bodyData, recordBatchEnumerator.CurrentBuffer, bufferCreator); recordBatchEnumerator.MoveNextBuffer(); - - arrowBuff = new[] { nullArrowBuffer, valueArrowBuffer }; } - ArrayData[] children = GetChildren(ref recordBatchEnumerator, field, bodyData, bufferCreator); + ArrayData[] children = GetChildren(version, ref recordBatchEnumerator, field, bodyData, bufferCreator); IArrowArray dictionary = null; if (field.DataType.TypeId == ArrowTypeId.Dictionary) @@ -282,6 +301,7 @@ private ArrayData LoadPrimitiveField( } private ArrayData LoadVariableField( + MetadataVersion version, ref RecordBatchEnumerator recordBatchEnumerator, Field field, in Flatbuf.FieldNode fieldNode, @@ -316,7 +336,7 @@ private ArrayData LoadVariableField( } ArrowBuffer[] arrowBuff = new[] { nullArrowBuffer, offsetArrowBuffer, valueArrowBuffer }; - ArrayData[] children = GetChildren(ref recordBatchEnumerator, field, bodyData, bufferCreator); + ArrayData[] children = GetChildren(version, ref recordBatchEnumerator, field, bodyData, bufferCreator); IArrowArray dictionary = null; if (field.DataType.TypeId == ArrowTypeId.Dictionary) @@ -329,6 +349,7 @@ private ArrayData LoadVariableField( } private ArrayData[] GetChildren( + MetadataVersion version, ref RecordBatchEnumerator recordBatchEnumerator, Field field, ByteBuffer bodyData, @@ -345,8 +366,8 @@ private ArrayData[] GetChildren( Field childField = type.Fields[index]; ArrayData child = childField.DataType.IsFixedPrimitive() - ? LoadPrimitiveField(ref recordBatchEnumerator, childField, in childFieldNode, bodyData, bufferCreator) - : LoadVariableField(ref recordBatchEnumerator, childField, in childFieldNode, bodyData, bufferCreator); + ? LoadPrimitiveField(version, ref recordBatchEnumerator, childField, in childFieldNode, bodyData, bufferCreator) + : LoadVariableField(version, ref recordBatchEnumerator, childField, in childFieldNode, bodyData, bufferCreator); children[index] = child; } diff --git a/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs b/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs index a5d8db3f509d7..2b3815af71142 100644 --- a/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs +++ b/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs @@ -55,6 +55,7 @@ internal class ArrowRecordBatchFlatBufferBuilder : IArrowArrayVisitor, IArrowArrayVisitor, IArrowArrayVisitor, + IArrowArrayVisitor, IArrowArrayVisitor, IArrowArrayVisitor, IArrowArrayVisitor, @@ -156,6 +157,22 @@ public void Visit(StructArray array) } } + public void Visit(UnionArray array) + { + _buffers.Add(CreateBuffer(array.TypeBuffer)); + + ArrowBuffer? offsets = (array as DenseUnionArray)?.ValueOffsetBuffer; + if (offsets != null) + { + _buffers.Add(CreateBuffer(offsets.Value)); + } + + for (int i = 0; i < array.Fields.Count; i++) + { + array.Fields[i].Accept(this); + } + } + public void Visit(DictionaryArray array) { // Dictionary is serialized separately in Dictionary serialization. @@ -218,7 +235,7 @@ public void Visit(IArrowArray array) private readonly bool _leaveOpen; private readonly IpcOptions _options; - private protected const Flatbuf.MetadataVersion CurrentMetadataVersion = Flatbuf.MetadataVersion.V4; + private protected const Flatbuf.MetadataVersion CurrentMetadataVersion = Flatbuf.MetadataVersion.V5; private static readonly byte[] s_padding = new byte[64]; diff --git a/csharp/src/Apache.Arrow/Ipc/ArrowTypeFlatbufferBuilder.cs b/csharp/src/Apache.Arrow/Ipc/ArrowTypeFlatbufferBuilder.cs index 203aa72d93ea3..b11467538dd04 100644 --- a/csharp/src/Apache.Arrow/Ipc/ArrowTypeFlatbufferBuilder.cs +++ b/csharp/src/Apache.Arrow/Ipc/ArrowTypeFlatbufferBuilder.cs @@ -120,7 +120,9 @@ public void Visit(FixedSizeListType type) public void Visit(UnionType type) { - throw new NotImplementedException(); + Result = FieldType.Build( + Flatbuf.Type.Union, + Flatbuf.Union.CreateUnion(Builder, ToFlatBuffer(type.Mode), Flatbuf.Union.CreateTypeIdsVector(Builder, type.TypeIds))); } public void Visit(StringType type) @@ -279,5 +281,15 @@ private static Flatbuf.TimeUnit ToFlatBuffer(TimeUnit unit) return result; } + + private static Flatbuf.UnionMode ToFlatBuffer(Types.UnionMode mode) + { + return mode switch + { + Types.UnionMode.Dense => Flatbuf.UnionMode.Dense, + Types.UnionMode.Sparse => Flatbuf.UnionMode.Sparse, + _ => throw new ArgumentException($"unsupported union mode <{mode}>", nameof(mode)), + }; + } } } diff --git a/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs b/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs index 8ca69b61165bf..6249063ba81f4 100644 --- a/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs +++ b/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs @@ -203,6 +203,10 @@ private static Types.IArrowType GetFieldArrowType(Flatbuf.Field field, Field[] c case Flatbuf.Type.Struct_: Debug.Assert(childFields != null); return new Types.StructType(childFields); + case Flatbuf.Type.Union: + Debug.Assert(childFields != null); + Flatbuf.Union unionMetadata = field.Type().Value; + return new Types.UnionType(childFields, unionMetadata.GetTypeIdsArray(), unionMetadata.Mode.ToArrow()); default: throw new InvalidDataException($"Arrow primitive '{field.TypeType}' is unsupported."); } diff --git a/csharp/src/Apache.Arrow/Table.cs b/csharp/src/Apache.Arrow/Table.cs index 0b9f31557bec8..939ec23f54ff2 100644 --- a/csharp/src/Apache.Arrow/Table.cs +++ b/csharp/src/Apache.Arrow/Table.cs @@ -37,10 +37,10 @@ public static Table TableFromRecordBatches(Schema schema, IList rec List columns = new List(nColumns); for (int icol = 0; icol < nColumns; icol++) { - List columnArrays = new List(nBatches); + List columnArrays = new List(nBatches); for (int jj = 0; jj < nBatches; jj++) { - columnArrays.Add(recordBatches[jj].Column(icol) as Array); + columnArrays.Add(recordBatches[jj].Column(icol)); } columns.Add(new Column(schema.GetFieldByIndex(icol), columnArrays)); } diff --git a/csharp/src/Apache.Arrow/Types/UnionType.cs b/csharp/src/Apache.Arrow/Types/UnionType.cs index 293271018aa26..23fa3b45ab278 100644 --- a/csharp/src/Apache.Arrow/Types/UnionType.cs +++ b/csharp/src/Apache.Arrow/Types/UnionType.cs @@ -24,20 +24,21 @@ public enum UnionMode Dense } - public sealed class UnionType : ArrowType + public sealed class UnionType : NestedType { public override ArrowTypeId TypeId => ArrowTypeId.Union; public override string Name => "union"; public UnionMode Mode { get; } - - public IEnumerable TypeCodes { get; } + + public int[] TypeIds { get; } public UnionType( - IEnumerable fields, IEnumerable typeCodes, + IEnumerable fields, IEnumerable typeIds, UnionMode mode = UnionMode.Sparse) + : base(fields.ToArray()) { - TypeCodes = typeCodes.ToList(); + TypeIds = typeIds.ToArray(); Mode = mode; } diff --git a/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs b/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs index abf7451e5e98c..1e76ee505a516 100644 --- a/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs +++ b/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs @@ -128,7 +128,7 @@ private RecordBatch CreateRecordBatch(Schema schema, JsonRecordBatch jsonRecordB for (int i = 0; i < jsonRecordBatch.Columns.Count; i++) { JsonFieldData data = jsonRecordBatch.Columns[i]; - Field field = schema.GetFieldByName(data.Name); + Field field = schema.FieldsList[i]; ArrayCreator creator = new ArrayCreator(data); field.DataType.Accept(creator); arrays.Add(creator.Array); @@ -188,6 +188,7 @@ private static IArrowType ToArrowType(JsonArrowType type, Field[] children) "list" => ToListArrowType(type, children), "fixedsizelist" => ToFixedSizeListArrowType(type, children), "struct" => ToStructArrowType(type, children), + "union" => ToUnionArrowType(type, children), "null" => NullType.Default, _ => throw new NotSupportedException($"JsonArrowType not supported: {type.Name}") }; @@ -281,6 +282,17 @@ private static IArrowType ToStructArrowType(JsonArrowType type, Field[] children return new StructType(children); } + private static IArrowType ToUnionArrowType(JsonArrowType type, Field[] children) + { + UnionMode mode = type.Mode switch + { + "SPARSE" => UnionMode.Sparse, + "DENSE" => UnionMode.Dense, + _ => throw new NotSupportedException($"Union mode not supported: {type.Mode}"), + }; + return new UnionType(children, type.TypeIds, mode); + } + private class ArrayCreator : IArrowTypeVisitor, IArrowTypeVisitor, @@ -306,6 +318,7 @@ private class ArrayCreator : IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, + IArrowTypeVisitor, IArrowTypeVisitor { private JsonFieldData JsonFieldData { get; set; } @@ -556,6 +569,43 @@ public void Visit(StructType type) Array = new StructArray(arrayData); } + public void Visit(UnionType type) + { + ArrowBuffer[] buffers; + if (type.Mode == UnionMode.Dense) + { + buffers = new ArrowBuffer[2]; + buffers[1] = GetOffsetBuffer(); + } + else + { + buffers = new ArrowBuffer[1]; + } + buffers[0] = GetTypeIdBuffer(); + + ArrayData[] children = GetChildren(type); + + int nullCount = 0; + ArrayData arrayData = new ArrayData(type, JsonFieldData.Count, nullCount, 0, buffers, children); + Array = UnionArray.Create(arrayData); + } + + private ArrayData[] GetChildren(NestedType type) + { + ArrayData[] children = new ArrayData[type.Fields.Count]; + + var data = JsonFieldData; + for (int i = 0; i < children.Length; i++) + { + JsonFieldData = data.Children[i]; + type.Fields[i].DataType.Accept(this); + children[i] = Array.Data; + } + JsonFieldData = data; + + return children; + } + private static byte[] ConvertHexStringToByteArray(string hexString) { byte[] data = new byte[hexString.Length / 2]; @@ -619,11 +669,22 @@ private void GenerateLongArray(Func valueOffsets = new ArrowBuffer.Builder(JsonFieldData.Offset.Length); valueOffsets.AppendRange(JsonFieldData.Offset); return valueOffsets.Build(default); } + private ArrowBuffer GetTypeIdBuffer() + { + ArrowBuffer.Builder typeIds = new ArrowBuffer.Builder(JsonFieldData.TypeId.Length); + for (int i = 0; i < JsonFieldData.TypeId.Length; i++) + { + typeIds.Append(checked((byte)JsonFieldData.TypeId[i])); + } + return typeIds.Build(default); + } + private ArrowBuffer GetValidityBuffer(out int nullCount) { if (JsonFieldData.Validity == null) diff --git a/csharp/test/Apache.Arrow.IntegrationTest/JsonFile.cs b/csharp/test/Apache.Arrow.IntegrationTest/JsonFile.cs index f0f63d3e19b8c..112eeabcb9931 100644 --- a/csharp/test/Apache.Arrow.IntegrationTest/JsonFile.cs +++ b/csharp/test/Apache.Arrow.IntegrationTest/JsonFile.cs @@ -71,6 +71,10 @@ public class JsonArrowType // FixedSizeList fields public int ListSize { get; set; } + // union fields + public string Mode { get; set; } + public int[] TypeIds { get; set; } + [JsonExtensionData] public Dictionary ExtensionData { get; set; } } diff --git a/csharp/test/Apache.Arrow.Tests/ArrayTypeComparer.cs b/csharp/test/Apache.Arrow.Tests/ArrayTypeComparer.cs index 77584aefb1bf4..c8bcc3cee0f99 100644 --- a/csharp/test/Apache.Arrow.Tests/ArrayTypeComparer.cs +++ b/csharp/test/Apache.Arrow.Tests/ArrayTypeComparer.cs @@ -28,7 +28,8 @@ public class ArrayTypeComparer : IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, - IArrowTypeVisitor + IArrowTypeVisitor, + IArrowTypeVisitor { private readonly IArrowType _expectedType; @@ -114,6 +115,22 @@ public void Visit(StructType actualType) CompareNested(expectedType, actualType); } + public void Visit(UnionType actualType) + { + Assert.IsAssignableFrom(_expectedType); + UnionType expectedType = (UnionType)_expectedType; + + Assert.Equal(expectedType.Mode, actualType.Mode); + + Assert.Equal(expectedType.TypeIds.Length, actualType.TypeIds.Length); + for (int i = 0; i < expectedType.TypeIds.Length; i++) + { + Assert.Equal(expectedType.TypeIds[i], actualType.TypeIds[i]); + } + + CompareNested(expectedType, actualType); + } + private static void CompareNested(NestedType expectedType, NestedType actualType) { Assert.Equal(expectedType.Fields.Count, actualType.Fields.Count); diff --git a/csharp/test/Apache.Arrow.Tests/ArrowArrayConcatenatorTests.cs b/csharp/test/Apache.Arrow.Tests/ArrowArrayConcatenatorTests.cs index 36cffe7eb4da1..f5a2c345e2ae6 100644 --- a/csharp/test/Apache.Arrow.Tests/ArrowArrayConcatenatorTests.cs +++ b/csharp/test/Apache.Arrow.Tests/ArrowArrayConcatenatorTests.cs @@ -77,6 +77,22 @@ private static IEnumerable, IArrowArray>> GenerateTestDa new Field.Builder().Name("Ints").DataType(Int32Type.Default).Nullable(true).Build() }), new FixedSizeListType(Int32Type.Default, 1), + new UnionType( + new List{ + new Field.Builder().Name("Strings").DataType(StringType.Default).Nullable(true).Build(), + new Field.Builder().Name("Ints").DataType(Int32Type.Default).Nullable(true).Build() + }, + new[] { 0, 1 }, + UnionMode.Sparse + ), + new UnionType( + new List{ + new Field.Builder().Name("Strings").DataType(StringType.Default).Nullable(true).Build(), + new Field.Builder().Name("Ints").DataType(Int32Type.Default).Nullable(true).Build() + }, + new[] { 0, 1 }, + UnionMode.Dense + ), }; foreach (IArrowType type in targetTypes) @@ -119,7 +135,8 @@ private class TestDataGenerator : IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, - IArrowTypeVisitor + IArrowTypeVisitor, + IArrowTypeVisitor { private List> _baseData; @@ -392,6 +409,91 @@ public void Visit(StructType type) ExpectedArray = new StructArray(type, 3, new List { resultStringArray, resultInt32Array }, nullBitmapBuffer, 1); } + public void Visit(UnionType type) + { + bool isDense = type.Mode == UnionMode.Dense; + + StringArray.Builder stringResultBuilder = new StringArray.Builder().Reserve(_baseDataTotalElementCount); + Int32Array.Builder intResultBuilder = new Int32Array.Builder().Reserve(_baseDataTotalElementCount); + ArrowBuffer.Builder typeResultBuilder = new ArrowBuffer.Builder().Reserve(_baseDataTotalElementCount); + ArrowBuffer.Builder offsetResultBuilder = new ArrowBuffer.Builder().Reserve(_baseDataTotalElementCount); + int resultNullCount = 0; + + for (int i = 0; i < _baseDataListCount; i++) + { + List dataList = _baseData[i]; + StringArray.Builder stringBuilder = new StringArray.Builder().Reserve(dataList.Count); + Int32Array.Builder intBuilder = new Int32Array.Builder().Reserve(dataList.Count); + ArrowBuffer.Builder typeBuilder = new ArrowBuffer.Builder().Reserve(dataList.Count); + ArrowBuffer.Builder offsetBuilder = new ArrowBuffer.Builder().Reserve(dataList.Count); + int nullCount = 0; + + for (int j = 0; j < dataList.Count; j++) + { + byte index = (byte)Math.Max(j % 3, 1); + int? intValue = (index == 1) ? dataList[j] : null; + string stringValue = (index == 1) ? null : dataList[j]?.ToString(); + typeBuilder.Append(index); + + if (isDense) + { + if (index == 0) + { + offsetBuilder.Append(stringBuilder.Length); + offsetResultBuilder.Append(stringResultBuilder.Length); + stringBuilder.Append(stringValue); + stringResultBuilder.Append(stringValue); + } + else + { + offsetBuilder.Append(intBuilder.Length); + offsetResultBuilder.Append(intResultBuilder.Length); + intBuilder.Append(intValue); + intResultBuilder.Append(intValue); + } + } + else + { + stringBuilder.Append(stringValue); + stringResultBuilder.Append(stringValue); + intBuilder.Append(intValue); + intResultBuilder.Append(intValue); + } + + if (dataList[j] == null) + { + nullCount++; + resultNullCount++; + } + } + + ArrowBuffer[] buffers; + if (isDense) + { + buffers = new[] { typeBuilder.Build(), offsetBuilder.Build() }; + } + else + { + buffers = new[] { typeBuilder.Build() }; + } + TestTargetArrayList.Add(UnionArray.Create(new ArrayData( + type, dataList.Count, nullCount, 0, buffers, + new[] { stringBuilder.Build().Data, intBuilder.Build().Data }))); + } + + ArrowBuffer[] resultBuffers; + if (isDense) + { + resultBuffers = new[] { typeResultBuilder.Build(), offsetResultBuilder.Build() }; + } + else + { + resultBuffers = new[] { typeResultBuilder.Build() }; + } + ExpectedArray = UnionArray.Create(new ArrayData( + type, _baseDataTotalElementCount, resultNullCount, 0, resultBuffers, + new[] { stringResultBuilder.Build().Data, intResultBuilder.Build().Data })); + } public void Visit(IArrowType type) { diff --git a/csharp/test/Apache.Arrow.Tests/ArrowReaderVerifier.cs b/csharp/test/Apache.Arrow.Tests/ArrowReaderVerifier.cs index e588eab51e1fc..8b41763a70ac8 100644 --- a/csharp/test/Apache.Arrow.Tests/ArrowReaderVerifier.cs +++ b/csharp/test/Apache.Arrow.Tests/ArrowReaderVerifier.cs @@ -91,6 +91,7 @@ private class ArrayComparer : IArrowArrayVisitor, IArrowArrayVisitor, IArrowArrayVisitor, + IArrowArrayVisitor, IArrowArrayVisitor, IArrowArrayVisitor, IArrowArrayVisitor, @@ -151,6 +152,24 @@ public void Visit(StructArray array) } } + public void Visit(UnionArray array) + { + Assert.IsAssignableFrom(_expectedArray); + UnionArray expectedArray = (UnionArray)_expectedArray; + + Assert.Equal(expectedArray.Mode, array.Mode); + Assert.Equal(expectedArray.Length, array.Length); + Assert.Equal(expectedArray.NullCount, array.NullCount); + Assert.Equal(expectedArray.Offset, array.Offset); + Assert.Equal(expectedArray.Data.Children.Length, array.Data.Children.Length); + Assert.Equal(expectedArray.Fields.Count, array.Fields.Count); + + for (int i = 0; i < array.Fields.Count; i++) + { + array.Fields[i].Accept(new ArrayComparer(expectedArray.Fields[i], _strictCompare)); + } + } + public void Visit(DictionaryArray array) { Assert.IsAssignableFrom(_expectedArray); diff --git a/csharp/test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs b/csharp/test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs index 29b1b9e7db74a..f28b89a9cd17e 100644 --- a/csharp/test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs +++ b/csharp/test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs @@ -112,6 +112,9 @@ private static Schema GetTestSchema() .Field(f => f.Name("dict_string_ordered").DataType(new DictionaryType(Int32Type.Default, StringType.Default, true)).Nullable(false)) .Field(f => f.Name("list_dict_string").DataType(new ListType(new DictionaryType(Int32Type.Default, StringType.Default, false))).Nullable(false)) + .Field(f => f.Name("dense_union").DataType(new UnionType(new[] { new Field("i64", Int64Type.Default, false), new Field("f32", FloatType.Default, true), }, new[] { 0, 1 }, UnionMode.Dense))) + .Field(f => f.Name("sparse_union").DataType(new UnionType(new[] { new Field("i32", Int32Type.Default, true), new Field("f64", DoubleType.Default, false), }, new[] { 0, 1 }, UnionMode.Sparse))) + // Checking wider characters. .Field(f => f.Name("hello 你好 😄").DataType(BooleanType.Default).Nullable(true)) @@ -172,6 +175,9 @@ private static IEnumerable GetPythonFields() yield return pa.field("dict_string_ordered", pa.dictionary(pa.int32(), pa.utf8(), true), false); yield return pa.field("list_dict_string", pa.list_(pa.dictionary(pa.int32(), pa.utf8(), false)), false); + yield return pa.field("dense_union", pa.dense_union(List(pa.field("i64", pa.int64(), false), pa.field("f32", pa.float32(), true)))); + yield return pa.field("sparse_union", pa.sparse_union(List(pa.field("i32", pa.int32(), true), pa.field("f64", pa.float64(), false)))); + yield return pa.field("hello 你好 😄", pa.bool_(), true); } } @@ -485,22 +491,29 @@ public unsafe void ImportRecordBatch() pa.array(List(0.0, 1.4, 2.5, 3.6, 4.7)), pa.array(new PyObject[] { List(1, 2), List(3, 4), PyObject.None, PyObject.None, List(5, 4, 3) }), pa.StructArray.from_arrays( - new PyList(new PyObject[] - { + List( List(10, 9, null, null, null), List("banana", "apple", "orange", "cherry", "grape"), - List(null, 4.3, -9, 123.456, 0), - }), + List(null, 4.3, -9, 123.456, 0) + ), new[] { "fld1", "fld2", "fld3" }), pa.DictionaryArray.from_arrays( pa.array(List(1, 0, 1, 1, null)), - pa.array(List("foo", "bar")) - ), + pa.array(List("foo", "bar"))), pa.FixedSizeListArray.from_arrays( pa.array(List(1, 2, 3, 4, null, 6, 7, null, null, null)), 2), + pa.UnionArray.from_dense( + pa.array(List(0, 1, 1, 0, 0), type: "int8"), + pa.array(List(0, 0, 1, 1, 2), type: "int32"), + List( + pa.array(List(1, 4, null)), + pa.array(List("two", "three")) + ), + /* field name */ List("i32", "s"), + /* type codes */ List(3, 2)), }), - new[] { "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8" }); + new[] { "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9" }); dynamic batch = table.to_batches()[0]; @@ -568,6 +581,10 @@ public unsafe void ImportRecordBatch() Assert.Equal(new long[] { 1, 2, 3, 4, 0, 6, 7, 0, 0, 0 }, col8a.Values.ToArray()); Assert.True(col8a.IsValid(3)); Assert.False(col8a.IsValid(9)); + + UnionArray col9 = (UnionArray)recordBatch.Column("col9"); + Assert.Equal(5, col9.Length); + Assert.True(col9 is DenseUnionArray); } [SkippableFact] @@ -789,6 +806,11 @@ private static PyObject List(params string[] values) return new PyList(values.Select(i => i == null ? PyObject.None : new PyString(i)).ToArray()); } + private static PyObject List(params PyObject[] values) + { + return new PyList(values); + } + sealed class TestArrayStream : IArrowArrayStream { private readonly RecordBatch[] _batches; diff --git a/csharp/test/Apache.Arrow.Tests/ColumnTests.cs b/csharp/test/Apache.Arrow.Tests/ColumnTests.cs index b90c681622d5f..2d867b79176aa 100644 --- a/csharp/test/Apache.Arrow.Tests/ColumnTests.cs +++ b/csharp/test/Apache.Arrow.Tests/ColumnTests.cs @@ -39,7 +39,7 @@ public void TestColumn() Array intArrayCopy = MakeIntArray(10); Field field = new Field.Builder().Name("f0").DataType(Int32Type.Default).Build(); - Column column = new Column(field, new[] { intArray, intArrayCopy }); + Column column = new Column(field, new IArrowArray[] { intArray, intArrayCopy }); Assert.True(column.Name == field.Name); Assert.True(column.Field == field); diff --git a/csharp/test/Apache.Arrow.Tests/TableTests.cs b/csharp/test/Apache.Arrow.Tests/TableTests.cs index b4c4b1faed190..8b07a38c1b8c0 100644 --- a/csharp/test/Apache.Arrow.Tests/TableTests.cs +++ b/csharp/test/Apache.Arrow.Tests/TableTests.cs @@ -30,7 +30,7 @@ public static Table MakeTableWithOneColumnOfTwoIntArrays(int lengthOfEachArray) Field field = new Field.Builder().Name("f0").DataType(Int32Type.Default).Build(); Schema s0 = new Schema.Builder().Field(field).Build(); - Column column = new Column(field, new List { intArray, intArrayCopy }); + Column column = new Column(field, new List { intArray, intArrayCopy }); Table table = new Table(s0, new List { column }); return table; } @@ -60,7 +60,7 @@ public void TestTableFromRecordBatches() Table table1 = Table.TableFromRecordBatches(recordBatch1.Schema, recordBatches); Assert.Equal(20, table1.RowCount); - Assert.Equal(24, table1.ColumnCount); + Assert.Equal(26, table1.ColumnCount); FixedSizeBinaryType type = new FixedSizeBinaryType(17); Field newField1 = new Field(type.Name, type, false); @@ -86,13 +86,13 @@ public void TestTableAddRemoveAndSetColumn() Array nonEqualLengthIntArray = ColumnTests.MakeIntArray(10); Field field1 = new Field.Builder().Name("f1").DataType(Int32Type.Default).Build(); - Column nonEqualLengthColumn = new Column(field1, new[] { nonEqualLengthIntArray}); + Column nonEqualLengthColumn = new Column(field1, new IArrowArray[] { nonEqualLengthIntArray }); Assert.Throws(() => table.InsertColumn(-1, nonEqualLengthColumn)); Assert.Throws(() => table.InsertColumn(1, nonEqualLengthColumn)); Array equalLengthIntArray = ColumnTests.MakeIntArray(20); Field field2 = new Field.Builder().Name("f2").DataType(Int32Type.Default).Build(); - Column equalLengthColumn = new Column(field2, new[] { equalLengthIntArray}); + Column equalLengthColumn = new Column(field2, new IArrowArray[] { equalLengthIntArray }); Column existingColumn = table.Column(0); Table newTable = table.InsertColumn(0, equalLengthColumn); @@ -118,7 +118,7 @@ public void TestBuildFromRecordBatch() RecordBatch batch = TestData.CreateSampleRecordBatch(schema, 10); Table table = Table.TableFromRecordBatches(schema, new[] { batch }); - Assert.NotNull(table.Column(0).Data.Array(0) as Int64Array); + Assert.NotNull(table.Column(0).Data.ArrowArray(0) as Int64Array); } } diff --git a/csharp/test/Apache.Arrow.Tests/TestData.cs b/csharp/test/Apache.Arrow.Tests/TestData.cs index 41507311f6a04..9e2061e3428a9 100644 --- a/csharp/test/Apache.Arrow.Tests/TestData.cs +++ b/csharp/test/Apache.Arrow.Tests/TestData.cs @@ -60,6 +60,8 @@ public static RecordBatch CreateSampleRecordBatch(int length, int columnSetCount builder.Field(CreateField(new DictionaryType(Int32Type.Default, StringType.Default, false), i)); builder.Field(CreateField(new FixedSizeBinaryType(16), i)); builder.Field(CreateField(new FixedSizeListType(Int32Type.Default, 3), i)); + builder.Field(CreateField(new UnionType(new[] { CreateField(StringType.Default, i), CreateField(Int32Type.Default, i) }, new[] { 0, 1 }, UnionMode.Sparse), i)); + builder.Field(CreateField(new UnionType(new[] { CreateField(StringType.Default, i), CreateField(Int32Type.Default, i) }, new[] { 0, 1 }, UnionMode.Dense), -i)); } //builder.Field(CreateField(HalfFloatType.Default)); @@ -125,6 +127,7 @@ private class ArrayCreator : IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, + IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, IArrowTypeVisitor, @@ -315,6 +318,67 @@ public void Visit(StructType type) Array = new StructArray(type, Length, childArrays, nullBitmap.Build()); } + public void Visit(UnionType type) + { + int[] lengths = new int[type.Fields.Count]; + if (type.Mode == UnionMode.Sparse) + { + for (int i = 0; i < lengths.Length; i++) + { + lengths[i] = Length; + } + } + else + { + int totalLength = Length; + int oneLength = Length / lengths.Length; + for (int i = 1; i < lengths.Length; i++) + { + lengths[i] = oneLength; + totalLength -= oneLength; + } + lengths[0] = totalLength; + } + + ArrayData[] childArrays = new ArrayData[type.Fields.Count]; + for (int i = 0; i < childArrays.Length; i++) + { + childArrays[i] = CreateArray(type.Fields[i], lengths[i]).Data; + } + + ArrowBuffer.Builder typeIdBuilder = new ArrowBuffer.Builder(Length); + byte index = 0; + for (int i = 0; i < Length; i++) + { + typeIdBuilder.Append(index); + index++; + if (index == lengths.Length) + { + index = 0; + } + } + + ArrowBuffer[] buffers; + if (type.Mode == UnionMode.Sparse) + { + buffers = new ArrowBuffer[1]; + } + else + { + ArrowBuffer.Builder offsetBuilder = new ArrowBuffer.Builder(Length); + for (int i = 0; i < Length; i++) + { + offsetBuilder.Append(i / lengths.Length); + } + + buffers = new ArrowBuffer[2]; + buffers[1] = offsetBuilder.Build(); + } + buffers[0] = typeIdBuilder.Build(); + + Array = UnionArray.Create(new ArrayData(type, Length, 0, 0, buffers, childArrays)); + } + public void Visit(DictionaryType type) { Int32Array.Builder indicesBuilder = new Int32Array.Builder().Reserve(Length); diff --git a/dev/archery/archery/integration/datagen.py b/dev/archery/archery/integration/datagen.py index 5ac32da56a8de..299881c4b613a 100644 --- a/dev/archery/archery/integration/datagen.py +++ b/dev/archery/archery/integration/datagen.py @@ -1833,8 +1833,7 @@ def _temp_path(): .skip_tester('C#') .skip_tester('JS'), - generate_unions_case() - .skip_tester('C#'), + generate_unions_case(), generate_custom_metadata_case() .skip_tester('C#'), diff --git a/docs/source/status.rst b/docs/source/status.rst index 36c29fcdc4da6..6314fd4c8d31f 100644 --- a/docs/source/status.rst +++ b/docs/source/status.rst @@ -83,9 +83,9 @@ Data Types +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | Map | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ -| Dense Union | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | | +| Dense Union | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ -| Sparse Union | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | | +| Sparse Union | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ From 3c013da56fd55122072ffaca3e23afb12c290075 Mon Sep 17 00:00:00 2001 From: James Duong Date: Mon, 25 Sep 2023 06:17:26 -0700 Subject: [PATCH 59/96] GH-37795: [Java][FlightSQL] Add mock FlightSqlProducer and tests (#37837) ### Rationale for this change Clarify how to write a FlightSqlProducer with examples and helper classes. This is more inline with what's available to help developers write a FlightProducer. ### What changes are included in this PR? Add helper classes for creating a No-op Flight SQL producer and a partially implemented FlightSqlProducer that can process metadata requests. Add a mock flight producer and tests for it based on the new FlightSqlProducer partial implementations. Clean-up missed closes of FlightStreams in TestFlightSql. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37795 Authored-by: James Duong Signed-off-by: David Li --- .../flight/sql/BasicFlightSqlProducer.java | 109 +++++ .../flight/sql/NoOpFlightSqlProducer.java | 221 +++++++++ .../apache/arrow/flight/TestFlightSql.java | 432 +++++++----------- .../arrow/flight/TestFlightSqlStreams.java | 288 ++++++++++++ .../flight/sql/util/FlightStreamUtils.java | 129 ++++++ 5 files changed, 917 insertions(+), 262 deletions(-) create mode 100644 java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/BasicFlightSqlProducer.java create mode 100644 java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/NoOpFlightSqlProducer.java create mode 100644 java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSqlStreams.java create mode 100644 java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/util/FlightStreamUtils.java diff --git a/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/BasicFlightSqlProducer.java b/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/BasicFlightSqlProducer.java new file mode 100644 index 0000000000000..ea99191f28e13 --- /dev/null +++ b/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/BasicFlightSqlProducer.java @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.flight.sql; + +import java.util.List; + +import org.apache.arrow.flight.FlightDescriptor; +import org.apache.arrow.flight.FlightEndpoint; +import org.apache.arrow.flight.FlightInfo; +import org.apache.arrow.flight.sql.impl.FlightSql; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.google.protobuf.Message; + +/** + * A {@link FlightSqlProducer} that implements getting FlightInfo for each metadata request. + */ +public abstract class BasicFlightSqlProducer extends NoOpFlightSqlProducer { + + @Override + public FlightInfo getFlightInfoSqlInfo(FlightSql.CommandGetSqlInfo request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_SQL_INFO_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoTypeInfo(FlightSql.CommandGetXdbcTypeInfo request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_TYPE_INFO_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoCatalogs(FlightSql.CommandGetCatalogs request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_CATALOGS_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoSchemas(FlightSql.CommandGetDbSchemas request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_SCHEMAS_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoTables(FlightSql.CommandGetTables request, CallContext context, + FlightDescriptor descriptor) { + if (request.getIncludeSchema()) { + return generateFlightInfo(request, descriptor, Schemas.GET_TABLES_SCHEMA); + } + return generateFlightInfo(request, descriptor, Schemas.GET_TABLES_SCHEMA_NO_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoTableTypes(FlightSql.CommandGetTableTypes request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_TABLE_TYPES_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoPrimaryKeys(FlightSql.CommandGetPrimaryKeys request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_PRIMARY_KEYS_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoExportedKeys(FlightSql.CommandGetExportedKeys request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_EXPORTED_KEYS_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoImportedKeys(FlightSql.CommandGetImportedKeys request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_IMPORTED_KEYS_SCHEMA); + } + + @Override + public FlightInfo getFlightInfoCrossReference(FlightSql.CommandGetCrossReference request, CallContext context, + FlightDescriptor descriptor) { + return generateFlightInfo(request, descriptor, Schemas.GET_CROSS_REFERENCE_SCHEMA); + } + + /** + * Return a list of FlightEndpoints for the given request and FlightDescriptor. This method should validate that + * the request is supported by this FlightSqlProducer. + */ + protected abstract + List determineEndpoints(T request, FlightDescriptor flightDescriptor, Schema schema); + + protected FlightInfo generateFlightInfo(T request, FlightDescriptor descriptor, Schema schema) { + final List endpoints = determineEndpoints(request, descriptor, schema); + return new FlightInfo(schema, descriptor, endpoints, -1, -1); + } +} diff --git a/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/NoOpFlightSqlProducer.java b/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/NoOpFlightSqlProducer.java new file mode 100644 index 0000000000000..a02cee64bd855 --- /dev/null +++ b/java/flight/flight-sql/src/main/java/org/apache/arrow/flight/sql/NoOpFlightSqlProducer.java @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.flight.sql; + +import org.apache.arrow.flight.CallStatus; +import org.apache.arrow.flight.Criteria; +import org.apache.arrow.flight.FlightDescriptor; +import org.apache.arrow.flight.FlightInfo; +import org.apache.arrow.flight.FlightStream; +import org.apache.arrow.flight.PutResult; +import org.apache.arrow.flight.Result; +import org.apache.arrow.flight.SchemaResult; +import org.apache.arrow.flight.sql.impl.FlightSql; + +/** + * A {@link FlightSqlProducer} that throws on all FlightSql-specific operations. + */ +public class NoOpFlightSqlProducer implements FlightSqlProducer { + @Override + public void createPreparedStatement(FlightSql.ActionCreatePreparedStatementRequest request, + CallContext context, StreamListener listener) { + listener.onError(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public void closePreparedStatement(FlightSql.ActionClosePreparedStatementRequest request, + CallContext context, StreamListener listener) { + listener.onError(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoStatement(FlightSql.CommandStatementQuery command, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public FlightInfo getFlightInfoPreparedStatement(FlightSql.CommandPreparedStatementQuery command, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public SchemaResult getSchemaStatement(FlightSql.CommandStatementQuery command, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamStatement(FlightSql.TicketStatementQuery ticket, + CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public void getStreamPreparedStatement(FlightSql.CommandPreparedStatementQuery command, + CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public Runnable acceptPutStatement(FlightSql.CommandStatementUpdate command, CallContext context, + FlightStream flightStream, StreamListener ackStream) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public Runnable acceptPutPreparedStatementUpdate(FlightSql.CommandPreparedStatementUpdate command, + CallContext context, FlightStream flightStream, + StreamListener ackStream) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public Runnable acceptPutPreparedStatementQuery(FlightSql.CommandPreparedStatementQuery command, CallContext context, + FlightStream flightStream, StreamListener ackStream) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public FlightInfo getFlightInfoSqlInfo(FlightSql.CommandGetSqlInfo request, CallContext context, + FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamSqlInfo(FlightSql.CommandGetSqlInfo command, CallContext context, + ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoTypeInfo(FlightSql.CommandGetXdbcTypeInfo request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamTypeInfo(FlightSql.CommandGetXdbcTypeInfo request, + CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoCatalogs(FlightSql.CommandGetCatalogs request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamCatalogs(CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoSchemas(FlightSql.CommandGetDbSchemas request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamSchemas(FlightSql.CommandGetDbSchemas command, + CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoTables(FlightSql.CommandGetTables request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamTables(FlightSql.CommandGetTables command, CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoTableTypes(FlightSql.CommandGetTableTypes request, CallContext context, + FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamTableTypes(CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoPrimaryKeys(FlightSql.CommandGetPrimaryKeys request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamPrimaryKeys(FlightSql.CommandGetPrimaryKeys command, + CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public FlightInfo getFlightInfoExportedKeys(FlightSql.CommandGetExportedKeys request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public FlightInfo getFlightInfoImportedKeys(FlightSql.CommandGetImportedKeys request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public FlightInfo getFlightInfoCrossReference(FlightSql.CommandGetCrossReference request, + CallContext context, FlightDescriptor descriptor) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public void getStreamExportedKeys(FlightSql.CommandGetExportedKeys command, + CallContext context, ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public void getStreamImportedKeys(FlightSql.CommandGetImportedKeys command, CallContext context, + ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public void getStreamCrossReference(FlightSql.CommandGetCrossReference command, CallContext context, + ServerStreamListener listener) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + @Override + public void close() throws Exception { + + } + + @Override + public void listFlights(CallContext context, Criteria criteria, StreamListener listener) { + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } +} diff --git a/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSql.java b/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSql.java index 6da915a8ffb14..7635b80ecd0fd 100644 --- a/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSql.java +++ b/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSql.java @@ -20,7 +20,7 @@ import static java.util.Arrays.asList; import static java.util.Collections.emptyList; import static java.util.Collections.singletonList; -import static java.util.Objects.isNull; +import static org.apache.arrow.flight.sql.util.FlightStreamUtils.getResults; import static org.apache.arrow.util.AutoCloseables.close; import static org.hamcrest.CoreMatchers.containsString; import static org.hamcrest.CoreMatchers.is; @@ -29,16 +29,12 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertThrows; -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.nio.channels.Channels; import java.sql.SQLException; import java.util.ArrayList; import java.util.Arrays; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; -import java.util.Objects; import java.util.Optional; import java.util.stream.IntStream; @@ -52,18 +48,9 @@ import org.apache.arrow.flight.sql.util.TableRef; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.BitVector; -import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.IntVector; -import org.apache.arrow.vector.UInt1Vector; -import org.apache.arrow.vector.UInt4Vector; -import org.apache.arrow.vector.VarBinaryVector; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.complex.DenseUnionVector; -import org.apache.arrow.vector.complex.ListVector; -import org.apache.arrow.vector.ipc.ReadChannel; -import org.apache.arrow.vector.ipc.message.MessageSerializer; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; @@ -657,197 +644,202 @@ public void testGetSqlInfoResultsWithThreeArgs() throws Exception { } @Test - public void testGetCommandExportedKeys() { - final FlightStream stream = + public void testGetCommandExportedKeys() throws Exception { + try (final FlightStream stream = sqlClient.getStream( sqlClient.getExportedKeys(TableRef.of(null, null, "FOREIGNTABLE")) - .getEndpoints().get(0).getTicket()); - - final List> results = getResults(stream); - - final List> matchers = asList( - nullValue(String.class), // pk_catalog_name - is("APP"), // pk_schema_name - is("FOREIGNTABLE"), // pk_table_name - is("ID"), // pk_column_name - nullValue(String.class), // fk_catalog_name - is("APP"), // fk_schema_name - is("INTTABLE"), // fk_table_name - is("FOREIGNID"), // fk_column_name - is("1"), // key_sequence - containsString("SQL"), // fk_key_name - containsString("SQL"), // pk_key_name - is("3"), // update_rule - is("3")); // delete_rule - - final List assertions = new ArrayList<>(); - Assertions.assertEquals(1, results.size()); - for (int i = 0; i < matchers.size(); i++) { - final String actual = results.get(0).get(i); - final Matcher expected = matchers.get(i); - assertions.add(() -> MatcherAssert.assertThat(actual, expected)); + .getEndpoints().get(0).getTicket())) { + + final List> results = getResults(stream); + + final List> matchers = asList( + nullValue(String.class), // pk_catalog_name + is("APP"), // pk_schema_name + is("FOREIGNTABLE"), // pk_table_name + is("ID"), // pk_column_name + nullValue(String.class), // fk_catalog_name + is("APP"), // fk_schema_name + is("INTTABLE"), // fk_table_name + is("FOREIGNID"), // fk_column_name + is("1"), // key_sequence + containsString("SQL"), // fk_key_name + containsString("SQL"), // pk_key_name + is("3"), // update_rule + is("3")); // delete_rule + + final List assertions = new ArrayList<>(); + Assertions.assertEquals(1, results.size()); + for (int i = 0; i < matchers.size(); i++) { + final String actual = results.get(0).get(i); + final Matcher expected = matchers.get(i); + assertions.add(() -> MatcherAssert.assertThat(actual, expected)); + } + Assertions.assertAll(assertions); } - Assertions.assertAll(assertions); } @Test - public void testGetCommandImportedKeys() { - final FlightStream stream = + public void testGetCommandImportedKeys() throws Exception { + try (final FlightStream stream = sqlClient.getStream( sqlClient.getImportedKeys(TableRef.of(null, null, "INTTABLE")) - .getEndpoints().get(0).getTicket()); - - final List> results = getResults(stream); - - final List> matchers = asList( - nullValue(String.class), // pk_catalog_name - is("APP"), // pk_schema_name - is("FOREIGNTABLE"), // pk_table_name - is("ID"), // pk_column_name - nullValue(String.class), // fk_catalog_name - is("APP"), // fk_schema_name - is("INTTABLE"), // fk_table_name - is("FOREIGNID"), // fk_column_name - is("1"), // key_sequence - containsString("SQL"), // fk_key_name - containsString("SQL"), // pk_key_name - is("3"), // update_rule - is("3")); // delete_rule - - Assertions.assertEquals(1, results.size()); - final List assertions = new ArrayList<>(); - for (int i = 0; i < matchers.size(); i++) { - final String actual = results.get(0).get(i); - final Matcher expected = matchers.get(i); - assertions.add(() -> MatcherAssert.assertThat(actual, expected)); + .getEndpoints().get(0).getTicket())) { + + final List> results = getResults(stream); + + final List> matchers = asList( + nullValue(String.class), // pk_catalog_name + is("APP"), // pk_schema_name + is("FOREIGNTABLE"), // pk_table_name + is("ID"), // pk_column_name + nullValue(String.class), // fk_catalog_name + is("APP"), // fk_schema_name + is("INTTABLE"), // fk_table_name + is("FOREIGNID"), // fk_column_name + is("1"), // key_sequence + containsString("SQL"), // fk_key_name + containsString("SQL"), // pk_key_name + is("3"), // update_rule + is("3")); // delete_rule + + Assertions.assertEquals(1, results.size()); + final List assertions = new ArrayList<>(); + for (int i = 0; i < matchers.size(); i++) { + final String actual = results.get(0).get(i); + final Matcher expected = matchers.get(i); + assertions.add(() -> MatcherAssert.assertThat(actual, expected)); + } + Assertions.assertAll(assertions); } - Assertions.assertAll(assertions); } @Test - public void testGetTypeInfo() { + public void testGetTypeInfo() throws Exception { FlightInfo flightInfo = sqlClient.getXdbcTypeInfo(); - FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket()); - - final List> results = getResults(stream); - - final List> matchers = ImmutableList.of( - asList("BIGINT", "-5", "19", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", - "BIGINT", "0", "0", - null, null, "10", null), - asList("LONG VARCHAR FOR BIT DATA", "-4", "32700", "X'", "'", emptyList().toString(), "1", "false", "0", "true", - "false", "false", - "LONG VARCHAR FOR BIT DATA", null, null, null, null, null, null), - asList("VARCHAR () FOR BIT DATA", "-3", "32672", "X'", "'", singletonList("length").toString(), "1", "false", - "2", "true", "false", - "false", "VARCHAR () FOR BIT DATA", null, null, null, null, null, null), - asList("CHAR () FOR BIT DATA", "-2", "254", "X'", "'", singletonList("length").toString(), "1", "false", "2", - "true", "false", "false", - "CHAR () FOR BIT DATA", null, null, null, null, null, null), - asList("LONG VARCHAR", "-1", "32700", "'", "'", emptyList().toString(), "1", "true", "1", "true", "false", - "false", - "LONG VARCHAR", null, null, null, null, null, null), - asList("CHAR", "1", "254", "'", "'", singletonList("length").toString(), "1", "true", "3", "true", "false", - "false", "CHAR", null, null, - null, null, null, null), - asList("NUMERIC", "2", "31", null, null, Arrays.asList("precision", "scale").toString(), "1", "false", "2", - "false", "true", "false", - "NUMERIC", "0", "31", null, null, "10", null), - asList("DECIMAL", "3", "31", null, null, Arrays.asList("precision", "scale").toString(), "1", "false", "2", - "false", "true", "false", - "DECIMAL", "0", "31", null, null, "10", null), - asList("INTEGER", "4", "10", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", - "INTEGER", "0", "0", - null, null, "10", null), - asList("SMALLINT", "5", "5", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", - "SMALLINT", "0", - "0", null, null, "10", null), - asList("FLOAT", "6", "52", null, null, singletonList("precision").toString(), "1", "false", "2", "false", - "false", "false", "FLOAT", null, - null, null, null, "2", null), - asList("REAL", "7", "23", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "false", - "REAL", null, null, - null, null, "2", null), - asList("DOUBLE", "8", "52", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "false", - "DOUBLE", null, - null, null, null, "2", null), - asList("VARCHAR", "12", "32672", "'", "'", singletonList("length").toString(), "1", "true", "3", "true", - "false", "false", "VARCHAR", - null, null, null, null, null, null), - asList("BOOLEAN", "16", "1", null, null, emptyList().toString(), "1", "false", "2", "true", "false", "false", - "BOOLEAN", null, - null, null, null, null, null), - asList("DATE", "91", "10", "DATE'", "'", emptyList().toString(), "1", "false", "2", "true", "false", "false", - "DATE", "0", "0", - null, null, "10", null), - asList("TIME", "92", "8", "TIME'", "'", emptyList().toString(), "1", "false", "2", "true", "false", "false", - "TIME", "0", "0", - null, null, "10", null), - asList("TIMESTAMP", "93", "29", "TIMESTAMP'", "'", emptyList().toString(), "1", "false", "2", "true", "false", - "false", - "TIMESTAMP", "0", "9", null, null, "10", null), - asList("OBJECT", "2000", null, null, null, emptyList().toString(), "1", "false", "2", "true", "false", "false", - "OBJECT", null, - null, null, null, null, null), - asList("BLOB", "2004", "2147483647", null, null, singletonList("length").toString(), "1", "false", "0", null, - "false", null, "BLOB", null, - null, null, null, null, null), - asList("CLOB", "2005", "2147483647", "'", "'", singletonList("length").toString(), "1", "true", "1", null, - "false", null, "CLOB", null, - null, null, null, null, null), - asList("XML", "2009", null, null, null, emptyList().toString(), "1", "true", "0", "false", "false", "false", - "XML", null, null, - null, null, null, null)); - MatcherAssert.assertThat(results, is(matchers)); + try (FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket())) { + + final List> results = getResults(stream); + + final List> matchers = ImmutableList.of( + asList("BIGINT", "-5", "19", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", + "BIGINT", "0", "0", + null, null, "10", null), + asList("LONG VARCHAR FOR BIT DATA", "-4", "32700", "X'", "'", emptyList().toString(), "1", "false", "0", + "true", "false", "false", + "LONG VARCHAR FOR BIT DATA", null, null, null, null, null, null), + asList("VARCHAR () FOR BIT DATA", "-3", "32672", "X'", "'", singletonList("length").toString(), "1", "false", + "2", "true", "false", + "false", "VARCHAR () FOR BIT DATA", null, null, null, null, null, null), + asList("CHAR () FOR BIT DATA", "-2", "254", "X'", "'", singletonList("length").toString(), "1", "false", "2", + "true", "false", "false", + "CHAR () FOR BIT DATA", null, null, null, null, null, null), + asList("LONG VARCHAR", "-1", "32700", "'", "'", emptyList().toString(), "1", "true", "1", "true", "false", + "false", + "LONG VARCHAR", null, null, null, null, null, null), + asList("CHAR", "1", "254", "'", "'", singletonList("length").toString(), "1", "true", "3", "true", "false", + "false", "CHAR", null, null, + null, null, null, null), + asList("NUMERIC", "2", "31", null, null, Arrays.asList("precision", "scale").toString(), "1", "false", "2", + "false", "true", "false", + "NUMERIC", "0", "31", null, null, "10", null), + asList("DECIMAL", "3", "31", null, null, Arrays.asList("precision", "scale").toString(), "1", "false", "2", + "false", "true", "false", + "DECIMAL", "0", "31", null, null, "10", null), + asList("INTEGER", "4", "10", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", + "INTEGER", "0", "0", + null, null, "10", null), + asList("SMALLINT", "5", "5", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", + "SMALLINT", "0", + "0", null, null, "10", null), + asList("FLOAT", "6", "52", null, null, singletonList("precision").toString(), "1", "false", "2", "false", + "false", "false", "FLOAT", null, + null, null, null, "2", null), + asList("REAL", "7", "23", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "false", + "REAL", null, null, + null, null, "2", null), + asList("DOUBLE", "8", "52", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "false", + "DOUBLE", null, + null, null, null, "2", null), + asList("VARCHAR", "12", "32672", "'", "'", singletonList("length").toString(), "1", "true", "3", "true", + "false", "false", "VARCHAR", + null, null, null, null, null, null), + asList("BOOLEAN", "16", "1", null, null, emptyList().toString(), "1", "false", "2", "true", "false", "false", + "BOOLEAN", null, + null, null, null, null, null), + asList("DATE", "91", "10", "DATE'", "'", emptyList().toString(), "1", "false", "2", "true", "false", "false", + "DATE", "0", "0", + null, null, "10", null), + asList("TIME", "92", "8", "TIME'", "'", emptyList().toString(), "1", "false", "2", "true", "false", "false", + "TIME", "0", "0", + null, null, "10", null), + asList("TIMESTAMP", "93", "29", "TIMESTAMP'", "'", emptyList().toString(), "1", "false", "2", "true", "false", + "false", + "TIMESTAMP", "0", "9", null, null, "10", null), + asList("OBJECT", "2000", null, null, null, emptyList().toString(), "1", "false", "2", "true", "false", + "false", "OBJECT", null, + null, null, null, null, null), + asList("BLOB", "2004", "2147483647", null, null, singletonList("length").toString(), "1", "false", "0", null, + "false", null, "BLOB", null, + null, null, null, null, null), + asList("CLOB", "2005", "2147483647", "'", "'", singletonList("length").toString(), "1", "true", "1", null, + "false", null, "CLOB", null, + null, null, null, null, null), + asList("XML", "2009", null, null, null, emptyList().toString(), "1", "true", "0", "false", "false", "false", + "XML", null, null, + null, null, null, null)); + MatcherAssert.assertThat(results, is(matchers)); + } } @Test - public void testGetTypeInfoWithFiltering() { + public void testGetTypeInfoWithFiltering() throws Exception { FlightInfo flightInfo = sqlClient.getXdbcTypeInfo(-5); - FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket()); + try (FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket())) { - final List> results = getResults(stream); + final List> results = getResults(stream); - final List> matchers = ImmutableList.of( - asList("BIGINT", "-5", "19", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", - "BIGINT", "0", "0", - null, null, "10", null)); - MatcherAssert.assertThat(results, is(matchers)); + final List> matchers = ImmutableList.of( + asList("BIGINT", "-5", "19", null, null, emptyList().toString(), "1", "false", "2", "false", "false", "true", + "BIGINT", "0", "0", + null, null, "10", null)); + MatcherAssert.assertThat(results, is(matchers)); + } } @Test - public void testGetCommandCrossReference() { + public void testGetCommandCrossReference() throws Exception { final FlightInfo flightInfo = sqlClient.getCrossReference(TableRef.of(null, null, "FOREIGNTABLE"), TableRef.of(null, null, "INTTABLE")); - final FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket()); - - final List> results = getResults(stream); - - final List> matchers = asList( - nullValue(String.class), // pk_catalog_name - is("APP"), // pk_schema_name - is("FOREIGNTABLE"), // pk_table_name - is("ID"), // pk_column_name - nullValue(String.class), // fk_catalog_name - is("APP"), // fk_schema_name - is("INTTABLE"), // fk_table_name - is("FOREIGNID"), // fk_column_name - is("1"), // key_sequence - containsString("SQL"), // fk_key_name - containsString("SQL"), // pk_key_name - is("3"), // update_rule - is("3")); // delete_rule - - Assertions.assertEquals(1, results.size()); - final List assertions = new ArrayList<>(); - for (int i = 0; i < matchers.size(); i++) { - final String actual = results.get(0).get(i); - final Matcher expected = matchers.get(i); - assertions.add(() -> MatcherAssert.assertThat(actual, expected)); + try (final FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket())) { + + final List> results = getResults(stream); + + final List> matchers = asList( + nullValue(String.class), // pk_catalog_name + is("APP"), // pk_schema_name + is("FOREIGNTABLE"), // pk_table_name + is("ID"), // pk_column_name + nullValue(String.class), // fk_catalog_name + is("APP"), // fk_schema_name + is("INTTABLE"), // fk_table_name + is("FOREIGNID"), // fk_column_name + is("1"), // key_sequence + containsString("SQL"), // fk_key_name + containsString("SQL"), // pk_key_name + is("3"), // update_rule + is("3")); // delete_rule + + Assertions.assertEquals(1, results.size()); + final List assertions = new ArrayList<>(); + for (int i = 0; i < matchers.size(); i++) { + final String actual = results.get(0).get(i); + final Matcher expected = matchers.get(i); + assertions.add(() -> MatcherAssert.assertThat(actual, expected)); + } + Assertions.assertAll(assertions); } - Assertions.assertAll(assertions); } @Test @@ -878,90 +870,6 @@ public void testCreateStatementResults() throws Exception { } } - List> getResults(FlightStream stream) { - final List> results = new ArrayList<>(); - while (stream.next()) { - try (final VectorSchemaRoot root = stream.getRoot()) { - final long rowCount = root.getRowCount(); - for (int i = 0; i < rowCount; ++i) { - results.add(new ArrayList<>()); - } - - root.getSchema().getFields().forEach(field -> { - try (final FieldVector fieldVector = root.getVector(field.getName())) { - if (fieldVector instanceof VarCharVector) { - final VarCharVector varcharVector = (VarCharVector) fieldVector; - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - final Text data = varcharVector.getObject(rowIndex); - results.get(rowIndex).add(isNull(data) ? null : data.toString()); - } - } else if (fieldVector instanceof IntVector) { - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - Object data = fieldVector.getObject(rowIndex); - results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); - } - } else if (fieldVector instanceof VarBinaryVector) { - final VarBinaryVector varbinaryVector = (VarBinaryVector) fieldVector; - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - final byte[] data = varbinaryVector.getObject(rowIndex); - final String output; - try { - output = isNull(data) ? - null : - MessageSerializer.deserializeSchema( - new ReadChannel(Channels.newChannel(new ByteArrayInputStream(data)))).toJson(); - } catch (final IOException e) { - throw new RuntimeException("Failed to deserialize schema", e); - } - results.get(rowIndex).add(output); - } - } else if (fieldVector instanceof DenseUnionVector) { - final DenseUnionVector denseUnionVector = (DenseUnionVector) fieldVector; - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - final Object data = denseUnionVector.getObject(rowIndex); - results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); - } - } else if (fieldVector instanceof ListVector) { - for (int i = 0; i < fieldVector.getValueCount(); i++) { - if (!fieldVector.isNull(i)) { - List elements = (List) ((ListVector) fieldVector).getObject(i); - List values = new ArrayList<>(); - - for (Text element : elements) { - values.add(element.toString()); - } - results.get(i).add(values.toString()); - } - } - - } else if (fieldVector instanceof UInt4Vector) { - final UInt4Vector uInt4Vector = (UInt4Vector) fieldVector; - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - final Object data = uInt4Vector.getObject(rowIndex); - results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); - } - } else if (fieldVector instanceof UInt1Vector) { - final UInt1Vector uInt1Vector = (UInt1Vector) fieldVector; - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - final Object data = uInt1Vector.getObject(rowIndex); - results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); - } - } else if (fieldVector instanceof BitVector) { - for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { - Object data = fieldVector.getObject(rowIndex); - results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); - } - } else { - throw new UnsupportedOperationException("Not yet implemented"); - } - } - }); - } - } - - return results; - } - @Test public void testExecuteUpdate() { Assertions.assertAll( diff --git a/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSqlStreams.java b/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSqlStreams.java new file mode 100644 index 0000000000000..4672e0a141832 --- /dev/null +++ b/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/TestFlightSqlStreams.java @@ -0,0 +1,288 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.flight; + +import static java.util.Arrays.asList; +import static java.util.Collections.emptyList; +import static java.util.Collections.singletonList; +import static org.apache.arrow.flight.sql.util.FlightStreamUtils.getResults; +import static org.apache.arrow.util.AutoCloseables.close; +import static org.apache.arrow.vector.types.Types.MinorType.INT; +import static org.hamcrest.CoreMatchers.is; + +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.flight.sql.BasicFlightSqlProducer; +import org.apache.arrow.flight.sql.FlightSqlClient; +import org.apache.arrow.flight.sql.FlightSqlProducer; +import org.apache.arrow.flight.sql.impl.FlightSql; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.VarCharVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.arrow.vector.util.Text; +import org.hamcrest.MatcherAssert; +import org.junit.jupiter.api.AfterAll; +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; + +import com.google.common.collect.ImmutableList; +import com.google.protobuf.Any; +import com.google.protobuf.Message; + +public class TestFlightSqlStreams { + + /** + * A limited {@link FlightSqlProducer} for testing GetTables, GetTableTypes, GetSqlInfo, and limited SQL commands. + */ + private static class FlightSqlTestProducer extends BasicFlightSqlProducer { + + // Note that for simplicity the getStream* implementations are blocking, but a proper FlightSqlProducer should + // have non-blocking implementations of getStream*. + + private static final String FIXED_QUERY = "SELECT 1 AS c1 FROM test_table"; + private static final Schema FIXED_SCHEMA = new Schema(asList( + Field.nullable("c1", Types.MinorType.INT.getType()))); + + private BufferAllocator allocator; + + FlightSqlTestProducer(BufferAllocator allocator) { + this.allocator = allocator; + } + + @Override + protected List determineEndpoints(T request, FlightDescriptor flightDescriptor, + Schema schema) { + if (request instanceof FlightSql.CommandGetTables || + request instanceof FlightSql.CommandGetTableTypes || + request instanceof FlightSql.CommandGetXdbcTypeInfo || + request instanceof FlightSql.CommandGetSqlInfo) { + return Collections.singletonList(new FlightEndpoint(new Ticket(Any.pack(request).toByteArray()))); + } else if (request instanceof FlightSql.CommandStatementQuery && + ((FlightSql.CommandStatementQuery) request).getQuery().equals(FIXED_QUERY)) { + + // Tickets from CommandStatementQuery requests should be built using TicketStatementQuery then packed() into + // a ticket. The content of the statement handle is specific to the FlightSqlProducer. It does not need to + // be the query. It can be a query ID for example. + FlightSql.TicketStatementQuery ticketStatementQuery = FlightSql.TicketStatementQuery.newBuilder() + .setStatementHandle(((FlightSql.CommandStatementQuery) request).getQueryBytes()) + .build(); + return Collections.singletonList(new FlightEndpoint(new Ticket(Any.pack(ticketStatementQuery).toByteArray()))); + } + throw CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException(); + } + + @Override + public FlightInfo getFlightInfoStatement(FlightSql.CommandStatementQuery command, + CallContext context, FlightDescriptor descriptor) { + return generateFlightInfo(command, descriptor, FIXED_SCHEMA); + } + + @Override + public void getStreamStatement(FlightSql.TicketStatementQuery ticket, + CallContext context, ServerStreamListener listener) { + final String query = ticket.getStatementHandle().toStringUtf8(); + if (!query.equals(FIXED_QUERY)) { + listener.error(CallStatus.UNIMPLEMENTED.withDescription("Not implemented.").toRuntimeException()); + } + + try (VectorSchemaRoot root = VectorSchemaRoot.create(FIXED_SCHEMA, allocator)) { + root.setRowCount(1); + ((IntVector) root.getVector("c1")).setSafe(0, 1); + listener.start(root); + listener.putNext(); + listener.completed(); + } + } + + @Override + public void getStreamSqlInfo(FlightSql.CommandGetSqlInfo command, CallContext context, + ServerStreamListener listener) { + try (VectorSchemaRoot root = VectorSchemaRoot.create(Schemas.GET_SQL_INFO_SCHEMA, allocator)) { + root.setRowCount(0); + listener.start(root); + listener.putNext(); + listener.completed(); + } + } + + @Override + public void getStreamTypeInfo(FlightSql.CommandGetXdbcTypeInfo request, + CallContext context, ServerStreamListener listener) { + try (VectorSchemaRoot root = VectorSchemaRoot.create(Schemas.GET_TYPE_INFO_SCHEMA, allocator)) { + root.setRowCount(1); + ((VarCharVector) root.getVector("type_name")).setSafe(0, new Text("Integer")); + ((IntVector) root.getVector("data_type")).setSafe(0, INT.ordinal()); + ((IntVector) root.getVector("column_size")).setSafe(0, 400); + root.getVector("literal_prefix").setNull(0); + root.getVector("literal_suffix").setNull(0); + root.getVector("create_params").setNull(0); + ((IntVector) root.getVector("nullable")).setSafe(0, FlightSql.Nullable.NULLABILITY_NULLABLE.getNumber()); + ((BitVector) root.getVector("case_sensitive")).setSafe(0, 1); + ((IntVector) root.getVector("nullable")).setSafe(0, FlightSql.Searchable.SEARCHABLE_FULL.getNumber()); + ((BitVector) root.getVector("unsigned_attribute")).setSafe(0, 1); + root.getVector("fixed_prec_scale").setNull(0); + ((BitVector) root.getVector("auto_increment")).setSafe(0, 1); + ((VarCharVector) root.getVector("local_type_name")).setSafe(0, new Text("Integer")); + root.getVector("minimum_scale").setNull(0); + root.getVector("maximum_scale").setNull(0); + ((IntVector) root.getVector("sql_data_type")).setSafe(0, INT.ordinal()); + root.getVector("datetime_subcode").setNull(0); + ((IntVector) root.getVector("num_prec_radix")).setSafe(0, 10); + root.getVector("interval_precision").setNull(0); + + listener.start(root); + listener.putNext(); + listener.completed(); + } + } + + @Override + public void getStreamTables(FlightSql.CommandGetTables command, CallContext context, + ServerStreamListener listener) { + try (VectorSchemaRoot root = VectorSchemaRoot.create(Schemas.GET_TABLES_SCHEMA_NO_SCHEMA, allocator)) { + root.setRowCount(1); + root.getVector("catalog_name").setNull(0); + root.getVector("db_schema_name").setNull(0); + ((VarCharVector) root.getVector("table_name")).setSafe(0, new Text("test_table")); + ((VarCharVector) root.getVector("table_type")).setSafe(0, new Text("TABLE")); + + listener.start(root); + listener.putNext(); + listener.completed(); + } + } + + @Override + public void getStreamTableTypes(CallContext context, ServerStreamListener listener) { + try (VectorSchemaRoot root = VectorSchemaRoot.create(Schemas.GET_TABLE_TYPES_SCHEMA, allocator)) { + root.setRowCount(1); + ((VarCharVector) root.getVector("table_type")).setSafe(0, new Text("TABLE")); + + listener.start(root); + listener.putNext(); + listener.completed(); + } + } + } + + private static BufferAllocator allocator; + + private static FlightServer server; + private static FlightSqlClient sqlClient; + + @BeforeAll + public static void setUp() throws Exception { + allocator = new RootAllocator(Integer.MAX_VALUE); + + final Location serverLocation = Location.forGrpcInsecure("localhost", 0); + server = FlightServer.builder(allocator, serverLocation, new FlightSqlTestProducer(allocator)) + .build() + .start(); + + final Location clientLocation = Location.forGrpcInsecure("localhost", server.getPort()); + sqlClient = new FlightSqlClient(FlightClient.builder(allocator, clientLocation).build()); + } + + @AfterAll + public static void tearDown() throws Exception { + close(sqlClient, server, allocator); + } + + @Test + public void testGetTablesResultNoSchema() throws Exception { + try (final FlightStream stream = + sqlClient.getStream( + sqlClient.getTables(null, null, null, null, false) + .getEndpoints().get(0).getTicket())) { + Assertions.assertAll( + () -> MatcherAssert.assertThat(stream.getSchema(), is(FlightSqlProducer.Schemas.GET_TABLES_SCHEMA_NO_SCHEMA)), + () -> { + final List> results = getResults(stream); + final List> expectedResults = ImmutableList.of( + // catalog_name | schema_name | table_name | table_type | table_schema + asList(null, null, "test_table", "TABLE")); + MatcherAssert.assertThat(results, is(expectedResults)); + } + ); + } + } + + @Test + public void testGetTableTypesResult() throws Exception { + try (final FlightStream stream = + sqlClient.getStream(sqlClient.getTableTypes().getEndpoints().get(0).getTicket())) { + Assertions.assertAll( + () -> MatcherAssert.assertThat(stream.getSchema(), is(FlightSqlProducer.Schemas.GET_TABLE_TYPES_SCHEMA)), + () -> { + final List> tableTypes = getResults(stream); + final List> expectedTableTypes = ImmutableList.of( + // table_type + singletonList("TABLE") + ); + MatcherAssert.assertThat(tableTypes, is(expectedTableTypes)); + } + ); + } + } + + @Test + public void testGetSqlInfoResults() throws Exception { + final FlightInfo info = sqlClient.getSqlInfo(); + try (final FlightStream stream = sqlClient.getStream(info.getEndpoints().get(0).getTicket())) { + Assertions.assertAll( + () -> MatcherAssert.assertThat(stream.getSchema(), is(FlightSqlProducer.Schemas.GET_SQL_INFO_SCHEMA)), + () -> MatcherAssert.assertThat(getResults(stream), is(emptyList())) + ); + } + } + + @Test + public void testGetTypeInfo() throws Exception { + FlightInfo flightInfo = sqlClient.getXdbcTypeInfo(); + + try (FlightStream stream = sqlClient.getStream(flightInfo.getEndpoints().get(0).getTicket())) { + + final List> results = getResults(stream); + + final List> matchers = ImmutableList.of( + asList("Integer", "4", "400", null, null, "3", "true", null, "true", null, "true", + "Integer", null, null, "4", null, "10", null)); + + MatcherAssert.assertThat(results, is(matchers)); + } + } + + @Test + public void testExecuteQuery() throws Exception { + try (final FlightStream stream = sqlClient + .getStream(sqlClient.execute(FlightSqlTestProducer.FIXED_QUERY).getEndpoints().get(0).getTicket())) { + Assertions.assertAll( + () -> MatcherAssert.assertThat(stream.getSchema(), is(FlightSqlTestProducer.FIXED_SCHEMA)), + () -> MatcherAssert.assertThat(getResults(stream), is(singletonList(singletonList("1")))) + ); + } + } +} diff --git a/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/util/FlightStreamUtils.java b/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/util/FlightStreamUtils.java new file mode 100644 index 0000000000000..fbbe9ef01816e --- /dev/null +++ b/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/util/FlightStreamUtils.java @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.flight.sql.util; + +import static java.util.Objects.isNull; + +import java.io.ByteArrayInputStream; +import java.io.IOException; +import java.nio.channels.Channels; +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; + +import org.apache.arrow.flight.FlightStream; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.UInt1Vector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.VarBinaryVector; +import org.apache.arrow.vector.VarCharVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.DenseUnionVector; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.ipc.ReadChannel; +import org.apache.arrow.vector.ipc.message.MessageSerializer; +import org.apache.arrow.vector.util.Text; + +public class FlightStreamUtils { + + public static List> getResults(FlightStream stream) { + final List> results = new ArrayList<>(); + while (stream.next()) { + try (final VectorSchemaRoot root = stream.getRoot()) { + final long rowCount = root.getRowCount(); + for (int i = 0; i < rowCount; ++i) { + results.add(new ArrayList<>()); + } + + root.getSchema().getFields().forEach(field -> { + try (final FieldVector fieldVector = root.getVector(field.getName())) { + if (fieldVector instanceof VarCharVector) { + final VarCharVector varcharVector = (VarCharVector) fieldVector; + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + final Text data = varcharVector.getObject(rowIndex); + results.get(rowIndex).add(isNull(data) ? null : data.toString()); + } + } else if (fieldVector instanceof IntVector) { + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + Object data = fieldVector.getObject(rowIndex); + results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); + } + } else if (fieldVector instanceof VarBinaryVector) { + final VarBinaryVector varbinaryVector = (VarBinaryVector) fieldVector; + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + final byte[] data = varbinaryVector.getObject(rowIndex); + final String output; + try { + output = isNull(data) ? + null : + MessageSerializer.deserializeSchema( + new ReadChannel(Channels.newChannel(new ByteArrayInputStream(data)))).toJson(); + } catch (final IOException e) { + throw new RuntimeException("Failed to deserialize schema", e); + } + results.get(rowIndex).add(output); + } + } else if (fieldVector instanceof DenseUnionVector) { + final DenseUnionVector denseUnionVector = (DenseUnionVector) fieldVector; + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + final Object data = denseUnionVector.getObject(rowIndex); + results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); + } + } else if (fieldVector instanceof ListVector) { + for (int i = 0; i < fieldVector.getValueCount(); i++) { + if (!fieldVector.isNull(i)) { + List elements = (List) ((ListVector) fieldVector).getObject(i); + List values = new ArrayList<>(); + + for (Text element : elements) { + values.add(element.toString()); + } + results.get(i).add(values.toString()); + } + } + + } else if (fieldVector instanceof UInt4Vector) { + final UInt4Vector uInt4Vector = (UInt4Vector) fieldVector; + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + final Object data = uInt4Vector.getObject(rowIndex); + results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); + } + } else if (fieldVector instanceof UInt1Vector) { + final UInt1Vector uInt1Vector = (UInt1Vector) fieldVector; + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + final Object data = uInt1Vector.getObject(rowIndex); + results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); + } + } else if (fieldVector instanceof BitVector) { + for (int rowIndex = 0; rowIndex < rowCount; rowIndex++) { + Object data = fieldVector.getObject(rowIndex); + results.get(rowIndex).add(isNull(data) ? null : Objects.toString(data)); + } + } else { + throw new UnsupportedOperationException("Not yet implemented"); + } + } + }); + } + } + + return results; + } +} From 0f94eb64c45a3a8f24395616941971e751b58765 Mon Sep 17 00:00:00 2001 From: hrishisd Date: Mon, 25 Sep 2023 10:55:29 -0400 Subject: [PATCH 60/96] GH-37829: [Java] Avoid resizing data buffer twice when appending variable length vectors (#37844) ### Rationale for this change This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. ### What changes are included in this PR? The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements. The data buffer is resized based on the total required data size of the combined elements. ### Are these changes tested? Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. ### Are there any user-facing changes? No. * Closes: #37829 Authored-by: hrishisd Signed-off-by: David Li --- .../arrow/vector/util/VectorAppender.java | 4 +-- .../arrow/vector/util/TestVectorAppender.java | 27 ++++++++++++++++++- .../util/TestVectorSchemaRootAppender.java | 2 +- 3 files changed, 29 insertions(+), 4 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/VectorAppender.java b/java/vector/src/main/java/org/apache/arrow/vector/util/VectorAppender.java index 9f73732ccfdd3..c5de380f9c173 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/VectorAppender.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/VectorAppender.java @@ -116,7 +116,7 @@ public ValueVector visit(BaseVariableWidthVector deltaVector, Void value) { // make sure there is enough capacity while (targetVector.getValueCapacity() < newValueCount) { - targetVector.reAlloc(); + ((BaseVariableWidthVector) targetVector).reallocValidityAndOffsetBuffers(); } while (targetVector.getDataBuffer().capacity() < newValueCapacity) { ((BaseVariableWidthVector) targetVector).reallocDataBuffer(); @@ -170,7 +170,7 @@ public ValueVector visit(BaseLargeVariableWidthVector deltaVector, Void value) { // make sure there is enough capacity while (targetVector.getValueCapacity() < newValueCount) { - targetVector.reAlloc(); + ((BaseLargeVariableWidthVector) targetVector).reallocValidityAndOffsetBuffers(); } while (targetVector.getDataBuffer().capacity() < newValueCapacity) { ((BaseLargeVariableWidthVector) targetVector).reallocDataBuffer(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorAppender.java b/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorAppender.java index 25d26623d5c05..ab36ea2fd2129 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorAppender.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorAppender.java @@ -21,11 +21,14 @@ import static junit.framework.TestCase.assertTrue; import static org.junit.jupiter.api.Assertions.assertThrows; +import java.nio.charset.StandardCharsets; import java.util.Arrays; +import java.util.Collections; import java.util.List; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.BaseValueVector; import org.apache.arrow.vector.BigIntVector; import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.Float4Vector; @@ -63,7 +66,8 @@ public class TestVectorAppender { @Before public void prepare() { - allocator = new RootAllocator(1024 * 1024); + // Permit allocating 4 vectors of max size. + allocator = new RootAllocator(4 * BaseValueVector.MAX_ALLOCATION_SIZE); } @After @@ -185,6 +189,27 @@ public void testAppendEmptyVariableWidthVector() { } } + @Test + public void testAppendLargeAndSmallVariableVectorsWithinLimit() { + int sixteenthOfMaxAllocation = Math.toIntExact(BaseValueVector.MAX_ALLOCATION_SIZE / 16); + try (VarCharVector target = makeVarCharVec(1, sixteenthOfMaxAllocation); + VarCharVector delta = makeVarCharVec(sixteenthOfMaxAllocation, 1)) { + new VectorAppender(delta).visit(target, null); + new VectorAppender(target).visit(delta, null); + } + } + + private VarCharVector makeVarCharVec(int numElements, int bytesPerElement) { + VarCharVector v = new VarCharVector("text", allocator); + v.allocateNew((long) numElements * bytesPerElement, numElements); + for (int i = 0; i < numElements; i++) { + String s = String.join("", Collections.nCopies(bytesPerElement, "a")); + v.setSafe(i, s.getBytes(StandardCharsets.US_ASCII)); + } + v.setValueCount(numElements); + return v; + } + @Test public void testAppendLargeVariableWidthVector() { final int length1 = 5; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorSchemaRootAppender.java b/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorSchemaRootAppender.java index ab0ee3a2075a3..6309d385870c9 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorSchemaRootAppender.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/util/TestVectorSchemaRootAppender.java @@ -50,7 +50,7 @@ public void shutdown() { } @Test - public void testVectorScehmaRootAppend() { + public void testVectorSchemaRootAppend() { final int length1 = 5; final int length2 = 3; final int length3 = 2; From 1ae243628611a43812747f3bc1505072a139b1c8 Mon Sep 17 00:00:00 2001 From: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Date: Mon, 25 Sep 2023 11:56:13 -0400 Subject: [PATCH 61/96] GH-37825: [MATLAB] Improve `arrow.type.Field` display (#37826) ### Rationale for this change We should improve the display of `arrow.type.Field`, which currently looks like this: ```matlab >> arrow.field("A", arrow.int32()) ans = A: int32 ``` This display isn't very "MATLAB-like". For instance, it doesn't display the object's class type. This display would be better: ```matlab >> arrow.field("A", arrow.int32()) ans = Field with properties: Name: "A" Type: [1x1 arrow.type.Int32Type] ``` ### What changes are included in this PR? 1. Added `getPropertyGroups` method to `Field`. This method is inherited from the superclass `matlab.mixin.CustomDisplay`. 2. Removed `displayScalarObject` method from `Field`. This method is also inherited from `matlab.mixin.CustomDisplay`. By implementing `getPropertyGroups`, we no longer need to override `displayScalarObject` and can use the default implementation of this method in `CustomDisplay`. 3. Removed `toString()` method from `Field`. This method was private, and only used by `displayScalarObject`. Since `displayScalarObject` has been removed, `toString()` can be deleted too. 4. Converted the helper test methods (`makeLinkString`, `makeDimensionString`, `verifyDisplay`) in `tTypeDisplay` into standalone functions. Test classes other than `tTypeDisplay.m` can now use these utilities as well. ### Are these changes tested? Yes. Added a `TestDisplay` unit test to `tField.m`. ### Are there any user-facing changes? Yes. `arrow.type.Field` objects are now displayed differently in the Command Window. ### Future Directions 1. Update the display of `arrow.tabular.Schema`. 2. Update the display of `arrow.array.Array`. 3. Update the display of `arrow.tabular.Table`. 4. Update the display of `arrow.tabular.RecordBatch`. * Closes: #37825 Authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- .../src/cpp/arrow/matlab/type/proxy/field.cc | 11 --- .../src/cpp/arrow/matlab/type/proxy/field.h | 1 - .../+test/+display/makeDimensionString.m | 22 +++++ .../+internal/+test/+display/makeLinkString.m | 36 ++++++++ .../+arrow/+internal/+test/+display/verify.m | 32 +++++++ matlab/src/matlab/+arrow/+type/Field.m | 12 +-- matlab/test/arrow/type/tField.m | 28 ++++++ matlab/test/arrow/type/tTypeDisplay.m | 85 ++++++++----------- 8 files changed, 157 insertions(+), 70 deletions(-) create mode 100644 matlab/src/matlab/+arrow/+internal/+test/+display/makeDimensionString.m create mode 100644 matlab/src/matlab/+arrow/+internal/+test/+display/makeLinkString.m create mode 100644 matlab/src/matlab/+arrow/+internal/+test/+display/verify.m diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/field.cc b/matlab/src/cpp/arrow/matlab/type/proxy/field.cc index 7df0e7d6ef304..138771a35c327 100644 --- a/matlab/src/cpp/arrow/matlab/type/proxy/field.cc +++ b/matlab/src/cpp/arrow/matlab/type/proxy/field.cc @@ -32,7 +32,6 @@ namespace arrow::matlab::type::proxy { Field::Field(std::shared_ptr field) : field{std::move(field)} { REGISTER_METHOD(Field, getName); REGISTER_METHOD(Field, getType); - REGISTER_METHOD(Field, toString); } std::shared_ptr Field::unwrap() { @@ -64,16 +63,6 @@ namespace arrow::matlab::type::proxy { context.outputs[0] = output; } - void Field::toString(libmexclass::proxy::method::Context& context) { - namespace mda = ::matlab::data; - mda::ArrayFactory factory; - - const auto str_utf8 = field->ToString(); - MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto str_utf16, arrow::util::UTF8StringToUTF16(str_utf8), context, error::UNICODE_CONVERSION_ERROR_ID); - auto str_mda = factory.createScalar(str_utf16); - context.outputs[0] = str_mda; - } - libmexclass::proxy::MakeResult Field::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { namespace mda = ::matlab::data; using FieldProxy = arrow::matlab::type::proxy::Field; diff --git a/matlab/src/cpp/arrow/matlab/type/proxy/field.h b/matlab/src/cpp/arrow/matlab/type/proxy/field.h index 4256fd21a0a23..3526a6c422ac3 100644 --- a/matlab/src/cpp/arrow/matlab/type/proxy/field.h +++ b/matlab/src/cpp/arrow/matlab/type/proxy/field.h @@ -36,7 +36,6 @@ class Field : public libmexclass::proxy::Proxy { protected: void getName(libmexclass::proxy::method::Context& context); void getType(libmexclass::proxy::method::Context& context); - void toString(libmexclass::proxy::method::Context& context); std::shared_ptr field; }; diff --git a/matlab/src/matlab/+arrow/+internal/+test/+display/makeDimensionString.m b/matlab/src/matlab/+arrow/+internal/+test/+display/makeDimensionString.m new file mode 100644 index 0000000000000..4281667543634 --- /dev/null +++ b/matlab/src/matlab/+arrow/+internal/+test/+display/makeDimensionString.m @@ -0,0 +1,22 @@ +%MAKEDIMENSIONSTRING Utility function for creating a string representation +%of dimensions. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +function dimensionString = makeDimensionString(arraySize) + dimensionString = string(arraySize); + dimensionString = join(dimensionString, char(215)); +end diff --git a/matlab/src/matlab/+arrow/+internal/+test/+display/makeLinkString.m b/matlab/src/matlab/+arrow/+internal/+test/+display/makeLinkString.m new file mode 100644 index 0000000000000..df6a11612043c --- /dev/null +++ b/matlab/src/matlab/+arrow/+internal/+test/+display/makeLinkString.m @@ -0,0 +1,36 @@ +%MAKELINKSTRING Utility function for creating hyperlinks. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +function link = makeLinkString(opts) + arguments + opts.FullClassName(1, 1) string + opts.ClassName(1, 1) string + % When displaying heterogeneous arrays, only the name of the + % closest shared anscestor class is displayed in bold. All other + % class names are not bolded. + opts.BoldFont(1, 1) logical + end + + if opts.BoldFont + link = compose("%s", ... + opts.FullClassName, opts.ClassName); + else + link = compose("%s", ... + opts.FullClassName, opts.ClassName); + end +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+internal/+test/+display/verify.m b/matlab/src/matlab/+arrow/+internal/+test/+display/verify.m new file mode 100644 index 0000000000000..d9a420663b783 --- /dev/null +++ b/matlab/src/matlab/+arrow/+internal/+test/+display/verify.m @@ -0,0 +1,32 @@ +%VERIFY Utility function used to verify object display. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +function verify(testCase, actualDisplay, expectedDisplay) + % When the MATLAB GUI is running, '×' (char(215)) is used as + % the delimiter between dimension values. However, when the + % GUI is not running, 'x' (char(120)) is used as the delimiter. + % To account for this discrepancy, check if actualDisplay + % contains char(215). If not, replace all instances of + % char(215) in expectedDisplay with char(120). + + tf = contains(actualDisplay, char(215)); + if ~tf + idx = strfind(expectedDisplay, char(215)); + expectedDisplay(idx) = char(120); + end + testCase.verifyEqual(actualDisplay, expectedDisplay); +end diff --git a/matlab/src/matlab/+arrow/+type/Field.m b/matlab/src/matlab/+arrow/+type/Field.m index f67ba69fe9826..d6e03f61fbea1 100644 --- a/matlab/src/matlab/+arrow/+type/Field.m +++ b/matlab/src/matlab/+arrow/+type/Field.m @@ -91,16 +91,10 @@ end end - methods (Access = private) - function str = toString(obj) - str = obj.Proxy.toString(); - end - end - methods (Access=protected) - function displayScalarObject(obj) - disp(obj.toString()); + function groups = getPropertyGroups(~) + targets = ["Name", "Type"]; + groups = matlab.mixin.util.PropertyGroup(targets); end end - end diff --git a/matlab/test/arrow/type/tField.m b/matlab/test/arrow/type/tField.m index 1a89c0077b5ae..f84034d032c23 100644 --- a/matlab/test/arrow/type/tField.m +++ b/matlab/test/arrow/type/tField.m @@ -231,5 +231,33 @@ function TestIsEqualNonScalarFalse(testCase) % Compare arrow.type.Field array and a string array testCase.verifyFalse(isequal(f1, strings(size(f1)))); end + + function TestDisplay(testCase) + % Verify the display of Field objects. + % + % Example: + % + % Field with properties: + % + % Name: FieldA + % Type: [1x2 arrow.type.Int32Type] + + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + import arrow.internal.test.display.makeDimensionString + + field = arrow.field("B", arrow.timestamp(TimeZone="America/Anchorage")); %#ok + classnameLink = makeLinkString(FullClassName="arrow.type.Field", ClassName="Field", BoldFont=true); + header = " " + classnameLink + " with properties:" + newline; + body = strjust(pad(["Name:"; "Type:"])); + dimensionString = makeDimensionString([1 1]); + fieldString = compose("[%s %s]", dimensionString, "arrow.type.TimestampType"); + body = body + " " + ["""B"""; fieldString]; + body = " " + body; + footer = string(newline); + expectedDisplay = char(strjoin([header body' footer], newline)); + actualDisplay = evalc('disp(field)'); + verify(testCase, actualDisplay, expectedDisplay); + end end end diff --git a/matlab/test/arrow/type/tTypeDisplay.m b/matlab/test/arrow/type/tTypeDisplay.m index f84c5ab56e270..6f5a4bcd97717 100644 --- a/matlab/test/arrow/type/tTypeDisplay.m +++ b/matlab/test/arrow/type/tTypeDisplay.m @@ -50,6 +50,10 @@ function EmptyTypeDisplay(testCase) % % ID + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + import arrow.internal.test.display.makeDimensionString + type = arrow.type.Type.empty(0, 1); typeLink = makeLinkString(FullClassName="arrow.type.Type", ClassName="Type", BoldFont=true); dimensionString = makeDimensionString(size(type)); @@ -59,7 +63,7 @@ function EmptyTypeDisplay(testCase) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(type)'); - testCase.verifyDisplay(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function NonScalarArrayDifferentTypes(testCase) @@ -71,6 +75,10 @@ function NonScalarArrayDifferentTypes(testCase) % % ID + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + import arrow.internal.test.display.makeDimensionString + float32Type = arrow.float32(); timestampType = arrow.timestamp(); typeArray = [float32Type timestampType]; @@ -88,7 +96,7 @@ function NonScalarArrayDifferentTypes(testCase) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(typeArray)'); - testCase.verifyDisplay(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function NonScalarArraySameTypes(testCase) @@ -102,6 +110,10 @@ function NonScalarArraySameTypes(testCase) % TimeUnit % TimeZone + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + import arrow.internal.test.display.makeDimensionString + timestampType1 = arrow.timestamp(TimeZone="Pacific/Fiji"); timestampType2 = arrow.timestamp(TimeUnit="Second"); typeArray = [timestampType1 timestampType2]; @@ -114,7 +126,7 @@ function NonScalarArraySameTypes(testCase) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(typeArray)'); - testCase.verifyDisplay(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function TestTypeDisplaysOnlyID(testCase, TypeDisplaysOnlyID) @@ -127,6 +139,9 @@ function TestTypeDisplaysOnlyID(testCase, TypeDisplaysOnlyID) % % ID: Boolean + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + type = TypeDisplaysOnlyID; fullClassName = string(class(type)); className = reverse(extractBefore(reverse(fullClassName), ".")); @@ -136,7 +151,7 @@ function TestTypeDisplaysOnlyID(testCase, TypeDisplaysOnlyID) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(type)'); - testCase.verifyDisplay(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function TestTimeType(testCase, TimeType) @@ -149,6 +164,9 @@ function TestTimeType(testCase, TimeType) % ID: Time32 % TimeUnit: Second + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + type = TimeType; fullClassName = string(class(type)); className = reverse(extractBefore(reverse(fullClassName), ".")); @@ -161,7 +179,7 @@ function TestTimeType(testCase, TimeType) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(type)'); - testCase.verifyEqual(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function TestDateType(testCase, DateType) @@ -174,6 +192,9 @@ function TestDateType(testCase, DateType) % ID: Date32 % DateUnit: Day + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + type = DateType; fullClassName = string(class(type)); className = reverse(extractBefore(reverse(fullClassName), ".")); @@ -186,7 +207,7 @@ function TestDateType(testCase, DateType) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(type)'); - testCase.verifyEqual(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function TimestampTypeDisplay(testCase) @@ -200,6 +221,9 @@ function TimestampTypeDisplay(testCase) % TimeUnit: Second % TimeZone: "America/Anchorage" + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + type = arrow.timestamp(TimeUnit="Second", TimeZone="America/Anchorage"); %#ok classnameLink = makeLinkString(FullClassName="arrow.type.TimestampType", ClassName="TimestampType", BoldFont=true); header = " " + classnameLink + " with properties:" + newline; @@ -209,7 +233,7 @@ function TimestampTypeDisplay(testCase) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(type)'); - testCase.verifyEqual(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end function StructTypeDisplay(testCase) @@ -222,6 +246,10 @@ function StructTypeDisplay(testCase) % ID: Struct % Fields: [1x2 arrow.type.Field] + import arrow.internal.test.display.verify + import arrow.internal.test.display.makeLinkString + import arrow.internal.test.display.makeDimensionString + fieldA = arrow.field("A", arrow.int32()); fieldB = arrow.field("B", arrow.timestamp(TimeZone="America/Anchorage")); type = arrow.struct(fieldA, fieldB); %#ok @@ -235,48 +263,7 @@ function StructTypeDisplay(testCase) footer = string(newline); expectedDisplay = char(strjoin([header body' footer], newline)); actualDisplay = evalc('disp(type)'); - testCase.verifyDisplay(actualDisplay, expectedDisplay); + verify(testCase, actualDisplay, expectedDisplay); end end - - methods - function verifyDisplay(testCase, actualDisplay, expectedDisplay) - % When the MATLAB GUI is running, '×' (char(215)) is used as - % the delimiter between dimension values. However, when the - % GUI is not running, 'x' (char(120)) is used as the delimiter. - % To account for this discrepancy, check if actualDisplay - % contains char(215). If not, replace all instances of - % char(215) in expectedDisplay with char(120). - - tf = contains(actualDisplay, char(215)); - if ~tf - idx = strfind(expectedDisplay, char(215)); - expectedDisplay(idx) = char(120); - end - testCase.verifyEqual(actualDisplay, expectedDisplay); - end - end -end - -function link = makeLinkString(opts) - arguments - opts.FullClassName(1, 1) string - opts.ClassName(1, 1) string - % When displaying heterogeneous arrays, only the name of the - % closest shared anscestor class is displayed in bold. All other - % class names are not bolded. - opts.BoldFont(1, 1) logical - end - - if opts.BoldFont - link = compose("%s", ... - opts.FullClassName, opts.ClassName); - else - link = compose("%s", opts.FullClassName, opts.ClassName); - end end - -function dimensionString = makeDimensionString(arraySize) - dimensionString = string(arraySize); - dimensionString = join(dimensionString, char(215)); -end \ No newline at end of file From 5ca26e89228c272305aa2070ce8eb17a54e17640 Mon Sep 17 00:00:00 2001 From: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Date: Mon, 25 Sep 2023 16:26:48 -0400 Subject: [PATCH 62/96] GH-37782: [C++] Add `CanReferenceFieldsByNames` method to `arrow::StructArray` (#37823) ### Rationale for this change `arrow::Schema` has a method called `CanReferenceFieldsByNames` which callers can use prior to calling `GetFieldByName`. It would be nice if `arrow::StructArray` also had `CanReferenceFieldsByNames` as a method. I also think it would be nice to add a `CanReferenceFieldByName` method that accepts a `std::string` instead of a `std::vector` to `StructArray` and `Schema`. That way, users wouldn't have to create a `std::vector` containing one `std::string` when they just have one field name. ### What changes are included in this PR? 1. Added `CanReferenceFieldsByNames` method to `StructArray` 2. Added `CanReferenceFieldByName` method to `StructArray` 3. Added `CanReferenceFieldsByName` method to `Schema` ### Are these changes tested? Yes. I added unit tests for `CanReferenceFieldsByNames` and `CanReferenceFieldByName` to `array_struct_test.cc` and `type_test.cc`. ### Are there any user-facing changes? Yes. `CanReferenceFieldsByNames` and `CanReferenceFieldByName` can be called on a `StructArray`. Users can also call `CanReferenceFieldByName` on a `Schema`. * Closes: #37782 Authored-by: Sarah Gilmore Signed-off-by: Sutou Kouhei --- cpp/src/arrow/array/array_nested.cc | 16 ++++++++ cpp/src/arrow/array/array_nested.h | 6 +++ cpp/src/arrow/array/array_struct_test.cc | 52 ++++++++++++++++++++++++ cpp/src/arrow/type.cc | 14 ++++--- cpp/src/arrow/type.h | 3 ++ cpp/src/arrow/type_test.cc | 18 ++++++++ 6 files changed, 104 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/array/array_nested.cc b/cpp/src/arrow/array/array_nested.cc index df60074c78470..d8308c824953a 100644 --- a/cpp/src/arrow/array/array_nested.cc +++ b/cpp/src/arrow/array/array_nested.cc @@ -627,6 +627,22 @@ std::shared_ptr StructArray::GetFieldByName(const std::string& name) cons return i == -1 ? nullptr : field(i); } +Status StructArray::CanReferenceFieldByName(const std::string& name) const { + if (GetFieldByName(name) == nullptr) { + return Status::Invalid("Field named '", name, + "' not found or not unique in the struct."); + } + return Status::OK(); +} + +Status StructArray::CanReferenceFieldsByNames( + const std::vector& names) const { + for (const auto& name : names) { + ARROW_RETURN_NOT_OK(CanReferenceFieldByName(name)); + } + return Status::OK(); +} + Result StructArray::Flatten(MemoryPool* pool) const { ArrayVector flattened; flattened.resize(data_->child_data.size()); diff --git a/cpp/src/arrow/array/array_nested.h b/cpp/src/arrow/array/array_nested.h index 47c1db039ccc9..8d5cc95fec00d 100644 --- a/cpp/src/arrow/array/array_nested.h +++ b/cpp/src/arrow/array/array_nested.h @@ -404,6 +404,12 @@ class ARROW_EXPORT StructArray : public Array { /// Returns null if name not found std::shared_ptr GetFieldByName(const std::string& name) const; + /// Indicate if field named `name` can be found unambiguously in the struct. + Status CanReferenceFieldByName(const std::string& name) const; + + /// Indicate if fields named `names` can be found unambiguously in the struct. + Status CanReferenceFieldsByNames(const std::vector& names) const; + /// \brief Flatten this array as a vector of arrays, one for each field /// /// \param[in] pool The pool to allocate null bitmaps from, if necessary diff --git a/cpp/src/arrow/array/array_struct_test.cc b/cpp/src/arrow/array/array_struct_test.cc index 318c83860e009..73d53a7efa59b 100644 --- a/cpp/src/arrow/array/array_struct_test.cc +++ b/cpp/src/arrow/array/array_struct_test.cc @@ -303,6 +303,58 @@ TEST(StructArray, FlattenOfSlice) { ASSERT_OK(arr->ValidateFull()); } +TEST(StructArray, CanReferenceFieldByName) { + auto a = ArrayFromJSON(int8(), "[4, 5]"); + auto b = ArrayFromJSON(int16(), "[6, 7]"); + auto c = ArrayFromJSON(int32(), "[8, 9]"); + auto d = ArrayFromJSON(int64(), "[10, 11]"); + auto children = std::vector>{a, b, c, d}; + + auto f0 = field("f0", int8()); + auto f1 = field("f1", int16()); + auto f2 = field("f2", int32()); + auto f3 = field("f1", int64()); + auto type = struct_({f0, f1, f2, f3}); + + auto arr = std::make_shared(type, 2, children); + + ASSERT_OK(arr->CanReferenceFieldByName("f0")); + ASSERT_OK(arr->CanReferenceFieldByName("f2")); + // Not found + ASSERT_RAISES(Invalid, arr->CanReferenceFieldByName("nope")); + + // Duplicates + ASSERT_RAISES(Invalid, arr->CanReferenceFieldByName("f1")); +} + +TEST(StructArray, CanReferenceFieldsByNames) { + auto a = ArrayFromJSON(int8(), "[4, 5]"); + auto b = ArrayFromJSON(int16(), "[6, 7]"); + auto c = ArrayFromJSON(int32(), "[8, 9]"); + auto d = ArrayFromJSON(int64(), "[10, 11]"); + auto children = std::vector>{a, b, c, d}; + + auto f0 = field("f0", int8()); + auto f1 = field("f1", int16()); + auto f2 = field("f2", int32()); + auto f3 = field("f1", int64()); + auto type = struct_({f0, f1, f2, f3}); + + auto arr = std::make_shared(type, 2, children); + + ASSERT_OK(arr->CanReferenceFieldsByNames({"f0", "f2"})); + ASSERT_OK(arr->CanReferenceFieldsByNames({"f2", "f0"})); + + // Not found + ASSERT_RAISES(Invalid, arr->CanReferenceFieldsByNames({"nope"})); + ASSERT_RAISES(Invalid, arr->CanReferenceFieldsByNames({"f0", "nope"})); + // Duplicates + ASSERT_RAISES(Invalid, arr->CanReferenceFieldsByNames({"f1"})); + ASSERT_RAISES(Invalid, arr->CanReferenceFieldsByNames({"f0", "f1"})); + // Both + ASSERT_RAISES(Invalid, arr->CanReferenceFieldsByNames({"f0", "f1", "nope"})); +} + // ---------------------------------------------------------------------------------- // Struct test class TestStructBuilder : public ::testing::Test { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 3d294a3fa8642..47bf52660ffe9 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -1847,14 +1847,18 @@ std::vector Schema::GetAllFieldIndices(const std::string& name) const { return result; } +Status Schema::CanReferenceFieldByName(const std::string& name) const { + if (GetFieldByName(name) == nullptr) { + return Status::Invalid("Field named '", name, + "' not found or not unique in the schema."); + } + return Status::OK(); +} + Status Schema::CanReferenceFieldsByNames(const std::vector& names) const { for (const auto& name : names) { - if (GetFieldByName(name) == nullptr) { - return Status::Invalid("Field named '", name, - "' not found or not unique in the schema."); - } + ARROW_RETURN_NOT_OK(CanReferenceFieldByName(name)); } - return Status::OK(); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 718540d449226..19910979287cc 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -2048,6 +2048,9 @@ class ARROW_EXPORT Schema : public detail::Fingerprintable, /// Return the indices of all fields having this name std::vector GetAllFieldIndices(const std::string& name) const; + /// Indicate if field named `name` can be found unambiguously in the schema. + Status CanReferenceFieldByName(const std::string& name) const; + /// Indicate if fields named `names` can be found unambiguously in the schema. Status CanReferenceFieldsByNames(const std::vector& names) const; diff --git a/cpp/src/arrow/type_test.cc b/cpp/src/arrow/type_test.cc index c55b33b4151e4..3dbefdcf0c564 100644 --- a/cpp/src/arrow/type_test.cc +++ b/cpp/src/arrow/type_test.cc @@ -548,6 +548,24 @@ TEST_F(TestSchema, GetFieldDuplicates) { ASSERT_EQ(results.size(), 0); } +TEST_F(TestSchema, CanReferenceFieldByName) { + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f2 = field("f2", utf8()); + auto f3 = field("f1", list(int16())); + + auto schema = ::arrow::schema({f0, f1, f2, f3}); + + ASSERT_OK(schema->CanReferenceFieldByName("f0")); + ASSERT_OK(schema->CanReferenceFieldByName("f2")); + + // Not found + ASSERT_RAISES(Invalid, schema->CanReferenceFieldByName("nope")); + + // Duplicates + ASSERT_RAISES(Invalid, schema->CanReferenceFieldByName("f1")); +} + TEST_F(TestSchema, CanReferenceFieldsByNames) { auto f0 = field("f0", int32()); auto f1 = field("f1", uint8(), false); From ebc23687cb0376e24a0f002fe710db5ad891c674 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Tue, 26 Sep 2023 05:57:30 +0900 Subject: [PATCH 63/96] GH-37849: [C++] Add cpp/src/**/*.cmake to cmake-format targets (#37850) ### Rationale for this change In general, all our `.cmake` files should be `cmake-format` targets. ### What changes are included in this PR? Add missing patterns for `cpp/src/**/*.cmake`. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37849 Authored-by: Sutou Kouhei Signed-off-by: Sutou Kouhei --- cpp/src/arrow/arrow-config.cmake | 9 ++++----- dev/archery/archery/utils/lint.py | 1 + 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/arrow-config.cmake b/cpp/src/arrow/arrow-config.cmake index 8c9173c1710cb..c18c9eff37279 100644 --- a/cpp/src/arrow/arrow-config.cmake +++ b/cpp/src/arrow/arrow-config.cmake @@ -19,8 +19,7 @@ message(WARNING "find_package(arrow) is deprecated. Use find_package(Arrow) inst find_package(Arrow CONFIG) include(FindPackageHandleStandardArgs) -find_package_handle_standard_args(arrow - REQUIRED_VARS - ARROW_INCLUDE_DIR - VERSION_VAR - ARROW_VERSION) +find_package_handle_standard_args( + arrow + REQUIRED_VARS ARROW_INCLUDE_DIR + VERSION_VAR ARROW_VERSION) diff --git a/dev/archery/archery/utils/lint.py b/dev/archery/archery/utils/lint.py index 18c93a5b8b71b..3efe5994055db 100644 --- a/dev/archery/archery/utils/lint.py +++ b/dev/archery/archery/utils/lint.py @@ -149,6 +149,7 @@ def cmake_linter(src, fix=False): include_patterns=[ 'ci/**/*.cmake', 'cpp/CMakeLists.txt', + 'cpp/src/**/*.cmake', 'cpp/src/**/*.cmake.in', 'cpp/src/**/CMakeLists.txt', 'cpp/examples/**/CMakeLists.txt', From cc51e68c5b3f9372b6410f9496b9cb53437201e5 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Tue, 26 Sep 2023 09:14:02 +0200 Subject: [PATCH 64/96] GH-37789: [Integration][Go] Go C Data Interface integration testing (#37788) ### Rationale for this change We want to enable integration testing of the Arrow Go implementation of the C Data Interface, so as to ensure interoperability. ### What changes are included in this PR? 1. Enable C Data Interface integration testing for the Arrow Go implementation 2. Fix compatibility issues found by the integration tests ### Are these changes tested? Yes, by construction. ### Are there any user-facing changes? Bugfixes in the Arrow Go C Data Interface implementation. * Closes: #37789 Authored-by: Antoine Pitrou Signed-off-by: Antoine Pitrou --- .github/workflows/go.yml | 2 +- ci/scripts/go_build.sh | 19 ++ dev/archery/archery/integration/runner.py | 7 +- dev/archery/archery/integration/tester_go.py | 134 +++++++++++- docker-compose.yml | 1 + go/arrow/cdata/cdata.go | 2 +- go/arrow/cdata/cdata_exports.go | 38 ++-- go/arrow/cdata/cdata_test.go | 12 +- go/arrow/internal/arrjson/reader.go | 10 + .../internal/cdata_integration/entrypoints.go | 192 ++++++++++++++++++ 10 files changed, 391 insertions(+), 26 deletions(-) create mode 100644 go/arrow/internal/cdata_integration/entrypoints.go diff --git a/.github/workflows/go.yml b/.github/workflows/go.yml index 3c695891b48d6..ad8fedb9bd9e4 100644 --- a/.github/workflows/go.yml +++ b/.github/workflows/go.yml @@ -232,7 +232,7 @@ jobs: name: AMD64 Windows 2019 Go ${{ matrix.go }} runs-on: windows-2019 if: ${{ !contains(github.event.pull_request.title, 'WIP') }} - timeout-minutes: 15 + timeout-minutes: 25 strategy: fail-fast: false matrix: diff --git a/ci/scripts/go_build.sh b/ci/scripts/go_build.sh index 3c8cc0f4ee2e2..2a38901337c56 100755 --- a/ci/scripts/go_build.sh +++ b/ci/scripts/go_build.sh @@ -41,3 +41,22 @@ pushd ${source_dir}/parquet go install -v ./... popd + +if [[ -n "${ARROW_GO_INTEGRATION}" ]]; then + pushd ${source_dir}/arrow/internal/cdata_integration + + case "$(uname)" in + Linux) + go_lib="arrow_go_integration.so" + ;; + Darwin) + go_lib="arrow_go_integration.so" + ;; + MINGW*) + go_lib="arrow_go_integration.dll" + ;; + esac + go build -tags cdata_integration,assert -buildmode=c-shared -o ${go_lib} . + + popd +fi diff --git a/dev/archery/archery/integration/runner.py b/dev/archery/archery/integration/runner.py index 2fd1d2d7f0c44..a780d33cbf323 100644 --- a/dev/archery/archery/integration/runner.py +++ b/dev/archery/archery/integration/runner.py @@ -70,6 +70,7 @@ def __init__(self, json_files, self.serial = serial self.gold_dirs = gold_dirs self.failures: List[Outcome] = [] + self.skips: List[Outcome] = [] self.match = match if self.match is not None: @@ -207,6 +208,8 @@ def case_wrapper(test_case): self.failures.append(outcome.failure) if self.stop_on_error: break + elif outcome.skipped: + self.skips.append(outcome) else: with ThreadPoolExecutor() as executor: @@ -215,6 +218,8 @@ def case_wrapper(test_case): self.failures.append(outcome.failure) if self.stop_on_error: break + elif outcome.skipped: + self.skips.append(outcome) def _compare_ipc_implementations( self, @@ -638,7 +643,7 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True, log(f'{exc_type}: {exc_value}') log() - log(fail_count, "failures") + log(f"{fail_count} failures, {len(runner.skips)} skips") if fail_count > 0: sys.exit(1) diff --git a/dev/archery/archery/integration/tester_go.py b/dev/archery/archery/integration/tester_go.py index fea33cd0ac6c1..6fa26ea02b8e7 100644 --- a/dev/archery/archery/integration/tester_go.py +++ b/dev/archery/archery/integration/tester_go.py @@ -16,11 +16,14 @@ # under the License. import contextlib +import functools import os import subprocess -from .tester import Tester +from . import cdata +from .tester import Tester, CDataExporter, CDataImporter from .util import run_cmd, log +from ..utils.source import ARROW_ROOT_DEFAULT # FIXME(sbinet): revisit for Go modules @@ -39,12 +42,21 @@ "localhost", ] +_dll_suffix = ".dll" if os.name == "nt" else ".so" + +_DLL_PATH = os.path.join( + ARROW_ROOT_DEFAULT, + "go/arrow/internal/cdata_integration") +_INTEGRATION_DLL = os.path.join(_DLL_PATH, "arrow_go_integration" + _dll_suffix) + class GoTester(Tester): PRODUCER = True CONSUMER = True FLIGHT_SERVER = True FLIGHT_CLIENT = True + C_DATA_EXPORTER = True + C_DATA_IMPORTER = True name = 'Go' @@ -119,3 +131,123 @@ def flight_request(self, port, json_path=None, scenario_name=None): if self.debug: log(' '.join(cmd)) run_cmd(cmd) + + def make_c_data_exporter(self): + return GoCDataExporter(self.debug, self.args) + + def make_c_data_importer(self): + return GoCDataImporter(self.debug, self.args) + + +_go_c_data_entrypoints = """ + const char* ArrowGo_ExportSchemaFromJson(const char* json_path, + uintptr_t out); + const char* ArrowGo_ImportSchemaAndCompareToJson( + const char* json_path, uintptr_t c_schema); + + const char* ArrowGo_ExportBatchFromJson(const char* json_path, + int num_batch, + uintptr_t out); + const char* ArrowGo_ImportBatchAndCompareToJson( + const char* json_path, int num_batch, uintptr_t c_array); + + int64_t ArrowGo_BytesAllocated(); + void ArrowGo_RunGC(); + void ArrowGo_FreeError(const char*); + """ + + +@functools.lru_cache +def _load_ffi(ffi, lib_path=_INTEGRATION_DLL): + ffi.cdef(_go_c_data_entrypoints) + dll = ffi.dlopen(lib_path) + return dll + + +class _CDataBase: + + def __init__(self, debug, args): + self.debug = debug + self.args = args + self.ffi = cdata.ffi() + self.dll = _load_ffi(self.ffi) + + def _pointer_to_int(self, c_ptr): + return self.ffi.cast('uintptr_t', c_ptr) + + def _check_go_error(self, go_error): + """ + Check a `const char*` error return from an integration entrypoint. + + A null means success, a non-empty string is an error message. + The string is dynamically allocated on the Go side. + """ + assert self.ffi.typeof(go_error) is self.ffi.typeof("const char*") + if go_error != self.ffi.NULL: + try: + error = self.ffi.string(go_error).decode('utf8', + errors='replace') + raise RuntimeError( + f"Go C Data Integration call failed: {error}") + finally: + self.dll.ArrowGo_FreeError(go_error) + + def _run_gc(self): + self.dll.ArrowGo_RunGC() + + +class GoCDataExporter(CDataExporter, _CDataBase): + # Note: the Arrow Go C Data export functions expect their output + # ArrowStream or ArrowArray argument to be zero-initialized. + # This is currently ensured through the use of `ffi.new`. + + def export_schema_from_json(self, json_path, c_schema_ptr): + go_error = self.dll.ArrowGo_ExportSchemaFromJson( + str(json_path).encode(), self._pointer_to_int(c_schema_ptr)) + self._check_go_error(go_error) + + def export_batch_from_json(self, json_path, num_batch, c_array_ptr): + go_error = self.dll.ArrowGo_ExportBatchFromJson( + str(json_path).encode(), num_batch, + self._pointer_to_int(c_array_ptr)) + self._check_go_error(go_error) + + @property + def supports_releasing_memory(self): + return True + + def record_allocation_state(self): + self._run_gc() + return self.dll.ArrowGo_BytesAllocated() + + def compare_allocation_state(self, recorded, gc_until): + def pred(): + return self.record_allocation_state() == recorded + + return gc_until(pred) + + +class GoCDataImporter(CDataImporter, _CDataBase): + + def import_schema_and_compare_to_json(self, json_path, c_schema_ptr): + go_error = self.dll.ArrowGo_ImportSchemaAndCompareToJson( + str(json_path).encode(), self._pointer_to_int(c_schema_ptr)) + self._check_go_error(go_error) + + def import_batch_and_compare_to_json(self, json_path, num_batch, + c_array_ptr): + go_error = self.dll.ArrowGo_ImportBatchAndCompareToJson( + str(json_path).encode(), num_batch, + self._pointer_to_int(c_array_ptr)) + self._check_go_error(go_error) + + @property + def supports_releasing_memory(self): + return True + + def gc_until(self, predicate): + for i in range(10): + if predicate(): + return True + self._run_gc() + return False diff --git a/docker-compose.yml b/docker-compose.yml index 8ae06900c57f9..62e5aee0a841c 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -1732,6 +1732,7 @@ services: <<: [*common, *ccache] # tell archery where the arrow binaries are located ARROW_CPP_EXE_PATH: /build/cpp/debug + ARROW_GO_INTEGRATION: 1 ARCHERY_INTEGRATION_WITH_RUST: 0 command: ["/arrow/ci/scripts/rust_build.sh /arrow /build && diff --git a/go/arrow/cdata/cdata.go b/go/arrow/cdata/cdata.go index bc8fc6e987b93..dc8825a7edb67 100644 --- a/go/arrow/cdata/cdata.go +++ b/go/arrow/cdata/cdata.go @@ -197,7 +197,7 @@ func importSchema(schema *CArrowSchema) (ret arrow.Field, err error) { // handle types with params via colon typs := strings.Split(f, ":") - defaulttz := "UTC" + defaulttz := "" switch typs[0] { case "tss": tz := typs[1] diff --git a/go/arrow/cdata/cdata_exports.go b/go/arrow/cdata/cdata_exports.go index ae6247494b100..187c2deb9755f 100644 --- a/go/arrow/cdata/cdata_exports.go +++ b/go/arrow/cdata/cdata_exports.go @@ -368,34 +368,36 @@ func exportArray(arr arrow.Array, out *CArrowArray, outSchema *CArrowSchema) { exportField(arrow.Field{Type: arr.DataType()}, outSchema) } + nbuffers := len(arr.Data().Buffers()) + buf_offset := 0 + // Some types don't have validity bitmaps, but we keep them shifted + // to make processing easier in other contexts. This means that + // we have to adjust when exporting. + has_validity_bitmap := internal.DefaultHasValidityBitmap(arr.DataType().ID()) + if nbuffers > 0 && !has_validity_bitmap { + nbuffers-- + buf_offset++ + } + out.dictionary = nil out.null_count = C.int64_t(arr.NullN()) out.length = C.int64_t(arr.Len()) out.offset = C.int64_t(arr.Data().Offset()) - out.n_buffers = C.int64_t(len(arr.Data().Buffers())) - - if out.n_buffers > 0 { - var ( - nbuffers = len(arr.Data().Buffers()) - bufs = arr.Data().Buffers() - ) - // unions don't have validity bitmaps, but we keep them shifted - // to make processing easier in other contexts. This means that - // we have to adjust for union arrays - if !internal.DefaultHasValidityBitmap(arr.DataType().ID()) { - out.n_buffers-- - nbuffers-- - bufs = bufs[1:] - } + out.n_buffers = C.int64_t(nbuffers) + out.buffers = nil + + if nbuffers > 0 { + bufs := arr.Data().Buffers() buffers := allocateBufferPtrArr(nbuffers) - for i := range bufs { - buf := bufs[i] + for i, buf := range bufs[buf_offset:] { if buf == nil || buf.Len() == 0 { - if i > 0 || !internal.DefaultHasValidityBitmap(arr.DataType().ID()) { + if i > 0 || !has_validity_bitmap { // apache/arrow#33936: export a dummy buffer to be friendly to // implementations that don't import NULL properly buffers[i] = (*C.void)(unsafe.Pointer(&C.kGoCdataZeroRegion)) } else { + // null pointer permitted for the validity bitmap + // (assuming null count is 0) buffers[i] = nil } continue diff --git a/go/arrow/cdata/cdata_test.go b/go/arrow/cdata/cdata_test.go index a0c2f25496a6b..af05649b1c541 100644 --- a/go/arrow/cdata/cdata_test.go +++ b/go/arrow/cdata/cdata_test.go @@ -184,13 +184,17 @@ func TestImportTemporalSchema(t *testing.T) { {arrow.FixedWidthTypes.MonthInterval, "tiM"}, {arrow.FixedWidthTypes.DayTimeInterval, "tiD"}, {arrow.FixedWidthTypes.MonthDayNanoInterval, "tin"}, - {arrow.FixedWidthTypes.Timestamp_s, "tss:"}, + {arrow.FixedWidthTypes.Timestamp_s, "tss:UTC"}, + {&arrow.TimestampType{Unit: arrow.Second}, "tss:"}, {&arrow.TimestampType{Unit: arrow.Second, TimeZone: "Europe/Paris"}, "tss:Europe/Paris"}, - {arrow.FixedWidthTypes.Timestamp_ms, "tsm:"}, + {arrow.FixedWidthTypes.Timestamp_ms, "tsm:UTC"}, + {&arrow.TimestampType{Unit: arrow.Millisecond}, "tsm:"}, {&arrow.TimestampType{Unit: arrow.Millisecond, TimeZone: "Europe/Paris"}, "tsm:Europe/Paris"}, - {arrow.FixedWidthTypes.Timestamp_us, "tsu:"}, + {arrow.FixedWidthTypes.Timestamp_us, "tsu:UTC"}, + {&arrow.TimestampType{Unit: arrow.Microsecond}, "tsu:"}, {&arrow.TimestampType{Unit: arrow.Microsecond, TimeZone: "Europe/Paris"}, "tsu:Europe/Paris"}, - {arrow.FixedWidthTypes.Timestamp_ns, "tsn:"}, + {arrow.FixedWidthTypes.Timestamp_ns, "tsn:UTC"}, + {&arrow.TimestampType{Unit: arrow.Nanosecond}, "tsn:"}, {&arrow.TimestampType{Unit: arrow.Nanosecond, TimeZone: "Europe/Paris"}, "tsn:Europe/Paris"}, } diff --git a/go/arrow/internal/arrjson/reader.go b/go/arrow/internal/arrjson/reader.go index 34b9b6e10ec4a..c8056ef1dc744 100644 --- a/go/arrow/internal/arrjson/reader.go +++ b/go/arrow/internal/arrjson/reader.go @@ -82,6 +82,8 @@ func (r *Reader) Release() { r.recs[i] = nil } } + r.memo.Clear() + r.memo = nil } } func (r *Reader) Schema() *arrow.Schema { return r.schema } @@ -96,6 +98,14 @@ func (r *Reader) Read() (arrow.Record, error) { return rec, nil } +func (r *Reader) ReadAt(index int) (arrow.Record, error) { + if index >= r.NumRecords() { + return nil, io.EOF + } + rec := r.recs[index] + return rec, nil +} + var ( _ arrio.Reader = (*Reader)(nil) ) diff --git a/go/arrow/internal/cdata_integration/entrypoints.go b/go/arrow/internal/cdata_integration/entrypoints.go new file mode 100644 index 0000000000000..629b8a762a689 --- /dev/null +++ b/go/arrow/internal/cdata_integration/entrypoints.go @@ -0,0 +1,192 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +//go:build cdata_integration +// +build cdata_integration + +package main + +import ( + "fmt" + "os" + "runtime" + "unsafe" + + "github.com/apache/arrow/go/v14/arrow/array" + "github.com/apache/arrow/go/v14/arrow/cdata" + "github.com/apache/arrow/go/v14/arrow/internal/arrjson" + "github.com/apache/arrow/go/v14/arrow/memory" +) + +// #include +// #include +import "C" + +var alloc = memory.NewCheckedAllocator(memory.NewGoAllocator()) + +//export ArrowGo_BytesAllocated +func ArrowGo_BytesAllocated() int64 { + return int64(alloc.CurrentAlloc()) +} + +//export ArrowGo_RunGC +func ArrowGo_RunGC() { + runtime.GC() +} + +//export ArrowGo_FreeError +func ArrowGo_FreeError(cError *C.char) { + C.free(unsafe.Pointer(cError)) +} + +// When used in a defer() statement, this functions catches an incoming +// panic and converts it into a regular error. This avoids crashing the +// archery integration process and lets other tests proceed. +// Not all panics may be caught and some will still crash the process, though. +func handlePanic(err *error) { + if e := recover(); e != nil { + // Add a prefix while wrapping the panic-error + *err = fmt.Errorf("panic: %w", e.(error)) + } +} + +func newJsonReader(cJsonPath *C.char) (*arrjson.Reader, error) { + jsonPath := C.GoString(cJsonPath) + + f, err := os.Open(jsonPath) + if err != nil { + return nil, fmt.Errorf("could not open JSON file %q: %w", jsonPath, err) + } + defer f.Close() + + jsonReader, err := arrjson.NewReader(f, arrjson.WithAllocator(alloc)) + if err != nil { + return nil, fmt.Errorf("could not open JSON file reader from file %q: %w", jsonPath, err) + } + return jsonReader, nil +} + +func exportSchemaFromJson(cJsonPath *C.char, out *cdata.CArrowSchema) error { + jsonReader, err := newJsonReader(cJsonPath) + if err != nil { + return err + } + defer jsonReader.Release() + schema := jsonReader.Schema() + defer handlePanic(&err) + cdata.ExportArrowSchema(schema, out) + return err +} + +func importSchemaAndCompareToJson(cJsonPath *C.char, cSchema *cdata.CArrowSchema) error { + jsonReader, err := newJsonReader(cJsonPath) + if err != nil { + return err + } + defer jsonReader.Release() + schema := jsonReader.Schema() + importedSchema, err := cdata.ImportCArrowSchema(cSchema) + if err != nil { + return err + } + if !schema.Equal(importedSchema) || !schema.Metadata().Equal(importedSchema.Metadata()) { + return fmt.Errorf( + "Schemas are different:\n- Json Schema: %s\n- Imported Schema: %s", + schema.String(), + importedSchema.String()) + } + return nil +} + +func exportBatchFromJson(cJsonPath *C.char, num_batch int, out *cdata.CArrowArray) error { + // XXX this function exports a single batch at a time, but the JSON reader + // reads all batches at construction. + jsonReader, err := newJsonReader(cJsonPath) + if err != nil { + return err + } + defer jsonReader.Release() + batch, err := jsonReader.ReadAt(num_batch) + if err != nil { + return err + } + defer handlePanic(&err) + cdata.ExportArrowRecordBatch(batch, out, nil) + return err +} + +func importBatchAndCompareToJson(cJsonPath *C.char, num_batch int, cArray *cdata.CArrowArray) error { + jsonReader, err := newJsonReader(cJsonPath) + if err != nil { + return err + } + defer jsonReader.Release() + schema := jsonReader.Schema() + batch, err := jsonReader.ReadAt(num_batch) + if err != nil { + return err + } + + importedBatch, err := cdata.ImportCRecordBatchWithSchema(cArray, schema) + if err != nil { + return err + } + defer importedBatch.Release() + if !array.RecordEqual(batch, importedBatch) { + return fmt.Errorf( + "Batches are different:\n- Json Batch: %v\n- Imported Batch: %v", + batch, importedBatch) + } + return nil +} + +//export ArrowGo_ExportSchemaFromJson +func ArrowGo_ExportSchemaFromJson(cJsonPath *C.char, out uintptr) *C.char { + err := exportSchemaFromJson(cJsonPath, cdata.SchemaFromPtr(out)) + if err != nil { + return C.CString(err.Error()) + } + return nil +} + +//export ArrowGo_ExportBatchFromJson +func ArrowGo_ExportBatchFromJson(cJsonPath *C.char, num_batch int, out uintptr) *C.char { + err := exportBatchFromJson(cJsonPath, num_batch, cdata.ArrayFromPtr(out)) + if err != nil { + return C.CString(err.Error()) + } + return nil +} + +//export ArrowGo_ImportSchemaAndCompareToJson +func ArrowGo_ImportSchemaAndCompareToJson(cJsonPath *C.char, cSchema uintptr) *C.char { + err := importSchemaAndCompareToJson(cJsonPath, cdata.SchemaFromPtr(cSchema)) + if err != nil { + return C.CString(err.Error()) + } + return nil +} + +//export ArrowGo_ImportBatchAndCompareToJson +func ArrowGo_ImportBatchAndCompareToJson(cJsonPath *C.char, num_batch int, cArray uintptr) *C.char { + err := importBatchAndCompareToJson(cJsonPath, num_batch, cdata.ArrayFromPtr(cArray)) + if err != nil { + return C.CString(err.Error()) + } + return nil +} + +func main() {} From 38922eded5797afca8ade33145bf59140ada1663 Mon Sep 17 00:00:00 2001 From: James Duong Date: Tue, 26 Sep 2023 05:44:53 -0700 Subject: [PATCH 65/96] GH-37703: [Java] Method for setting exact number of records in ListVector (#37838) ### Rationale for this change There is currently a setInitialCapacity() function that can be used to set a number of records and density factor when setting the capacity on a ListVector. A developer may want to specify the exact total number of records instead and can use the new methods introduced here. ### What changes are included in this PR? Add setInitialTotalCapacity() to BaseRepeatedVector, ListVector, DensityAwareVector, and LargeListVector to specify the exact total number of records in the backing vector. This is an alternative to using the density argument in setInitialCapacity() that allows the caller to precisely specify the capacity. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37703 Authored-by: James Duong Signed-off-by: David Li --- .../complex/BaseRepeatedValueVector.java | 21 ++++++++++++++++++ .../arrow/vector/complex/LargeListVector.java | 21 ++++++++++++++++++ .../arrow/vector/complex/ListVector.java | 22 +++++++++++++++++++ .../arrow/vector/TestLargeListVector.java | 20 +++++++++++++++++ .../apache/arrow/vector/TestListVector.java | 20 +++++++++++++++++ 5 files changed, 104 insertions(+) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index 62d4a1299dead..95deceb4e75ca 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -205,6 +205,27 @@ public void setInitialCapacity(int numRecords, double density) { } } + /** + * Specialized version of setInitialTotalCapacity() for ListVector. This is + * used by some callers when they want to explicitly control and be + * conservative about memory allocated for inner data vector. This is + * very useful when we are working with memory constraints for a query + * and have a fixed amount of memory reserved for the record batch. In + * such cases, we are likely to face OOM or related problems when + * we reserve memory for a record batch with value count x and + * do setInitialCapacity(x) such that each vector allocates only + * what is necessary and not the default amount but the multiplier + * forces the memory requirement to go beyond what was needed. + * + * @param numRecords value count + * @param totalNumberOfElements the total number of elements to to allow + * for in this vector across all records. + */ + public void setInitialTotalCapacity(int numRecords, int totalNumberOfElements) { + offsetAllocationSizeInBytes = (numRecords + 1) * OFFSET_WIDTH; + vector.setInitialCapacity(totalNumberOfElements); + } + @Override public int getValueCapacity() { final int offsetValueCapacity = Math.max(getOffsetBufferValueCapacity() - 1, 0); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java index 6ef5f994fc6f4..acb058cda3cb8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java @@ -196,6 +196,27 @@ public void setInitialCapacity(int numRecords, double density) { } } + /** + * Specialized version of setInitialTotalCapacity() for ListVector. This is + * used by some callers when they want to explicitly control and be + * conservative about memory allocated for inner data vector. This is + * very useful when we are working with memory constraints for a query + * and have a fixed amount of memory reserved for the record batch. In + * such cases, we are likely to face OOM or related problems when + * we reserve memory for a record batch with value count x and + * do setInitialCapacity(x) such that each vector allocates only + * what is necessary and not the default amount but the multiplier + * forces the memory requirement to go beyond what was needed. + * + * @param numRecords value count + * @param totalNumberOfElements the total number of elements to to allow + * for in this vector across all records. + */ + public void setInitialTotalCapacity(int numRecords, int totalNumberOfElements) { + offsetAllocationSizeInBytes = (numRecords + 1) * OFFSET_WIDTH; + vector.setInitialCapacity(totalNumberOfElements); + } + /** * Get the density of this ListVector. * @return density diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 52e5307e13908..0d6ff11f8ccf3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -148,6 +148,28 @@ public void setInitialCapacity(int numRecords, double density) { super.setInitialCapacity(numRecords, density); } + /** + * Specialized version of setInitialTotalCapacity() for ListVector. This is + * used by some callers when they want to explicitly control and be + * conservative about memory allocated for inner data vector. This is + * very useful when we are working with memory constraints for a query + * and have a fixed amount of memory reserved for the record batch. In + * such cases, we are likely to face OOM or related problems when + * we reserve memory for a record batch with value count x and + * do setInitialCapacity(x) such that each vector allocates only + * what is necessary and not the default amount but the multiplier + * forces the memory requirement to go beyond what was needed. + * + * @param numRecords value count + * @param totalNumberOfElements the total number of elements to to allow + * for in this vector across all records. + */ + @Override + public void setInitialTotalCapacity(int numRecords, int totalNumberOfElements) { + validityAllocationSizeInBytes = getValidityBufferSizeFromCount(numRecords); + super.setInitialTotalCapacity(numRecords, totalNumberOfElements); + } + /** * Get the density of this ListVector. * @return density diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestLargeListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestLargeListVector.java index c1d60da4d5988..adf86183c0ada 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestLargeListVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestLargeListVector.java @@ -972,6 +972,26 @@ public void testIsEmpty() { } } + @Test + public void testTotalCapacity() { + final FieldType type = FieldType.nullable(MinorType.INT.getType()); + try (final LargeListVector vector = new LargeListVector("list", allocator, type, null)) { + // Force the child vector to be allocated based on the type + // (this is a bad API: we have to track and repeat the type twice) + vector.addOrGetVector(type); + + // Specify the allocation size but do not actually allocate + vector.setInitialTotalCapacity(10, 100); + + // Finally actually do the allocation + vector.allocateNewSafe(); + + // Note: allocator rounds up and can be greater than the requested allocation. + assertTrue(vector.getValueCapacity() >= 10); + assertTrue(vector.getDataVector().getValueCapacity() >= 100); + } + } + private void writeIntValues(UnionLargeListWriter writer, int[] values) { writer.startList(); for (int v: values) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java index f0f19058eef20..2a1228c2a38c2 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java @@ -1115,6 +1115,26 @@ public void testIsEmpty() { } } + @Test + public void testTotalCapacity() { + final FieldType type = FieldType.nullable(MinorType.INT.getType()); + try (final ListVector vector = new ListVector("list", allocator, type, null)) { + // Force the child vector to be allocated based on the type + // (this is a bad API: we have to track and repeat the type twice) + vector.addOrGetVector(type); + + // Specify the allocation size but do not actually allocate + vector.setInitialTotalCapacity(10, 100); + + // Finally actually do the allocation + vector.allocateNewSafe(); + + // Note: allocator rounds up and can be greater than the requested allocation. + assertTrue(vector.getValueCapacity() >= 10); + assertTrue(vector.getDataVector().getValueCapacity() >= 100); + } + } + private void writeIntValues(UnionListWriter writer, int[] values) { writer.startList(); for (int v: values) { From 5978729277e164c5b4dfd8916bb410b9e67a04c7 Mon Sep 17 00:00:00 2001 From: mwish Date: Tue, 26 Sep 2023 22:23:19 +0800 Subject: [PATCH 66/96] GH-37293: [C++][Parquet] Encoding: Add Benchmark for DELTA_BYTE_ARRAY (#37641) ### Rationale for this change Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization. ### What changes are included in this PR? Add benchmark for DELTA_BYTE_ARRAY in parquet, and do tiny optimization. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #37293 Lead-authored-by: mwish Co-authored-by: Antoine Pitrou Signed-off-by: Antoine Pitrou --- cpp/src/parquet/encoding.cc | 8 +- cpp/src/parquet/encoding_benchmark.cc | 108 ++++++++++++++++++++++++++ 2 files changed, 114 insertions(+), 2 deletions(-) diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc index e3c8ab196f45e..0564ea2b93f3f 100644 --- a/cpp/src/parquet/encoding.cc +++ b/cpp/src/parquet/encoding.cc @@ -3300,7 +3300,11 @@ class DeltaByteArrayDecoderImpl : public DecoderImpl, virtual public TypedDecode void SetData(int num_values, const uint8_t* data, int len) override { num_values_ = num_values; - decoder_ = std::make_shared<::arrow::bit_util::BitReader>(data, len); + if (decoder_) { + decoder_->Reset(data, len); + } else { + decoder_ = std::make_shared<::arrow::bit_util::BitReader>(data, len); + } prefix_len_decoder_.SetDecoder(num_values, decoder_); // get the number of encoded prefix lengths @@ -3323,7 +3327,7 @@ class DeltaByteArrayDecoderImpl : public DecoderImpl, virtual public TypedDecode // TODO: read corrupted files written with bug(PARQUET-246). last_value_ should be set // to last_value_in_previous_page_ when decoding a new page(except the first page) - last_value_ = ""; + last_value_.clear(); } int DecodeArrow(int num_values, int null_count, const uint8_t* valid_bits, diff --git a/cpp/src/parquet/encoding_benchmark.cc b/cpp/src/parquet/encoding_benchmark.cc index 6726810911fd5..717c716330563 100644 --- a/cpp/src/parquet/encoding_benchmark.cc +++ b/cpp/src/parquet/encoding_benchmark.cc @@ -737,6 +737,114 @@ static void BM_DeltaLengthDecodingSpacedByteArray(benchmark::State& state) { BENCHMARK(BM_PlainDecodingSpacedByteArray)->Apply(ByteArrayCustomArguments); BENCHMARK(BM_DeltaLengthDecodingSpacedByteArray)->Apply(ByteArrayCustomArguments); +struct DeltaByteArrayState { + int32_t min_size = 0; + int32_t max_size; + int32_t array_length; + int32_t total_data_size = 0; + double prefixed_probability; + std::vector buf; + + explicit DeltaByteArrayState(const benchmark::State& state) + : max_size(static_cast(state.range(0))), + array_length(static_cast(state.range(1))), + prefixed_probability(state.range(2) / 100.0) {} + + std::vector MakeRandomByteArray(uint32_t seed) { + std::default_random_engine gen(seed); + std::uniform_int_distribution dist_size(min_size, max_size); + std::uniform_int_distribution dist_byte(0, 255); + std::bernoulli_distribution dist_has_prefix(prefixed_probability); + std::uniform_real_distribution dist_prefix_length(0, 1); + + std::vector out(array_length); + buf.resize(max_size * array_length); + auto buf_ptr = buf.data(); + total_data_size = 0; + + for (int32_t i = 0; i < array_length; ++i) { + int len = dist_size(gen); + out[i].len = len; + out[i].ptr = buf_ptr; + + bool do_prefix = i > 0 && dist_has_prefix(gen); + int prefix_len = 0; + if (do_prefix) { + int max_prefix_len = std::min(len, static_cast(out[i - 1].len)); + prefix_len = + static_cast(std::ceil(max_prefix_len * dist_prefix_length(gen))); + } + for (int j = 0; j < prefix_len; ++j) { + buf_ptr[j] = out[i - 1].ptr[j]; + } + for (int j = prefix_len; j < len; ++j) { + buf_ptr[j] = static_cast(dist_byte(gen)); + } + buf_ptr += len; + total_data_size += len; + } + return out; + } +}; + +static void BM_DeltaEncodingByteArray(benchmark::State& state) { + DeltaByteArrayState delta_state(state); + std::vector values = delta_state.MakeRandomByteArray(/*seed=*/42); + + auto encoder = MakeTypedEncoder(Encoding::DELTA_BYTE_ARRAY); + const int64_t plain_encoded_size = + delta_state.total_data_size + 4 * delta_state.array_length; + int64_t encoded_size = 0; + + for (auto _ : state) { + encoder->Put(values.data(), static_cast(values.size())); + encoded_size = encoder->FlushValues()->size(); + } + state.SetItemsProcessed(state.iterations() * delta_state.array_length); + state.SetBytesProcessed(state.iterations() * delta_state.total_data_size); + state.counters["compression_ratio"] = + static_cast(plain_encoded_size) / encoded_size; +} + +static void BM_DeltaDecodingByteArray(benchmark::State& state) { + DeltaByteArrayState delta_state(state); + std::vector values = delta_state.MakeRandomByteArray(/*seed=*/42); + + auto encoder = MakeTypedEncoder(Encoding::DELTA_BYTE_ARRAY); + encoder->Put(values.data(), static_cast(values.size())); + std::shared_ptr buf = encoder->FlushValues(); + + const int64_t plain_encoded_size = + delta_state.total_data_size + 4 * delta_state.array_length; + const int64_t encoded_size = buf->size(); + + auto decoder = MakeTypedDecoder(Encoding::DELTA_BYTE_ARRAY); + for (auto _ : state) { + decoder->SetData(delta_state.array_length, buf->data(), + static_cast(buf->size())); + decoder->Decode(values.data(), static_cast(values.size())); + ::benchmark::DoNotOptimize(values); + } + state.SetItemsProcessed(state.iterations() * delta_state.array_length); + state.SetBytesProcessed(state.iterations() * delta_state.total_data_size); + state.counters["compression_ratio"] = + static_cast(plain_encoded_size) / encoded_size; +} + +static void ByteArrayDeltaCustomArguments(benchmark::internal::Benchmark* b) { + for (int max_string_length : {8, 64, 1024}) { + for (int batch_size : {512, 2048}) { + for (int prefixed_percent : {10, 90, 99}) { + b->Args({max_string_length, batch_size, prefixed_percent}); + } + } + } + b->ArgNames({"max-string-length", "batch-size", "prefixed-percent"}); +} + +BENCHMARK(BM_DeltaEncodingByteArray)->Apply(ByteArrayDeltaCustomArguments); +BENCHMARK(BM_DeltaDecodingByteArray)->Apply(ByteArrayDeltaCustomArguments); + static void BM_RleEncodingBoolean(benchmark::State& state) { std::vector values(state.range(0), true); auto encoder = MakeEncoder(Type::BOOLEAN, Encoding::RLE); From 2895af492236e5dc42ac665406de8648a9aac6db Mon Sep 17 00:00:00 2001 From: Tim Schaub Date: Tue, 26 Sep 2023 09:01:36 -0600 Subject: [PATCH 67/96] GH-37845: [Go][Parquet] Check the number of logical fields instead of physical columns (#37846) ### Rationale for this change This makes it so trying to read with a column chunk reader consistently returns an error if the index is outside the bounds of the logical fields (currently it panics in some cases and returns an error in others). ### What changes are included in this PR? This makes it so the column chunk reader checks the number of logical fields instead of the number of physical columns when checking if an index is out of range. ### Are these changes tested? The new test will panics without the accompanying code change. ### Are there any user-facing changes? Applications that used to panic will now have an error to handle instead. * Closes: #37845 Authored-by: Tim Schaub Signed-off-by: Matt Topol --- go/parquet/pqarrow/file_reader.go | 4 +- go/parquet/pqarrow/file_reader_test.go | 67 ++++++++++++++++++++++++++ 2 files changed, 69 insertions(+), 2 deletions(-) diff --git a/go/parquet/pqarrow/file_reader.go b/go/parquet/pqarrow/file_reader.go index d54e365b55e0c..d91010c62c19d 100755 --- a/go/parquet/pqarrow/file_reader.go +++ b/go/parquet/pqarrow/file_reader.go @@ -394,8 +394,8 @@ func (fr *FileReader) ReadRowGroups(ctx context.Context, indices, rowGroups []in } func (fr *FileReader) getColumnReader(ctx context.Context, i int, colFactory itrFactory) (*ColumnReader, error) { - if i < 0 || i >= fr.rdr.MetaData().Schema.NumColumns() { - return nil, fmt.Errorf("invalid column index chosen %d, there are only %d columns", i, fr.rdr.MetaData().Schema.NumColumns()) + if i < 0 || i >= len(fr.Manifest.Fields) { + return nil, fmt.Errorf("invalid column index chosen %d, there are only %d columns", i, len(fr.Manifest.Fields)) } ctx = context.WithValue(ctx, rdrCtxKey{}, readerCtx{ diff --git a/go/parquet/pqarrow/file_reader_test.go b/go/parquet/pqarrow/file_reader_test.go index 2b4aa8ab78dbe..d1f3ae1c984a2 100644 --- a/go/parquet/pqarrow/file_reader_test.go +++ b/go/parquet/pqarrow/file_reader_test.go @@ -19,9 +19,11 @@ package pqarrow_test import ( "bytes" "context" + "fmt" "io" "os" "path/filepath" + "strings" "testing" "github.com/apache/arrow/go/v14/arrow" @@ -216,3 +218,68 @@ func TestFileReaderWriterMetadata(t *testing.T) { assert.Equal(t, []string{"foo", "bar"}, kvMeta.Keys()) assert.Equal(t, []string{"bar", "baz"}, kvMeta.Values()) } + +func TestFileReaderColumnChunkBoundsErrors(t *testing.T) { + schema := arrow.NewSchema([]arrow.Field{ + {Name: "zero", Type: arrow.PrimitiveTypes.Float64}, + {Name: "g", Type: arrow.StructOf( + arrow.Field{Name: "one", Type: arrow.PrimitiveTypes.Float64}, + arrow.Field{Name: "two", Type: arrow.PrimitiveTypes.Float64}, + arrow.Field{Name: "three", Type: arrow.PrimitiveTypes.Float64}, + )}, + }, nil) + + // generate Parquet data with four columns + // that are represented by two logical fields + data := `[ + { + "zero": 1, + "g": { + "one": 1, + "two": 1, + "three": 1 + } + }, + { + "zero": 2, + "g": { + "one": 2, + "two": 2, + "three": 2 + } + } + ]` + + record, _, err := array.RecordFromJSON(memory.DefaultAllocator, schema, strings.NewReader(data)) + require.NoError(t, err) + + output := &bytes.Buffer{} + writer, err := pqarrow.NewFileWriter(schema, output, parquet.NewWriterProperties(), pqarrow.DefaultWriterProps()) + require.NoError(t, err) + + require.NoError(t, writer.Write(record)) + require.NoError(t, writer.Close()) + + fileReader, err := file.NewParquetReader(bytes.NewReader(output.Bytes())) + require.NoError(t, err) + + arrowReader, err := pqarrow.NewFileReader(fileReader, pqarrow.ArrowReadProperties{BatchSize: 1024}, memory.DefaultAllocator) + require.NoError(t, err) + + // assert that errors are returned for indexes outside the bounds of the logical fields (instead of the physical columns) + ctx := pqarrow.NewArrowWriteContext(context.Background(), nil) + assert.Greater(t, fileReader.NumRowGroups(), 0) + for rowGroupIndex := 0; rowGroupIndex < fileReader.NumRowGroups(); rowGroupIndex += 1 { + rowGroupReader := arrowReader.RowGroup(rowGroupIndex) + for fieldNum := 0; fieldNum < schema.NumFields(); fieldNum += 1 { + _, err := rowGroupReader.Column(fieldNum).Read(ctx) + assert.NoError(t, err, "reading field num: %d", fieldNum) + } + + _, subZeroErr := rowGroupReader.Column(-1).Read(ctx) + assert.Error(t, subZeroErr) + + _, tooHighErr := rowGroupReader.Column(schema.NumFields()).Read(ctx) + assert.ErrorContains(t, tooHighErr, fmt.Sprintf("there are only %d columns", schema.NumFields())) + } +} From c07f5bceacd5efbdfd2ab3a673916d2ff46078c8 Mon Sep 17 00:00:00 2001 From: James Duong Date: Tue, 26 Sep 2023 12:18:52 -0700 Subject: [PATCH 68/96] GH-37705: [Java] Extra input methods for VarChar writers (#37883) ### Rationale for this change Improve the convenience of using VarCharWriter and LargeVarCharWriter interfaces. Also allow users to avoid unnecessary overhead creating Arrow buffers when writing String and Text data. ### What changes are included in this PR? Add write() methods for Text and String types. Ensure these methods are part of the writer interfaces and not just the Impls.### Are these changes tested? ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37705 Authored-by: James Duong Signed-off-by: David Li --- .../templates/AbstractFieldWriter.java | 10 +++ .../codegen/templates/ComplexWriters.java | 21 ++++- .../complex/writer/TestSimpleWriter.java | 81 +++++++++++++++---- 3 files changed, 94 insertions(+), 18 deletions(-) diff --git a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java index 5e6580b6131c1..bb4ee45eaa073 100644 --- a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java @@ -142,6 +142,16 @@ public void write(${name}Holder holder) { } + <#if minor.class?ends_with("VarChar")> + public void write${minor.class}(${friendlyType} value) { + fail("${name}"); + } + + public void write${minor.class}(String value) { + fail("${name}"); + } + + public void writeNull() { diff --git a/java/vector/src/main/codegen/templates/ComplexWriters.java b/java/vector/src/main/codegen/templates/ComplexWriters.java index 4ae4c4f75f208..51a52a6e3070d 100644 --- a/java/vector/src/main/codegen/templates/ComplexWriters.java +++ b/java/vector/src/main/codegen/templates/ComplexWriters.java @@ -44,7 +44,11 @@ public class ${eName}WriterImpl extends AbstractFieldWriter { final ${name}Vector vector; - public ${eName}WriterImpl(${name}Vector vector) { +<#if minor.class?ends_with("VarChar")> + private final Text textBuffer = new Text(); + + +public ${eName}WriterImpl(${name}Vector vector) { this.vector = vector; } @@ -120,11 +124,19 @@ public void write(Nullable${minor.class}Holder h) { } - <#if minor.class == "VarChar"> + <#if minor.class?ends_with("VarChar")> + @Override public void write${minor.class}(${friendlyType} value) { vector.setSafe(idx(), value); vector.setValueCount(idx()+1); } + + @Override + public void write${minor.class}(String value) { + textBuffer.set(value); + vector.setSafe(idx(), textBuffer); + vector.setValueCount(idx()+1); + } <#if minor.class?starts_with("Decimal")> @@ -256,6 +268,11 @@ public interface ${eName}Writer extends BaseWriter { public void writeTo${minor.class}(ByteBuffer value, int offset, int length); +<#if minor.class?ends_with("VarChar")> + public void write${minor.class}(${friendlyType} value); + + public void write${minor.class}(String value); + } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java index 7c06509b23c87..ef918b13fb691 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestSimpleWriter.java @@ -22,9 +22,14 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.LargeVarBinaryVector; +import org.apache.arrow.vector.LargeVarCharVector; import org.apache.arrow.vector.VarBinaryVector; +import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.complex.impl.LargeVarBinaryWriterImpl; +import org.apache.arrow.vector.complex.impl.LargeVarCharWriterImpl; import org.apache.arrow.vector.complex.impl.VarBinaryWriterImpl; +import org.apache.arrow.vector.complex.impl.VarCharWriterImpl; +import org.apache.arrow.vector.util.Text; import org.junit.After; import org.junit.Assert; import org.junit.Before; @@ -45,9 +50,9 @@ public void terminate() throws Exception { } @Test - public void testWriteByteArrayToVarBinary() { + public void testWriteByteArrayToVarBinary() throws Exception { try (VarBinaryVector vector = new VarBinaryVector("test", allocator); - VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + VarBinaryWriter writer = new VarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; writer.writeToVarBinary(input); byte[] result = vector.get(0); @@ -56,9 +61,9 @@ public void testWriteByteArrayToVarBinary() { } @Test - public void testWriteByteArrayWithOffsetToVarBinary() { + public void testWriteByteArrayWithOffsetToVarBinary() throws Exception { try (VarBinaryVector vector = new VarBinaryVector("test", allocator); - VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + VarBinaryWriter writer = new VarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; writer.writeToVarBinary(input, 1, 1); byte[] result = vector.get(0); @@ -67,9 +72,9 @@ public void testWriteByteArrayWithOffsetToVarBinary() { } @Test - public void testWriteByteBufferToVarBinary() { + public void testWriteByteBufferToVarBinary() throws Exception { try (VarBinaryVector vector = new VarBinaryVector("test", allocator); - VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + VarBinaryWriter writer = new VarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; ByteBuffer buffer = ByteBuffer.wrap(input); writer.writeToVarBinary(buffer); @@ -79,9 +84,9 @@ public void testWriteByteBufferToVarBinary() { } @Test - public void testWriteByteBufferWithOffsetToVarBinary() { + public void testWriteByteBufferWithOffsetToVarBinary() throws Exception { try (VarBinaryVector vector = new VarBinaryVector("test", allocator); - VarBinaryWriterImpl writer = new VarBinaryWriterImpl(vector)) { + VarBinaryWriter writer = new VarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; ByteBuffer buffer = ByteBuffer.wrap(input); writer.writeToVarBinary(buffer, 1, 1); @@ -91,9 +96,9 @@ public void testWriteByteBufferWithOffsetToVarBinary() { } @Test - public void testWriteByteArrayToLargeVarBinary() { + public void testWriteByteArrayToLargeVarBinary() throws Exception { try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); - LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + LargeVarBinaryWriter writer = new LargeVarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; writer.writeToLargeVarBinary(input); byte[] result = vector.get(0); @@ -102,9 +107,9 @@ public void testWriteByteArrayToLargeVarBinary() { } @Test - public void testWriteByteArrayWithOffsetToLargeVarBinary() { + public void testWriteByteArrayWithOffsetToLargeVarBinary() throws Exception { try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); - LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + LargeVarBinaryWriter writer = new LargeVarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; writer.writeToLargeVarBinary(input, 1, 1); byte[] result = vector.get(0); @@ -113,9 +118,9 @@ public void testWriteByteArrayWithOffsetToLargeVarBinary() { } @Test - public void testWriteByteBufferToLargeVarBinary() { + public void testWriteByteBufferToLargeVarBinary() throws Exception { try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); - LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + LargeVarBinaryWriter writer = new LargeVarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; ByteBuffer buffer = ByteBuffer.wrap(input); writer.writeToLargeVarBinary(buffer); @@ -125,9 +130,9 @@ public void testWriteByteBufferToLargeVarBinary() { } @Test - public void testWriteByteBufferWithOffsetToLargeVarBinary() { + public void testWriteByteBufferWithOffsetToLargeVarBinary() throws Exception { try (LargeVarBinaryVector vector = new LargeVarBinaryVector("test", allocator); - LargeVarBinaryWriterImpl writer = new LargeVarBinaryWriterImpl(vector)) { + LargeVarBinaryWriter writer = new LargeVarBinaryWriterImpl(vector)) { byte[] input = new byte[] { 0x01, 0x02 }; ByteBuffer buffer = ByteBuffer.wrap(input); writer.writeToLargeVarBinary(buffer, 1, 1); @@ -135,4 +140,48 @@ public void testWriteByteBufferWithOffsetToLargeVarBinary() { Assert.assertArrayEquals(new byte[] { 0x02 }, result); } } + + @Test + public void testWriteStringToVarChar() throws Exception { + try (VarCharVector vector = new VarCharVector("test", allocator); + VarCharWriter writer = new VarCharWriterImpl(vector)) { + String input = "testInput"; + writer.writeVarChar(input); + String result = vector.getObject(0).toString(); + Assert.assertEquals(input, result); + } + } + + @Test + public void testWriteTextToVarChar() throws Exception { + try (VarCharVector vector = new VarCharVector("test", allocator); + VarCharWriter writer = new VarCharWriterImpl(vector)) { + String input = "testInput"; + writer.writeVarChar(new Text(input)); + String result = vector.getObject(0).toString(); + Assert.assertEquals(input, result); + } + } + + @Test + public void testWriteStringToLargeVarChar() throws Exception { + try (LargeVarCharVector vector = new LargeVarCharVector("test", allocator); + LargeVarCharWriter writer = new LargeVarCharWriterImpl(vector)) { + String input = "testInput"; + writer.writeLargeVarChar(input); + String result = vector.getObject(0).toString(); + Assert.assertEquals(input, result); + } + } + + @Test + public void testWriteTextToLargeVarChar() throws Exception { + try (LargeVarCharVector vector = new LargeVarCharVector("test", allocator); + LargeVarCharWriter writer = new LargeVarCharWriterImpl(vector)) { + String input = "testInput"; + writer.writeLargeVarChar(new Text(input)); + String result = vector.getObject(0).toString(); + Assert.assertEquals(input, result); + } + } } From 23c1bf9ff74b749c11385b38911d6aa8d85e180c Mon Sep 17 00:00:00 2001 From: Dominik Moritz Date: Tue, 26 Sep 2023 15:36:22 -0400 Subject: [PATCH 69/96] MINOR: [JS] update link to issues (#37882) Update link to use GitHub for issues. --- js/package.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/js/package.json b/js/package.json index 1ee0e11bca5b9..11bbe24f0244c 100644 --- a/js/package.json +++ b/js/package.json @@ -35,7 +35,7 @@ "author": "Apache Software Foundation", "license": "Apache-2.0", "bugs": { - "url": "https://issues.apache.org/jira/projects/ARROW" + "url": "https://github.com/apache/arrow/issues" }, "homepage": "https://github.com/apache/arrow/blob/main/js/README.md", "files": [ From 517d849b5813788ea23cf938629f668cba8b4fb8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ra=C3=BAl=20Cumplido?= Date: Tue, 26 Sep 2023 22:48:44 +0200 Subject: [PATCH 70/96] GH-37858: [Docs][JS] Fix check of remote URL to generate JS docs (#37870) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change JS Docs are currently not being generated. ### What changes are included in this PR? Use a regex check instead of an equality to cover both remote set with `.git` and without for upstream. Added also a fix to generate docs from origin from forks if necessary. ### Are these changes tested? Via archery ### Are there any user-facing changes? No * Closes: #37858 Authored-by: Raúl Cumplido Signed-off-by: Sutou Kouhei --- ci/scripts/js_build.sh | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/ci/scripts/js_build.sh b/ci/scripts/js_build.sh index c97733257a721..d61f74f0b7ca1 100755 --- a/ci/scripts/js_build.sh +++ b/ci/scripts/js_build.sh @@ -32,12 +32,14 @@ yarn lint:ci yarn build if [ "${BUILD_DOCS_JS}" == "ON" ]; then - if [ "$(git config --get remote.origin.url)" == "https://github.com/apache/arrow.git" ]; then - yarn doc - elif [ "$(git config --get remote.upstream.url)" == "https://github.com/apache/arrow.git" ]; then - yarn doc --gitRemote upstream - elif [ "$(git config --get remote.apache.url)" == "git@github.com:apache/arrow.git" ]; then + # If apache or upstream are defined use those as remote. + # Otherwise use origin which could be a fork on PRs. + if [ "$(git config --get remote.apache.url)" == "git@github.com:apache/arrow.git" ]; then yarn doc --gitRemote apache + elif [[ "$(git config --get remote.upstream.url)" =~ "https://github.com/apache/arrow" ]]; then + yarn doc --gitRemote upstream + elif [[ "$(basename -s .git $(git config --get remote.origin.url))" == "arrow" ]]; then + yarn doc else echo "Failed to build docs because the remote is not set correctly. Please set the origin or upstream remote to https://github.com/apache/arrow.git or the apache remote to git@github.com:apache/arrow.git." exit 0 From e038498c70207df1ac64b1aa276a5fd5e3cd306b Mon Sep 17 00:00:00 2001 From: James Duong Date: Tue, 26 Sep 2023 13:52:07 -0700 Subject: [PATCH 71/96] GH-25659: [Java] Add DefaultVectorComparators for Large types (#37887) ### Rationale for this change Support additional vector types in DefaultVectorComparators to make arrow-algorithm easier to use. ### What changes are included in this PR? Add DefaultVectorComparators for large vector types (LargeVarCharVector and LargeVarBinaryVector). ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #25659 Authored-by: James Duong Signed-off-by: David Li --- .../sort/DefaultVectorComparators.java | 16 ++++++------ .../sort/TestDefaultVectorComparator.java | 26 +++++++++++++++++++ 2 files changed, 34 insertions(+), 8 deletions(-) diff --git a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java index 99d66f94261ee..4f9c8b7d71bab 100644 --- a/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java +++ b/java/algorithm/src/main/java/org/apache/arrow/algorithm/sort/DefaultVectorComparators.java @@ -25,7 +25,6 @@ import org.apache.arrow.memory.util.ArrowBufPointer; import org.apache.arrow.memory.util.ByteFunctionHelpers; import org.apache.arrow.vector.BaseFixedWidthVector; -import org.apache.arrow.vector.BaseVariableWidthVector; import org.apache.arrow.vector.BigIntVector; import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.DateDayVector; @@ -50,6 +49,7 @@ import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.UInt8Vector; import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VariableWidthVector; import org.apache.arrow.vector.complex.BaseRepeatedValueVector; /** @@ -112,7 +112,7 @@ public static VectorValueComparator createDefaultComp } else if (vector instanceof TimeStampVector) { return (VectorValueComparator) new TimeStampComparator(); } - } else if (vector instanceof BaseVariableWidthVector) { + } else if (vector instanceof VariableWidthVector) { return (VectorValueComparator) new VariableWidthComparator(); } else if (vector instanceof BaseRepeatedValueVector) { VectorValueComparator innerComparator = @@ -675,14 +675,14 @@ public VectorValueComparator createNew() { } /** - * Default comparator for {@link org.apache.arrow.vector.BaseVariableWidthVector}. + * Default comparator for {@link org.apache.arrow.vector.VariableWidthVector}. * The comparison is in lexicographic order, with null comes first. */ - public static class VariableWidthComparator extends VectorValueComparator { + public static class VariableWidthComparator extends VectorValueComparator { - private ArrowBufPointer reusablePointer1 = new ArrowBufPointer(); + private final ArrowBufPointer reusablePointer1 = new ArrowBufPointer(); - private ArrowBufPointer reusablePointer2 = new ArrowBufPointer(); + private final ArrowBufPointer reusablePointer2 = new ArrowBufPointer(); @Override public int compare(int index1, int index2) { @@ -699,7 +699,7 @@ public int compareNotNull(int index1, int index2) { } @Override - public VectorValueComparator createNew() { + public VectorValueComparator createNew() { return new VariableWidthComparator(); } } @@ -743,7 +743,7 @@ public int compareNotNull(int index1, int index2) { @Override public VectorValueComparator createNew() { VectorValueComparator newInnerComparator = innerComparator.createNew(); - return new RepeatedValueComparator(newInnerComparator); + return new RepeatedValueComparator<>(newInnerComparator); } @Override diff --git a/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java b/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java index 62051197740d8..bdae85110aa62 100644 --- a/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java +++ b/java/algorithm/src/test/java/org/apache/arrow/algorithm/sort/TestDefaultVectorComparator.java @@ -35,6 +35,8 @@ import org.apache.arrow.vector.Float8Vector; import org.apache.arrow.vector.IntVector; import org.apache.arrow.vector.IntervalDayVector; +import org.apache.arrow.vector.LargeVarBinaryVector; +import org.apache.arrow.vector.LargeVarCharVector; import org.apache.arrow.vector.SmallIntVector; import org.apache.arrow.vector.TimeMicroVector; import org.apache.arrow.vector.TimeMilliVector; @@ -47,6 +49,9 @@ import org.apache.arrow.vector.UInt2Vector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.UInt8Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VarBinaryVector; +import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.testing.ValueVectorDataPopulator; import org.apache.arrow.vector.types.TimeUnit; @@ -911,4 +916,25 @@ public void testCheckNullsOnCompareIsTrueWithEmptyVectors() { assertTrue(comparator.checkNullsOnCompare()); } } + + @Test + public void testVariableWidthDefaultComparators() { + try (VarCharVector vec = new VarCharVector("test", allocator)) { + verifyVariableWidthComparatorReturned(vec); + } + try (VarBinaryVector vec = new VarBinaryVector("test", allocator)) { + verifyVariableWidthComparatorReturned(vec); + } + try (LargeVarCharVector vec = new LargeVarCharVector("test", allocator)) { + verifyVariableWidthComparatorReturned(vec); + } + try (LargeVarBinaryVector vec = new LargeVarBinaryVector("test", allocator)) { + verifyVariableWidthComparatorReturned(vec); + } + } + + private static void verifyVariableWidthComparatorReturned(V vec) { + VectorValueComparator comparator = DefaultVectorComparators.createDefaultComparator(vec); + assertEquals(DefaultVectorComparators.VariableWidthComparator.class, comparator.getClass()); + } } From 7dc9f69a8a77345d0ec7920af9224ef96d7f5f78 Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Wed, 27 Sep 2023 16:23:26 +0200 Subject: [PATCH 72/96] GH-36590: [Docs] Support Pydata Sphinx Theme 0.14.0 (#36591) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Preview: http://crossbow.voltrondata.com/pr_docs/36591/ ### Rationale for this change The Pydata Sphinx Theme that we use for our documentation has been pinned due to bigger changes in the theme layout. It needs to be unpinned and our layout needs to be updated. ### What changes are included in this PR? Update of the Pydata Sphinx Theme and changes to our layout/structure: - dark/light mode - top menu bar - search button in the top right navigation bar - drop down from the theme layout in the top right navigation bar - version warnings bar from the theme layout - main landing page and the landing page for the dev docs ⚠️ Needs an update of the [versions.json](https://github.com/apache/arrow-site/blob/AlenkaF-patch-1/docs/_static/versions.json) ### Are these changes tested? Yes, locally. Will also add docs preview via GitHub actions. ### Are there any user-facing changes? No. * Closes: #32451 * Closes: #36590 Lead-authored-by: AlenkaF Co-authored-by: Sutou Kouhei Signed-off-by: Joris Van den Bossche --- ci/conda_env_sphinx.txt | 2 +- dev/release/01-prepare-test.rb | 3 +- dev/release/post-11-bump-versions-test.rb | 3 +- dev/release/utils-update-docs-versions.py | 10 +- docs/requirements.txt | 3 +- docs/source/_static/arrow-dark.png | Bin 0 -> 91541 bytes docs/source/_static/theme_overrides.css | 85 +------- docs/source/_static/versions.json | 47 +++-- docs/source/_static/versionwarning.js | 2 + docs/source/_templates/docs-sidebar.html | 25 --- docs/source/_templates/layout.html | 10 - docs/source/_templates/version-switcher.html | 60 ------ docs/source/c_glib/index.rst | 2 + docs/source/conf.py | 37 +++- docs/source/cpp/index.rst | 55 +++-- .../continuous_integration/index.rst | 1 + .../continuous_integration/overview.rst | 2 +- docs/source/developers/contributing.rst | 190 ----------------- .../developers/images/book-open-solid.svg | 2 +- docs/source/developers/images/bug-solid.svg | 2 +- docs/source/developers/images/code-solid.svg | 2 +- docs/source/developers/images/users-solid.svg | 2 +- docs/source/developers/index.rst | 198 +++++++++++++++++- docs/source/developers/overview.rst | 6 +- docs/source/developers/release.rst | 2 + docs/source/format/index.rst | 7 +- docs/source/index.rst | 115 ++++++---- docs/source/java/index.rst | 3 + docs/source/js/index.rst | 2 + docs/source/python/index.rst | 8 +- docs/source/r/index.rst | 2 + 31 files changed, 416 insertions(+), 472 deletions(-) create mode 100644 docs/source/_static/arrow-dark.png delete mode 100644 docs/source/_templates/docs-sidebar.html delete mode 100644 docs/source/_templates/version-switcher.html delete mode 100644 docs/source/developers/contributing.rst diff --git a/ci/conda_env_sphinx.txt b/ci/conda_env_sphinx.txt index bd08937ae81be..af1bfe9b780f4 100644 --- a/ci/conda_env_sphinx.txt +++ b/ci/conda_env_sphinx.txt @@ -20,7 +20,7 @@ breathe doxygen ipython numpydoc -pydata-sphinx-theme==0.8 +pydata-sphinx-theme sphinx-autobuild sphinx-design sphinx-copybutton diff --git a/dev/release/01-prepare-test.rb b/dev/release/01-prepare-test.rb index 1062e8b06c090..54437e9da60ce 100644 --- a/dev/release/01-prepare-test.rb +++ b/dev/release/01-prepare-test.rb @@ -170,7 +170,8 @@ def test_version_pre_tag "+ \"name\": \"#{@release_compatible_version} (stable)\",", "+ {", "+ \"name\": \"#{@previous_compatible_version}\",", - "+ \"version\": \"#{@previous_compatible_version}/\"", + "+ \"version\": \"#{@previous_compatible_version}/\",", + "+ \"url\": \"https://arrow.apache.org/docs/#{@previous_compatible_version}/\"", "+ },", ], ], diff --git a/dev/release/post-11-bump-versions-test.rb b/dev/release/post-11-bump-versions-test.rb index 0ef4646236740..8253472ccc5b9 100644 --- a/dev/release/post-11-bump-versions-test.rb +++ b/dev/release/post-11-bump-versions-test.rb @@ -148,7 +148,8 @@ def test_version_post_tag "+ \"name\": \"#{@release_compatible_version} (stable)\",", "+ {", "+ \"name\": \"#{@previous_compatible_version}\",", - "+ \"version\": \"#{@previous_compatible_version}/\"", + "+ \"version\": \"#{@previous_compatible_version}/\",", + "+ \"url\": \"https://arrow.apache.org/docs/#{@previous_compatible_version}/\"", "+ },", ], ], diff --git a/dev/release/utils-update-docs-versions.py b/dev/release/utils-update-docs-versions.py index 6e0137b7c84df..7ca4059214db5 100644 --- a/dev/release/utils-update-docs-versions.py +++ b/dev/release/utils-update-docs-versions.py @@ -50,11 +50,15 @@ # Create new versions new_versions = [ {"name": f"{dev_compatible_version} (dev)", - "version": "dev/"}, + "version": "dev/", + "url": "https://arrow.apache.org/docs/dev/"}, {"name": f"{stable_compatible_version} (stable)", - "version": ""}, + "version": "", + "url": "https://arrow.apache.org/docs/", + "preferred": True}, {"name": previous_compatible_version, - "version": f"{previous_compatible_version}/"}, + "version": f"{previous_compatible_version}/", + "url": f"https://arrow.apache.org/docs/{previous_compatible_version}/"}, *old_versions[2:], ] with open(main_versions_path, 'w') as json_file: diff --git a/docs/requirements.txt b/docs/requirements.txt index a4e5f7197b553..37a50d51dd54c 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -5,10 +5,9 @@ breathe ipython numpydoc -pydata-sphinx-theme==0.8 +pydata-sphinx-theme sphinx-autobuild sphinx-design sphinx-copybutton -sphinxcontrib-jquery sphinx==6.2 pandas diff --git a/docs/source/_static/arrow-dark.png b/docs/source/_static/arrow-dark.png new file mode 100644 index 0000000000000000000000000000000000000000..618204a2370a56120e6809228498cac5d98ca54c GIT binary patch literal 91541 zcmeFZbzD?k7dA|c3uD_f49s8G7@F7L zJ9%lH34Gu<$?LhIp>Z-?|3UZ4k?;V9(zDjmb<7^U=L1ehm*-qOLPA1Z+`L@8yd1z29IoCD zZcn{999$Xx9^}_JG8V4pF4j(N){YLe*W*4lb98qTqo==q(Z9d`o~N6&<$vDf;QG(9 zfCX}0{|grnCpXu>V*_6my}m1~=3;FDy!m>3aURjXp8W0JKkpIcx__r&9_*-&`6#|E9i*)l zI7BWbsajrRk?$^>jZ4bD!-oOu!Ol0~)cV2ZiaEka#$MES=jQ7m?0mJ-tk2@?G&Eu@ z$``z-d`qJL2bS}tNK$j_W|200@(5pM50cadjWCaw6x#K?y1tM3Im7%XIYs6Fix#*u z6ZUiF7|35Kp!ku#*~wELIZ*5M**6@uY8Ts#LtF|H_x*ilna6FN;|ugvaPqHC3ovL% zXgG{mTa;0CPfK&F4tdmW{@)=6_J|l;E^%mFL9;>q0SH1qS-8oqf8l6=+*HVG)*l{l z-i#xLTFE^S%p_fNyexff62IJf%wzq)|j6f`N7a|H@<}_p!gU$NL-f7aZIw zSdMsq@9ZKhK+yAR+}>xY{iOPDq(uXp(T5-7*|wk9tQIf(Aoush73MR(aeSv-99$#pv8_RRmrGR%kAp9Drr42rpO3DqLBrZW*WfH>Cj-Rg!86TA(pu@ zzoDR{G`D)L~@DeB#(VqEw1pUFZ9IdbLqtU5HD7wS$3v9~jvS^AYSga0NmVMW@Mr=^$TLpk2_^Wr%mi15^X8x4C2=koZ2 zm-n2?161rjg>4bZN5+>SaVTXnihQ;5c>w*w?i12lao>3J4E+yj^nf&;znfzStfVxf zSm)Ho0Bd5Yc$gDm%N3hwA4nZ``;RbhYbef}vWMA;%3>fZhG6hO_t~|?q?FTuzFWQY zY}X!>N+&s(@jyjImhQhX5cZl#KK1EkXZw4X7h+HXN_IP4Bknpe(1HOa!Q=SzaLl&C zHwY=;sk#vU-3ne*4)8*Bofilxl4Rm!K&FkrjMWnp5RgB7(v}z)H9-5@U6(iZtj{NS z0U?Ynb;K;-d=8E_iopak6GK_3g&FY}2``4a`pid}C4P!wQ{pXmVqGf-MNDfoNQR61 z+f321M_x-pKiLh=YDhRs&SwaG9afND;Fm3JZgY;pf{d--BVsiIV>lPPp;$4%?pDNh zz(`4oK^an-29T!ThJtYtVLyMj{+$AY1Hmca<5a7e6ee*aiqt4`J2_yaw`FYc?5|W_z1w@dIHK?YC#C8 zL;Ii`4?qjL4f^wvT}lHCe@%eo)~p8+7GZuvtX={ysBg%*b4cR5HU5g9v&;s74uwn>Gaak|hyf zL&ZlQ_5T>71+rYg{rAh+T9C&M=Xw-#0@XF-lB*@88-&H z#TOLh|CCF|6KdlFr*%4LRG-8m#gwVUxG3Lz0G_3EX<4FsKKT@ zSnU|v14dwpxMr@cuk1SY!8A^FX`OIp9~%)WQmbHPyj>33xlLU z5K>Y-7H{cmD~9}S#gfOsN`4wAYnuNio{S67M}>7nd)HF7zxr8`GQO=U!r$c;9|m|p zo7?t1iU|f!5;O$&F*psv9_79z3_^H;u*t)^LQ8)0FLZ3aSCSw|8XX*f%ACogej=%WlE5>Y@TJ?1J3e=Mkn32<_|g*d@5NpdJu4+i%? z_}uQ2bHe;O+T8ye=sVctL+c5E&$ku*&t$d0ACJB>wepekCjexKS{96Pna_otKdW4)k1tX?NYg`@_RX&;z5})jQ{n-Xv($oc54Ab+jv_BS;B!h_HZc|Hh zq~t3A;l(p9`9jou8LER$%k5L@#+A0?2j*(9t_`Mc&_O z6Ro;0Ml04mC-pPSwKy0MTt>f#7Zk&FAe7A-?Jr2|`C*uvl7Iu00?<4dpOE|KR4jp| zWH66}OFGV*I&N*18j!g6Q-kraDcwpZ3%EhFgn&jtI2ym%1scW-9iR;n`4(dC0G%Iu;?tmK?7L*k;ys{JP0l`v-v-6vt?R!NUz@XQyXcUpuIW45(H6Yb{Y(SROZbwjA$>yrW|Rm;U)l#LJ)wX{YUbD#ioDh z5E_VFjh51k2s-gjq={lKSEURM2G6ZCJ==&}%(i_rpW(l$6b|U&8mmEAvA-<`4J51# zd>2BQQYK=};Dm5D4;GQ_&}Md0cg9pdCkHJ+k+%Vq9VEF~iPDx0s3FAwkUrx!=+1df z=G_9&?_3a8K!_czOAeiHnbDL|Vj>}?b>gOjda#8BMaHu2eUDVZ0Qcfh-s(&V1LP5d zL&(Q+7uo@6>%V7|FLaHjOs7WGh7x1VKlJUWVQ^AG2kD}f#UC~>%o})a`G$Z2e87J@ z2^i)Y=*~e&KV0Y!8zCH=768$SNBE_sJSBq8A5t7d6avPU|5kJvLSGI9Fp8FRL_4h( zdoU1~9|3`RSRc>#58#NC&Jc7ezwh}i0W(bl>v}1K_-w57K!7*qpunco8X>%ei;k_F z2nZFqcTez-8Q3!b0U#x-$qI+Ey2L$MhXw=u)wZhE;5(O7xd-(#_dwZe_3=dj-|pnU zi}i<5u!IMt3zS2hc9aV+!33RmE$_*{0s-&DPR~3e{)tc7tk$InSlxX_r$Ms206BpP zlK|{zb*x?Bk7!og!p#@290@j}IPx6%Z50B9&cA7`b2?DxyYP{w%--`g=_WD)& zP3!|K$Po_32-mQw4m;B;$#2&j=&G<)em=u$ZHWLKyDt>59q%9J&;M9lD$^ifV~jAk z>T!s$mN=k(R`90yTC84BC_;)VDKAz4V1&=AG_A4uu5EXvmMY;7>&wSP2Sc7Ly*>$n z=eYC%%*~k!@AGR_V5X)GZmxi?2fkv0MePT>x70%wNuc?%4@?0d;JyjyDofpr;*a?d zz6%X1<2btk*@~`8K(re!>FV4AXo|3tlOi=#_rom{`RI{uUrvl~%q?GSvzrU8IQ5Hx zEQXZ<5mL!}a0~AbW~qHGhfBiECyhU+X-N}Rp#Ls-bnGyBT6{`TgR_$wZh(ut?6Q1Cgw0$6Bm_lj!cOHW2&t#d z-Cj`G&)C@*M7xu*9bgM_$^cY!Bkn`~u!UY6ASQcb3Re64$Oc0D{yqZM<{u=PaT8dS z5kR3?^@Df;Cxuev6Ql$PiW>Pw5u*J%Te;bxy5dR* z&`rs<6imt@z+KI*k@NqIBtKbU0Wf&Ua`rL>*l3gJFyHe)37}m`fU}K$5-t8odkw6< zlwasGM%&XQQa_8MOr2iKPXIOwfO0C$OI=fsALa^CK^Ibt+(#wRPNwzh|S8db!6{-eL^b|F z7&D~MF|pKu@B-jWT!X1I$Ayjn2KWr*mtKM(h30DAgh7%I0bOCh6@FV9uu7INNvM?( zqADAZth7p}IiXXV;eGe{`A;|RG;;%4Cp80sz$*kF4&Z*XT_oV)&q~HXoME{pmk&ei z|10MHb!kctz_K!&-di~tfnfop5hVKvbefGc0gF9~;A=Q|A_5qb$7MKX*3_vEKxyWK72&DLn?KYwvyoUgICqT5O zIr5>Pp6fK4vOn$*^`T)DVISa9+P#YPxJ{eON(TfiRe3nmb+G?rSvg2Z)xnilmenXvo|G z1R3EV+xaX-z+ws5I1)&9xWy8dK!AfuO2Rj5T#e+AAE^&{i|Tewss4+OSqZ{M?ydmz zD;$#wTCZ~-2KT7*Pqj_Nf{5o|gIs^fDE{=2IbbUR#7r3AlcUCO;Cx1qB4c^^$Pi00 z5y)*}7w36&b^5SJL%-s2yf#rfNIT&SSv@zt8fo~-YiiB&2CA518!zlIK3DqA#MarG zrRz=hy1C6n{AbVBTaYNJ;jLheW9eQOWtacV zn$OYYRiEi3^e9YS!a^^pUiCP_JkUDfG2t(fX_M@{>oa%C8;AoSt-ce#O{1}ND7rIp ziU9Vv8I5U}oY$k4d=$6SETxuiz=&|NyL!x)k?H!7nzt0*=g!Y0oab%d;Kl=mK{MOhPb&wy zNJKlvVVBP7)RSwqRrAyYNpb+876=3*pWgVPWj|XsfA&p6k>*}amH1ZqPB(IR=otBO zsVNON_H#nG_Lr+@aRU?>S&b^D>hM&&y4RxM6MmqNtkbTWWBvMqwDJ?gey&?$ROEZ= z6}Q42QT6FdN92A^TWk_6btWKhMSCbP#$U(!YzmL?nUZDJM$Zeeq`(gw2a79_al-25 ztt0Js9an>3XSzpxFpcd+5znn}0S=*8ahFYds3eWL&~&^VO)$x^)N*u%1`4!%C{TcT zw-wI8dGp0~JL_tmzVW*+ShsnG+xYzRZ=wv4WjRPq(n~Jmd+r8T;v3_BCLxE4G#~H{ z+o-ZF$q$F0Sub4mV-2OtM|N+fm$+I#QUBL8RA{4wKhZlmy~BbCS{qi$TU!ef^P`gr zB&Z@_n1L_H1bK;|EL0LVm;gUoX4_wG9iw-Ok~AX*hz||A$Oa)c2wlX{KXLzqhKSc$ zNNNaAEQhYWr0KrBd4U+zg#?2QCo&M zEsGS&_=+luoIo%{YJEFYw|jGR6k1wRXW~Pk-aBT&=e#P8EH#+l+apS{0~$?~_oLP(SUrWu$8*VlElr1-YUYY6zY;1p+l-&Hg~4^R#q)*w8M zv#)_sxzm#$+FrSZ96XEPmqnUuM@U2n&I#EXG zDT@3rg346RKewZ=#d_X9B$>ws719z)JxRQ{dDubV`|v^|HV4IN$PS==br^u!3H(1r z3$ykWj~Wv%;(}9GNslA779BRS6vfeHCEQO7#P(BFG0}cB*FG2>K91d@KgBgQ%?3>;S8Sm(Rc8+5~JEZ-3 zxuc=&3-b*cceV%N;&L7LiXZ;IgBn5Fkdwr|9rq1MNr;qNYz1y}-&(|yg_{7Xw zEC{kTB0N*cDE0(MsEKqW+EkB(K|r)b$bg&{lwX#_i@Y-5b{g%E&{Jw738QLkC7CNr zE`U!UH<~^&YlAU1%4_ReFV#LPjC=cDei-sxU-ErDRLQBn(1sl6J=K0(n^R}!xY#r} zL#h^|-)vMs&q*HnEK+i6tT)?NX(sq%f@z`bUE4bCa`!gZmg2NpAK$XbO}+;Grg{QY z%|qsbjm7NH{XO{mI=M#-^Oq9Hz~>_u-{m-Fss9t(775=tH*^I8Idx7&Iwp8h6S4U0 zb1flD5J2-VGGdlxfGR37)ueRs9tjUxW({xEm_-4-@_s=D8cU4*Ylj>oya zDEIN6@hZ6&2O z){E`o(v91WH=~JoE_T1D?#n+T-t>eWHK|%3#XW3%96h{+8Pin|b`|9_Zx>Sg{(UkI zsh1uIe17qUEoWTmVv2lxAWSeuSbl$X$m`(y`gcz2WSPMq9Qs@1Pg;~bcD5eD?nq~f zbxv3qC!?}fcGHt|9&P+$>5_TaUgz`9p@&EfFa$C-u}zwz?dPe?L`(RNaeh6Wppa|N zc}ee{vX%@`s8ywc4nB2h@6AbCq5$8n!RdII@eSZve2u5MmgmC4(NL=$;^m*7s|K%y zjw6P|x8F}z$C!x3%mhj>xZj#7Yu|nKIe(!`&R_6cU?^DOILd*eCR^2JAdlI^B;U`_ zeXb5cQih2hfNH0aJSy^H_eeS>;i|)&@erDKO&vM&)w@p*3BdpMJcpXQ3BQjG6}f63 zx_NMb9NN#3NSK-otQ&DZ6rSD`aCcg|9O_cM@L{P7@0h+3oS!x9+4I_9gelKvYJV2; zK3VcS@bYo8zmbi@H%r%Tp-cU`lIcSN4e%R3DQ*2LH}8MhRXFp!^V_alPa=H>#!k2} zz^w^S&zV`|EkFV~-Itx{FnCW=&Cd{cK`n6eZu%fL~mDi00zTy&g7)G z2nyI~EIRO+Je14%v?Mq+Prkgp%^)g`GDb!W#l51M`Ef38Jbf9tcbTiP1tob$M|_pu z+hMnGH4w%7UpoT13W!DHNWl`oAF`b1=B(HknIYJe*w`BP(bJ&$w=_DZqmZal9+aFJ)JZp(?)k{*WsDtL@5p~Uc#E+s?}bSgsyY8NTY0#I`Q8^ zbIz3f<=j&2_#0j~YxhxCbPjoQs;BXq15`O+;S-s3%feB4F0tp_S;0%WRkj|KW1!^n zX~>(|0OEa*JsJl8inT)qYf^3}G3gqICuZ+8xkipW6(q8{SLL}9guNC-r72Gin!(5} zt2UKte^R~>eEHRT++|k-DMD4()BQWlDW^+@8HtX4S6V}~TasUSpHrGIAQKfM2=4%O z>mfDD6Kz>=z8Y%rOK9+RCn+JzOUbofr}_DNiJ5#6A70&3%mp%*ChWLnT|2TGk5MM9 z@!)*@fYth(V;PswgpSniNvzX>OJU@R>OL!D)hKh@X$qRT3I5jQy^EKu)lCOX)?Bz8 zJ)gJd0+aL1=IoH?feR9+WIa6*E=twWdG-UF4L-AQ&9dVz%PZwI9aJ2B;qeyh&o|BG zak2J$=cz-9Z&rhIj&oD^wtvd!o?hw^_onzrZDgGm@%a`XBm=&D{!;1cWydcGNM<|%nVm1}W&&F^Hhp(sn4Lj2HifEp7Xy&ucY(yg9p5$b zx@@oN4%gQfh=>#@I^-JYYNF47m4FlkFO*TkMyJad%%XTmJ)ppOM`V9#VE>YJp z8{r`~z6+^>If*$ofodi`o(5-o?*XZp-_ot6A0MS9N>r6Z(3p{7u}lHeCCVuo)(f0>ROLv@V%32{PMQ zuTTa-M6nD%RG21J%M#lbW+>r#=0E0aln8>cTb%=K@Jz4U<5=nI;^hoy- zYD{1Wt(~DD2neGfPV%6e&uKod6o*o4iv1QR##S;2^7AI(uU|kP>4f$mb`z`^-!217 z^iQoZL)GpToHtR{v{s!Bxjp5*0O1EwCy02&c7&(nJwI_S?51)ZSqTyv3vXhbI9u#$ zQoOjChK(g-JV&%HVb!Vj@Z?YeVhDAbuXQtB4H#;RLHVKuOwsNgWLozm@AaXUuP91H zm`vI_+1@bm&~KZxFU-Yv;Sc3ERaH}6*22JxY6+x9fn=~>NexhH)lx$&N=D8c$K}G{ z0gEGRB^@s~3Ex~>CK=M_HA0F^zxC4HZO(0e{+Wq6*-x%~2^@A{Tl6(CcHZtF{$z#` z$jxOZ^?542L_WPg?ac3SIu>M-ny+0 zYh+42XCXZz;`J$H50asS6 zE+0(p0mbNF%LOVs)cZLwxQZAgo}0AzF@t+w+|O75fAd=bzci5miexcCXDiXyY^%`j zM$c4_ouXd@nE|#JKa1F}Yq36MC z3r!PeAy?9oX&+@3&dDx#7FU;V$Wpwbn)|k=G)nE!X_AFJDV1vVCjD8F8Zd;Q(Wt%U z{DDu$_pBi7c-fz6qaW}6wpzcJQ4YlT{i}{JdTt`T3!b0BUOCxsPy z(I72ISu*gXmAcwX9E*R83_@tgRAXfq>z^^MM*TTp^6mL^*X2`6`>gDE&waxJ9~1-? zU}hTTge1rG?%fSYTFr3AJDo+<5 zq#pxJRUk52%nZ~Tle?P<{;EGJ1DUl?txjUG3xFDfIiz`|gm+>kp$uK$!@p9fA|OTt zT7GiAWv-*DddH!-*@1ETAMg7nX_*A&)V%uzije%cDTqnjxfP>f`&WGqVaIZ0Ysj5sUjI`Sboc2wvCne(R4rTlg`z6e$LyrWDSsK!R z#Q^hH6U8>l(xRR@&eN98G|IxK#zKP508I(kVAo1!JObseC>K9Y<+gyr?xav=s|bD- zBzSHMwHS!~G^%PUV*FsWK|`bK2uZpYyk1S zxlL+}!*8S(4Jhw2IPG8<4V+@-)o-z7$XC7u$cx{F>b&VaEjtj*foLZL7fFbYuegBu z#h`F1vbTs@)k4JZ;%RQW!d2ILwLkRDr@OvsWWGj`#d)XFd^_Jb-*~qnq{{+6tPe>k_|T24^SZQ0 z$@lQiTY{uqihA~ftMV?;mnQf28dS2IqY_5dI1W?&PsrGSt*x!Zqn4k-SYrsGU00SR~z4(L5J1utf zxo*?oeVE2vxWg{a#|4qv8D`2m^+riz7E!rL;&P}*1#sK~=$v%UcI=~oB)jH`9~)9L z8L*V`Ii5)Ur1<>_y++5hNIm`Us%e*{2Spm8(_dm*Jmes|W!B}VCP~MdH!E6>cRhTS zJfl$&=hijdS@Gkvr&6sS5AQ3}V`m)L+&24bEA5xA1|ub*)`?a?-4JYr=WW}DuZIPT z0XiyxHEh$L#i4Q&eH#aD3u4yfdA{B=O>cj^7J7t9HPgpx$`9;es9=cm(9O^j_M5k1 zb(J~pX535YKwF-3i>WUo3st4oZ$_xmO(@K$QRmvHW!UTQQlB=rS}r(ZzGP^tn>7y9 z-2RBOLi54jkF0GwBC;}5td%WSr#ns7a=kg;#|Y%minJge@NJlWFs+It-dxQi$sJ`9 zxjzuFu-BSchTU{#-^Kechyz9!rvUW0C~2_(tt1L*l~_b?fGv)8g+}9PwQ<{4bawxt zz`<7bU|?$q$i#(iesensH6sGr^8joC;R_E9mgYi^0_^)*6w@>H<|fnLg713>2dbcy z%HAz5IpvNYM|bFZ$h0=&`?IM=v?6pT@>-3bfxSI~{kOi*lusswejSHP=MaocouIkI zA&;EiweZ*V#V*C8 zMNOC!r65O>&E&j9%#TE7;YMZ1z*~t$%#;(dmm;j1iBCgZ791|NZ6_HgleEkuA;F7`Q{P)S1u>XHC? zpefGH4iDyW-3?TFM5&wXH&~&JY@D*H^OGh>+N?iPprf21vyL)*?$)=OT&h@oDe%?b zc{FA;mpZ$F%$A}Jl2>ujTf6GMEJ_=O{($`KZY{@nyb!P6YHhqWuBD*%<7Ab3wtFHerI3pSTeY` zGK{Z_H)jJCuP4~^pC*qMkf`|4$@7pAfJ*oZno?shUT-Ih&MlwmJEo00FzUb;^kop^ z=gvlo^BJ=EO4_vCoUq#OMB)xDP4z9?@@;CEk(gSin=WWS(2fw3{y#{D=2q5UGd{P0 z*NceHjFpuM$@nDj?Wv8A2TZ}MzPGo22)lTW%+T4Z=d={4&Bp^XZhW3o6xmLHPIx6( z7tCel5EZ~^vTAX0FO9H4m+eC9W9bmIU4%f*cm5Q;04|0^ldLlD_Hnh_bh+g2Lgkjc zZ9Rk5L{CesgyskDM+nUEaAkbiGkkgWdiil-`q?Y4C%N})=lmB0*H*{#@vibRO;oEW z_x>Y*MdBUk7xOa4>jcPo>2VpXlOuGKj{7FCYwh$Fl?}&A-#vNne86%tk&6bz^VJa; z=5cr8H)dQ~l?1|k9+;EH*o2I7GBM2Qp}5+qJ$nr%SAzH82CS#co%jo7oA_O;0M64F zPo^e+Hv25&2p4Q*yc&<^{t_a5FvUIAym~XW^Q4)9sa2xK>vm%s%>>U^wxdxqox#3GQBr@;yNfnjl{anuF)F2|AcdG!Z(O^ zYF0lWJrDCODi7gpQbCNr=*{`7Ug2>3s&QmiJZ6XD5DQY7OuKHz;sF402O)NsE>AM_m@Vcn3hj8MT(~aZhbO}=mk;KRmf&QkS;-rUP zC@(QL?lw-;>NAGs5DY^gZ#TrAb$QyPTX>5F+cnzMoGNZ?98+x!^pDq5J8MLk&eqk; zh|N04-zCDX*Y!SETssQ0h#p7pd2Ybrd8RtWRO?vU$}>&Mna8*QYr+(^D44eA>>U2Q z=3*!zvY)3Xu}b!R>Xy7l4dE}~)t9;M`v~d!9#h$A#_>==I~OoRr8RZDmL9LTEkGsth@01X2VgCtJxjU?4Yf@u=SAA;d?EEh00 z9c$<;zYv@A$L5Dh3H_W@j`Z*Bie`=mJi`Uv@&*Gyr&Bf&qw&hztd z&M1V`w~1~Vj8T#u92{#Nm>+@yvecrb9f&=-|05;LJ$+0nu(;_(%|%_@6@QPdoOPL9 zj{CrnfQgp==PcW&SN%PKAvfz>dBxrzP7RAVLIQ4lOVF6U`fM2g%u^vRthsN2B&WAt z!6%5B+HiH4c^*P4>(5C!(5b7awOQ zj)e5D!u>=TrA?%6pScV1f3hm$h1Lh^^~PkbtzPAw*}vcDF?%HRH~E2C6Yt=Q(xnK( zWg4u*3`Adxt1T{**{1j<4{=5&E~4P#dJZOGL-bDNuSipJTfORq7s>oC#(o4K&2!3} zISzX7LM$#{mvta3)`B7A&H6SbGDMeYyz7tjfaDkrh{aez*5Az9lC0iCmIClqw^LWB z@us{3;t%A8Hb!Nr!@PXnoh)V8Cn(O{KNK&h8s7dUp|#euY))I+T@Z3sx_K-YP%<1< zpEU=MuI79!$kn8`7MWx8Kmlf!YgE;q)^@|$G)+$UWVafQo!BL8pKh$Y5;`vb@hG%< zg%&BDvtpji5EuwLdxnhMUuLMOs@<%7XEH_BV@h;U^0P8COVDdT|OCU1Ey z9Vp9S!X%Ti4}=w_wd(rE8ENKci`+=sy&gjDxodzKB-VZsY_Z<;o!@n}*uzK0GCas} zj&==Hp~)4UTP1mu8~R4{?AdZ2;|d$K(@SBlRf@xtlH9VWR%Q27m40{nyr!ebi?rNx zQl;(qq;STbGe`U52k}*<8f7%BQ>w`Wyq8HH6LM=;o_N^{Z2?iWXEC*&VsT-kv!A90 zwksV&>r=mTs*z=__ZG%YExEI)69s+(h6#xiUIPGA-ock#DH=|B%6TIxBM(QWi)`yKIid+dkrbieB zRGSfLzH=3_XTpIZkLxsCoYJJKqkCN&+F}EQOcJLOxso36VD_$NOZ4Pga{7oOo5d#F zxSxWNUougzsXV_{#I7F!Wimnc(F7YVf)}J@`Rn($-Hjq>PrE6V#&j&1a4%*ImrIRI zXU*MyFhp zw$2=0L)s8m;aXXgF>{Smamo+%BS%VRqpjW?=0g7swr&it1%op;zp40wk5DT6)Q@&b!>NW6VRj}FmLRY?cyjzc1sMzw15O;#7? zH8b&wdH%twBVC8@$|mTx2&Cqi4r6b$o}ox8&nOf(f7VHRHmrQ&+3tw#k*-->%xO&) zeb>NQ+NG@_a=}4;CxX59s|Ra~A~MjnOHQE*Jut14Hc+jkO=J2r0-upV#C<#M)1Aac z@iBo09X~&vfTuCFu-v?FZJiaPhpufqS7g6fPzDBuFxgqi9N>BY70wwl%K3p8x@&l2 z%)R!D(;E&V*a>ux!zR7&{+I`At9TTP-h8^C&m;GMmXUEbYLQO<9H!`)>wZ2zbA$0T zXuW}vv;XKBeU67#pjKl`oOuqd7OiuN+uG62hXvi5$P40DZ#(wDM>mc%fo$WS6BbF0 z`e17UEC1D@D-Ej4<~(ZZgAU2HQ$D52cjAKTb!$#e68eZmbbEPLCt z>62{Odibvl91Zpm$Ld^eK12N@iN)4#Nmr{5kEeW(>!-Wba8DG1+oCQXu2lxV-K+K&?63H+?=izZ>yhnO zRkFGpJ3T5GeE+bj)w_!D-vWzra!@BH&{_KN{)0`kZbvA3q-~=7!05yS3~;K`Yj4rb z-C9kcx6*R=c&Pqv-#D6l1JGJlObsp1NO>WlZRS0X2p)UpW{G;A2)j_z&T2(Jb>q3iJ-jaG zCtbCw(id9o)tEPsMd|1InP0Ox8D5aIcVBf~xM=gs3NsEV zChn+!r_J8Wk z^!*o!4w6I%@lhZ8HdP^Dl@bqqWR1hR3^ZwqR=i{Jmp=U+SeDcRT<)Y=JI-bgZI3EikDa;@bl_iqe47zXiVtk+CVcg$?HtmpF6 zM67oqUsj~@w#>rs5Lo7FOTzb@FVYzgZQo$sz;E^&6fQ}stN%)+KB11fg^cX73*EL& z_wtdAE$Z$s`g)r$_JBqrz9M-fdfg+*XP+laN&T;iIq0?y9xF6wDy%9@8TH=wd%TU-xevMbPTSF7xk))rCcuFpy;^!9FtC1O zFJn(oF&+=HbZ`AKomYs7cTIsN!;iYizKk-#WFUZHfOT%;n^n%`&q^>|tJOL)h1HM1 z1B3Se)b9pC804C-9su3oc|BWs3YvM!B!xqy6y|oDyNz8L&#jfG4iPZDW`(tNd<^gl zbr>O3*S#3nA(R4dXl4&Qb{5^?7lFaWqjW(qCPI0t5S4N=ieM7#;BtwejmR1OA&&@n z9o4aHes82Cthd&SyFs_H?s?-qkF?dp85!h<(`xW29c0#vqVCnCN|g$|L|q_x*iJKC zAdqoJPdLIbGkhi(6*7Gi=3XtqiogWl`iZ!AIy%*2mXlK6dX-ys zRkWGR$I9Rp^1tu)sNB6P{P+YCJ1lcl+z%&$&Sf`h+m z!Z>!HboDti>Oqdk@e-+hRr=I|Z{={r@rTS!hruI>vDX)^Ci4L+h(U_%QXQgAic|52 zr@{E^*1}jXjU5}DneO*Ri2DAQSlH3S+dkeU1&)2>5{r!;pL8v5-?_3Pz7LZO6A^Qm z)AY&h?BdcEnaLkf<%v8t;3&)jdV>yAaxd;Rte-Izlu0Zk4fjUvVV~$<+f2R6h(LOR zO(lh(MnS3g6V4|niyUAF@7|2m8f5K?LxU4`;5Cncr9-_^r@9*Yj5O8nbKqY zP)@UY?omjEPVoIa`n}k*Nt2ba4%wG#ajp3~urvOtE#FgA0W)i$h(%)&pK8DJi=&yN zV0o&GNRl^9)Atu>`zC3>&JlPi7%Po8>2MBt$Q^N}j5Tj~)$IAyjkLeDBANbi=~R9I zLUMgFRw0OWt!bCQpRjtqystI7YO7eKgM&ZeFw{{yw`Wy=sb#I+ceu;*mBzMJ$|hv< z`O3+wf3_Qd^Mm&?Dzrh6fKr|@qQZy34*FQ+t`Vl~m~O1UKgnXB7(c+{KL8eGIvfwQ zJ&*F^Q@-m;&S~wZb)v<-{if{I5CYb?^g>NxR4WyS^2B-{o05`)npw#+$|dF2%bWMa zsDT4N8R#s;82faDkmb{*_SQo>!p&2>_tjZ^VmwnH*18a!(U+=gEZZpY52^~|n&Rui zTo#)oRUHE8cIlTVQM}P zcR74wK?FbG2TqRX=}Te87K798>&IFna{NjH z+|x!!cC_xPSz<4G5A}~69VBuBdE4%wlcJIF2p#__Z@hnS&ru+~o#Oa|{~R+WSCBNP zzyKfGG`QKdBr1y-nGf2Js8WN^EDHhQ0B8BU@>qd0AX5p)?wX>`^N4obg+}>d>-3Vb z9tO~(89p)r!($Id|;sg z>zY-W2kED0)p!qDvm~&X1GEQPYd?IA!Rw^X@Ih^1DOCsO!LK~|qINyO=c;Y=l;5LQ z$+TWicWe^&+QGrGmsUs8*$fKCb&LDq!%{lc6u}I!@DKqT#5Lx^tsX!vrn3}6`OH&fym!Q` z_^}ay#M4@eUyvSSfJcWSJC@Qd`a z?*Rf(#8BZOd`AlN7TM3Wz&=%0rIkQ-g*XZ>FEw?FNxo?h0^2y)itLre>VI?=eIVFj zk{U;q0(~&RyRl~%cj`2Ge#d^b>zn>0Si3HNU0T~w;`GP(sf>~NjEeq|LP1V`d$kGX zvdfhU3YY9&$li>E8A@-;Asg{(o#}#W^9-k8KSGpGNaO;f(k8N+QOq2;AbFMx4>gb5 zevrL2N5nF`#NLLhwGvq!xt?$&<_P~4kI-OG2nW5G@hK@w6^+t#*s11fGE4CbWO`b9 zVQ@RY(#unUnrV?7>PTO`$K%31eVh}1b`OYzzIOMl+MH&N(^`P)^u-uNTY-Ij z(#@u7?y;)J+oniaJs#+!VJ{?JzY@0G_pB+PW72b~%V_tprw*@|VV+05{PcS%I(R~J zs+fGEkwv`V`K?x*2vx)3F&}7R6XO2Bwd4F!VKOZ_ zF+MHqQ|%0&x?c8SK%D!SvtHQ!<&~t7F4dDnMWZXLto*j^xHdzZ`%~>ymfKZ@O@XHmh6O#>*cLUz#4#Uo&{I zH4Ed_E_o`u9oey|p>d!YMgHO|Br$$(#Eyvui1|lHieJg@MhCrX_Omm&!A8pi6_QdYGv^_)v3J$F}a9249X0x#nN8X6zUF4!`g&1}xy-S@wCF<|@H_AfTU!|0M4JW9u#BqVC$h zUrGiPNkLJ%LmKH+x*KVvMYf)3H zq|ILN!5-$1Pj=cX-&LJ->wpMM4L)6^@yLwDXH!2=R%H2lihbx4W+#uIe%2QM#pcz& zONC1|aUtFYF+)x-pm>oRgEN_dWKWpY1DvSW5X(TwBInXk%u}dI)ow;ErGWy5ePoC^ zNG&|uEH2YgJ*{I47JKVZFg%}(^hO71sQ-qRvkFo;_HL8`7~7mD`;T?$VP%QRe!ia% zhF3sxuCuw}H#Gh+pi&B=5e7fEsKzS@?!C?j;Q3X`|mlpGA6nRjyjcQTicVb#pL z^RI`^nItWqo;lW1GvJ=hDJTZ6n&nz}2pI+1kVG6WU)A_shQRhK9dEQ4D3eWYvD zf3cWvLHjLK;~?fd)QY!pLT89$Q62Ha$EsXo!j}s!>C5_;n1=5MjbRFs_%ZNQA z3luAUWAYy+?=Jt+DO%}9k>&(kg6lm33%_99OS|b~jg38o(l6O3?aP9yC8>2G zZFrFBTl>myM@3J{$UQqB1v8aHSL#pG#ODC3`g z$HNm$$6!S%2wT_%iCp6$qZspd&ZaKK1EZkzt2^i4NrvO z8<@iusQ1+oja>$B;hCRIN<+qz3bCy`-2-ow1LEt|@D(oy^ok>};VL~e@O7a>UEIK8 zt}%lQzYwNhjZJTJzBe2y!-De(v6?HE`VCaw{5}*`+*16gD!82q2Vv1XpR2jXK-%em zh+no>%@*AZT!~mBFnWL2c0abvqTDSVJ0@?6frrLj*a`OE{?ScQ3YPG=U_9q@j``P% z@V*c0e+8UV@;IPYrI(SzzNC-x4EcN|_;<1T#Wl_oNJZ$}v~*@O>Qc8;NJJ*+t-B+| z>>h;(z%vb!nsrV`%}4Wr>fkH=6H3@#K?#q1os#n3?3UBCaDf>MWV)6zh4%;>9D)<{ zL*}W93`j?O;l?q~SwE6>bNb+(tddBybj3aC@yCTb)@?6!<~NUSR5HzLAPYyt$nNdmdKceA60BeG7}nSIh*$s#QkPi9fybW32D%B zS(WE`ovV9HU9h3=`9(^5*XK2T-4VrgX?_dAKLSFL8n6MSoAsWx{E%8p@e*v@%~fa^ z_3-j+qu*U%{uo?SAV?cE;8>OLaGp#!T~UFR_+~saM(-SZcof!L#(gQ@7*6}|kuQ8D zv&OKJ1;|qxh7Jp0ocVl(YuDn`BWvrVgvR@1z_imEl(uG5UtVDL-r9i;97+5s1o+QU zs2D&My-WV|its6;{WX^i}!kv9~ivwhoB{Q@PjyT`is+~8?s^}E;1d7U( zI6<{C>8j8%z2V6M%`u?88XBxkywg<)$;;~ZDNpvO$*o;oU00Yt+*nL*uAAGDaNeo~ zjZ?rMUyVF}`=^Qr!<_rYyPWHnt9NH+4G~$pKC2+eT;b;_O;Pk)&~faoC46_}juYsd ztpyPqsi)i{++4E``9HJqT@;MKXz88cR^BvA%hH-O!v9~X6&MT8!88$;e0rNmhnot+NrNAj+~v9UwyX7Z=>=p> z%_^QK%NEOwNLB6t?m28>K#B|tL@tOc{}PJhh7hc$-9{k6p_wBzWp9a^Z`m3mW_*tq z*mooTq~uE{8LPA@O@JjiRZFPbi|h-==WO8Rjgv>j(;IK4NE})%LcOGXo|sszc->j@ z*2znT&~gj-#J4`oll%j#=b;E;JcZCEH~Xcgao)~GFq~(276@aP zHg_=7xGix>%8Jvlc{}3zoD5(h0l5h0_KKp8Ds196ZTs&pKuA~p zLR!Pm;7B1#lSSOhu1SY7#P5;Xklm;Rrc2K`;*yy2J4@3*(<=1c$qdgCH)F%nuk3WW zf9r8<31R(;hf>=+dSmo}3DT$<^;2t(?Dp*k;k}QT@dpa<5(a?l(Hs-I6}<7$NBflw zlmb1pZwyg?Z4tk5_VdcKb~TH{Ac3ags(fS; zbv@!#CntZ2Hjo->-dHe|Fyq>hCO;4IJ03)lP0v@&h;lbEH3)w`hN7{YXR9h})ags$ z78{rDrx}U2YLUCoN6qD@bP?i$q3I|6*&Ci?n-_ry1i1kTvrTZw0KQuIAMpCxa6m&il#+- z$JC3$Wr@yu&Hg47BTU!-O188fVSY)go&Cbrp|MsNp+7bcv_5EZOhjoPyo_3PudTY5 z`G@(G45SEDC!H(%RDvVq%jCUBus|9Gu7(|ED@{tlUl1AgEG-5TE0We z4mZ6mRfgifSVz@QN0$R~PdujKvNLu0a8)D(D1@QYg=3R|n|R@F+L1PA65S63NiXA< zn4vN?-<+!=3cmlAYH=>&kkCM3zY{~B%gzw?sMcp?^{Nc{zla|)Ro82@2wr`VsGeW) zCA(4g%&>N^Y=OHBvKg%Nb5|cFhE|T=E1*#d)UtNJ*~Pg53!Qw_e@9gB?-}r5*E0Yf`F8GH1MrIuS!^CRz9lv>y#Q&9tX!Fg5&?|ox9OkJA2-xc4fV10HxtpQX}Qd>|NX4dSdd`HibXF8Yp zbb3t4nj@xpxALby!PqMj+wXjVD~v5whMX`zylg6K*Sk8?yLMcEN4qJ5GfsK7WOuh$ z-03bvD7I!_$(Eq*Tlwh?%Gh9@+kWauaYOp2&wiD6**&78BVip>Zgftso6KPMBO*L1 z5jRU{)%<1FQ}2e{$fFG3YMBckg-v*fJh-&XOny2g=s&6iZ@S!Lz(X`9Np2|tBM5Nb z5`kT-G>_)~&Mf~`==OKb%}H0oOA4Tc;)5Ho7?-G-I|+_B5tE5WkOR z`e&?=gPQZE7pPoEKYef2CJ#(p_rT$-Li*hiWwOpKrXc;CYff_7DO)JrGisv)M0FLx zyl>>$X;i)tfA6yBjc$RyhD!gQT&9q(Sy`NUZiV5|Ls>xTX9s zK7q$-WBY<0KnW@i4+FjNZ;20>ne2#Ncap+&x{zY}HxC5jonyM(n%6$Qg)ryD(H7uD zm%Os0X{p=6!O=Ox0VcE5CL-_8M?z&ozUyitpql+x%Kkuo1X%+>S}v~HYrf;?445gB zXgUAu`~P$OF3|y49-qBTu^u4kX>A3wTooGbJA;vx}UX?}R z4NuuW0n?EB;E1A6H5NOK9f?rzKL!T?0}WA6T1*E%$D6~=;zu7n04o6Ek%{fi9>4$k zz%_u^5d-_^Fn*lhrlR|C#E-J}GzcHK=`R2=MQY}oF1;%+b~oOuSdZx|YO4bx0Q3~D z^-Zn|V5^xIN%>bU6qZ$ScmVVjkvUnFzd1<4iG+8wlixkSFpR^k#AN})U`~Z#M=z%IRZJT>bcKJk>3EaU_hM+S{eHi=;%JjAvCD;LWHpr za{T$k0f=7z^7E+wA{!VVs=xRz)=)ms>dO~&7I4%Y3fV zq_e^O^{7|2p+W*!60PNG+v#)z4J0aN-dET(y%1pHMFdaj-~pl1lZL8C9cHcw1RFJ@ z_fH@~7xX)Fl?J+_d#=`wc>mLu{AYPA*}!hKY|L)Fi3aS?fN+2?SzuS8fe4J9Ey7lS zP00Qp+KmEZgN`W(-}Ixb03rw52hJ9sOkf3KeKAV^onQV}FP{P!usK80DA9eYkr3Wy z{$Ul4*TM0ugeq2&vV8yNAUWTkXMRZaTpwQ%_%8IBgY~a~`Btyuni7uxsrml1Utn#3 zAoC8Kn+ELYCP+cduN?Ir?14pd&As*_iOpgi!af+2PHYjFVc)w&WTPJ$y7Y>n@8jUjb0!CbD9Xl3VwimCbHee{Fe#!4QGa?_1?386P*8;ZA0B7iU ziHe)xPGQ&}`;>9BOAI#)sKgd?%-JBMN_5q@U@on^ZGOAiHF08M(=DyKmw8%?SL#4o z5(bUX-@KZVBD7oj?&W?@DA7D0Ra#m(a+{HU+e#qvI-p{zyIdp4SMIbtqgis*&#k#o zvmygLhYe;S`P8ilaq2htuX9Yy78OQi9kiCqH1jGN-#)c!jVGy}3{+ROPgxtfW(Ze) zXJ5gYhg?hQB?Fm%2<@>PO0nE$B{VNP6V-I?Ap=RcYlUd;TC5^b5JEaSKysIgBT^EJ z|9SMk4w=`Nu*@)%(lJ^HASxmQWi|Jh0e>S2xMf&8=JzR}$)s=>a#B2_?| ztVTC?XRy)9LPS+wcCwtXSDY_$bEh{wZ#koD+iL`)&JS6I29uQ}1D4=;p&RfD0P{FF z^t=)JuifPbc9#uuHc|US89omMC zF$n!7o8dhz|NkMIJ&}u95Zo0At9`PGETArKyEUb$3OSwOjQH1{?&cvAN7z|+YvKro6# zWrPfbvN9q6vH)gR|Ng`BTJP0gSkW9?>i;SY{yEaQe}JoVsuXAjbz^#rE$5wS^B=p< z7#kUX<%@!xnQ^S1>PX3$5bu!GsRP$jDUW%s5TN6nF+Q}QCSlAV|4U?)wH$1e#JPUL z_-|wTe>zmePts)me7#Nb5m7>kFV4muo{n4Ya-Ggd@q zytv|ljV(_9xt!Gsf|OpZ7sAMK&(}2nhIOry`7Zy^f-7H*pE?ct z4HZ)q*Z$(!%iIVYKUi+W7mcLh&DBJzLWn=#=)`p%1jh8MOWI!N;7XJ#7 zbk~Xp#9Qx*>Mu9`4m%_E^J6#!K7^~&U9MN8ODQ-X*EoEyL_}S3obth9QPc(91)NV% z#+tw$yc@PO+yG2WZyD#wzc<6b7h4+6Iu0XXCxs8}0YX6<6tjw>v@=d2M-6BmabIGr z&OT`{_bML&>K$If`aUH9u9tHVIx@<$DV~fzxRT zV$#45aN$>y*677)(3zhd=0RTtt1m{UxQFD1hgvB zzitEoDh5Am)#Oeo1T-(`33Aqo%hll0GHg&vbObiYP?cKnw+K5P`S3Rm=?7HB!Bc8W zcy{_DnXB*Aa>;%zK0O}z?~__VPsgen^!|*#bV?Mql>eJx4baaepDiu=%$UYH77oQUy9i)Fl8m`Tpr(Twp;~z0XWQ>t0@KWmzjA8}*$&)kA9N z%x0|tk^l3evTMf0$8A%roOT+AFPIQu?Em|405D2Xqtzr{!0s%OwlKGF?;n&RuAe5y z>9R%U^o@K^qp-UuKCluNt~9ZX{=x)|2cVlMP6=I!e|LicpPT?9pb}};^OjTzXqXyM zi3M`M8mhu0P2H}ONlz8P8dlG8(b0Ilh>E;bc8dSeQT@lCEhxxnB>SyeKAYZiLb|}* zKDbjGt4AF7c9(`+XuRm&DTpb3dtYS~9~dIW_~8so8I--k=V^k>j2UY-nd!~;mF`HY zoNJkB#sjXVQtTgr?c#p;6*65)XOFCE<(TggSjHajAYax#Jw5urLO`~UfCV0MUs#m( zMiitd_mE8D+gpm}4{$`X5*kS29}_X0^T5y|{J|dOm4XdlsCfZ$o;6b0BeUzLsJJCv z)LP`@27{w8Ak?tXXP#swG_}99z2MppdAvw8c~XLkAq=+%M;wiQlf%1ksl+V zdaJ%qFb70GeLim^`7$Ovi3Far?E>LU<-vUhQ}9);|L3vAuw-nn%y1uTq>ebBpkinqTrZBb3}=lWkG z{g_OhIE4L}24jsRE1rP62c@(!>{gG9}DppEN8yY94{W`Z!)Mr&1pkU2`y~HDrZdwWA#NK2OHo zp=~;hgPAi8v$i=Q+Ud9E7ok&>SJ(CF$5Ix)p%sB)Cm}5^6<#2u5KO%wgms?<(1Gr# z&Z_G{j4T1CyJH4hQN3a7JZ*wK1G?$1iOcjiy4lP0f!8%qBjq?&=c}9+7w57`ryg<< zZIFw1^*5Q`NZTcHm(s1)OV^b$FY#^(vsOjI=`FFiIH2H3Eb}CfA=i^Z)7K} zrV4=S0{~P;_q-^mw}yX@eM7xyQB$|3YNBPNYH2Mh;%Ff(L4^gHTW*7R)Fg&UZD~)9 zmEPFXr;+TIP|@lB!jnekPFCTzJbiV?qI6+YqjofZXKH2P2bWi9O@IuEFTvHH-+}o) z1j{9zUQMtUa#TzE9XTC|My>q{4^1bIY0qWOWnp4B17&$ngZti|B<&Jt_qIQK#tpfjQjnC_;BdY z4?Pq#X?#3JRcLV}gxxL_&qReg)Vy!d?oQY@UW|7e$=^GWv7%h{SXi7kvEH7F7cUnD zpbhpKMs<)h7mc_3Nqtg02~4br&TYPQfHPdYaV7Dn2e)n7B6T4h4jBsW~{B2Kv6GZ@fd< zGtA2%r(M$%jVkIoZIZ^gXq$^>m4_wvzlUvSDs;rLg0`x&g1Ww%b9N{M)SdVsolz#C zdSaZbs0WSx;x~4j>qEe2HYLQV?Df7hV72^cbCP87PV6%JDr)*>Ah?fkbvj(bd6%=) zzJ%ce%>(>BM=+MSaZ{oTXK>yLJFEzRhAG)b+6BqJ8$3!&%k=guSS$Bh$hgR3WGb8# zETsP0P-r!_`w_`nn2R>wdc+lr8eQ_?B|^-m6+*KsjADH~K-f-Pi7+ zANajd=14-0-Ian)>&eI4C7!*HxAo6;#p8mu_8Z0U=lnl@3R@V(-c~K}DV#@&M#)_% zh$+)d&e^ALv~?K=cPE`~n2YZ%UV%?kN~l8(l|w6}d?H{*hj(%UUkK=a`slN^Z7-O6 zqf4hs?o=yv4Vf>EPslu<%oqkO9xiR%vEO_+d0KKc#HAv@_f9>v2~_hF59g9^vUaaw zM*{{7Ot#+YJY8)(HoH5#)0mtq{4}*;ywn(fnS_tL`x*RW>lcq|C_{%qaQ7H|H*euB zYv7IIP89-r-b?N#DA1k{M^C&HT8E>mVtE{V8w6Jq>+!|A0 z@ij!%u_$L>Vx;%+yY%9ltjd;_H=gY*YdrN(4Oj81AKwz4do*WQKYz2ZIJBJFdV4pk z)ACBdhcMLqQWDqvF0I$@j@a%|5spXNE1k#px4ecFM9qt#kCfdd zeg-R&-c7?kKPjMt7w+v&nG_=LU?U3i~67JBl=;%N1zHjEx5NS+n3~V>cA}sX+oM!Q`Ix!UJw8X&8PjpbecyfQg!v6G`AE z(44tj(?aGb@8OJPa;Lj6?2F9Gm7?fH!7YEUovWBDg$2VPMbv|QnoM(+;I7NQ5E!+f zWT0Zp)t9Y!O03m4nweRpXV;YzCkimhhMB5PzJYTUAlA!;-l+-*hnC%Sd9NZv)~scz zbQcw!>YPK1Pee*~PTou(`op+Px7wHAy<5JTHSiFB%Nb-Bf!~JCV<=~} zTuA&5>$bKj>Y}zT$s6w%MakA~95$9Pi1T9r`kd0erDgqw<1eesPMk~UW!wAKtLe*| zrB%TACjue15zu^6=KtqUk5ni91{vO&rtYt;l3<#=Go#9>yZD)yyNHC{*SW~1k{|mC z{fD2WjMRzTRy?0MF-gkr9ly-@c6OLr8GW1;!=tZf7z&40akR?a>r%TsUc@*DS_Q&K z0ft$>n;5nmyT-PXxB1?YR|kHI61RI$QNGOjyyLV?8l%E=a z8&a=a8Mu(;kjKT$S1Q)~ z{(z-aG5h4$CjOQXMTs?cKTF#wvS_`AT#DmPiZ9~IForg=V%yKoEby!-E74U-ZW9AO zHnfkwsu>gaC<4@tg(5_&TZjG(d4D|}0lKSP^L>!hQGfc3-Hd>PPx)T3{Ph>ds>OWz z+vXpt@X9m8&yO1klQJzEoM`TY@jQGdII?wafHO9R8Ux42FZ{gz9|7vS*#3&-JrHBQfP>e?fE-Y*gBLEDk} z?j&wo$5AR_%&&7RH3N*dus5!SpB7Vffn_Lg4kjhE$9l2IesK0?6NhAmh9LR~z|PiXVjP!ZKmZ8ESZ<)&h;Xw)ib=IMsohOBq3%n{itTYPRku{eiJm3LG6#KkzEt>qQfG5qf8N3J1_;<( zjKcN+@lV{4L6eT?=|2la4X{Au-@Ep>${}(;cK{Eyys_}lj)nqN&R{NG{+lEfivee} z`$A|{5HmPy89j7MadWcSzC(>h%6G1adaehySo+n@k%^co-s|vRx}7&O6g! z!xP}8_jAid(@v3ynqR;6R!F7Wy9Rpho?2IUa|TaFZcT%l(NyyTJALJQnqS{?I7bFy zx|R~PyB}&C&adwES;{JQ@^oX_dcUv=8^r;=J~$&?@kX2c_6g%_y8`b|C-0Mfy(9Y-^v=4`vx}VsZhm6*fR=YLb>)b7G}@E%pxH zB!xXWH*mtwD3mf6c43$d$hp?WA#BxeuH>BCyImF7AfpzcW0O2($|whf=%oaTh}@u( zaX@i}m828eFJxjqO!1=lV(6I3&Icfh491^G;7zQ!(*ds;9fv(CVg-RKeGsgm?g=j@ zoT(PQHUo7P-nVrYt!ZOhOkc%ms?u?6Uk#MVhv7?ygVgvBW`>oma9zYpTxvo(e8feO ztS=(?8KE-KUP6!!6gBDIrx2?C2CPeyzyN)Zk(n_ z(qeO$+_MIIonEozsdwD-iCdeO*kkIA`x#7gve-Gp6{@ToC7r-1p9%9?>lfN(3$;7X zcS&TfV!tZYB&Xp@dtLgJAJ)D^)G}KTQ6139tzd)H^j6pkJCjiYNe!QlU`PRg%O52A z9`_;L#57?Lwowd|SS&`>co!4$lWrE1m{Jx=ax>n292EwHvNX&4G`btb`Hf>OwucTZ z73SXpw8Zhy56M|RtuRTMHU3DT52;;*VC*T4e=R+#i#_Ur3q6az(_*;T-Y2v=QoDVk zg5H{w=$~E}`G+o=9Gy(z{b=cg@T7#rOB6Mtmq_HeNx!@0l!F z1bsm1&|-u1rA^NKYkBG1pg8nf657Q}xu5Z-utA}W89~go`ceyfrnz0@uM?_F4dKVk zbXcH@TnU{#?@(e_s?RuPri;ut3=R_ouMg z;{U$td*1isY2C+ns;p`F>5eiz?ppgKP>NK#euHe*O(b;7LHrP#VNXntu;hnF3OAY<3C(|WDLYFfG!189W-%yvb;V3A zX06M?>Gh_<)G5RWR)r&_eBzk#r2y{2p_<*hx6s_PfqsrO;IHI}}J-E;(`lDNXlF{9oVcCmP1C z92N$q$v%cr=PSQeRkeT_L?Bb@MVx+x9w z;eaeSjJUdDxAbPxhLZ;?X+48=^l@D~DK;Yud!x3H6oXwD4xcP?a1$kW{rnTgjcls0 zKsCNoM1fx`;*|vbPg406asiC*)ZDV-4<7H9CSJE3jVONY-bSu+2aQ9;7IF^k*jr`I zsC{?2mdafV96r!GeZuEvc}z__T=D9IlMw4m9~yAPCc=q<8P3_*-j&+)1SnrZlGX9H z1D>7=>0AL1Cp{8(laXd(lg2{pR|B^l!`DFYQ%WSWmhs)62mzAXW#S{S@k%r2u9P1YcFN@;q-uF}aNsC@P z!w@qyv@=~u-0pGHPnV|?@2fzc=@>0d=8ajUm`wb+Lw_1PL2D8yyJF$0Y3wo-jFWW9 zjD+FYO*|2NidFAyUinEFGYumk5Fb5{IOP_P!Pl>@NHh|gb`4#Kc~1)^<+|B zC#vh29P~7k>HP7eOSzWl=?j2T#4Dl=s0Vz2DPyc}7+^_T8ARB}5_9H^^6QAR9U&8$ z1_=gT>UH(S2TX^r`EFPj5DxY@FEOK`vX~REYy|MUDvy`d8X1ye8LQJy;F@ZR!1IxA zy2#EV5jNs&QmpI)>*0m<>1TW>UZ1z&WvYIsMvvk5J3ntYWQ!_WDp`OCd=i)KqtX&> z!hTx1>A=ZLmhllq=A=!qqF^DikS12mfay~+%{#DuA z#RL3}rc0TY#>X6QgMbGup7#6ot~)^%eKJLsMV^pAo7)MnXdDvbX>Oz3 zB#32c_*FQ`cihUDhJPPNEF6O@BxDT{jC!&UC+T?kq%r zK`jGs0t-`F%hNLRH3dqXqxJS5U9d}rU(fZ|O@AO0v4ldBbXK}Os#Jgcit`Z>*0$f( zje(~RVN*-+b@79LuPx4vFpocRW-!Zn{}2h3u_epl=CHxD7lt$oQoeO=7o7)D`;?!< z>Uqmp@*6tzo2VIMq1|}s1~>N8L8%LjJw=4&mlqnuglOvp&L9h|5S)X(L_mJjm7D{J zA_k9%7#WMNNkwvlTeO7NI2qk|`J=xqTV^o*!lM$r@60qqfn|nlZc6G>YGTAIt|`T} zU#Q)@Wa3e`ksBOW6>(d>R&>G!6VgF-*Mru=3OKfzs`Tu&Gf;sH)LJ}x+r(IeRMovi zv|j>kK6;s`lOqSpX4EL0OQbc)>kP!OWiAD>ai(0fu<>}SG2ymWSfHShWY#{_6Y5{s zR#AE4eW&;wfSb?W-oJk+i5)H~u1bGtsFi{Od%rOnlH}NHeJqF=C1E-BAcd4A`v;Rc zm3yH0e%0!j$Pix$*}fmF_SF3Vg^|r6^_=+Wv<7Z41_8m9ItOcKQOk{*{l6>#!PETq zu^^Drm8*I z#MK~sVm7{m9ktC@@eFvp#ga-RWRD>(p(bSl*Mcvge7x*dD)e3?IImrPha5tB_Y*5z zzKRV(%=akOfdMQ(@r}q;$Uw97FjwuHVa%zdfL0Ne=buNmrDITXAzkg)IxwtN) zEyweDT)z|B42ex>gv8A5>?iDI^nKlXt=hf;hTa)gCg^pimqR z8mI1@PZQW#G`eq-&OhEVdS#%~9bG4QG}>w??_nm{O1in#|B-C+31yeo*8S$&hL(q; zF&Fl&5n;km2Db^W$)b*4ZhWfhAfRchFY2v2?Nc9LRlciZf(25H!lw$0fxq3uwUx__ zz1d+B#sX1x6KVG_63a&Xu8HOxkPGXkY;sg(yn!VWXn4S)E_-a8wCE;maZ(Wx%Y|Y} z6WnaXsYIApw%I?dMB}+hhDk{d#1vF+#`B*#F z4dKrKheYAaF0Z172EJz9W$DoK((9bIz}z1T&DaY}TP#m(|Hvqui9*EPnB$kbkkhP? za34xWFz4}5;Zsd$ELP^nQjm$biMC}w*H&4S!osQ_B@qE=Z+I+_4IMZ@(5xy@PuSl! z0an15hh$YTiqY+B-^nKVOWI!9XTZBpx)*w?Dx-mh*Ccp0a&)+G%5FmKcLxDtn$;SU za^pfMhh2^V8teq#gGz?59D8J%V$=#l;1XUSxj*lce#fh-YdlLb^cMMQ)B0oSCb{|2 zF6#pVGxzXRLLpi=Zqa$x!w$x)&(d0HR0+6(qQg)!EyMd48Kjp53+5opFUmgFT9gNp z*vUH#&?f&k5)?ySd<$}qy*Ju$XOL+ycHe^s8K*eY-uNOr5JGDKLaY)gKo*z?1HPY# zjF4{}tAp|(hI_kvb^p_W2cSzMGzM0&>;*}+8L0?j&;>;3(RA;^Gn;A1@Tf2g8>EQx z0t)&`3e0{$DMt7WFwowp_A=lhW~LpmiyT;vsS?HZKR)87jlt>(zWy8lI4*g3#Nn;c zVbY(k8Q5P3L)#2^+DbjQ^>HzMO;?Hucd}RWI*pGc`G#Pe1dUCkWW3v>2o3+er{J45 zJt334*9y-r^s!b|7nYOHg1M>f;7)l#JP<+LsQW!0ny0r z!&P8t?R~gS8=%49y@)lJoUgcsg?@_*E*bhpBl5}8fU@TV{Q%kS4mDW;p5n2seJ_*0GNI&XJMXu8>7yKz zEBoMUnfVbep{mr<()M|tcr~Gv~K$7agAe3BY`rz;?dRteS_Rq(F$ZUdEKhE%g0np&@E3&Ipu30y( zbG4UQ^|2p|oe0%!6Wh$Xj9-JPv7hgJUvlbI-1JwW=CF-RzG6Z`nLyq%TlH}ETJwxy zJD-G6%l#JT{xGE=^4)n8Rw}jgP|?Q5S$O_{+2FW;0^P@J^Pj|x)4Dwi%OwP~^irBm z;$D;|STV6S370P7(lS>?gJ(_J7mHO~xBny>Y3RuiAcfanonyWY10ImG0dVkmVrRg6 zJx*2rX}nOwWwux-kY&huTL4oOtsFU&ciKu-#BggI74Nw~0{^aBZqQb&XjOh)p%}G- z??TZ%f~PYn>Z*1u&uQlj-)1xYtk&NiUA5gUww7={jdvQAFWBqtbmLxX^&7j1XiCx| zYqn2&69jxJTCUcC8ytdm6GaeCfbm ztx9K<+riO84vfyuJkC3+miI1e(aDZgY7=4C{78b0p(~m65Xed;i@bNU_vQCtseYll zG_O4lO@*pB>>j?JhU6fHJ+e6p_OGn{knF02AsR)ymN_7y3SBA?LPYy@m=149g+y zB|}mXG{)KMbJ`8P2ZO0X zvT;o4J-0`3Ry=JuxA=<@HY|CeKbWq+tr{#?u}GuF1~VJo9)EPYKy&mO~E7)2#=q@ILYA z5<~A@?>P=_qKP+YUT=_nfxp;2Z!ASIljpk4TRR2g^)kN=yxT8Xt?NC>T4;P6_4*5& z_@`VVv2`XmVaRVzF3;Zsy!Ke@Z&0OMF8prozvv_v*w}oZ*7h z-FSTM?7U?)ev71aQW?bHStT(TE7cOy_B{BIwGZJ1LAP09-ILZf$D(`(riv~@lhwT6)7b<$DDe<1_KewaNaZWL6L z+GOM*oT(N9Riom2TW30gGd9@=|3tqVhJgjiGNVkj%9-jj>m4T!tu-+87SXpcX;Emo z)ID^zVma`f(HvT8JW|>dAnDW@8Bon(wo7>OBji_#PYY)wgFeTqdIYsC>+F@>u!+;7 z=jPlYoll%vVHVv!Pb9R9*bTG&?!KarSK*N)-4WADXzfC~TUKg2wKw7S56*jqqj@Gz zb6{t|Rtw|SeV}OiaJ8Sf-(fc5eTAA@Qn5+P+aE&TLaHBb53dIHJ&Wp|ymSzL@LTna zRM()zw(a_@Ck+>Od}tJPOyk$AUAn*)Sre32W24CtQ)9rZwX}q*;+lGk1@%NSua76q zT+@%Qm9>WP@LaJ#YPE|Mdv8d6lhN~5=VZn+bO||;7*Zo4??U&^U$|+hUk`Gxu(aXi zk@cPU9>kyra@+AHl68sO$|U*y6kdha>^#= zGM7#y{=WEdncfS;oUmrm#4l6s5HQ+}Z)^K{5`Qmx+}kWD3AHW%<7-sYTPFCB{?KAk z;<$#|7@jAKK@C}Ik;?_kkZZK$Zl>DQ--m5UI*g1p3V}vcCEL2OA+xcTK$&!B0n5?%xJTLT6Rh=Yg|Q zB0Z;|WsXyPceZr(t}70V4PWQ6%^?x_`7ZQWxy}rxSoJrAraTNA&?9qBxN={}p%&Rz zS@A3#wu~M5CDoE8mVT$NKq8WgT`V$9i1ka>5j4A8q+$(`v0Soa3E6(6(bgepbg`^= zajnmFCr03ikIAB5&GU4up|8szY&Jpd^$mbS+P=z0;~o|j%w+TtgwNf|jo4sn;`@S& zV-V*me5!e2-*%pVoK*0t_j{Taq+?#-RB!~YjRW0YZr970ITm9@=yYuPQcBM0XL;l%t@5*$7 zw<1=UST^J*Ykh?meU~Ua7&K#`^2d&ikN$l{*fUo{$rnhq;34Hk!(~A>5<$AE6$z_!NEot&lR=2R-zx??e)x3THzx zcUgm)>o7DXPVYj`**L`erBx0w6djHQaO!evl%xKXa*-juLL^?RhN>@pC{Q%dJd)b; z-+TFcn(vH>8sZ{K*>b8ZrFp_fcaw%LTlK{&$L_U{nkAo{=a3Y+C+;Pa6WlAX^SV|# za{W4?A@fW-MSq^{{sf8UuJUi%A&u9Yn?!x%6c%iSPN|O?d8S>ojk&YVopiAa_6H|e zmBx+}p6}M1&%*WIbkEem39N@?XkxCVy^OWr#313JxEqFm_bo(HmmFx$Br7>sSdjYc3uQ_v@a{+pxHkZFm9tEnfI^{P=j1R?Z zR13yO;7SSaz%;1yzOzDol+qg_BAIU;^!)P6z zc737C#f58<8d2~V=FK8JWp0La;%oTTe_M& z>#hZj)%^o*v`-RPZ7TLtPWeq;P-@ftfNE#36Xc3xFQwo_QgkuA`C9@`AQo;)G?Cs# zkiv*!I;UB~O29lFcIDu8f|q|hdd+HW4gYZ#heC$xkvl#FRWeBNg71$-;^+MaHv7)} zrT)qBc}}x+&*P=ZDgRjrE4wVsP?*OcjncEb*#K^cq${fl{r1N7=NuHV@2Y)#(VkH2 zwL3C7^R&PAPXdjw3K6Pu&+ijeFS9yBE-gr{)uOr`k@S1pgNi89y^ksFl|?Z4W)`5JYHT^ zv7_~1i)0p$n%!^Lb&D}V0wUhe~Cvku}(_PjnBj?qRB81zVT~5Qp8_|5&Bt z^KKX&v`ipM2Z2SMHfXZ0u+44jYc6a8z*_3RCP)j31aD42#RT?AGb;Va`DLv z3A@?q{aC8j8H=lIcss0;V8I1uyl9U<#KxC?8I=Cyvn3HjeMZvr45;YsL&vto&(g9H z;v7-7M_rXR(wu|;4^3Yk)pYy+4P&x_G=g+9I;BBEP^42rx&(yLDGf>}y%9=E3zE_e z(nxpb=+J8+#SOC);He0$in2!f?MF>4H8?)KB4fBg|+1mhf1{xpmchHRd0vU9ki&X4AhzTg$d)? znB=7+br+kn9{*y<=i9i(i+~_#n&}Kq1b_%zM`XCQIX;`QlTd_gnW%+}KGr92OjC$q zhMPW9H5wu_r&LO^sz5Q!P11*pL{oak_cKTqe@7HF4$;$|3Axb!`(Sh2v>K0S+mcUaKd{L`m<}`k8U5OGM`;jqLy%&bDwND-18$F83BvCOo7m zyg6_wmr(P+t0dUW%iks8!%MWo*kxltQT%p%O~ou;l`y}&Mg-w zb0AD1%`Q8H6(8v7XtQ>-Urx&`FHfwVc1&aMJN$R4cv>-?&n;$ln_JiVe zI<*M2CCD9_4W$BD?$q;Q_L+OOK^iug(-HJKatA&89ZyuYw5ct$7NeyeQcIW0;VmsqVklaboP*0mYJ(Ad|*}o#Y6UD31C`;M~MV7>7+qn;E z4>^8At(f{XBh;h4hD(BFO-F>5Yc@?T@Wc4bqWH4W2MOJDaUMuOysrf9FOOxkN0Oir z{@l&_uTo38z=8r;+@KJ7qy}55HYy>arBBTkfDgEvynj58bx`>BKYQWL9so17>ees$ zte>_Tg2jEYkn5tI=?$BK49Omfd{02(N`a8LWZIMjIh=?k zIA@Jm4ORFS!Qk{kAxk{o{{V^a8jxt0V?ZJ@eBqBJq4RIuyElT{qOz1XJ7DHwOf}sN z#klJEjeeI`r0?|G!aFU@;%6No|0$7`yJ+YAMrurYGtZvvO!FUWGaUZiA)g|Vro&a<_fbp2&LBDcpc*ll zc0I|dz z9wA}*801u%4$T`(RJSDi-up%`FN`I5iN0Ar@`2g?L+>AqVHJa6dwR(SJwj?M; zE#8p|*&ICO`iXRY%eM|p*0-*M#XBjMcdx}+cL<@EmCYgp8BpDN@t!=&SX#W8%TWEwLUP<;R ztk;pZRYo-!VpSy|e@*QPmHn~DPsLYn09xX`e1I`JQhvhUR%v-zAVFhd$Vq>|#~Q$C zae@LNw6ca2Nnp@ldb&-y?Bi9pDskVx3-kZpdHfj25&FcA^Y->8v2)Aj{0Psz!V8oX z^PPX>yaX)M)Hd`5o1fg*p5@@Rwea&YmEO20ZM6y8{;_fEJU6B3uYcARBx!}q-Nnl_ zBg*#_=MV#@+sbn?E#bNTswszAcO?P{{1O^?Ob}~w9en`I3|>cq-buS`k`}alK_SK3l9fJSx4d*s1KsPl00Rby zE*?aF0o7|56}`rhxt3vl9|3sF>g&6VF2DA_xpVT`z|g&=N3T*S2KO9gAxRqcv4!&5 zKAa%sR4M;tuVeg?dX_|n)DOEy74$a55L1<7fpuh;vC~8i{w?Qs%wg(;=SyBMzMB3` zxIH*&)&BX&%G_58>k*TIvZv;lsm^_MembYp(q-a=DMzLijlpI)B6vqGk9KqWF-aBQ zGQQ4|@7Lh_%NmZhUCpdI_-B4ud>SedX&e-LHwW|6#|T|>UY>U+=njD#VYYo1JC?Ii{$ zULx~iBYP=%BcPt#1^Gqa&t#0wDsOLm+DZ2P2P^B)4bb!5-;-+aPqqJ?f8dBH z0Vml6^r~S|#!U+fz5DLK=67c)u{ldua|kw3;?!7|D3=@gGb&>MZyvL<%~HUlKts%> zp(+q06~uFP!DYu$yKRGPRSSProbcy+@HK=&e_Qx(H<`W zIl@K4TbPqm{oNa$;G*lF2QeMS0u{hzc_OMpU6aJ5vZGoAN7vPMhC)C=C;9(*%(`{ zq6vR91j$V+?vFL5(lp=ra$7eW9zLWhgr^2mB2nnG5ab3f+X9hL@D^^TgalfZemb{s zFs4K=F6!T((+k#K6q`puvCz^sk$@?e)rsx(Y#M9DqJe^4i`lws)Q8Xhzjg$azDvBu zavZi<{mZ=Ja{H-XqS>04$|0>z@dCK~9FUljpsve(MoIZ|z=Bk@o*$hLj8|6I^!}9% z8wLq0S{=TZeYMIjF{%G$mU}KQzfHhZB0>2V6+;E86+YW#G*6qYQg%kNEdZ|@b$T#s z43z2_DByiZJ)=0+b%ZZcKCsH%ByqgQ)Hha+W zP4Qw-2oZ#-dd&s)o_^DWxmeilK`A7v;MAJwQxg~6@$O3T)+vf%`a;t&A(&D9*1>PL z8{m51T{@iCDl})|j>9PzF~X7Q$j?<)MP1=1`cl1pRn7rN##oWS0spG$iRcL*p@nG z8;d<85CZ}~|I@e5p7TML4u^$uyos>m{nTx^EB=QCIwS5G|Mrtt`vGcugSj4mM%Tu@ zB<0gE)1AK6Cz9Os62RtSZMBu29W!;kQ?P>&xmdLmvyU&7jTsiiu*z4jVwu6@9hGY~ zt4TV{d8V-~=g6s|E_YkIb+$crUJ-9cXGOVR(SIP5J~IXoSTbHK~otRzmN#z4I! zl7!!qB+=sI&a=ER9+WVc+&I~cNwA*E63E($jmT_(yLY?c zOO`}Q%sgBmQ{`d$@E!$)(u#7puab2O@}az`#KnxtjTU{P4NB#fpIk6C&J4lR!z<;Y z9ZfT2Q9(}i)$~}Oz$O&LsDTMRSFYhz-9J34BT{c3XhRP5bD)IH5RkltvCl8s2^SE6 z!W0h3uOeTa$nJp|3EsSD`!XdsVat>=_Vnr2Ym9yL<)m@-g(RxgFJb{oqxn0AR-*&; zM-+ETh<^ok-&-?G`8<^_Gs34xl9IKj^<1{6pSaTk=@QdWjJVWhT+1WJ>ArY2Tu!p} zQ>M|c$jtlA_%T|3^cq^Q1vmLgI=2aCy66XV023Wa zTMKyU06jUIm8>*MMBIr0^&R(q0=)9P4M4}2iu{3Oy1WFyf!wZ?L038&bO#6p-OMC7 zGcH110JkVKVJw6_8yLEx2}pr4Vp>x15Rf6dVJR2*;=h7v->WxRK#wd}`Vp$4@NamG zQO5)3Emmde6t*xk+X6a#3W&V7KRFc*QZ6AjgC&XH>KD{Yb46aUV4R88+f+0z^_=v( zOg6o#ftmSPx?9muRO0BB?pnVP#pA`hgG!V2ZTs}hP+te9@dutOv&)QUOH0kQ9?Urk zV2d7vvHGu1_iu5L8OfOc;5*WppD(JP%7j!Wvhw?nkqPMitgy3+V3>L09)v{+(u;Y% z2}_z&Q8;4Mq#h`i@3A95PcVDwBu)MT@4tO=yri`1LoO2gaugHz{E_zq%n2T0%>d_j zSzu2kvW))Sm|)+qLIKEH6xq`GMJytMIN45o|5GK25NfgQvaOu%5EIuoL^9@mk@Gs* zu=dX&I3fh>j~mS*AHd!<>8e_yKS_*2UTUuOW(w}4VAI_p|;k)PA=G|5zu2Y_erRNhbsj#G}nT%~r`hSkW zC{+&jEpe{HQB5ddv1myL6UgLMXmRf!@1LxjY8&n^w z<%3ibyPoal7rcnkUeo=CgA)GS_Knu4bm4{tRip;0*tQuzB4%Hmh)jP#C<^b^<`Eri4PPv0zCpTX$7l0;5AAXddiY_5dQK zyBD!5f&Cdd10@RpK^`+tegUuu&yNj@8OHs=_p|9azRj}#Y8eWsK$U`LDnK)(p_S}1 z-KH!q%7v%5uQWveY#{m>AfPY;E-D!?pa~$jQy`{BBqMapsV8)C>H0CeZLNX*#`TPf)(k}1PS6~jkef*?*i_2B5lZRvUyv|DFZBp-hT2}YFqLng zo(YL@zPBp%kzOrN()`@w7_awqkyh7%f1z_UOC^(5?)G?~lRF*xI7SDjhC3n#@hFh$ z0Y+F0Ojr(cHRBQw`EB3LMTt|Olpwaa@NH3RTa7ya7cG=MohtthOhMy)`A^_6 zd-1}Ds>pHAa6l4*|4_#?0|RKEgg&50n`9p7)3TM=y!+_${lqf|z)}|2Y#?}_hy2b7K0F z8#_#U+-ZOSY?4012K{Pn9zo&WnjBs|Fgeuie@W1)s_1c4mH_LKyopn^_j4X24*j+r z-lTQd%)m|lSzmd``9Zk4CND+l>=G)sfsJ`AK#7m7o`^@0^V@piNJx52=~>-g9-ob= zxDow4g;Cbq3+Y+qL4x)9_c$o~HA9D*9`0lKw6*H>yU?}#mta=4J!L>caA8U#XthIa z@mCUpS7_yA3H%eo{u5KQTmE-r^CsZJ$4XbEuy+$*`}Xm)0dfBS)6P7({`r|eL`1e) z&&5y_ndPh@3ZDPTl`r!)ofM3x38?scR9?tn2+Gyubxnov2TU#dvboD^{(D3Xk5QvH zpF24bk-2k2P7rJ%5x6i_SHv#ZQPfMS29MrR06aN?V!kV`vcHKU3&vDOs9ED{QXO1g zF!|ILv8=IAp&D{ZE}P)u5Qkf+gQ(52YjKaxt=b}INGrOWPm1`^^vMfC?vI5{o?_zc zDvNzT)2~uJSbng+=(G7=JCYeRiqO|~^i`)>P6QqkmkKtV00(vbuj1%s!>%j-&fV}qv8H1Hm` zMJR>9JL9#j>nk@cJn?f`FwR>gU)K3#pOd-nwPl?J1dO<`u|tUMCv(1d^{auUaHHA- z!e>E0Zc2h~4LDUS1N5E~A+(*ht!xwZ!VX^PfwzhFQ^r2=5Gng&vvDr0W^Pw>2Uqy1<3%prL5A5^tGSh#p9 zjq>|RkJFmUdlfixG;iuR@J z+uv;mP?-&e_EuEK6EA9r9rOJy@chMVu0_8rc4|Wn|IT|{mr8Ii*ZyHgO|^~Q&IWA% zJY^eCtqaf3RrdMnnE$-#d#2Zvn(+sTPO#^B5%U+k5fkGsn%T=1*P=`56_4h;eU@C7 zFEN4Zp0?P-|9RD=3a3Jkd~CwJ?3~hhas15%hPTUD^RrEBN|dV-Dv2(5i0gIz?Rzk; zeV9s)_*d2GmFEdOZsmT_E&+z(^w1s$EDsHn7UqQSzSpVAF5QrXBUz$3pF#-Ug9hr) z>m`D=4ECktsE~ta5|Tj=FktEYvIqRN4>GrPbNLJftSx6zP%1-)C((q&B6Noy|DX11qb1Bz1;?;(=klr$WS)7n?1fPig;_#|)!8FYtWgbRqUN zI+3xnRif-&@89gG{*rnj+{P+YQjY-{((D6dd5%;94yG1NHYtG0(*N9`ur=r|)We5F zB(sGyD1;BE0A*m)Ajx15*(%d7t)_XR;?~E)*7DQtFEpoXe>fCD4HFexItVxU^`={6 zcP7mRLO3{e&N+rQ(EP!vI?ZIaHa$MTpNR0hg!(s&;Jdjk4>kms+ubr&?&U?zmb1(MPU|7g#)Dfc#uRSHF21T z-H*2l^}1w(@r*HW&R)MBo9r6WZ|~zM;sG>&zqt!roMVfYgz_DnE3iKW$f1B|DsL!u z5VLz9#$Rb^C@=ld>SwvZH_=_6bHf~nhb|M*qj=P4N{aUJ-51WyS@Y9-T>sW+)^n;w zS&qMq-4^YoX*}!b3N}9kL#TUGEpOL*Mbs2Gk+c<|vSu=Y$M|IfXpc^Ru7 z!*n%{CDP++y@HR$vIYh)Zly`PB)GwKN?HkvHO~m`Qp&w`sKxDAKIgl{9kKJa>lHR% z)~OVm!EWh~4K3?q=yt)31b#KSR~WZl`5eVWj_vVbJ{=j>^6M^CRGo*~jIpyH0Zcvt zm}TGC`oseihh2$x{8uIr1hqdZq_DhPIf}_;TL8ghO-sEeSbtLdHE>C!U=fM* zt2e3IdFMR-6r6SBqw%l!Lwv*_OSr4Jhg$H_%aGDMs=@q_(wAye_iz76yEWTsX%^e- z`NI-fKR%g_P#H%sC_`WJpt6uA%@IKd9SoxPU+1_lG;}<9>^X#maB5B`ohsr|30Pn1 z+tMZ#@ur$!W-8CVVxGNau?eh7@ZHgDw#b<)Ixi5`pwZD5?Y$h9J#hKc2>&v-v2JjE zUWYu76KYPf%JMqJ_zybuX{iSNL;zZ=)ko4fOgLlO zXdR@_>%T9d2!huR>(@q`1A-Sy-{4;3*2OSC8SG#Gfb*-5Z&6^F6ZCnH{oCOvE>3?cP_fqe_K zKbs70sMw-`#LnPjrhWDxj2r#`)ax<>+-mW1z;2jR+vwf1d|R$E>Ub4sUk;Qm&54kU zl4~i=tYiE-1k8%=tCStnmeaY?tkUlDSX91x|7{x3lM#}*CMP)~bf4L?*i*01C)(z= z444(W;pCAU!^(<#otranZBs1io$kjBtGVDC~T zA#+txOndQzJNTWAUqU1kjQ1e!nCNqz{W)<r&AO|COvrfXNcSq2UNc4< zq6fpf?O@dhQ#9_dfU_#;^a-B*DfNVJAppJtZykfe)OG%wvB%;Wlmh!Up!pQ`v_E;i zIujJKD6zl=oU~=>)z7gYU7cXO%%cPhDwoGeXDP#WEGpiNNy?OexL!#p6os!FdB`x_ z7o1*bP($^2QGRBK}!zZvV4c?T=xsXO|vuEnjS@X||osx^U685MU$_CrkBYso>9ko z6paN;vaRjV=^EqT{mm^$xlxb1GNi<4XK{3cQL@dB1`Eo|!|1WjqZFugpLN&45nSsS zmocJJ=F)Y1m*K#mUwmYDFXl^^f>%}Wc~k`E(*}&bEOcJt{P^pdrKRb|;MA)1!;^zi zKEitqt-U#$t2ndgAtYu49s0i)(zYECim{>4z6>|-qBlc;eiDofiaQ<2%&ZrU$fCwl z6yBrG{ZZl|$QK+ZymI#d>E&fvc3KGD(NCF|GMVE!vTDm|H}(%7ep1zk_*GFPQKA}m z@ZgdO5)nuGMo0T@abw>P(_ur|j2&o`YS{AkMReO!_!o+hSVOx<=GnWS$*QXW`nfLF zx;{JP@)Y0uDPyc}ir>`X&Q3@t)r&G#nV#md&&#po(8$0*$n=#U%#r@KzD=W{C9&$u zj#Jg1VPU_l(!{y4@1McLn|fQT`jPwapJU{w?fuM2ZFG(w*O@LOsu6c82?4X6&cKT) zJ@0ng6~&bF(!=x137_(#K)yeH5do;eI`|QccMy*)A3+`?ZN`W}rFgfxxixhr>eN~z z45$y^i2!`)XOa^3F<|TA0oS)Mo=#OXzx=sSA0TW%FZPuuPk8mw0X$zm$R)h4A_Cem zH5FGjMZqvF(|H@jur_@#X$l7N@exqB$(Ed_2dRfyi9zJ+e;I$5lKXgXtWrf(9qe_tcfFVS zd@fTu(+T@JVZw$2wr@9ASldp?nj`zGQ2kFQAhnab+lG@4F`E_r+}k%WzTJ zg1pCY{SM9(B+m3Rd$kL$8kX5s&qkf06jlu9+~hWUyPWxnp;XW|?nsWFE(N${q7zRD zMOaOOuaZA4o03Z!2t*&UWu)*4v@UV903~KwOsi2Ld`|X9q+MUY1Joh~G^5(0*FoW! zuiam;p#&EJP-bMf#+fz_=mtC?{|}=fuPRh=r0CGm*t1@3V4w59;bH-~?Kv58Ds=>s zo^)(3+eDj)1&Zyr&7)>FMQZL1nv+;xIi|Z{?Vy7`8Wx>&lfE|K=nsG`zE<9B{7hY| zY8p26bh={#q$l3gtgQILqSHD$EKMGcT3V`YP>6(ay%z5#BRI2AZTjIw~QB2PdTO?fYOsuz^eb{b~%;Sq0OyN32#%tROnHry7W> zXmV+PRJR6hZ9_>Qy1gsfkKW2>BMKuJ(ufN;ya|>`Isvl5!Z0Au3r(Ym0Y>!qm(d}Y zr48u*W3ewLaA4H|o7X_7Siy6%Q4l@u=!~8&tQFxXas4BOEOsUG;Mpw~2L34N#OuM| z^|bjq4N{)eM}Zd+qbr@GPx9yT2e-0N0=>@*9FEx08xw1>N!7b%m-Oy1QTxQ^n2gGx zBpfpJ=kr@8Irn-$palH!d+6fyfJ(YYdGannuK^#<{2}vAtEn9yMm0AW2arAa@8~rE zaxYK9Ol}47nnM$%ANBVGKWnY{oCiE{p`-6JI@fnk3K#TBp6cd(OItL*jJ}ofk^$@Q zxaH#Aodw+c8r%+U;t@%k5L1xW+*CE%meI4wsk%CiFP5U_i%)K&@4SGW`SN!@gVhWc za5{UM8oEYay-3nelsxKBRbRd^?VYxOfRMTInTztav_TNYUX@*sjS*+wo9HZsd-4eL>{P^!S5XGBkEoxJ=M1CO3Udc3olTyNO>VSC`!a+zI`V|h58_9M-kyhWRR^scyq|ry z6(MmH?6NEv`>cidTj0(d#>Yo$pqY{%2-!c$jiAlqK4gAK?&}8D)_|7U!7kg&D5OfaK}=Oy z0Mul22dn>F|Lcxkvq%mDg+49#QKXrpyJHkXATygFiUUJj40RcP-Hf&!{epD@Zb3+` zaWuy$cqY-7^}FNqe7FsN`LWgk`Cq&@&jjyK>`JhRYNA6K!;_L&%5at> z*kT1;xNa&*CC<;#Wv7=k^cX4*qdmrE$C#5R4YWgXcu41j2W}dUWu5*$5^|#87(FNv zksq5$56C-lHHXq+=E}dKy;tSqODF~2O+ZolnlN@j4DfL6UAot=$gYNTq$An*Hr{I= zvbD5MALjZ3O3C8BZ5?A@nXnX6vP@5*= z+1>rgk$r{-$^FP?G*3t-#`Cgy_{!|t&@&HZ7hXO@!=K7TOHBvP7ajaiPT4j@+<44= zr$IP9q7;T zJyw|B5!LY<^YB5EhXRi3VAWw3D_Ya%&!EaUpbGt2A51QjiA3Q-p`qW6EII|d>gU7l z1kIE1?@RoxJeB*zM+wSaOPuLHKd(FyJC=XJGPmTM5qU^q&q?j|)^cM``f2#4WEtaI zSs$FZQmXY{>|m`VSP*wN)roA^XGS`4K$djbN09Mq0U-6O?}p9T^9?HZJ1vt>3V-8u zL4ex^jg%$aHik@TxCXcHTj1tX8m{vH+?6uMyD0`gZjJ3@)`{xA3in#yDbE>S_0^eu zFZn|#kX&+N&m~-VR(Ly+v&zB5G)4D2omPx}Sa6l+ozI#Y(b!0kB*&u3zt1nWH~W%M zN62HD)yHRRHX)(1fvW5w$MDWW(igoN?MZxC|G)y&!XWA0ue)ci)aoT$yZoh(CLfom z?>&oDmqLhsIV6qdtJGB|G%>Fqb|_ONlC7IkPGb}68-!cW{lyRq!?n;dVz^^)-yix2 zmrgo+Y6J@EBycin(0qOXyErz#6uf?1dl%r5kiqN~^yo}$DM)^k(d#_<8KvMoD^jrk z5OgaAUY^NNF?q}4yRi3|BI!X+Xur~icUL{c#ndnIQswzEsjX^-|B~kQLhcDLi=CkZ zYyKx;>;M2F+f-HuJ(S+o2qn zo)+C6?~!{yoZk_^I#_3k?m{m!f?FNA^k_ZP{ruZU zqIQg+W5U)=nayw8Ek`;raD#^8cox72E%EN|UHQ-e%k-mT_ERc>qq^$@+nqaQJonr>}z|xEU#R1-aDgnSW?Mnn|$|kddQT z)c|&yF`-0u3i2i$2d9NRRdFY$+P=*X30geqS?~T+vdJ4R&UQ`~J`E)rMSVl+CbG8!-ZZZ?H!FSN8{BD@)S~;O0y=u>mi|)%`QY$A=0y>- za5g$@Vc^eVqEFn0dHwh?)$B?foe_<-o*SX36@SX8^-7W&=~$GUe%% zNfNgNAHlh^fc=6<5H$izX(cyjh3*;RSoy)c$cRhn1gl9!NY@W)I9I+iLiau5I#;4B zCKq|Cxz~Sb&N1$1|0Hj0>79YI)0)>c>n+DGN?@Bm16??7D(q%F8%@n$^uD2fzxVU$ zZAe1BN5f9^h2LksdhPw7xj#(xvqz6l&WX7W*Bq(z79@LrD3y{6#>ae8H=qH{Fx;>B z4j4|2Mx*ZDZkr_P-&cP~`u|w~9!mOOEtMEDKY6v-*I`q=#fSYZ43ntrl;YHTeiop9 z3a!PLGh%Ua%ImYHOdysFYn~q`hKHO0Re7@kVf?xW&$<6*>iIiY+k^cikqY6&K%h7)()jm8y|g@75+UXLXUzIQr1pgTc-@uZ9kS@NPS z*dDanI9?U6=#kwv(mAm7vxnw25SMkQX^=sA_h+XxddK=1>{J5^ubrc$8&;)^Vh%|fw4l|?o+ z{N|6g!%HW+KYQI|r5hYPq?lKI|1&UaJ7!16(;Oyoi8&2FHPI)yos{DlM|V`QVUlo$ z)@HjE2;@8WsT@!d3oNe{#}45}zj<^Y4r2c5pX@LsNcP6ikG~DjZLz_W)W4*UQk5$J>{o$tl)e3<_1I$CoC(hoBd#o}` z8rO@3E|$IVxfdErTeSS*p@Dk^G$JCZJAfp^o_q$fB8)|q`txRSv$ZV)3upo%6d`~< ziKzPyyalVa7#`_Q_qrQj4d@#G{!PnK0>KMpWaR;!_0HpqDc<5y4MRRiGT5{eII;M& zkK?(&V3f{2)$2W5_fS2i8D==85X7D+$`tslq#Z0#-a&}5;>g8UoV(bpT))Ry2?HQ?LSPts{(w7PU^vf^Y z45*75>#AS*wVii8>z%R5g`tMLoi97-YKm@-b^gH-)!=^O!FSyoM1B2cFnSfv(!)Cb~D;h1DTt5(m5(xV1 z-~Cd$=TC@R`cr)7?-HTz1vdJ-j}!YJgC9(-e49YuyGKfyAM&PYlMMhe82Ob=tvWyw z`)M7ch{(sssS`kl!lNN)45TK)2y0Cuy!Erj6!gw#tB1_g=}f_7XCra9Pp{vW{@i;s zr-UlAgVL8pbjh!5e=6e;3piz%esP4iH=`{&#;<7{(^E?#reDI1UoycA z)@4tCrsM3rZb}H5jA*-jt(qJ1Vk!58-f5W9L?7?w19NwW#vdMly@AcOm%Kc zEEosp!OtPip!;FYFTI~g^_2N@z)?=^;rGMYyg~6p@uG!Ksb?vN4Vaa-IMSYol)uX0W&3a8>8XqQ%Q?|d92rr+k5GtQMVW>SncD%zw#|Zx87#~Go zkBtq^!$Ow|bc*Bu;`mP?U1CO4@~XVhq{wdyZHe6P&aY39;%=?0q}BKu^1!S6fR99o{D7 z8}<+K`U|T2V7e1}*sstG2~z@I4UG3yl~&h@MJcKQZ=a-Ppz0s$wrQyB)alBefhDk% z^N&3(*EYGB%WFLg30rlJ{&+^jBoyUu>JJJz|gD)lQJHG<(<`TPW#s>Uh^tmpyuIjuDgDz`Kj&pY&Q-&zL1z==471X+d%3y$gzKble`)&P zqsye6Be2uQ*0@Hf{CX=yGl;fb-JA27?Bb#@3CjE$HEplif-&Q%#jx;CO2?=l4y4E8 z%=0hVdas;EIrA|r0u>~^`!4kpOW$x~rej4;M@FKljqr!Jl|1O-p3LOXrF`uorwiTz z0w9}95v57RN9>O3bi7f}o+moHJSLg{?gd;i7t$RW>Ka&P4ovh(O#T?M0Ar`9$=KPX zAN7syRd8qc!|3bPnR*_NB>;O0fV`+yil`mZQ5(CSW=qT2x1$u7(|7Q*&8j<7`q&90 zAfrXU<^f>>a7wP`b>I~NAFGv>cuBVph2;(ARo13UKsyt9|4@L~;@Fnnq+bAwyb7q~QFn{DSY-7^y0+ z#rL&f>I-3bgB_06ggq0-hn#3Xr-PwAOFE_R5B@B_+rPK9K5-X)w99>pm%lxjY!Y>Y zpF7Z6$^1qBG9c(P!uMF9diM)jOY#f*^0{!tRnpBMv)`$7SA4|6*I&L#5~ zbb97Ho*X?yjqbhG1SRdyMlsW}nhO6PUJyD0NP4oR9obr;4iK$yfH{$N6Vl>wX5?0ZsAo zEf@z~zF*cugxACOmRbEVf;tIJzZbiAxRC~^3Eew5Xc(n?6NRs30>t^Wj>D*Ah|%Qj zZ%oo&78kPr?acIlqH5%fUa^RIPkru;#%i!sLGhgbS(yU|LKEj<_c4F0*uI%xx$p!R z=Yjj}l$g}t`FfKFeZ1%xvx2kGn1w3f z#TfLdHP8{^hcm-YyDuc}8PBK27@=u=oX}_{-MakO9D=js7aBJj`7H^EhipI9&$oLF zDDc(eFBEQeQP^_d+t{*q`zq`E?KYJ6`c->p$WESF#cwqZGuW65u^yu4 zyu*iiWs;4$3r3Lq9Niw$(CJ^mBVVNP!IT^wLmFDh7+B1!|FzhV?nBwF$?U~PQyKq| zPZ5E06Q{`A3}&!z30!(U|5Fxt|K}%pqu)2n6w-X?f4iB4M@2E~VoPbM1Rs@}$^RWs zJ_*_|ix9%xo1GlG2_)1jSG}0rw)ERu^;=vv)sIZI#3>nM6RaDtEIsNyqr@U>e76@E!~(q*2UZjzuh+Eh0ps*8dWdf>LHrK%-@rCT%l{;tEkQyD6ZCni&B*o~TYrkny85M|52cG!v zbZ6WBD#awJS?$8sOhP&X*)9cQs&5ElqyALlI+7Zf%t!HS08Lc~^6;z<4<)SnsLOzU z_eB|u$0TmJ@7m3vi){8;F;qQ-_C^$Uzxs{o-_Ov>d(%H(o4fFozt?6-ZMtrs=m-sp zqa(i|z?aMhgk%vHPtmUnfn_0d>y3rJ_Z$8=%u^QDU%V^v@TA4Nb~auTg7Z+r>s2HI z!UX-K8<9NmfcQ})GVSbAV)`Ch(V7L8=Fq+?EfvG*3zr6-ZbmMQrm=)~)pD=%v$-r; zM2tiic76$4t%$D9t2qQwMK4x+M^Usw-2d4vIiS_%@oCm8GsQmWCTyj#tx$ywexgZT zg1enejnN)=yrKR%UCP+XgpUy34Zrqpo9hI6dP?KDUSREs{T`EfDUW8bvJbLH;T<1- z4KS#7q4dbQ7pz?R>d3qgaC-n-1#jWsMy*v($l;ya%N`#nB+vQnmQ@WNCO@3kxlLbn zxj&pLdLvmZw`08H=WPW_=6{2uo-7G^Mc$G4RhtmIrXP|M2$&=u^yVVot;-;kp7vJb zm8bE8%WP*v5vJ3EfK@b?bCSg(k8>(*F!~-_GVDe*;FdH2-=bUN&N}}^U93x=ou2{& z2$Lm((X@=fX8%!G+|&t=K}&z;!`xI>CKxr*%n!nR(*F}yzxXDtk&)hu6@G%jYCH zcHeVqd5yO2I6o~B@f_({>E`~C$bI5)#p?#}wf?QS(bn~q69Y}7EB)-DVcSEL}!(s$SMSuW;8k#sxb2mmY`^!*FWp+^dIaM0CLFAUJJ9;?3e4`PLBb zbT1w=Mvp6D#Rq1E`)Zbxh1LqWkr;&XZXDhpxCb(1K`}a!WlM26gDacf#s(~b9{G|N z-4}6n*-;vo=BsFc`Qy0FA`sfHlkCP+_PN*)CaHXfIfH%ckKHR7qXiH!CIn>*8TmM; z)uhOxV$Y?ITGzWq+O)rluHeCts|#v%7)G4QdU-PDiWw(9;)vO>2hVN|2OF z1n>kRX`xN}K}W-V)L5@~3+&H46SvSF5{Tfv$EWNVj@ydAr#bxKhc9!*gPYli$q^Ki5>8$(7JV(iQm zEYO9Ws%*FGPQDz@@Ka&z&fyW2VDl&;x|2~9@XTJj>Cp;@z^Ie2p6;*!$zJK^g=%O^ z#cMU;_;JO}y1q4g9$e)fMOdu&B;&H+qnK&z$RR!XJCb9{PH^HN>5{>p1fBa|mY7dK zX>!u%(i52N+bi_-T z5M}#BKhW3mDS+U@)7mm%i*6S+&fCx2T_Q%4JE56luN7WhjqChOf9OCZF_;cFndcd= zX+g<`fgf7+;+iM_Z;gR*9^}X+Wt1O@4EF;5cOTB6hr}=%SdpNI-1!AYb}rg#Twz=q z3=R8(am0)a*Vr~eG=~SZb`*q1fxE)UI(IEhKW=${^0N^_vzb7ibpnwv9I-HFXsY+g z5#MT4JV>-$jln{T41EU;gQ!|!Ko*k5A|7CJ2Rfb`k9cHH^L9S>2ogWP;VCcyrUsk- za*`|GW5QKgtLFiXA7cA&mT*}8$*UPG0R$Kwl=K`Foa>ezZ|SpY#X)_TJXp0L-0dj6 zhKUOmSWoBlG}-Ds0iw@nEn*b|ux3A${eo4DIuuc^b<4*tn4Hz=Jds}=iwLwA17R+3 zyjT0I(TzoRM9m7q#546!Tf-F8G~Rdp9t$&dYD_*3ceYi3&m_nM+<(cBxEQm`Zz26JSy1uFcjx^QvNFB`9Dt4ZHPC<20l7xDi%thFH*$O`h9Z&~`)a z^RIHvn|ai~dT6k~wL}*kZRb+*A&Nl@^mWw%umFA7E^S8!sR^4wrm8km%8{mA-A?H1 z?rL%9I2?ud>pW(4=xYnR!FPQZraO60Y)*k~9FW8AP{JDty5k<@A0sX8*2JtwMQ7s{ zHs8Hm`4AvHCYu_PihmsI98mQbx%#kNVxAF|06LHNjJbSWop?DC3YQLC57_le_#iHS zj>-EezIfW*#EiFL{kgJ^Frw~PY!2@U&}U*WFb#Iv@gwjRg~wSeNR=5M?5WT#hD6RL z3O8$tQVVm}DU!z9zzI`*N$g9rep_m}s|Fo}ysQQ~XSh2`M1EB6+Q3d_R5#uF&B}g#;q}Eu+*~}v z>-cS$A#pa5ZZXFlZG3ME9vIhUb17=kBi=7CSrwfax=bXn0G*zMlHOG%V_u0DLY6$M ztuhk0whT+e%tp%hNkJC~PlaO7@W;}txr(QWxMl=Fv?D7&T7pFZE(i>I0Ue&yl7BGx zr*#U(1y+B`m+DwNgChUM#GH{0+3_huztbX)h_!oP=f=6N=5Lwk(`C;H(Mc0cLi9Ga zBPNWS6)@}}tGLoV-67dmxZqGZX~{s8?saHeVm<2r(e#yJQMKRK5)#rOAR!>#DK(@L z(%mT`-3>#lbhj`x4BcJBC=HS#9n#$l3`4y0{QlSbZN8mr?sM+F*Is+?wFECbXKNNZ z(QN*M?(`PNJgMCQeFfx?r+G7Iv9y=C?J*n_5>{o*j3C$c4<`ifv|N+%P*M2m9$ z!6qXo(3$%W_*2561Pa>P(xTR)m?||&#O02%X&FSdZ$p=z7V>#RQr{nMd0u+478S!a znN?(s*N*n;8zb*v0q8L~i=_kc@Escbb3@1dQT2B#LWP)I5C`QRm?t2vjesz9G2YRj zL>Nk|&KBYTe_WFKzoA6|tdX-vw+yL&IR;3ack8m5q-c#uP&DD^rvN2A^uu%aOm{+Q zc@)MA-W?vUFhNsq7aV3)Fn{mxwr%fDcE+`7&mdB}M?v#{tx)|A47Qe8XUk5$UVjq4 z4JGc1%wzk~f$#Kf!l>Ku1qE`fW)rKu2BMLj(EJ!|ijDQ-!2iW;EDG`e9nRw!@(K>y zg?dK)q;0SBE`u>x@M3O;!tT=O+(`5A z{kzXef4kn#5<5?_HwXnaoiGKOUqE6+GaW6sH9^don5`Gse5tfYu{nK_jVzBgo5C@k zDJP&dpcy+rT-o?9hR)phRjca6_?HwLBnT}`jBJpr0?p;In%u*)cZs|}Sp*fJFx1!0DXd#VHH<=zGI4EL%mx3T`Z2}N<%Ln@iP zvpApncvOLZsks{TG6Ws>a(Tlk`t;fDN#(-5g6rt_t=Si}8SoYs<#k0f`2+`zu9!k( z6m2^9?^R7VTlmuU*UH2QCeubIMtx`PoBmVhrQnYr;JqKQsabI{covM03&MONzP?!C z{pXQ>H>vx@T+a}GV+b-r7BIC0u2aEkQuf9GX%GLIgSa$d=BNmHg^kS4Zs#3Mv%2XKW=-*r$rQTJ5FER`sIAUeK^H5Bq9TB-;k}uo&C#F3=B%oYuF5n;ym;(+31`?|4pdOf>w5s52WTTuLfWI8SR+06|kali2bhO z#l z;mE0Z?Z(y)yt8n-NwzpAdM=_5);Ksv-(|DRvmRHxvvx}8TZOU*GPl36=nOvOwZlcP zPjz^7N$1*GPOCY$du~gI%yFgT%`)SiiHQWNB#SZBp{uiQAg$e-4H7Rc~Z zcl+UToPelaSEL{}(H5pBT*B%akm8@aAUV}*05)Q|402Dgq;fE|!^$#2|08+zoBr(! z`r(kyyLoMw`uyD1oc=+1tc|M+T6~HXx%}GQ^|H4m)|ui?9^w2{t`uoJ7ymn+?$@=I zGClG4TqxK2HQ+RT>Rw`k;%ujc|*g`@BL~hC*i!rxlx4Kn7!7ifU1b>KDoN zCJ|G*uReM(^~U5i?((S5q|=?#J&z&znU^MNNDC6AmEzm>Wtd-hu9}HjQEoDvqJbrK z6Z^6>Q+ihDXX>NwkEP4sUm>5^mwYm6FE-hDH+GyOocl4|JG`)sQKFEak2|`F{^WdV zw`oSs`AxTOYhJtiv!KEk&WO_S$^t0$e4Iy&V3u1%E^MQ`$NXM?*iy0GdznPQc65TF zb@S}a!uQ>7dNmAGN_9fsjU41c+5{m9Z(oK*V$gpB4oJFcLnB??91UY(2m9ya?@fag z)?$5wf ztgqu3-r$#u_-608-gNIW=Msz6la7~meB*g{)^tz%=#d(MF}$jZ$3}}5!xY1x9$rtj zt*dZi6fMq^y3k26KY~^|q+Ci6F8~@TIDHp-)~Plw8?2TeCcgg(xzSKfNkW9CvMKx@ z*M;ym4Nd%SO-~pRL+$-HqI+^SU;0k#lRfVUGzTN;iciGhRv1&ybEgk!Zg2!nXX`?L=_ALSbcP-|N0=?7lNy9&s>l6&&2m}F#Omx=7m&i= zGM4-S*^Ew^B^#_BDLbq5h2%_7!2+g{h@+59VpZ{EVKp-oDc`@pCzY`8Rzwa4B#;~I zT|ltQm~ fGd>~XK*$}49hU4h}We)h^;s0{4F17IT!X}Dim8fgCC$VzKj7gr$!cd zVSNu+<(<_dlsEblA&OPcX%x6kxkm5l`CKOU0cC+_GcZichot54>v2v3T(AItG$MmENUexQ~=ephW15_BgeH;HE;1G3~h<;ya zT%>9STk!I>2?`n_mj8!M3VV*VcQj@*lg0Vx3z+4e#4*3N=s2;}3N{5bXsHp4VjQ(X zJ(3S)5j>LzsPoFBlMH>xa^QJRr1}onPU|g^w{ot~@S0H$_!J*mPW#OCWElL3JzB)c z&44YTK<1@lFYR(@>oZq{{$8p$!^ji7qj6wQnZOBp)QZt$QiA7?|EUV+g=@|&KbkI%IrNFqn^x`$L2$}UmjEls~ABsyuX&8?T%o`qv z?$R%xR}i#Q@#LjB@z8erfgtUe&?R$J4Do%LzK!Bdw1HOgTk;~-O> zuVSk1GLNIJ6(xh&(%F>NHZnu4nWekeK(X#^+KH@HbVP{^g!dplaH{~t!#DHS7s3y} zel!6+_BtFd#fP`BFIVE3yi*mce-A`P(Z3E!dwvmf`cP@Mq;`ifY97l504i|*jJ#}x zYWE{p3jdD^#zNmkn$K;UJR)1Jvq}>z>jD;R!`ZHloP57Z1{kT+y5gxn{YJ#bo0R*A z^JGkjH!4no(%Vfi=a;_F2;IQr6jU z3H7qi11ZB$w{sDLc&TWjkCFHHM5_IZ$r>s!h(0sKR~_*LrExx_0J}cz0i-!l9~t8*+U5#Kbd+*=*$7xqkzzQw zjDg15d6ffXd$o$}yk|9ziAKY)Mi>mj@(W1tcUtGe{vgvPGf9{-i}p!jb1TpN2Zzb; zo%V&X4kuHpoe4g#@`ZmRiDe$LkF&XxGWh>fW1_8WQ%7Fiz6fT_n1(ll?}>PkV;;l* zz4fS285HCmF{5NyYlQY=56A)}_!YUW<*X ztJ=tZ%<0Na*?)3&fZfULJSn^+bffV{q2WB(lF>lX{jhC`RSrCze;<-lYvq=d>cpWi zrq}w@M-I7fov3NEGia~#xOsm;w67H?Y0h1}wdWPKybP}e^zUEJ`^;!Xn(vgjV!yagip@dLXY($ zKeN>1?)Jx3R5agLvC^274o9jA&A>+3ZdULzb}j|q?MD74pMc^ z0vTCOJ(;c#N>%<%e8r*3UFny_Rd=6h*j#2vQlUpx`0uqFbEF!V%(3)4#(Li4L7d$} zw=V}#tZ9Wr4tIHAc!5k7BPtg@7`oH zOsd*w*=r7p|HgTraZ1;r_PAEm9c)f%KT9198vV>FK#ba4vk;D?>=@|h4+rwTB7^*Y z4oR@%rv|r%96!DA`g^}Q!bSWj|_hvd0qHeZWXN-gkx0`^RkvJM4j@Y z=9%o*#}pbVW3?Z{^PpBKxCEER#GiB>rxIvjNfP0zPtkZB^!#JZ&FU zvYk!v`yN@%mcJ$^KNfz6h8}O=N5c`gMPFjimod8b2@a9Mi$s0qjB~Zqun!kYo4HU8 zi1WGl$m+)x8-6{QYlnBs!;RxkhH+%;wx}z*tP(n2ZrubXUbWp)p45>aCB8)RB!Z*vP>&&U?Ko568E2O6<5=ykyp*wE5V`toIyJ@fwuAWo+Hkk) zmCf2tP`#5y+aDAT(KiNI-(6yLJ9J2tmM*zSAc`|Z+`3S z?EmZhXIi^+DVJgLDLP%!(j1G@Tw-)S18jQ(-A*cZhvA`{XP0}%s7{b*8neDE!VA&?p_eYn&83#C5e`K z{Md(|OF-wE+jusTlWS+dtX#Xo4wZU6+wL>ruPBxG*Myr}ZUgm%nR)8hM!sjyfgfv~ zX<=0f7Kt+E7rm$#C?w=vvF3EWM#o|$s|B@mbRmV397HDt&ml2|#vfAO(lb~2xmn)_pN28)F zO$PZT`!qkGa@-~d#LZ@kvCH?k?`*vk0GK`;$lbvfL;BsNh?2MndGFfq7B}si;fm}A zTk3rq>6d%H^Y@=q(-IrJb%P~ZrDUSJ4ORLI*2oe;J}V+*sDpW-(6ce*O<`^5KQVY? zaHu-M98!8BkpB10=1KqTABYjVEYn$HsEpz@i-r$69&`N^rN&|^!_C>7G-@oqg4ND2 zpsFWv8bO3!{X09i&aCeSk4Xl``Uchv=VTYH&@fBiS4|`8;rPk;&&;N^v#kM*tZk_D zuEFRI-j^<>Hz*f@172w~53@*MkU!Ce*UJj(t*a3NU6bE33EJg&{AIIhR`CGaq%3^j z>2|IlLGPi@r=o6o9^1$gcA*<>nJu}8mW-V&JxRgV&{k54#pb^V7QbMA;~D)SPCB%X zPbeX@;j27)B-I^HmQ(oGq_6L$EgJj&L2e=9%-JPzSnu}Y`s2_|;-H~WlOkC{*3^iS zWWnd#n+or;FO#B(uRUFrzlJ|IyC(WV>Tx zrJ|#HseJ6n@#mf+4bOrb1-iB-HRZ2?iy^H3Ofl zDt@=%+`%qM#mTo2UA6i=y+WU=XXuqB zeOw2r3+RreM9hGaXymn5S`t}@o|Do~zj_Xp0XDnOB&BzCmLk8I_=A+_K)i^r$B+E$ z&d$k_OJ+5#dDI-e(*EVgZN`(%L2vr8k9cl;aWKD-=GVB!IwJIs0Xa8qg{!heHcfUK z9HupKRwXiAw0~1v``1Jr;o3tXRL&E&JGKmzE)fP6MjUSzR`RB6P~16ue&HqB`*Z-XCdNWT z93A{6%7XVbfjPGVofv&4B5yKMq0e^)j-nX+X_rg;qR8{JUnz}Vv}MT6ucWQ8;fuw3 zrKQK2yAL}dEW*k@@3s+#5zVx;NwB@bRi@QPEQ6lN^?7x3CePK5h(aBYU?QI2R;cQV zVL54j`lZ!-xSvC+5ooNIL>;kP_z>}_B7xyAO!-^P#8!&U^+#VG)i)VVEVf)*L(O55 za6nDBCVrw&OjZ^QVc3gkczVFgzAdG^5Q;sUeR6c!E2Bl9w1@0EU{al}!VMX=1z*0CI zn!VYFg(kEA=>zZnC%NlsJTYp7fnyV#nx#FD$;|~y^qUZhN|Jx zZs$24U_b)>35qAv&)sbL=6li`0C5j?(Y|#Qt|~J2U4ie0cCpXrl;k=#6TOx3kQs$o z_aJ-zt%ObI7hCt><%M3B%pnFl@t+%*+;Pi{JjP1FLeXGW5060Mv~7ch-JjNc{ArDv z3A)5ny%84mN_sDYcU!s0ak+IK4R|TEx|%o#?S@w0vPq~*0b9o{YdV@22OY0UY_QM~ zgt!wLs|LL1-iVD^S(SjVVSsWjio3fvLFz_Ky6-mb{W$k{)AtGJjAbKGbT%DglPJAt z|Ky&w)Z1|%i$Fi|2GmCd!3cEo5mgtrG;V*)K`~ugi%;5{=9eFs>dr7Z@L}52?Ge0| z#rSw0Ao-2u`%6YcuT6Eti8y4;YKM1VxDNhRlkFLp2e0upmYSi3WImaAn0sVEH_gzN zB8#}ruUvyA+_^fUu5e>9_P}vlO)0_EqC96*O1=BfGe4eJdZhf!JvjXw-X-qZGj*sc zQo$lJJ_<%n8jUoa%k${Hd903IsC#9r?$d%@eoBPvp#Ss3lZ(ESq)Xseq`RK05$}1E zyiz`Z6>-T%Qw%JaabTbjS1>zf!{Ocmy&JN~$H54!N|0xu5P3_35i>%Fi$7(IgVF0A zrjyy1J-QYSWnCKXERW0qu=M?^SuR!D?c!~%l7S@lcB|gYcx7?k1WF#g4O_5%)~H>T z=pfJco?J!=OvCR5S42T7HOBXWz4@1tj63SoI^|LJhWCs{im`aTXt#Kat~sek$3Zz& zL>%fG%)TxH#okT;5d*j1vuXoETq%|i}b`YhO z>ja7M`0*+JlRxH%zN|`PKG7mlXv2^3AOC7O?B%ZSX)!1JI{K5XoVSMiC+iMjv9=Q7 zc51Rrm)#k_u(MxAoK*zNU%6dYgS_{Sulip=3#b|nvYIWlPFoo%LY# z`tGI?%^C~i#zAUHDbNDC;U1%a9uaWE7nQs9(}VC^At48`+g@^NzOCEQm-x(HqI1b3 z0J#feEL+UI-gkQV(Zq!Qmwz(Kecy> z=^tNnAXxR9%cYj+;L}ly9(RC|g9mG29tGp0HD6O&f}{@96K|!P!&VE|*Nk6xQk+xa za^nNn&OO<-OMKa=7Fb0xf4cH*r=pe(;aYrqz2^d)+$n~ku@8HBicq{xXAzzq@&--b zz{W~N=@Zs)ncuG#vVD&OC0h-6{YsV^8?Z5AhcpLvJO}8|(b}5a$LH89P{Rmu)owd7 zEj_El!(07HfX9-&#DO5GeSj)_9c>_0;3R34&{q?LN;}FLhheI~B zyhk!Vb0L{1_D~|U$+rJJsdw{3$yHzL*p{a5a|(Y&1MOrNFk#hst$g{?-!=EdJ2MnU zjvnEnqkimrdfjMo&Uz9{hAU)$;JE>1NuLXs@uZ&Xe2JFhDt%$=j$f*)^%AwiVk5_c z5H~BO`~!uM;UH3jD3{Gqo}&D>hJnY^4B@8DD!Ry2HI?bv)56N{itk%_m%#Nk|HNOS zVGFJjxaYnz})DD7wKW1@8B$+0O)e+)t-u>}`IlDY- zrJv+DkXu3=1bA9av!$s?TDnV3M3q~SPi{$6F1bYCT=?gib!AP4`v4@u!5}Nc%wK_( zSF0LVdLvJ7?ykgwH;oFfOiPcNosm5M^F}Y0JcSCA-uBpo$+P$9g+?2dzv=6Xa1{*Y z(9{j6-igc0SdulwQO;XLCMPxAUeS`R-CWJlq^q^SiodP+JACRW zjH%n5T6SnW;x(GED$A?>57_sP!wkz(5;Ge|jg-ZB)Ep4zz>)`d%qFrq1u8*AEU=;A zk||re|N3Vae_g|5w}y&|aFB|-Nj1el;#JNhNScOb#8=) z3XSb?_lyP2;cEnUd>G8x$vYS4=#FS?Mn@^9-?JThH#Bh!-7!UE3al{Ew_C zW@^Oey+>#*k(xuEc+4Tc5RK@q>;Uz;&BvN;=n|q4kPkYG@M?7X3gj zu#*`ngh+s;fE*IA_k0Di^CPCyNyZk;Co;|7e)|`@m12@+np%{4HDGBa-}9;BEfx(HV~&4)4U(ifxE8WktzPR{YfhI z^2GDeywK`Ezu7$9IvgN);{x!ius2zeFozT!K$+ooxPdRM+xHMZOfz zhuM|{3yofU`OIP-_D^WcNop|??~?&Cry?6g2@Q7csQ#|6pK7sRAiPRHP07W73%?!Z zPVIQRDXU#o!;&xFe&z4Fh}HrLO!z&ct<`Kc+-qAPra31_{!PD<^DI^LOE+z1!s~pu z&f@K`#zSpaYm?x2z7Fb+TR|f?9@U4V+QHs3It^4^OmJKC)r*~=N2pOq?s)zVIcb0~ zr>HBKXOu-lq7+wP5p{Gn+QLruDx%a*A9>3AxWW(R0-ZmcnGrsa~j>QK$Dw)40 z?2^CvG2<8AhKO#D5pR2oQa@J2{oG*?X9De>0}XT|U_LNRgpLC1!vLxKn{enG2LZqD z(4Q)pSQy@@vwObodET9W6Q>%+8F0zKf>m2&9rz$=^wp^`(cMe<*Ix(0jmyH00QK=1 zV=@V*jmLSV=ibKkQ~NjHcUu7x=A53Z?OP4jG1C}{x@EVY-pmeWR)@{)yru~-pP5uo zH55D#HsK%mwm+*4R?HL|d{(8K$P~>}KKrPNSxd5ag3}+dYF|!DLwS)?Ikv5ig>joE zz>~ac^5vp_F!(Iik)#&@h5hxYX23P0!*|cy6-v{|lkR5u^~Lmn zoD1r^n0pz=I$l1?Oo*-RuHs4izlT0zhz~(Li#a;xw<bzi2xq-pBD zsl_i07b;E2UKXWHJ0#AO>`M3M2n!7dqb`$%?&^J+>pnd)I6Ibzv;kC)6xg|ma7X8V z*?GooP;rIQlbc|Qtbg0Dh!(2K^?l=Es9@6Cjy8Tfm?T5djQ4s@+z#po@vzm@TyLs0*Hg+!7gGv zqFXR7gSpGeS&<&)Ugw6ykx2)Hj-{9B<(B&PGQcJ;xE{7@etPF_3SwAaS=_oxT=HfS z{-Vqtp`e^Av8)&Kv^>6=2h@fYr+GF6M2XGMFi_ufQUdwsf<6ovouO+jA5gJzQnZse z#d4lS9Rv~~b|H(2ro88(>H_c7-J>n0;;~>CGw$|Kr1cXm67e+vUBqHa8in?7*>6iM zHjXfmQ3&pSsACh7c8CQn07}G(4LOka7MGdTo=+D7Lu@)kZKzo6cDm`nLfGw+tBZaL zhANt}G;NB&Z_Z$nF@c%M81KI&4<@&#;@uU;Jl;RaXEO;&urKIs>#p{=~{kiHQpBf?Ns zp)4PL(Kr?@8$kjtE(l0Dyi0?iUr;9wu({o)bq~~(i{sstXWnvXAohg$w6YD z%hpRB8#3Oa`Tws4@GRSs-b?SN>Bf0RESiukOz~JTMU!6&n9K#M0~3>0O?Uf)A$^6hb;0LnVC%eB zLoB;YsXQvH!Q#fwk>AQH6`l2G-_-52?F?h43bX*Ig7U28g_HgBgTo9?cxda=?OTWl zi~yN4FfKyD58tl-bng8N*1R0>9XI2;k~9~PLu`s(Ll|i#-WyS8I^4gn>xNX}fp}s) zCLFt*omkofNwr^3OTz~xj}pk$FShP*A!$xaN z8;EdM^xjeEyc$e%Uen8~ob-SeT;Eul5^tM<45FuBwJ8)1?YTy}J!>46b=Qc?)l%2oM-ySR)I}er(BHD7Z}yoodGb4_OM5Y{n|G|x z@90$Z2J!l^gj01C8vRxyp0Db7%HWhmt4VPnJdf5Ny~Q!gSsi0zYz9F@s3Eif2JP=4lLPpC%R{OC#PkO?18m$IgwJ z!}FMYH``- zHpAa54Cx2m?$_-5%p4-cZXPK#THwFYY>?08dH8{o8>#GB6?Ap6tF-v-;HW(Ry7lqf z)_RBCK{9^6qG*Q8yjJUx5zweNv8>6?^lQL5_z#)?+FsDk>5;Xub;C2+I!{1owKMsZ zT7AvvcsLYZNsAuw&9R-EdU81-jN4`1{Q%Y%4#obD9s}mRk0}Y768_=P@GCrQc@)k1 zRG*IdW#Na>&9i+JhS(atQ6O+tFgN$E0yIrT%{_)u&X*{7 z(a)E0hRAr;>-X3&vQ~V5HH@_mdU!v#$Hf2q#mK1D`ERTEhbeMr2m?Z3F(gyeKj+E( zWcnp+V5B!ILP1Z|2Sg3|sFHsuQ@E1XesZV3sGb5D8#KJ{P5)-O03w9#Noipuq()`x zUFL4dPlC2u7j#~tA}#Egj=T6>!N=;E&dCG48wQN&OdG9zd!=f|Hi`O=bQHWy$y^e*h6eeMT28J`Ht zKfIVOcm?wj(EYU;`#zuZxzi)2D{jlWV|Cnjy;FT+^+g^F)gj3EOQ2OYL3;~#OmXUk z1^hgxk!K7TF8K$a`gKj0A|lTRDqjyT>3*mj_a|1iP2<`u2cy(lUfUS}Hc)Y?ok z%olK^7PQRNs6J}zn-*;9(=hI9{QL=~oE_qN6|#D97IcllPg~{6-elm$lAX95!hS_+ zrItXD6S7{4Uw6KfT3OYkdrEKq$AlSXj*RYlM3!(&z~v{>J^ujwQWXLz{;Ee0Ns_uI2Hl)f2;f^@p?3hR=NmSOtmhx8lYj5H0+5SCnwk& zb^tF#uCXdSeeg*dC-|^cMcqSE`pD(0!%(_ug{w#Zsm^MRBuOZ`3xTvv*#u@n5s@^_*ItYFOM7> z=q&9mg!B-!J9kBlKZ)jb{pePIR)T%26^k;T#kLNI2H)@4H-5Aximf%*GDL+-LUauS zm5Ok5)U7}%`-_%tMdL%Bv^)oqL|EZ!qMS|ITxTT;3xD{qKc#+k;p#7`mtq-JH` z&lJR@?Hd-djb<{5{r$5}o^|BEoTert>6$MmXbHCmQfWOOtApB3ld`L_R#l^WB7vR7 zBQ%l1nFGO^qZj)YK`F=u!EmTM<*a<#Wf-uG&@HEYJN9)VKB7K2pm`kOeqazi(Y-g8 zc(tuQ7t6A91{?FO4bdN)j~bG&KDADam8kJ~*Krd6INsYqlhAqfCsZg=&`8yT<%5Yx zyQES~T-627vudtwZlTzH?xEe3XH8{X-sPk-eAY%2BAU2#1Y?8!t6P2pH}0t+ZyVc= z1S(>B@sD5b?Q%}44H<5&&Lgh>wTtlZ)<7k8|I5M3=RSfiRcX999^zH_ zwd18=nU|XtDfwL~&bgJ1U*3cJ-q7|h+AaE|9&_gM-kb><=N5xVtaiEXvlL+Ej`C=5 zH_l1qX}|k@*fA4-TU)8R5(s?e(wlm0k$on9@^L8Mlt$Zw>g1X&8@%P-lYZLlYy4RX zKjksG8mGcvD3+QpRUq{vv=Zk@^?oCuDRg5Cv}|?e2%AUr{PzA~3_M-GYa`3yio1iX z>CMjyZ*O>db-!u0NYeee*K8I58E?$*Jlu-6Iy2@^bgTLSe9(@ec+%3f)ywp!u4}6~ z-Fv81MkjSdeokh=AZ7sLRpQkKu6&{HnR>&s>zvqsCssI>^CNz;g*uZAS6shaf#NO! zF*E-~8{~AQoGy$WA*Cw`9>Y({RzAp~NoP_-_kgh+NQ9$&o>oM4mp&`;6lvi`)db(p zL4xSq{-aOR{o_-wCl4%+W1ZG7gI6`U=v$;cTF1M$Pz8f@t^Wao&O5-IhmwbdQE1kU z94Me%w(Xp|ki_1LX2#mYnTCG}j9~C^{<6Q^7^QP~6El~LG+zI{#Z42*P@G^AZFeh>3)d=vOahM5 zD8>1aCLLj^S$;|514z*Gn)Z{yf~To>&3G*Vcyeroe)G?3TCUu{X3#+Mxpoyu#aO#* zR&meG!539bZzmjDaX&zRpr7z8*<8^@Ub&vXz0y|6YDJ1m1}J)+KS>rrVk0@x<+(Hz zge-2RMUS?&_#F1uOFHe#_Sw#qK-9Ws>m8>#?wl04%mC?Kt46ET{|fCA&sK{KTSwlX z$Ra3diyiCkXQUt}@-QXoR`fFRyVJQVBK;sdHcK}+1OD|F53^_024?2e@UMSHkQu)A z@=MISy3|VIcIgg?-frx-@?tIg4pld&HC>-10bP^0&mmr#h_oWL){z5(;Z4*g>o+Uk z7ZNdtgCW*~V>_h=tF<`PQVW-3_rIlv#rN6)Ye+oEuHMJDc}+PtvOFWVZ*cf?=fObx z{?u!j#g`1QB|#e!g>hk8Q#vUH62FAbK8eTy+Rkf@>&O_MYmB-7{T4)Z2qDBRLt%e5 z4gTGsyO^lUYnKomqmCYN$=9xvP)m@n9ch7-&l?mFh}&5i5RQGXy=v!QAd<_B1Esr< zzb2}n3yV^;jkeGr6gR9%e1SJ`-|l^BE=G7~*?V9yeR(>fD6Hh3oGGRQTu*I{EN;&0^X$*+YwD^Eh$%4XW z(DGOmPNGnIy1;A-4B~v~xX`;yFc*Cm(D^CkYM2yEiE!D5cfTYTyExqXG36wpy=DOP zY|JmcEoQbd`_Z+S|4&HPdo$s$CS&^Cp0??0JWKPtlt&eWGElE8SbVfk8zAa^VKSAw zl80>n+3{!*V6hH<?KplVorcBSy3J9TkH5J z%6=ZrpC5wwLNAZxtGfI9?*9|MJW%IWcVse%&_V z0aFZ$$0z17ixI=HOi{TY&hD)ivamYVt+r8z6{eH+b33$kiD1_^#dx@dUG`Nb2y(;t zWHuL`m4@4O?=W{^!y6a_jTH4SwGbYF@vc=4PrsrKwjZUWxnH1B5ntJJ)rT&tj{D$b z=xxXMM2@%LJU`!ecZRIFrgL)oOWj~$fhqUoa761KI=y%D&ttg>HB3&5+b=&R3XkP$ zTSqcNsLkiBq~QEzauY#0uY$`D_m(U3$prgZnLn76@g!yrT;Yj~ zti}Ibk#fV8&Dn`;4l?p4n*5O!_=RS@;4EvE{R*zf0>gv<>uiwf)m{TJm?}+Y0QJ=# zOSHdeU+(kptjh+-iyv;vVWVhs!*Q z^+k9SpUYv`zd-krK96pU%db_;XT)3 zhd7l60+%-05}|GGgS({OrNG2SPqGo!*^>E&MsF^k0x-i7eBoMd9?KQ#!9Kk9-$!4Z- zQ_X2$Z>B`F-5xpf<6x|7O2m7Ag?bKbyyVY!MNqF7%PAszEc^I47z%96>%`o45|%=e zl-v9;8LqQp)I82BEVN9Jj}{?2V;knQlxND{t-_(-pORiZAz9DAvZ3FtRNKWKtq4$V zp&QbRrYYd}hx056YUCsNJckzV%POMDl!AMXObgOoM}Tf_pft6KdGfy7)Ky8Fw>K$k zjjW1vYFaJVUZ&5vmoRME15y@F_OFex`+2LMjp0Bk&Zzx(%HFH|jO56cCDN zBZ7=%R^q@nyUKtj?n5WjjqQPpppFtaNmSR3S()nd$C%#@uQ2#(a;zpth8DR?&bO|X zcz)Oyi#1OUiY6rnk^Kh!H%a4deiFOf$ulZ`Lsn6Y{s%^flwonPl!A$F3mIZzoT6XF zTSchHi~yC2_9~Gr1i1L}o|AiT&41KXhlN{-C6YqX(Kn(2S-91=?}ekx2fjq(8)J7R z@PPIafOpHlxTIC9Od7@79bJO=Cx*GX579C1V4XB0Oic5>U~xDhOZveHE19HWaN_vG zmSwLs6jTS8q9%dt#d&K~P9N-7D@KO`mS+yZE!W)(@5}7nRPNAiVviH8;+J}9 zk~Da84r@I!C75YO3dN>uhz~(eLe;w{bte zUjM^fK=BG{5d!y!IZoP6wPM?p3hFO<5oUJ_G^x-`5`pUG4vHfF* z<+*tFvRTOAY$1c*IJN=N*#FIbh+2Ajru}93g87@FB|Ug;49Z>(D7X45lR5DZ(-f2B z+t@PNd?fhp`Z|CNxBW|Y=)m$R@rGb{+euL?jUw108qcX8ZS1Jw$%aH;WhH?r!?DMgJ%X8}pCk0UdMoubt5U$HF5df135x` z^v~b7?_u8(L<5wHSYpp6O>ULmZ>n>9al;2gJyY~svimQ7%^ux#ps2M38J%11Io?@m zZh7+4M*KVjK`Og;i9$;fkk5>;n5q$r4qDtpUA9jA>kdVh+hvOyhB=Nd`mulW!2_N) zSQr>uJgToST$?y5^QUv$%B!dKTr{W6IdGtTtjIR<9W{%~%tv0<$$aWP6*{r{-EKo6 z{VZugrH~@vN5OK#w)|<$B~S|Jz=B9C0}ybslUGo9-?YxE=AfXBl>7NAX<|3;G0m`ck7-Ey7LBz+!!3 z>;XRg@MOSA@UPD|T|yRyW{8E^0Gab{mg)fPj*rvOO=e1bj0@RpEp+teS+6YZNBXBo zpHRe3_{L)NY8J+_QJB8l+w8#j=c^iWlDSNCXqd!ac{1d^%R96e^!j{*Q_?eTi0k zQYNPkSl=09MUU{IXqx+pE^@$*3tU?~C>g+v*!OyfXbytZc_*Ed5}hdrd;Kfo?a=Zm zOeKrO;FHtzdP?`U1pf?*+T13a_7y9(^*p*1CZ4s{gcy->ItY;hcJ2>Br3R5|FavOEYE3LXz#wg$B zZI%JQnC`oQDWeMxkwuG@Da>dnpVxAL>|=%>AfT2o*nOLq!0>#k znX&J|f%RLSSNXbaZ~CK07?XP+|1?l$lWGw}H7bsMcL;b>amKgz7D-AvPB-A%|V6@NvDY=U1FGP zOh!*_w?>VSx}@~n06DMLzDy6srT!ImWW!H$S$baLq5#bUk?sy5GN@r`kje#yQW*~q z0p$=vbG}dC=Vr)+3;nx;3XMzsZyeKJdXQZjc#-&fTY!6FR+(?m_k=D7b0+`ad**rK zspHq*(I`h~OG;i&ZVCQ*rQ)Jb`|YKq=1)A0{TYY1=K}4{7TZ8*I7Gg-GKXdtbQY_W z3Cs%g%Rl!Ym{kn;d_RA`w5^BvEFg#>BmJV20d78 zndtSIFn8Rm?|wPYDaNAjr+H{o-Vhy+uXRAJOPz8#zAV-J$K3YY%Fd^6o$G$P&iH&A zDE#_GV$@6;zfPJC9$+|s>-b17L;CIGmIU_?UvqhXQGI{k*gX=8-Pf19+}kwZy}>6( z=WtBKzfoh`A4Hktbq@Uqt-a!FmK<(PiOuERq`lCSpaWJ)N+AmCk19P+HBJ1IyA@l} zF|#oDO*ZY{k9)>PI)eZx**rAQQ^L|abKV-WyOn5||C-_G{%^XRi5$_VD zOCj-PaUdgpg?B%&@tHa$t16#Rg{7#?mdPc|ZRxQWK?b*;^ISIFRhi#;JNeN&{?q-c z_E?jSjACVX)rf%6qPafX%1;RtiGI%_?h$r0OLun!1$da+D2!+br7&|#Yqq?azFWr^ zmzDUSEM4Y|!3Ub}t!cALmP&+LD?9fffnHZU)Zq0w2;z?X}zE zw#M#?;A!(wNa%56_adc~5Qr2vhlc&bv<{wYuke(#f;v&h@#hPU+*)dP8mvVMIJZ6+ zT}t3<@8qSUgh%&V=&ST=9(M5b#2W~J3jOF89{R~YbSg>Qs6MYOJFd;YHXu%zydh$p ztneY8Nj!G_`c>7*Zx1vrUg(pHSkg%QML!+A=(wwtpDat(5wT~QGu2cxedhkvxWIRt z-;VdLj{NGPpkqFli*#pPm)4ib8j4JEykyN)^cYyIT=7c@m%jBp>}e~LeBFPzu#?x( z8J}VfnA}F`4;{Ye*2x&PlxL482EIP#JMt8fx;{_LBF+J+v$*hT_J-?2X+b7S_hEbc zj=ZqhNqsvx{bT$7?@dX;m*@R%hU@-#y-~61*w)b3?*QAgRUh3PqiKdg$4fcZb1uh6 zpf?DIl8C}|R)t-(@~+heJ^h@+74stdS5kSQfra0p77Z<2o1eI-{md7`!6^R%~Z4Emaa8ivt6-?7|A*B->__#9ID_@Z@~*X z8$z`u*^z)|@_?v??atwvWYFOA zo~A45&nA^X90(zVgppc_U+QU*l~8`YGg!`IHRX7z<2(sL*tu40s2)+9>x+xTZr?Vt?BSU>`e1R|8B`YAquuOZHWKoJ*HrVYF95;i+3|^_hXjCn%`Y6a$0z3W?)&>z%)@fUeMqyskdjxf5cb4?)cL0 zRoqDD?T#}y&!|18ajqYERJY&vkjZAUFUPJbf=chee&czk&UPbo@9oh$qH15da04|i zlH07L*r~nleZ7&?v8jm~+u_YPof|zhJeP8lD zw7kaWVnEC=OYil+`fPno^RQ`m|OEd49BtDyk^n>L0V|iKU zzj-np{W8xmR=Tr%H&E=?{c(!f6?(>m6%C5{yB^KEBxijf1<|~X+KerQ)2$qEm&yQ; z*;G4_R=Mk`p>~ z<+mV>oD$uo*fv#Hy|~RTv;wj1J=SYo@Z3XP$!N4`@0 z^-9$BwO({mr)%^k5#IpXygRaSb_S`lrLE$F*BG`A%g};VCZ|lKt-d{D8}v*0&J}yF z0-D_|YUjDq>dE93flQlr`oJ7&o1>4T@tIELU!Yzwn_U|njG>X>y{`r{_`CgEgnaX{X|L4rx$_?zE^B4L@EDX zt*Ow;mxwRh7`|(%_2WQzt8LEB^hVQ}?aQm%C^GxrOuYvo)@Z_i-BJzAo&$U8+&kkD+R@3RR$~%5t6axO5^Xd%&{A*$J$*Q)sq~P z{?gW*n(!b?xHR$S$L`<4BHwp^6?&vE34MPu@jGHw+_s~qH}N{RkMDtyv2b8yqA!}2D zI4+y?SPZ)VD9+ExcYn6ZQ~ha4)YdmNma%ZYnz$^8!+I^=$g+8r=cwU**LO3_qW0mb)41w%EHx%_gpzPaZ`CwcO5wv%YYxRB&T%biC*gSL*}RC@mk z+IkthZqt6_x3pzW@Qpq9jP)z&d0URpsqRk7X+IR-Cie>p7u*o%X+IDZtY;f)#byE&_DJ->bK20KB?`gLoUO;0&n zzlc@zh{yN*SLbNirUF|Rs1~}mD4{ySL}o>q_eR;<^Y10SPT~pX@MK)uN^?j`rsd;6 zIfr9;r*AxOsJH$kzQA-&s4`_-ht8Hqs0Phf`>39{kY7tH#g@`aZ=jym`nni1m;Bp- z{%EWqOdf3pxwG}e^TDq>KVwV1oAgbr^T!WSmJ6sxxnt5r?qb_q11X(%D zC12}JKRq>E)nPfqejGict@?E}n(xZ>GTAlaxEh-mw+DE-0$k487EZeKwirH_tf5#d zv_E#MOMyX4@jA33l-ayPtx_Dq^W9YyK3HpcR}Ajt_xMpf&vr<$FA_0~v0D{)_9)kR zA=X3v-a*iDBet46!!fzP2i>CQx0@yG5|~!|f#bMkXU8I&yhmNm^NqR|`%E+aF{5x+ zfOytA-To(vmBn-5Qo?z-XE zcRuIS`#0Fz>+-iycz=2$4I@sEu-oIlUb+FRf zWjQa8hedb9zr4e;xM1%@*%uJ5@c2S6{)J2<9?ZbbYXn^Hda1OfrJcpkGXzdLrv!d&0s}VwwkR2u_fJuByDYb9W^0CVRZcv2o!eNJomZZFU+KmPiIIEHYVR+rBMuxUk%@xI{%lUdIT6yYqw; z%hb6#C6>g#e_Iy!$J}~|D0b=ve;XrU_;wvHNZIwJoTt!H{gS_H&cA}07R@flptb3a zJsOoYOi&UgA_sK!%qEt*+gYijZkAw>rFik!@a-l0QOy)%6ig_P@16Fs84WzbM82B{ zQ%5C#(dfKL5XE0Y!X4$WuRnwD6;w>&6h)K0`$_JD2Bb)$u-A#s%bL5fpOT`~ za41-44G#u2FQW}8(FDa)cWI~owZI=8!^8Zdlt(Sz%u-$u2*rk?H|`95%FZe)-jtvf z0p-BR`?0PDqy4c2XURf&D^rP!-v@OH;dJ@{#CXMIGD}E&rLcH?*xKp4l)`AB8sY*1 zLC>32zvLIM{m5m@4OJ96T*1&<#+nLutLH^Bjxgv4188OfY6#*CcgP0aUwZ#<@fvWm zeFK=uxSG&*xLd6IsuxUYp^PXja@l1~{mPf(8O;MT{Rv;BxGq&n$7{X#O}ELO-8uJY zP&T0Y;qfo^W8bJ8TahKPtvhFNd+z_oo_R5{Qp@x%z(V~cf`3R>UX`Ti0vwww?FE{Q zbx}X0_0?tGwq|{PCIv~Qg6}1}xm{=W>0W4xMWfwQf*Q7lL^3!+v7D3>if9so*o#I4 zq^zfxZ2bg5z*FaMEDZp&#`ili&N&=o-5w=v+D=N1mAyr;?jbDT93+UM0 zEXfovb^)=k@TzLvDpT4ulhsaDB#NQbByRPgNGMiF!j}*;N5aM+&==wQckRZBjEy5# z6QzQ@DZ^i>vap&?M0g&$_E;HJ#1zYS6xC&h`l9Z2$4h+7ahR;#YM>+4dP^&1bmitz z4+8#G`SsWGoPs7vr#iWL9|TewfBrF(KSVp|pLXCMUscMMOa9z+#m+VLV9i(xYk)0l z&-v>@N;%&5p*9}>`W3KY|J0KXwA5Z=5y2@W=!g@y7TqVnvK3;bAO;GpZ}+aAkUCBs zF}5Ut0*EQtp=T%k5kHTB3uyvG$IlQ@jYdpIea3a~&VYPY=yHcHRh&b(0$md5%e0gD zxNab)LJ40U#%I}ifJcE|vdaG^`p@U_5AD7I89Z3vyN|TeCjGH>FdK?BI9{b(E@G-d zZ=4gEIt%UTP%n19f@;t>Yl8C4`x^VQj_E_PQU==1RV0eL%DUY6#-*ofIp$*BRnNFNh*0%I>-AFu}geNnQP2e zV6K2#j|eGLn5Or2ieET}i16?tMzDe@dG;I1yRrO+?@1GVX#lbk;O=I(Sg-;dAuVk`^>a?{fh7 znlQUAv2Gxe;UE~)Ra+JiB3wEjqTUmQ=?-2N;{bigCgunq2PxGF#Nv5ih+chW3R3*r z&453qQ)3p*`l+yJO%1!T_rT-c_6lN0vLW9i4LpyeL$MWos4qg*EXP;)wa|bs9}{X5 zDQ}UUg<-xS={`7kZ0ojb3&a)&F+DrqVn{|E>I*? zT?aNV`~0MC>w|7;&^XrjHZXowgo^YEYGeF8$oVRhK|gFW078xQeX`Pcu@}zNS18#4 z&FKKKjDz^#O&}^mewf^$7BKt2y{?j+`l7Oo5Gg|$%NA+ly`daR_-75bwct)^EYM7Y zUC$@W9USm?{74y7CIPguQtt072!3udsMZaEEqK_9xohzA4e%SX!ldvs>QfrgADIT0 z86STOwau1|&=X~wA>4RWRs@9TOoYJ5I;*H0G;y|XWFL3I&xI=EEE?g1&x0m>H zvRpaHQ%_~}1QZP0$}FOxrdP<|iJfw4+>r1J#c6C<=}Lp1m|=>S2S}E8h9Rxw%~H$b zJ3qKDfyQ0WYfnE+V_}4<%2~5BdOJ42OJ8$-*6s>g7JWDYD2|6%+hxvhkI5t$gF+++?fKnN9z z9$e|Xf-*wDfl*?I@l*qK9$uD)gptU%+-q$iakG|ied+|_%zwLjM2I{w2`6^yD<`r= zL5PL7N7`CyaW~hzdt@OLJF%?9+JkFk8g58ULT{LrpF<9w@rMvz_elaKzOEKsW&94z z0Vzt}dhsa~+wUMQ5ab{J@HGeURf0eRZ{zc4hM%%5TmUNBug!nnxP~Vn5%7qB7kf!*k;$2PxfYIKVFnFbMV2Y~x@lqn+ zm?)={iXe{-K|yg=*@}ZN|Lh3H8dgxLexVU6nRe8YcHe5=|o{2r9rP^&>sPi2*vmD93~mMP5twu4H?ufj*mOeY)=2V_-z17DhE_EZU2x( zex?Tw2WV_-N3}Bc`9sT_WEdSvV2Jd+bd63JA~YN_*l3MKH>YOmPnei&{)p zC9b+s!r}QYO$H;Xmia`49bCBjT=xA$swignXYGYwPN*Ut1O#AShxfo?tB_Vxq31wq zf6X2cTXW-lIe7ru<;v`(c*Hiqu~Q$4zc<{?mm{C^PZ*mz@qQOP&xvQX@H+Z%Dz zC&`tfqNH;|f=J)xwEChCw^ewC+zSIL_}9V18&hqJ1Y2}0=;F6z?sW&YU18T|(_hD} z?9XMkK*VC}7n7Mz++|3s8?4B)eIVMb9EiauROC&r%dSdC*-LqpQToHqP|OiK0?tJ1 zb)#)=uKU;Ry0fnOnIbCKDU%d1erKurLv_o5iK`kB$?U4^8_+! zRNrb^_0;j2)aA1*{durc$LHf!aWwN+pup-t(G8QE>P@}&-J262^Ji4dooR9Oe z5=MSmLZbpb(&f9UdJp7#f2DNOV{_R8szX~Z5@={$j3b{sg+Lc2aHcmupD1hq@buoC z^?`=s#R8bcbgv3JIJqv*+`gV1FuLi5oE`$AZ(4P9c$Ccm5PE98=^WjqbN%>XHzTgi zhg9qnrFOMehm6;&rGK@U(*qj-4EDZjfYf$vq}i?h4Z-9rfY8%IbOn~uv_=9*qg zYzXL3J?DB?vtQB?JH-p6XqeeQ<}P>a`u4zJ&KW`_SYMrdHi$iYeqnUGc& zh{XM|GbF61r;annn$ObeDf!^MEPO=Ch>Zv65i+c~5XW z(EORd_?Wo{{c)_ZB3(ww%PrceMqtwNk1*Sw)hMBOVcBJDRdfW8uU6cndwuVeS;Qw; z{^?NA)c=VC7xNCQ2@TEyF9|%R>d^E8FEG0?eT$FlK;4xJvPIGP9g9iR+W;E1Indyp zq`!EKiwzy*tUs}UBb5?+M6^aj(z9u*eeKy(>cl z>5#DYVi&zNYJD!S(2S^*U55a2H9xT5b9T~j2!sBZm0rsO{$_=cf(?}tq906k(%9m| zc0yXi(QeUZmw6kre9ZQ9Im6sV7JiAO0DoHuUdqb#u;Qg4+F6Fil}shz!k)x?0{5pu z9X|}JX2J6vQbS3o-Mba8F#d>wvd03PqMXblvb6;EBUb@{yc6&O6Hg6-N;4v5;v;S+JN5AM~0!0s$(B90;}F9N5p>jT#AEOYY;xRrVE* zS)Q*!KOhU3%VO|}_wx@yqt=XacVy3nEKlZjztKNw4<7@dPuqu*kcVg{;@&OqEb{|^ zf15KaOIA!W#%Plcl9gak+p5as9+yyspH|O8F{((X)VU)^HY?)@yikR+y|VC!jEV5bmF5TpPGm3i+Ols;UY|I_7r&2C3WiQ*n{N#QN2_2exazxcSlChNVr8PU7=iJHPgm4IZwc6lUlJo^{#pnzdmk+iW7tXg zq2vJ1;>P3bxpxa&-b*b97K`hi_5`ms#D@xn4 zNoKtp5Nf^oNaTL$?#HAHka=9h^-5-h;gou1zyN{|uZ)w^B83K*YgWZ{43ZR71G3En z%eKzu6ec@rmW>C%J?^2VCZyb$oc#{ECis?T2{EZKl&p?AdPLyVl66#_kTAlXNnKlj z;?K7o{JKmJ@9i%DRp7JDb_<#M;`L%6k@K=8px}KVi660*cou7gAq*<*ZDx5il@i@2 zr+X<9jkwf%rdr|Ym?*Bq5`EFxR&sC_f)Rz>zS`r)O@BgR@d$KDinbkHy+RKZvSCnN zHQFOku;cq|d4X!E@3s_H&w$p?>JekPDatnjkP=2Pp$r`%BvZkmUuAI0@X*8G3?C!; zONqiP?Ey`rFRG;xAJ?#U;eId$kc z;O=ZCu_2mObGdcD{i^n1VF^0A3qH@B5U2cnH$YF-g;K^Z z93>Cngn-I;H~d-nDaU&-VozT#tvG1U&Cba5thZM)zmE_Sk+jRt4+%Rz`{qe>JILap zg;S0k+GlfOFs#+h!6oF=q8Q$mDDO`rciTEdTsZXQ^3KNTr;Cp$1`Bq3;rwe?yE zD2kri2FjaY*=bXc{1r>Hv$a(}P&g^f^JMf?ym9d5ceR%M7e$OK_iU~`?I(qjyxf25 zIA}1Tx9*yFc{@nWGy_dzHcMbFgfueqGR3(Xs7gp8l(1Agn(%~`-iZid6=J-vS;eN~ z-S`-d@VSt-Tfo@Z{^4HR&8)*t92QsfzQX}eZC!gIDphd5hYPRnrw$Bi@#=~Q;0NDH zF_hYlIk6Covmtm0$2XwDCtMmnEH*Mk8vSle(!3d1g&z$}_xbJ*vWmqQTNbqPa2K}P zACv;(%J8x(FP+@fPnqj|oih*9og_C3oF-Nki~P!`5m+>-<9@Ry9v%mYnv#N^z^l;K z2RH(zSd5SCyxpcGaeQ>)?KuPM(CC-b#tu-ql&RjjE#7$){AGSJ}#$R9m0KNHI>c#OJ91#zGPHQ&|No3I3wwsmybqdEmZ z!6ASFC)bBooWdL-tr~!u5d#MF)>nq$h}Koe7|F`3Gz5Gi5fbKjYx1Hg?q7f_eAAJ` zj&cI$dRt+64G?61Uk5D!*6`Lrpjm2NETxaJff}INwdoh=s=`zlG^@P2Ujh`? zJt`^4MHsPyt?atbSe{mxkIbOe&;NzlRIYDmS0-|ktqk_ISK+XGY$D<4JlI5Ple zxXcDd-VX5WcV0NS!B6@QJiubJN!bBZ%Y;SKw*62Wg`DFS)=}x;tL$PzqkH;0WYJNS2fX8wUEW%W36ktjPuc-J2 z_q)Qzfe6JWMyuW9{(kda1OUD0P&|(RGg%^~)tY75FL@+V?jNoIg9af*NeC{%lr-i6 zM-+V>62?%ckS&Q13407(0s0dS0HT3X8^tLjC!ebb89SJpC4LXWK+aH%`vZPzU^+po zGm90N_g2y!lqd`CDhk{V`U6y|K74)CpRy3v2psQ z&1qFYsp3p?-Vh}iY7E+)Gi)w9krJ%11A#-@r%e&?&tVC_-zNgYx}oGc8j875o9FpH z{yH#2ia30;!%z@#YQ;k}HYsdC+NvrT;wYqmTC}qz|F~j?YItRxLuq~QZYa=<3|Kg+#YMj!`pQ~elUZia~v?57o`GD83a56F5IfD2JilaC)LnSM+V>w z{+NIJXFn>O3Ba&Yt@RX;u#WsJ*zypk=kJx+7}7Ju-sGJ20(b-$sQ!@OOphR8FKfX$ zW8Dsp-s6mtN+)z)iOuyqc$PEYuXR?Vr`G=@u>K1T{g0JEM(HX}ZWYI zok<-R2{xV10Q!@UTb3)4DC~UU@%?qSkLSR0z-N?gTZb#ew2fSZsAz!5sVzmT|I|_S zxF7h08lxlxNTB|XfWq%ixiG!TLljoej*OLHXwunfDG#`3Ce`c0CnR1{zJp`G@d!M* zfR-6PcKT0Cs0OwNQfWUxvezMHbjl8qE)Fb=AGNK24b@*XZE1P3#YY%)xVOpu7cPg5 z<^xAn*ZFqg^igjo$xo2bg<>fNN&bxXc7A9#Iso7Z_Ib22m@k^w?!yI}b)M@ml^L&%mU3KI5kw(E~V2=ym`2vhdHk)h@h^`t_sG69P+;@F!G zo()FM(b?D7gDue$^#54hV*ihfDvlus#*f0`DEna)B~&ovR+8g6KtrqAh2rN~d=5`! z?MnTn?6)1ypH>G3<} zpcy!B&o@%)*8V4us4|y07+}JkyQfwbm(7@ZBW5?c#sgR%c%vhfINKn_1W zj`g=oquET1k}6CIo`_S~(X*db#n1nO*W#xln+%7LKVvMB*?= z;0M&F2Uh_?cgFIxn{8zPEJztXot&Twz$1z%-%2nxk`SAV3>N*G1tt}|evaO_*pWB= zsSlU<9rxQLfv{;~@cJ$m;bjS&G7Iona5Di{ZKXA`Ea|u5VVEJ0-_g`K97t}c@k*Ek z8gB?dA0jIK&IAm4$st6B9S=uqKf{)wHaSkeWTS{y@<5j1RPf>Jp;!x(t&cOdxts1Q zkP01gMljCfCmYC!GuE#8a)HwVbHOm!MGdr@9$$r_jEph(ROpC@Pw7TnCT*PsctXvr znzSO%&wz~lP}l+j5WpX#2u6kjAhBjRAup?hLf&q_ApnDC0j-I!c$<7Gkq4eXAoJ|+Ybo*`)b^Zfq{sC{=8{QJuS zc#UXzo-as_r;N6}V792?Oqh2JC3L*#apGMO@cy|-pqaJL%9-r7wVhx)_kj(_aUioi zr3Ql>I8$MgWtrwa?DUe9eE|@o_?VBE2nHa8xZ-($?G|uA;Z?NTDMm%!i=c-DI(EtQ zrGv_Mpi91kF@-&VVk}a8diPsp5;0>EefGn}u4r}pQt3Za-6~m_fMgAD1Yg}VoS*-D z{tf^?fXIH~HZnjW){9=$E~bP_>78x(gs4JSsaj*u^4hoy=0?OBSCV12Pd#_GkEED6 zNLbw1rBPB>u_h@1=}TwYD!VCxF;C+&RY4}(Xla78kU^049|`otPQ|8Zd5Bn%0BfS` zwJTU!1^|0oJ-w6Iw8H>N3<6onxK^Q1%-0U;pEfEs*A?ISfQ5!}0qV-7@ZeJ$Sw?K| z0!EDmu(E|zonG_=z)w8Y@{7O<-S*YJ-|7L}$a8*u*a&##+w84ZLEL)t{HY}(l)&w> zuw$npn;g9dd+;NYxVL9{DDZc~i zYCS-IZYLO`r2ups>b3P)1*c|0l;;5zeE7nC@CN{>i)X+a+$kt=JDKAQkzc{pFML~u z9qBLtwDwWzuBZ`E4Xh~oM*{#6EDJQZV^k zRa`OEAnOZ!Y_4jG1j-QDa)T#?yqdL1Lf3?lsK8M?{{mLx8jT_B&^Ze*5(4v)hiCM_b4XVg>+diT*1|Y^qSfpzG$qnie8MZ5g0O_DBf>Uof@na0=$Yr=FIpEw&8!X_HGPqB^isoZGC5wm< z>DK8d9Wi~VkTU>i0KvQk;abNcxe8LRjJ8?~zdo(98r@n%Pe1@x;Q)b^$Y&|rh!csF zcPeZJUw=fA#-mX$zOLW$lU;5H1Im+?=-?BTy(6-Qpw@&L8JY~x|Xh)L(}0c6)hbQL8Q?Tk3^IuGPzg6&oTF7dy`u|pso zW+Zz)$T$KQ!(AT#;EZk6r3Lq!^p~leOv0&o>_ZUg@Rnv>IT=84uz~m}^xB#+r6Pzb zf>@tu8iPs*1pKZ+%-cJ|&zJ0wzSqtbbTQmw*5wCa_6#m6$BQ`)JQqnobaYuD*e^ms zi{E+5MT&cz*a|y!)vNcFI0CNzrIRlqVifaBaLez8G&O26*lrY_;$X^<0EzAFc3 zVddMvuiIGH0adbJEZ&~o3YS;G!?J~8`GIa=9s`%98Rm8B2_JI-@CAb&m%n}D$07nM zhXE9amp}41W`TrR7%*xAQRaj1Pd<13^6498pNbz@e0==iGftu^$lAA$$}4aao&bL4 z=avO<3_Dah33(kZe;$ZsJh#N`CgeQ&zG|=b0)IkOaURjDYM_2** zK_NIJ9jp~9uLhp~Px%bMr_@_}xPr(;U*4hz0Z_}ALCr&1oKpoQa$t`dc;5t1J%Vfm zOb5+}4T-PH)|}yw$0gnZ^11^9XlAm3-ICTfNI8fJ74c;%VZ@a?YYXF1W;4v>)4cPAlfXpla-WMW) z_D58>mi{CIVrsn#q!jz?DIfOKp2Qd?q20Pu_P$O1k4Ob z8a`QuHG}#v3EHulhrY=e@Dym1Vo{6k%H(akIPznIGfmxjj>Uum0@>XqI^Ua zPj33C;=K2kO0{tJFIFNe?K-e>4kQGgdg2NE%Xc1gyP>;&;}oM1wA5~tFL9j<<{?7zCGp(neGk6a{HIfW!feR+&Z>NR$964&#-R% ze;)*g#s<4+2VBOwvn;E%Izh4!wP@~4-e=YRE$=34;!AVy;52Q(Hhk0fT2O!1K!(}% zsS7G`%3QVo<{QBJ-osPxN*ZOR23Px@_xK_cez=kOwX>9I;n; z)0Y%y0-7O@8ZS1luwMMkr1!F^*CDWCm!Y%;!xTB!#MxjT>eu(@G$t4+L1`{J7EL@ literal 0 HcmV?d00001 diff --git a/docs/source/_static/theme_overrides.css b/docs/source/_static/theme_overrides.css index e64d40f1116e5..eeba0ef4cce6e 100644 --- a/docs/source/_static/theme_overrides.css +++ b/docs/source/_static/theme_overrides.css @@ -21,43 +21,15 @@ /* Customizing with theme CSS variables */ :root { - --pst-color-active-navigation: 215, 70, 51; - --pst-color-link-hover: 215, 70, 51; - --pst-color-headerlink: 215, 70, 51; - /* Use normal text color (like h3, ..) instead of primary color */ - --pst-color-h1: var(--color-text-base); - --pst-color-h2: var(--color-text-base); - /* Use softer blue from bootstrap's default info color */ - --pst-color-info: 23, 162, 184; - --pst-header-height: 0px; -} - -code { - color: rgb(215, 70, 51); -} - -.footer { - text-align: center; -} - -/* Ensure the logo is properly displayed */ - -.navbar-brand { - height: auto; - width: auto; -} - -a.navbar-brand img { - height: auto; - width: auto; - max-height: 15vh; - max-width: 100%; + /* Change header hight to make the logo a bit larger */ + --pst-header-height: 6rem; + /* Make headings more bold */ + --pst-font-weight-heading: 600; } /* Contibuting landing page overview cards */ .contrib-card { - background: #fff; border-radius: 0; padding: 30px 10px 20px 10px; margin: 10px 0px; @@ -70,12 +42,12 @@ a.navbar-brand img { .contrib-card .sd-card-img-top { margin: 2px; height: 75px; + background: none !important; } .contrib-card .sd-card-title { - /* color: rgb(var(--pst-color-h1)) !important; */ + color: var(--pst-color-primary); font-size: var(--pst-font-size-h3); - /* font-weight: bold; */ padding: 1rem 0rem 0.5rem 0rem; } @@ -112,48 +84,3 @@ dl.cpp.enumerator { p.breathe-sectiondef-title { margin-top: 1rem; } - -/* Limit the max height of the sidebar navigation section. Because in our -custimized template, there is more content above the navigation, i.e. -larger logo: if we don't decrease the max-height, it will overlap with -the footer. -Details: min(15vh, 110px) for the logo size, 8rem for search box etc*/ - -@media (min-width:720px) { - @supports (position:-webkit-sticky) or (position:sticky) { - .bd-links { - max-height: calc(100vh - min(15vh, 110px) - 8rem) - } - } -} - -/* Styling to get the version dropdown and search box side-by-side on wide screens */ - -#version-search-wrapper { - width: inherit; - display: flex; - flex-wrap: wrap; - justify-content: left; - align-items: center; -} - -#version-button { - padding-left: 0.5rem; - padding-right: 1rem; -} - -#search-box { - flex: 1 0 12em; -} - -/* Fix table text wrapping in RTD theme, - * see https://rackerlabs.github.io/docs-rackspace/tools/rtd-tables.html - */ - -@media screen { - table.docutils td { - /* !important prevents the common CSS stylesheets from overriding - this as on RTD they are loaded after this stylesheet */ - white-space: normal !important; - } -} diff --git a/docs/source/_static/versions.json b/docs/source/_static/versions.json index f91b0a17e7774..8d9c5878c8213 100644 --- a/docs/source/_static/versions.json +++ b/docs/source/_static/versions.json @@ -1,62 +1,73 @@ [ { "name": "14.0 (dev)", - "version": "dev/" + "version": "dev/", + "url": "https://arrow.apache.org/docs/dev/" }, { "name": "13.0 (stable)", - "version": "" + "version": "", + "url": "https://arrow.apache.org/docs/", + "preferred": true }, { "name": "12.0", - "version": "12.0/" - }, - { - "name": "12.0", - "version": "12.0/" + "version": "12.0/", + "url": "https://arrow.apache.org/docs/12.0/" }, { "name": "11.0", - "version": "11.0/" + "version": "11.0/", + "url": "https://arrow.apache.org/docs/11.0/" }, { "name": "10.0", - "version": "10.0/" + "version": "10.0/", + "url": "https://arrow.apache.org/docs/10.0/" }, { "name": "9.0", - "version": "9.0/" + "version": "9.0/", + "url": "https://arrow.apache.org/docs/9.0/" }, { "name": "8.0", - "version": "8.0/" + "version": "8.0/", + "url": "https://arrow.apache.org/docs/8.0/" }, { "name": "7.0", - "version": "7.0/" + "version": "7.0/", + "url": "https://arrow.apache.org/docs/7.0/" }, { "name": "6.0", - "version": "6.0/" + "version": "6.0/", + "url": "https://arrow.apache.org/docs/6.0/" }, { "name": "5.0", - "version": "5.0/" + "version": "5.0/", + "url": "https://arrow.apache.org/docs/5.0/" }, { "name": "4.0", - "version": "4.0/" + "version": "4.0/", + "url": "https://arrow.apache.org/docs/4.0/" }, { "name": "3.0", - "version": "3.0/" + "version": "3.0/", + "url": "https://arrow.apache.org/docs/3.0/" }, { "name": "2.0", - "version": "2.0/" + "version": "2.0/", + "url": "https://arrow.apache.org/docs/2.0/" }, { "name": "1.0", - "version": "1.0/" + "version": "1.0/", + "url": "https://arrow.apache.org/docs/dev/" } ] diff --git a/docs/source/_static/versionwarning.js b/docs/source/_static/versionwarning.js index 601b93b75ddd8..e53c160ed98f7 100644 --- a/docs/source/_static/versionwarning.js +++ b/docs/source/_static/versionwarning.js @@ -17,6 +17,8 @@ (function() { // adapted 2022-11 from https://mne.tools/versionwarning.js + // Not used anymore for versions 14.0.0 and higher + // Kept for older docs versions (13.0.0 and lower) if (location.hostname == 'arrow.apache.org') { $.getJSON("https://arrow.apache.org/docs/_static/versions.json", function(data){ var latestStable = data[1].name.replace(" (stable)",""); diff --git a/docs/source/_templates/docs-sidebar.html b/docs/source/_templates/docs-sidebar.html deleted file mode 100644 index 26d42a82f1d5c..0000000000000 --- a/docs/source/_templates/docs-sidebar.html +++ /dev/null @@ -1,25 +0,0 @@ - - - - - -
- -{% include "version-switcher.html" %} - - - -
- - diff --git a/docs/source/_templates/layout.html b/docs/source/_templates/layout.html index ca39e8e5a8fae..956e0142c5062 100644 --- a/docs/source/_templates/layout.html +++ b/docs/source/_templates/layout.html @@ -22,13 +22,3 @@ {% endblock %} - -{# Silence the navbar #} -{% block docs_navbar %} -{% endblock %} - -{# Add version warnings #} -{% block footer %} - {{ super() }} - -{% endblock %} diff --git a/docs/source/_templates/version-switcher.html b/docs/source/_templates/version-switcher.html deleted file mode 100644 index 24a8c15ac0102..0000000000000 --- a/docs/source/_templates/version-switcher.html +++ /dev/null @@ -1,60 +0,0 @@ - - - diff --git a/docs/source/c_glib/index.rst b/docs/source/c_glib/index.rst index 56db23f2a2040..b10524eb2e8a5 100644 --- a/docs/source/c_glib/index.rst +++ b/docs/source/c_glib/index.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _c-glib: + C/GLib docs =========== diff --git a/docs/source/conf.py b/docs/source/conf.py index 23b7070c4a84e..e9e8969f55254 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -115,7 +115,6 @@ 'IPython.sphinxext.ipython_console_highlighting', 'IPython.sphinxext.ipython_directive', 'numpydoc', - "sphinxcontrib.jquery", 'sphinx_design', 'sphinx_copybutton', 'sphinx.ext.autodoc', @@ -288,16 +287,37 @@ # further. For a list of options available for each theme, see the # documentation. # + html_theme_options = { "show_toc_level": 2, "use_edit_page_button": True, + "logo": { + "image_light": "_static/arrow.png", + "image_dark": "_static/arrow-dark.png", + }, + "header_links_before_dropdown": 2, + "header_dropdown_text": "Language implementations", + "navbar_end": ["version-switcher", "theme-switcher", "navbar-icon-links"], + "icon_links": [ + { + "name": "GitHub", + "url": "https://github.com/apache/arrow", + "icon": "fa-brands fa-square-github", + }, + { + "name": "Twitter", + "url": "https://twitter.com/ApacheArrow", + "icon": "fa-brands fa-square-twitter", + }, + ], + "show_version_warning_banner": True, + "switcher": { + "json_url": "/docs/_static/versions.json", + "version_match": version, + }, } html_context = { - "switcher_json_url": "/docs/_static/versions.json", - "switcher_template_url": "https://arrow.apache.org/docs/{version}", - # for local testing - # "switcher_template_url": "http://0.0.0.0:8000/docs/{version}", "github_user": "apache", "github_repo": "arrow", "github_version": "main", @@ -319,7 +339,7 @@ # The name of an image file (relative to this directory) to place at the top # of the sidebar. # -html_logo = "_static/arrow.png" +# html_logo = "_static/arrow.png" # The name of an image file (relative to this directory) to use as a favicon of # the docs. This file should be a Windows icon file (.ico) being 16x16 or @@ -354,10 +374,9 @@ # Custom sidebar templates, maps document names to template names. # -html_sidebars = { +# html_sidebars = { # '**': ['sidebar-logo.html', 'sidebar-search-bs.html', 'sidebar-nav-bs.html'], - '**': ['docs-sidebar.html'], -} +# } # The base URL which points to the root of the HTML documentation, # used for canonical url diff --git a/docs/source/cpp/index.rst b/docs/source/cpp/index.rst index e06453e202979..6d4d4aaa8148c 100644 --- a/docs/source/cpp/index.rst +++ b/docs/source/cpp/index.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _cpp: + C++ Implementation ================== @@ -25,9 +27,9 @@ Welcome to the Apache Arrow C++ implementation documentation! :padding: 2 2 0 0 :class-container: sd-text-center - .. grid-item-card:: Basic understanding + .. grid-item-card:: Getting started :class-card: contrib-card - :shadow: md + :shadow: none Start here to gain a basic understanding of Arrow with an installation and linking guide, documentation of @@ -37,14 +39,14 @@ Welcome to the Apache Arrow C++ implementation documentation! .. button-link:: getting_started.html :click-parent: - :color: secondary + :color: primary :expand: - Getting started + To Getting started .. grid-item-card:: User Guide :class-card: contrib-card - :shadow: md + :shadow: none Explore more specific topics and underlying concepts of Arrow C++ @@ -53,19 +55,19 @@ Welcome to the Apache Arrow C++ implementation documentation! .. button-link:: user_guide.html :click-parent: - :color: secondary + :color: primary :expand: - User Guide + To the User Guide .. grid:: 2 :gutter: 4 :padding: 2 2 0 0 :class-container: sd-text-center - .. grid-item-card:: Examples of use + .. grid-item-card:: Examples :class-card: contrib-card - :shadow: md + :shadow: none Find the description and location of the examples using Arrow C++ library @@ -74,14 +76,14 @@ Welcome to the Apache Arrow C++ implementation documentation! .. button-link:: examples/index.html :click-parent: - :color: secondary + :color: primary :expand: - Examples + To the Examples - .. grid-item-card:: Reference documentation + .. grid-item-card:: API Reference :class-card: contrib-card - :shadow: md + :shadow: none Explore Arrow’s API reference documentation @@ -89,10 +91,32 @@ Welcome to the Apache Arrow C++ implementation documentation! .. button-link:: api.html :click-parent: - :color: secondary + :color: primary + :expand: + + To the API Reference + +.. grid:: 1 + :gutter: 4 + :padding: 2 2 0 0 + :class-container: sd-text-center + + .. grid-item-card:: Cookbook + :class-card: contrib-card + :shadow: none + + Collection of recipes which demonstrate how to + solve many common tasks that users might need + to perform when working with arrow data + + +++ + + .. button-link:: https://arrow.apache.org/cookbook/cpp/ + :click-parent: + :color: primary :expand: - API Reference + To the Cookbook .. toctree:: :maxdepth: 2 @@ -102,3 +126,4 @@ Welcome to the Apache Arrow C++ implementation documentation! user_guide Examples api + C++ cookbook diff --git a/docs/source/developers/continuous_integration/index.rst b/docs/source/developers/continuous_integration/index.rst index 6e8e26981c549..f988b5ab69d50 100644 --- a/docs/source/developers/continuous_integration/index.rst +++ b/docs/source/developers/continuous_integration/index.rst @@ -15,6 +15,7 @@ .. specific language governing permissions and limitations .. under the License. +.. _continuous_integration: ********************** Continuous Integration diff --git a/docs/source/developers/continuous_integration/overview.rst b/docs/source/developers/continuous_integration/overview.rst index 1d82e845a3360..3e155bf6001e9 100644 --- a/docs/source/developers/continuous_integration/overview.rst +++ b/docs/source/developers/continuous_integration/overview.rst @@ -20,7 +20,7 @@ Continuous Integration ====================== -Continuous Integration for Arrow is fairly complex as it needs to run across different combinations of package managers, compilers, versions of multiple sofware libraries, operating systems, and other potential sources of variation. In this article, we will give an overview of its main components and the relevant files and directories. +Continuous Integration for Arrow is fairly complex as it needs to run across different combinations of package managers, compilers, versions of multiple software libraries, operating systems, and other potential sources of variation. In this article, we will give an overview of its main components and the relevant files and directories. Some files central to Arrow CI are: diff --git a/docs/source/developers/contributing.rst b/docs/source/developers/contributing.rst deleted file mode 100644 index 6dc2a4e0147d6..0000000000000 --- a/docs/source/developers/contributing.rst +++ /dev/null @@ -1,190 +0,0 @@ -.. Licensed to the Apache Software Foundation (ASF) under one -.. or more contributor license agreements. See the NOTICE file -.. distributed with this work for additional information -.. regarding copyright ownership. The ASF licenses this file -.. to you under the Apache License, Version 2.0 (the -.. "License"); you may not use this file except in compliance -.. with the License. You may obtain a copy of the License at - -.. http://www.apache.org/licenses/LICENSE-2.0 - -.. Unless required by applicable law or agreed to in writing, -.. software distributed under the License is distributed on an -.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -.. KIND, either express or implied. See the License for the -.. specific language governing permissions and limitations -.. under the License. - -.. highlight:: console - -.. _contributing: - -**************************** -Contributing to Apache Arrow -**************************** - -**Thanks for your interest in the Apache Arrow project.** - -Arrow is a large project and may seem overwhelming when you're -first getting involved. Contributing code is great, but that's -probably not the first place to start. There are lots of ways to -make valuable contributions to the project and community. - -This page provides some orientation for how to get involved. It also offers -some recommendations on how to get the best results when engaging with the -community. - -Code of Conduct -=============== - -All participation in the Apache Arrow project is governed by the ASF's -`Code of Conduct `_. - -.. grid:: 2 - :gutter: 4 - :padding: 2 2 0 0 - :class-container: sd-text-center - - .. grid-item-card:: Community - :img-top: ./images/users-solid.svg - :class-card: contrib-card - :shadow: md - - A good first step to getting involved in the Arrow project is to join - the mailing lists and participate in discussions where you can. - - +++ - - .. button-link:: https://arrow.apache.org/community/ - :click-parent: - :color: secondary - :expand: - - Apache Arrow Community - - .. grid-item-card:: Bug reports - :img-top: ./images/bug-solid.svg - :class-card: contrib-card - :shadow: md - - Alerting us to unexpected behavior and missing features, even - if you can't solve the problems yourself, help us understand - and prioritize work to improve the libraries. - - +++ - - .. button-ref:: bug-reports - :ref-type: ref - :click-parent: - :color: secondary - :expand: - - Bugs and Features - -.. dropdown:: Communicating through the mailing lists - :animate: fade-in-slide-down - :class-title: sd-fs-5 - :class-container: sd-shadow-md - - Projects in The Apache Software Foundation ("the ASF") use public, archived - mailing lists to create a public record of each project's development - activities and decision-making process. - - While lacking the immediacy of chat or other forms of communication, - the mailing lists give participants the opportunity to slow down and be - thoughtful in their responses, and they help developers who are spread across - many timezones to participate more equally. - - Read more on the `Apache Arrow Community `_ - page. - -.. dropdown:: Improve documentation - :animate: fade-in-slide-down - :class-title: sd-fs-5 - :class-container: sd-shadow-md - - A great way to contribute to the project is to improve documentation. If you - found some docs to be incomplete or inaccurate, share your hard-earned knowledge - with the rest of the community. - - Documentation improvements are also a great way to gain some experience with - our submission and review process, discussed below, without requiring a lot - of local development environment setup. In fact, many documentation-only changes - can be made directly in the GitHub web interface by clicking the "edit" button. - This will handle making a fork and a pull request for you. - - * :ref:`documentation` - * :ref:`building-docs` - -.. grid:: 2 - :gutter: 4 - :padding: 2 2 0 0 - :class-container: sd-text-center - - .. grid-item-card:: New Contributors - :img-top: ./images/book-open-solid.svg - :class-card: contrib-card - :shadow: md - - First time contributing? - - The New Contributor's Guide provides necessary information for - contributing to the Apache Arrow project. - - +++ - - .. button-ref:: guide-introduction - :ref-type: ref - :click-parent: - :color: secondary - :expand: - - New Contributor's guide - - .. grid-item-card:: Overview - :img-top: ./images/code-solid.svg - :class-card: contrib-card - :shadow: md - - A short overview of the contributing process we follow - and some additional information you might need if you are not - new to the contributing process in general. - +++ - - .. button-ref:: contrib-overview - :ref-type: ref - :click-parent: - :color: secondary - :expand: - - Contributing overview - -Language specific -================= - -Connection to the specific language development pages: - -.. tab-set:: - - .. tab-item:: C++ - - * :ref:`cpp-development` - * :ref:`C++ Development Guidelines ` - * :ref:`building-arrow-cpp` - - .. tab-item:: Java - - * :doc:`java/index` - - .. tab-item:: Python - - * :ref:`python-development` - - .. tab-item:: R - - * `Arrow R Package: Developer environment setup `_ - * `Arrow R Package: Common developer workflow tasks `_ - - .. tab-item:: Ruby - - * `Red Arrow - Apache Arrow Ruby `_ diff --git a/docs/source/developers/images/book-open-solid.svg b/docs/source/developers/images/book-open-solid.svg index cbc8ed27256ca..9586e249be060 100644 --- a/docs/source/developers/images/book-open-solid.svg +++ b/docs/source/developers/images/book-open-solid.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/source/developers/images/bug-solid.svg b/docs/source/developers/images/bug-solid.svg index f842cb240544f..49cc04a1f0f6e 100644 --- a/docs/source/developers/images/bug-solid.svg +++ b/docs/source/developers/images/bug-solid.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/source/developers/images/code-solid.svg b/docs/source/developers/images/code-solid.svg index 725f767148b2c..4bbd567528ef8 100644 --- a/docs/source/developers/images/code-solid.svg +++ b/docs/source/developers/images/code-solid.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/source/developers/images/users-solid.svg b/docs/source/developers/images/users-solid.svg index a04d7fe2fd4a0..4bdf638a70f89 100644 --- a/docs/source/developers/images/users-solid.svg +++ b/docs/source/developers/images/users-solid.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/docs/source/developers/index.rst b/docs/source/developers/index.rst index f6a15a6c5452e..c2f10c9e95c47 100644 --- a/docs/source/developers/index.rst +++ b/docs/source/developers/index.rst @@ -15,12 +15,206 @@ .. specific language governing permissions and limitations .. under the License. +.. highlight:: console + +.. _developers: + +Development +=========== + +Connection to the specific language development pages: + +.. tab-set:: + + .. tab-item:: C++ + + * :ref:`cpp-development` + * :ref:`C++ Development Guidelines ` + * :ref:`building-arrow-cpp` + + .. tab-item:: Java + + * :doc:`java/index` + + .. tab-item:: Python + + * :ref:`python-development` + + .. tab-item:: R + + * `Arrow R Package: Developer environment setup `_ + * `Arrow R Package: Common developer workflow tasks `_ + + .. tab-item:: Ruby + + * `Red Arrow - Apache Arrow Ruby `_ + +.. _contributing: + +Contributing to Apache Arrow +============================ + +**Thanks for your interest in the Apache Arrow project.** + +Arrow is a large project and may seem overwhelming when you're +first getting involved. Contributing code is great, but that's +probably not the first place to start. There are lots of ways to +make valuable contributions to the project and community. + +This page provides some orientation for how to get involved. It also offers +some recommendations on how to get the best results when engaging with the +community. + +Code of Conduct +--------------- + +All participation in the Apache Arrow project is governed by the ASF's +`Code of Conduct `_. + +.. grid:: 2 + :gutter: 4 + :padding: 2 2 0 0 + :class-container: sd-text-center + + .. grid-item-card:: Apache Arrow Community + :img-top: ./images/users-solid.svg + :class-card: contrib-card + :shadow: none + + A good first step to getting involved in the Arrow project is to join + the mailing lists and participate in discussions where you can. + + +++ + + .. button-link:: https://arrow.apache.org/community/ + :click-parent: + :color: primary + :expand: + + To Apache Arrow Community + + .. grid-item-card:: Bug reports and feature requests + :img-top: ./images/bug-solid.svg + :class-card: contrib-card + :shadow: none + + Alerting us to unexpected behavior and missing features, even + if you can't solve the problems yourself, help us understand + and prioritize work to improve the libraries. + + +++ + + .. button-ref:: bug-reports + :ref-type: ref + :click-parent: + :color: primary + :expand: + + To Bug reports and feature requests + +.. dropdown:: Communicating through the mailing lists + :animate: fade-in-slide-down + :class-title: sd-fs-5 + :class-container: sd-shadow-none + + Projects in The Apache Software Foundation ("the ASF") use public, archived + mailing lists to create a public record of each project's development + activities and decision-making process. + + While lacking the immediacy of chat or other forms of communication, + the mailing lists give participants the opportunity to slow down and be + thoughtful in their responses, and they help developers who are spread across + many timezones to participate more equally. + + Read more on the `Apache Arrow Community `_ + page. + +.. dropdown:: Improve documentation + :animate: fade-in-slide-down + :class-title: sd-fs-5 + + A great way to contribute to the project is to improve documentation. If you + found some docs to be incomplete or inaccurate, share your hard-earned knowledge + with the rest of the community. + + Documentation improvements are also a great way to gain some experience with + our submission and review process, discussed below, without requiring a lot + of local development environment setup. In fact, many documentation-only changes + can be made directly in the GitHub web interface by clicking the "edit" button. + This will handle making a fork and a pull request for you. + + * :ref:`documentation` + * :ref:`building-docs` + +.. grid:: 2 + :gutter: 4 + :padding: 2 2 0 0 + :class-container: sd-text-center + + .. grid-item-card:: New Contributor's guide + :img-top: ./images/book-open-solid.svg + :class-card: contrib-card + + First time contributing? + + The New Contributor's Guide provides necessary information for + contributing to the Apache Arrow project. + + +++ + + .. button-ref:: guide-introduction + :ref-type: ref + :click-parent: + :color: primary + :expand: + + To the New Contributor's guide + + .. grid-item-card:: Contributing Overview + :img-top: ./images/code-solid.svg + :class-card: contrib-card + + A short overview of the contributing process we follow + and some additional information you might need if you are not + new to the contributing process in general. + +++ + + .. button-ref:: contrib-overview + :ref-type: ref + :click-parent: + :color: primary + :expand: + + To Contributing overview + +.. dropdown:: Continuous Integration + :animate: fade-in-slide-down + :class-title: sd-fs-5 + :class-container: sd-shadow-none + + Continuous Integration needs to run across different combinations of package managers, compilers, versions of multiple + software libraries, operating systems, and other potential sources of variation. + + Read more on the :ref:`continuous_integration` page. + +.. dropdown:: Benchmarks + :animate: fade-in-slide-down + :class-title: sd-fs-5 + :class-container: sd-shadow-none + + How to use the benchmark suite can be found on the :ref:`benchmarks` page. + +.. dropdown:: Release Guide + :animate: fade-in-slide-down + :class-title: sd-fs-5 + :class-container: sd-shadow-none + + To learn about the detailed information on the steps followed to perform a release, see :ref:`release`. + .. toctree:: :maxdepth: 2 - :caption: Development :hidden: - contributing bug_reports guide/index overview diff --git a/docs/source/developers/overview.rst b/docs/source/developers/overview.rst index 272f3dbd98074..c7bc4273313bc 100644 --- a/docs/source/developers/overview.rst +++ b/docs/source/developers/overview.rst @@ -45,7 +45,7 @@ checklist for using ``git``: .. dropdown:: How to squash local commits? :animate: fade-in-slide-down - :class-container: sd-shadow-md + :class-container: sd-shadow-none Abort the rebase with: @@ -78,7 +78,7 @@ checklist for using ``git``: .. dropdown:: Setting rebase to be default :animate: fade-in-slide-down - :class-container: sd-shadow-md + :class-container: sd-shadow-none If you set the following in your repo's ``.git/config``, the ``--rebase`` option can be omitted from the ``git pull`` command, as it is implied by default. @@ -136,7 +136,7 @@ will merge the pull request. This is done with a .. dropdown:: Details on squash merge :animate: fade-in-slide-down - :class-container: sd-shadow-md + :class-container: sd-shadow-none A pull request is merged with a squash merge so that all of your commits will be registered as a single commit to the main branch; this simplifies the diff --git a/docs/source/developers/release.rst b/docs/source/developers/release.rst index 066400b33ffb5..6924c2d714e8b 100644 --- a/docs/source/developers/release.rst +++ b/docs/source/developers/release.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _release: + ======================== Release Management Guide ======================== diff --git a/docs/source/format/index.rst b/docs/source/format/index.rst index 1771b36d76128..ae2baf128b472 100644 --- a/docs/source/format/index.rst +++ b/docs/source/format/index.rst @@ -15,10 +15,13 @@ .. specific language governing permissions and limitations .. under the License. +.. _format: + +Specifications and Protocols +============================ + .. toctree:: :maxdepth: 2 - :caption: Specifications and Protocols - :hidden: Versioning Columnar diff --git a/docs/source/index.rst b/docs/source/index.rst index b348d3dab22b7..e8cdf50c5b1ec 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +:html_theme.sidebar_secondary.remove: + Apache Arrow ============ @@ -35,11 +37,71 @@ such topics as: **To learn how to use Arrow refer to the documentation specific to your target environment.** +.. grid:: 2 + :gutter: 4 + :padding: 2 2 0 0 + :class-container: sd-text-center + + .. grid-item-card:: Specifications and Protocols + :class-card: contrib-card + :shadow: none + + Read about the Apache Arrow format + specifications and Protocols. + + +++ + + .. button-ref:: format + :ref-type: ref + :click-parent: + :color: primary + :expand: + + To the Specifications and Protocols + + .. grid-item-card:: Development + :class-card: contrib-card + :shadow: none + + Find the documentation on the topic of + contributions, reviews, building of the libraries + from source, building of the documentation, + continuous integration, benchmarks and the + release process. + + +++ + + .. button-ref:: developers + :ref-type: ref + :click-parent: + :color: primary + :expand: + + To the Development + +.. _toc.columnar: + +.. toctree:: + :maxdepth: 2 + :hidden: + + format/index + +.. _toc.development: + +.. toctree:: + :maxdepth: 2 + :hidden: + + developers/index + +Implementations +--------------- + .. _toc.usage: .. toctree:: :maxdepth: 1 - :caption: Supported Environments C/GLib C++ @@ -55,52 +117,15 @@ target environment.** Rust status +Examples +-------- + .. _toc.cookbook: .. toctree:: :maxdepth: 1 - :caption: Cookbooks - - C++ - Java - Python - R -.. _toc.columnar: - -.. toctree:: - :maxdepth: 2 - :caption: Specifications and Protocols - - format/Versioning - format/Columnar - format/CanonicalExtensions - format/Flight - format/FlightSql - format/Integration - format/CDataInterface - format/CStreamInterface - format/CDeviceDataInterface - format/ADBC - format/Other - format/Changing - format/Glossary - -.. _toc.development: - -.. toctree:: - :maxdepth: 2 - :caption: Development - - developers/contributing - developers/bug_reports - developers/guide/index - developers/overview - developers/reviewing - developers/cpp/index - developers/java/index - developers/python - developers/continuous_integration/index - developers/benchmarks - developers/documentation - developers/release + C++ cookbook + Java cookbook + Python cookbook + R cookbook diff --git a/docs/source/java/index.rst b/docs/source/java/index.rst index 9b555e297b0f9..cf93b0e897832 100644 --- a/docs/source/java/index.rst +++ b/docs/source/java/index.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _java: + Java Implementation =================== @@ -41,3 +43,4 @@ on the Arrow format and other language bindings see the :doc:`parent documentati cdata jdbc Reference (javadoc) + Java cookbook diff --git a/docs/source/js/index.rst b/docs/source/js/index.rst index 77813c1372dfe..2ab205a08b850 100644 --- a/docs/source/js/index.rst +++ b/docs/source/js/index.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _js: + JavaScript docs =============== diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst index b80cbc7de594e..6a3de3d42b149 100644 --- a/docs/source/python/index.rst +++ b/docs/source/python/index.rst @@ -15,8 +15,13 @@ .. specific language governing permissions and limitations .. under the License. +.. _python: + +Python +====== + PyArrow - Apache Arrow Python bindings -====================================== +-------------------------------------- This is the documentation of the Python API of Apache Arrow. @@ -62,3 +67,4 @@ files into Arrow structures. api getting_involved benchmarks + Python cookbook diff --git a/docs/source/r/index.rst b/docs/source/r/index.rst index b799544bb6bb3..8ccbec132ad3d 100644 --- a/docs/source/r/index.rst +++ b/docs/source/r/index.rst @@ -15,6 +15,8 @@ .. specific language governing permissions and limitations .. under the License. +.. _r: + R docs ====== From 286487010b43da384dbeec941d2b49f66638a90a Mon Sep 17 00:00:00 2001 From: Danyaal Khan Date: Wed, 27 Sep 2023 16:42:22 +0100 Subject: [PATCH 73/96] GH-37377: [C#] Throw OverflowException on overflow in TimestampArray.ConvertTo() (#37388) Throw `OverflowException` on overflow in `TimestampArray.ConvertTo()` when `DataType.Unit` is `Nanosecond` and `ticks` is large, instead of silently overflowing and returning the wrong value. * Closes: #37377 Authored-by: Danyaal Khan Signed-off-by: Weston Pace --- csharp/src/Apache.Arrow/Arrays/TimestampArray.cs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/csharp/src/Apache.Arrow/Arrays/TimestampArray.cs b/csharp/src/Apache.Arrow/Arrays/TimestampArray.cs index 0269768f490bb..0dc5726d01734 100644 --- a/csharp/src/Apache.Arrow/Arrays/TimestampArray.cs +++ b/csharp/src/Apache.Arrow/Arrays/TimestampArray.cs @@ -76,7 +76,7 @@ protected override long ConvertTo(DateTimeOffset value) switch (DataType.Unit) { case TimeUnit.Nanosecond: - return ticks * 100; + return checked(ticks * 100); case TimeUnit.Microsecond: return ticks / 10; case TimeUnit.Millisecond: From aca1d3eeed3775c2f02e9f5d59d62478267950b1 Mon Sep 17 00:00:00 2001 From: ismail simsek Date: Wed, 27 Sep 2023 17:51:05 +0200 Subject: [PATCH 74/96] GH-35770: [Go][Documentation] Update TimestampType zero value as seconds in comment (#37905) ### Rationale for this change To clear the confusion around the zero value of `TimestampType` ### What changes are included in this PR? Just a comment change `nanosecond -> second` ### Are these changes tested? No need to test ### Are there any user-facing changes? No Closes: https://github.com/apache/arrow/issues/35770 * Closes: #35770 Authored-by: ismail simsek Signed-off-by: Matt Topol --- go/arrow/datatype_fixedwidth.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/go/arrow/datatype_fixedwidth.go b/go/arrow/datatype_fixedwidth.go index 7f62becdc2884..fc0b3aea56e70 100644 --- a/go/arrow/datatype_fixedwidth.go +++ b/go/arrow/datatype_fixedwidth.go @@ -347,7 +347,7 @@ type TemporalWithUnit interface { } // TimestampType is encoded as a 64-bit signed integer since the UNIX epoch (2017-01-01T00:00:00Z). -// The zero-value is a nanosecond and time zone neutral. Time zone neutral can be +// The zero-value is a second and time zone neutral. Time zone neutral can be // considered UTC without having "UTC" as a time zone. type TimestampType struct { Unit TimeUnit From c9674bcc16411d7ecfd3b5587f544758b9fc7062 Mon Sep 17 00:00:00 2001 From: h-vetinari Date: Thu, 28 Sep 2023 22:18:07 +1100 Subject: [PATCH 75/96] GH-37621: [Packaging][Conda] Sync conda recipes with feedstocks (#37624) Syncing after the release of 13.0.0 + a couple of migrations (state as of https://github.com/conda-forge/arrow-cpp-feedstock/pull/1168 & https://github.com/conda-forge/r-arrow-feedstock/pull/68) Relevant updates: * we're not building twice for different protobuf versions anymore * new abseil version (fixes #36908) * we've finally upgraded the aws-sdk to 1.11 * the default R versions (on unix) are now 4.2 & 4.3. Also some further hardening of the activation scripts & clean-ups for dependencies & test skips. * Closes: #37621 Lead-authored-by: H. Vetinari Co-authored-by: h-vetinari Signed-off-by: Joris Van den Bossche --- .../linux_64_cuda_compiler_version11.2.yaml | 20 +++--- .../linux_64_cuda_compiler_versionNone.yaml | 20 +++--- ...nux_aarch64_cuda_compiler_version11.2.yaml | 20 +++--- ...nux_aarch64_cuda_compiler_versionNone.yaml | 20 +++--- ...nux_ppc64le_cuda_compiler_version11.2.yaml | 20 +++--- ...nux_ppc64le_cuda_compiler_versionNone.yaml | 20 +++--- .../conda-recipes/.ci_support/osx_64_.yaml | 22 +++---- .../conda-recipes/.ci_support/osx_arm64_.yaml | 20 +++--- ...r_base4.1.yaml => linux_64_r_base4.3.yaml} | 2 +- ...e4.1.yaml => linux_aarch64_r_base4.3.yaml} | 2 +- ...4_r_base4.1.yaml => osx_64_r_base4.3.yaml} | 2 +- ..._base4.1.yaml => osx_arm64_r_base4.3.yaml} | 2 +- .../win_64_cuda_compiler_version11.2.yaml | 22 +++---- .../win_64_cuda_compiler_versionNone.yaml | 22 +++---- dev/tasks/conda-recipes/arrow-cpp/activate.sh | 17 +++-- .../conda-recipes/arrow-cpp/build-arrow.sh | 4 +- .../conda-recipes/arrow-cpp/build-pyarrow.sh | 4 ++ dev/tasks/conda-recipes/arrow-cpp/meta.yaml | 65 ++++++++----------- dev/tasks/tasks.yml | 24 +++---- 19 files changed, 151 insertions(+), 177 deletions(-) rename dev/tasks/conda-recipes/.ci_support/r/{linux_64_r_base4.1.yaml => linux_64_r_base4.3.yaml} (98%) rename dev/tasks/conda-recipes/.ci_support/r/{linux_aarch64_r_base4.1.yaml => linux_aarch64_r_base4.3.yaml} (98%) rename dev/tasks/conda-recipes/.ci_support/r/{osx_64_r_base4.1.yaml => osx_64_r_base4.3.yaml} (98%) rename dev/tasks/conda-recipes/.ci_support/r/{osx_arm64_r_base4.1.yaml => osx_arm64_r_base4.3.yaml} (98%) diff --git a/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_version11.2.yaml b/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_version11.2.yaml index 1cdcec199e7ba..042e2364d1c49 100644 --- a/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_version11.2.yaml +++ b/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_version11.2.yaml @@ -1,7 +1,7 @@ aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -33,20 +33,18 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -67,7 +65,7 @@ snappy: target_platform: - linux-64 thrift_cpp: -- 0.18.1 +- 0.19.0 ucx: - 1.14.0 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_versionNone.yaml b/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_versionNone.yaml index 5be5b58a73932..9885e6db38cd7 100644 --- a/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_versionNone.yaml +++ b/dev/tasks/conda-recipes/.ci_support/linux_64_cuda_compiler_versionNone.yaml @@ -1,7 +1,7 @@ aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -33,20 +33,18 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -67,7 +65,7 @@ snappy: target_platform: - linux-64 thrift_cpp: -- 0.18.1 +- 0.19.0 ucx: - 1.14.0 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_version11.2.yaml b/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_version11.2.yaml index 1677b03564c08..788b584504ec4 100644 --- a/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_version11.2.yaml +++ b/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_version11.2.yaml @@ -1,9 +1,9 @@ BUILD: - aarch64-conda_cos7-linux-gnu aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -37,20 +37,18 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -71,7 +69,7 @@ snappy: target_platform: - linux-aarch64 thrift_cpp: -- 0.18.1 +- 0.19.0 ucx: - 1.14.0 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_versionNone.yaml b/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_versionNone.yaml index 88fdf1254e661..a1e4b8571abaf 100644 --- a/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_versionNone.yaml +++ b/dev/tasks/conda-recipes/.ci_support/linux_aarch64_cuda_compiler_versionNone.yaml @@ -1,9 +1,9 @@ BUILD: - aarch64-conda_cos7-linux-gnu aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -37,20 +37,18 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -71,7 +69,7 @@ snappy: target_platform: - linux-aarch64 thrift_cpp: -- 0.18.1 +- 0.19.0 ucx: - 1.14.0 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_version11.2.yaml b/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_version11.2.yaml index 3585db7b99baa..e21c4cbe853f8 100644 --- a/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_version11.2.yaml +++ b/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_version11.2.yaml @@ -1,7 +1,7 @@ aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -33,20 +33,18 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -67,7 +65,7 @@ snappy: target_platform: - linux-ppc64le thrift_cpp: -- 0.18.1 +- 0.19.0 ucx: - 1.14.0 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_versionNone.yaml b/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_versionNone.yaml index c13a522254286..89f1049ebdd84 100644 --- a/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_versionNone.yaml +++ b/dev/tasks/conda-recipes/.ci_support/linux_ppc64le_cuda_compiler_versionNone.yaml @@ -1,7 +1,7 @@ aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -33,20 +33,18 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -67,7 +65,7 @@ snappy: target_platform: - linux-ppc64le thrift_cpp: -- 0.18.1 +- 0.19.0 ucx: - 1.14.0 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/osx_64_.yaml b/dev/tasks/conda-recipes/.ci_support/osx_64_.yaml index dd4a230760ef2..2a5f8c5b36bd3 100644 --- a/dev/tasks/conda-recipes/.ci_support/osx_64_.yaml +++ b/dev/tasks/conda-recipes/.ci_support/osx_64_.yaml @@ -1,9 +1,9 @@ MACOSX_DEPLOYMENT_TARGET: -- '10.9' +- '10.13' aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -27,22 +27,20 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 macos_machine: - x86_64-apple-darwin13.4.0 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -63,7 +61,7 @@ snappy: target_platform: - osx-64 thrift_cpp: -- 0.18.1 +- 0.19.0 zip_keys: - - c_compiler_version - cxx_compiler_version diff --git a/dev/tasks/conda-recipes/.ci_support/osx_arm64_.yaml b/dev/tasks/conda-recipes/.ci_support/osx_arm64_.yaml index 6a6713a54fe86..211b71226cae8 100644 --- a/dev/tasks/conda-recipes/.ci_support/osx_arm64_.yaml +++ b/dev/tasks/conda-recipes/.ci_support/osx_arm64_.yaml @@ -1,9 +1,9 @@ MACOSX_DEPLOYMENT_TARGET: - '11.0' aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' c_compiler: @@ -27,22 +27,20 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 macos_machine: - arm64-apple-darwin20.0.0 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -63,7 +61,7 @@ snappy: target_platform: - osx-arm64 thrift_cpp: -- 0.18.1 +- 0.19.0 zip_keys: - - c_compiler_version - cxx_compiler_version diff --git a/dev/tasks/conda-recipes/.ci_support/r/linux_64_r_base4.1.yaml b/dev/tasks/conda-recipes/.ci_support/r/linux_64_r_base4.3.yaml similarity index 98% rename from dev/tasks/conda-recipes/.ci_support/r/linux_64_r_base4.1.yaml rename to dev/tasks/conda-recipes/.ci_support/r/linux_64_r_base4.3.yaml index e63767cbe9771..a4d06c9f20cdd 100644 --- a/dev/tasks/conda-recipes/.ci_support/r/linux_64_r_base4.1.yaml +++ b/dev/tasks/conda-recipes/.ci_support/r/linux_64_r_base4.3.yaml @@ -19,7 +19,7 @@ pin_run_as_build: min_pin: x.x max_pin: x.x r_base: -- '4.1' +- '4.3' target_platform: - linux-64 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/r/linux_aarch64_r_base4.1.yaml b/dev/tasks/conda-recipes/.ci_support/r/linux_aarch64_r_base4.3.yaml similarity index 98% rename from dev/tasks/conda-recipes/.ci_support/r/linux_aarch64_r_base4.1.yaml rename to dev/tasks/conda-recipes/.ci_support/r/linux_aarch64_r_base4.3.yaml index 2b80b020fdc0b..028b190bb1ef5 100644 --- a/dev/tasks/conda-recipes/.ci_support/r/linux_aarch64_r_base4.1.yaml +++ b/dev/tasks/conda-recipes/.ci_support/r/linux_aarch64_r_base4.3.yaml @@ -23,7 +23,7 @@ pin_run_as_build: min_pin: x.x max_pin: x.x r_base: -- '4.1' +- '4.3' target_platform: - linux-aarch64 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/r/osx_64_r_base4.1.yaml b/dev/tasks/conda-recipes/.ci_support/r/osx_64_r_base4.3.yaml similarity index 98% rename from dev/tasks/conda-recipes/.ci_support/r/osx_64_r_base4.1.yaml rename to dev/tasks/conda-recipes/.ci_support/r/osx_64_r_base4.3.yaml index 6be6c2f5462c5..7b8b62d8e00bb 100644 --- a/dev/tasks/conda-recipes/.ci_support/r/osx_64_r_base4.1.yaml +++ b/dev/tasks/conda-recipes/.ci_support/r/osx_64_r_base4.3.yaml @@ -19,7 +19,7 @@ pin_run_as_build: min_pin: x.x max_pin: x.x r_base: -- '4.1' +- '4.3' target_platform: - osx-64 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/r/osx_arm64_r_base4.1.yaml b/dev/tasks/conda-recipes/.ci_support/r/osx_arm64_r_base4.3.yaml similarity index 98% rename from dev/tasks/conda-recipes/.ci_support/r/osx_arm64_r_base4.1.yaml rename to dev/tasks/conda-recipes/.ci_support/r/osx_arm64_r_base4.3.yaml index 0ce856fcccf5c..a8e8aab83d598 100644 --- a/dev/tasks/conda-recipes/.ci_support/r/osx_arm64_r_base4.1.yaml +++ b/dev/tasks/conda-recipes/.ci_support/r/osx_arm64_r_base4.3.yaml @@ -19,7 +19,7 @@ pin_run_as_build: min_pin: x.x max_pin: x.x r_base: -- '4.1' +- '4.3' target_platform: - osx-arm64 zip_keys: diff --git a/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_version11.2.yaml b/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_version11.2.yaml index f75d92e276d9e..32da33c072019 100644 --- a/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_version11.2.yaml +++ b/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_version11.2.yaml @@ -1,11 +1,9 @@ aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' -c_ares: -- '1' c_compiler: - vs2019 channel_sources: @@ -27,24 +25,22 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libcrc32c: - '1.1' libcurl: - '8' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -65,7 +61,7 @@ snappy: target_platform: - win-64 thrift_cpp: -- 0.18.1 +- 0.19.0 zip_keys: - - cuda_compiler - cuda_compiler_version diff --git a/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_versionNone.yaml b/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_versionNone.yaml index 6d8fb15b15a2a..6a33b86b9d65e 100644 --- a/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_versionNone.yaml +++ b/dev/tasks/conda-recipes/.ci_support/win_64_cuda_compiler_versionNone.yaml @@ -1,11 +1,9 @@ aws_crt_cpp: -- 0.20.3 +- 0.23.1 aws_sdk_cpp: -- 1.10.57 +- 1.11.156 bzip2: - '1' -c_ares: -- '1' c_compiler: - vs2019 channel_sources: @@ -27,24 +25,22 @@ glog: google_cloud_cpp: - '2.12' libabseil: -- '20230125' +- '20230802' libcrc32c: - '1.1' libcurl: - '8' libgrpc: -- '1.54' -- '1.56' +- '1.57' libprotobuf: -- '3.21' -- 4.23.3 +- 4.23.4 lz4_c: - 1.9.3 numpy: -- '1.21' +- '1.22' - '1.23' -- '1.21' -- '1.21' +- '1.22' +- '1.22' openssl: - '3' orc: @@ -65,7 +61,7 @@ snappy: target_platform: - win-64 thrift_cpp: -- 0.18.1 +- 0.19.0 zip_keys: - - cuda_compiler - cuda_compiler_version diff --git a/dev/tasks/conda-recipes/arrow-cpp/activate.sh b/dev/tasks/conda-recipes/arrow-cpp/activate.sh index 8757612781bbe..19d037ff4127a 100644 --- a/dev/tasks/conda-recipes/arrow-cpp/activate.sh +++ b/dev/tasks/conda-recipes/arrow-cpp/activate.sh @@ -23,6 +23,13 @@ _la_log "Beginning libarrow activation." # where the GDB wrappers get installed _la_gdb_prefix="$CONDA_PREFIX/share/gdb/auto-load" +# If the directory is not writable, nothing can be done +if [ ! -w "$_la_gdb_prefix" ]; then + _la_log 'No rights to modify $_la_gdb_prefix, cannot create symlink!' + _la_log 'Unless you plan to use the GDB debugger with libarrow, this warning can be safely ignored.' + return +fi + # this needs to be in sync with ARROW_GDB_INSTALL_DIR in build.sh _la_placeholder="replace_this_section_with_absolute_slashed_path_to_CONDA_PREFIX" # the paths here are intentionally stacked, see #935, resp. @@ -44,7 +51,7 @@ for _la_target in "$_la_orig_install_dir/"*.py; do # If the file doesn't exist, skip this iteration of the loop. # (This happens when no files are found, in which case the # loop runs with target equal to the pattern itself.) - _la_log 'Folder $_la_orig_install_dir seems to not contain .py files, skipping' + _la_log 'Folder $_la_orig_install_dir seems to not contain .py files, skipping.' continue fi _la_symlink="$_la_symlink_dir/$(basename "$_la_target")" @@ -54,13 +61,13 @@ for _la_target in "$_la_orig_install_dir/"*.py; do _la_log 'symlink $_la_symlink already exists and points to $_la_target, skipping.' continue fi - _la_log 'Creating symlink $_la_symlink pointing to $_la_target' + _la_log 'Creating symlink $_la_symlink pointing to $_la_target.' mkdir -p "$_la_symlink_dir" || true # this check also creates the symlink; if it fails, we enter the if-branch. if ! ln -sf "$_la_target" "$_la_symlink"; then - echo -n "${BASH_SOURCE[0]} ERROR: Failed to create symlink from " - echo -n "'$_la_target' to '$_la_symlink'" - echo + echo -n "${BASH_SOURCE[0]} WARNING: Failed to create symlink from " + echo "'$_la_target' to '$_la_symlink'!" + echo "Unless you plan to use the GDB debugger with libarrow, this warning can be safely ignored." continue fi done diff --git a/dev/tasks/conda-recipes/arrow-cpp/build-arrow.sh b/dev/tasks/conda-recipes/arrow-cpp/build-arrow.sh index dc588f9473870..ef0b038812a01 100755 --- a/dev/tasks/conda-recipes/arrow-cpp/build-arrow.sh +++ b/dev/tasks/conda-recipes/arrow-cpp/build-arrow.sh @@ -30,7 +30,7 @@ fi # Enable CUDA support if [[ ! -z "${cuda_compiler_version+x}" && "${cuda_compiler_version}" != "None" ]] then - EXTRA_CMAKE_ARGS=" ${EXTRA_CMAKE_ARGS} -DARROW_CUDA=ON -DCUDA_TOOLKIT_ROOT_DIR=${CUDA_HOME} -DCMAKE_LIBRARY_PATH=${CONDA_BUILD_SYSROOT}/lib" + EXTRA_CMAKE_ARGS=" ${EXTRA_CMAKE_ARGS} -DARROW_CUDA=ON -DCUDAToolkit_ROOT=${CUDA_HOME} -DCMAKE_LIBRARY_PATH=${CONDA_BUILD_SYSROOT}/lib" else EXTRA_CMAKE_ARGS=" ${EXTRA_CMAKE_ARGS} -DARROW_CUDA=OFF" fi @@ -43,8 +43,8 @@ if [[ "${build_platform}" != "${target_platform}" ]]; then fi EXTRA_CMAKE_ARGS="${EXTRA_CMAKE_ARGS} -DCLANG_EXECUTABLE=${BUILD_PREFIX}/bin/${CONDA_TOOLCHAIN_HOST}-clang" EXTRA_CMAKE_ARGS="${EXTRA_CMAKE_ARGS} -DLLVM_LINK_EXECUTABLE=${BUILD_PREFIX}/bin/llvm-link" + EXTRA_CMAKE_ARGS="${EXTRA_CMAKE_ARGS} -DARROW_JEMALLOC_LG_PAGE=16" sed -ie "s;protoc-gen-grpc.*$;protoc-gen-grpc=${BUILD_PREFIX}/bin/grpc_cpp_plugin\";g" ../src/arrow/flight/CMakeLists.txt - sed -ie 's;"--with-jemalloc-prefix\=je_arrow_";"--with-jemalloc-prefix\=je_arrow_" "--with-lg-page\=16";g' ../cmake_modules/ThirdpartyToolchain.cmake fi # disable -fno-plt, which causes problems with GCC on PPC diff --git a/dev/tasks/conda-recipes/arrow-cpp/build-pyarrow.sh b/dev/tasks/conda-recipes/arrow-cpp/build-pyarrow.sh index 9c12321a1c115..f39e06874ca0e 100755 --- a/dev/tasks/conda-recipes/arrow-cpp/build-pyarrow.sh +++ b/dev/tasks/conda-recipes/arrow-cpp/build-pyarrow.sh @@ -24,6 +24,10 @@ BUILD_EXT_FLAGS="" # Enable CUDA support if [[ ! -z "${cuda_compiler_version+x}" && "${cuda_compiler_version}" != "None" ]]; then export PYARROW_WITH_CUDA=1 + if [[ "${build_platform}" != "${target_platform}" ]]; then + export CUDAToolkit_ROOT=${CUDA_HOME} + export CMAKE_LIBRARY_PATH=${CONDA_BUILD_SYSROOT}/lib + fi else export PYARROW_WITH_CUDA=0 fi diff --git a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml index fbe40af3dae01..371b62245bb72 100644 --- a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml +++ b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml @@ -61,7 +61,7 @@ outputs: build: string: h{{ PKG_HASH }}_{{ PKG_BUILDNUM }}_{{ build_ext }} run_exports: - - {{ pin_subpackage("libarrow", max_pin="x.x.x") }} + - {{ pin_subpackage("libarrow", max_pin="x") }} ignore_run_exports_from: - {{ compiler("cuda") }} # [cuda_compiler_version != "None"] # arrow only uses headers, apparently @@ -114,6 +114,8 @@ outputs: - libgrpc - libprotobuf - libutf8proc + # gandiva requires shared libllvm + - llvm # [unix] - lz4-c - nlohmann_json # gandiva depends on openssl @@ -133,8 +135,6 @@ outputs: # its host deps (which aren't yet covered above) leak into the build here - libcrc32c # [win] - libcurl # [win] - # same for libgrpc (before 1.55.0, which is coupled with libprotobuf 4.23.x) - - c-ares # [win and libprotobuf == "3.21"] run_constrained: - apache-arrow-proc =*={{ build_ext }} # make sure we don't co-install with old version of old package name @@ -198,8 +198,6 @@ outputs: requirements: host: - {{ pin_subpackage('libarrow', exact=True) }} - # avoid wrappers for different builds colliding due to identical hashes - - libprotobuf run: - {{ pin_subpackage('libarrow', exact=True) }} test: @@ -235,9 +233,7 @@ outputs: - cmake - ninja host: - # we're building for two protobuf versions, cannot pin exactly - # - {{ pin_subpackage('libarrow', exact=True) }} - - libarrow ={{ version }}=*_{{ PKG_BUILDNUM }}_{{ build_ext }} + - {{ pin_subpackage('libarrow', exact=True) }} - clangdev {{ llvm_version }} - llvmdev {{ llvm_version }} - cython <3 @@ -246,8 +242,7 @@ outputs: - setuptools - setuptools_scm <8.0.0 run: - # - {{ pin_subpackage('libarrow', exact=True) }} - - libarrow ={{ version }}=*_{{ PKG_BUILDNUM }}_{{ build_ext }} + - {{ pin_subpackage('libarrow', exact=True) }} - {{ pin_compatible('numpy') }} - python run_constrained: @@ -336,28 +331,28 @@ outputs: # crossbow CI: reduce to one python version, except on (unemulated) linux, where it's fast enough {% if linux64 or py == 311 %} - # {% if not (aarch64 or ppc64le) or py in (310, 311) %} - # only run the full test suite for one python version when in emulation (each run takes ~45min); - # there's essentially zero divergence in behaviour across python versions anyway, and otherwise - # CUDA builds for aarch/ppc consistently run out of disk space on azure for some reason + # {% if not (aarch64 or ppc64le) or py == 311 %} + # only run the full test suite for one python version when in emulation + # (each run can take up to ~45min); there's essentially zero divergence + # in behaviour across python versions anyway test: requires: - # vary protobuf version in test suite (historically, test failures only have a very - # weak dependency on python version, so we don't lose coverage by doing half & half) - - libprotobuf <4 # [py % 2 == 0] # test_cpp_extension_in_python requires a compiler - {{ compiler("cxx") }} # [linux] - # temporary pin due to missing fixture - - pytest <7.4.0 + - pytest - pytest-lazy-fixture - backports.zoneinfo # [py<39] + - boto3 - cffi - cloudpickle - cython <3 - fastparquet - fsspec - hypothesis + # currently disabled due to GH-37692 + # - minio-server - pandas + - s3fs >=2023 - scipy # these are generally (far) behind on migrating abseil/grpc/protobuf, # and using them as test dependencies blocks the migrator unnecessarily @@ -372,8 +367,8 @@ outputs: source_files: - testing/data commands: - - cd ${SP_DIR}/pyarrow/tests # [unix] - - cd %SP_DIR%\pyarrow\tests # [win] + - cd ${SP_DIR} # [unix] + - cd %SP_DIR% # [win] - export ARROW_TEST_DATA="${SRC_DIR}/testing/data" # [unix] - set "ARROW_TEST_DATA=%SRC_DIR%\testing\data" # [win] @@ -382,34 +377,26 @@ outputs: {% set tests_to_skip = tests_to_skip + " or test_cuda" %} # skip tests that raise SIGINT and crash the test suite {% set tests_to_skip = tests_to_skip + " or (test_csv and test_cancellation)" %} # [linux] - {% set tests_to_skip = tests_to_skip + " or (test_flight and test_interrupt)" %} # [linux] - # tests that may crash the agent due to out-of-bound memory writes or other risky stuff - {% set tests_to_skip = tests_to_skip + " or test_debug_memory_pool" %} # [aarch64 or ppc64le] - # cannot pass -D_LIBCPP_DISABLE_AVAILABILITY to test suite for our older macos sdk - {% set tests_to_skip = tests_to_skip + " or test_cpp_extension_in_python" %} # [osx] + # skip test that intentionally writes out of bounds and then expects no error message + {% set tests_to_skip = tests_to_skip + " or test_debug_memory_pool_disabled[system_memory_pool]" %} # [osx] # skip tests that make invalid(-for-conda) assumptions about the compilers setup {% set tests_to_skip = tests_to_skip + " or test_cython_api" %} # [unix] {% set tests_to_skip = tests_to_skip + " or test_visit_strings" %} # [unix] # skip tests that cannot succeed in emulation {% set tests_to_skip = tests_to_skip + " or test_debug_memory_pool_disabled" %} # [aarch64 or ppc64le] {% set tests_to_skip = tests_to_skip + " or test_env_var_io_thread_count" %} # [aarch64 or ppc64le] + # XMinioInvalidObjectName on osx/win: "Object name contains unsupported characters" + {% set tests_to_skip = tests_to_skip + " or test_write_to_dataset_with_partitions_s3fs" %} # [osx or win] # vvvvvvv TESTS THAT SHOULDN'T HAVE TO BE SKIPPED vvvvvvv - {% set tests_to_skip = tests_to_skip + " or test_extension_to_pandas_storage_type" %} - # segfaults on OSX: to investigate ASAP - {% set tests_to_skip = tests_to_skip + " or test_flight" %} # [osx] + # currently broken + {% set tests_to_skip = tests_to_skip + " or test_fastparquet_cross_compatibility" %} # gandiva tests are segfaulting on ppc - {% set tests_to_skip = tests_to_skip + " or test_gandiva" %} # [ppc64le] - # test failures on ppc + {% set tests_to_skip = tests_to_skip + " or test_gandiva" %} # [ppc64le] + # test failures on ppc (both failing with: Float value was truncated converting to int32) {% set tests_to_skip = tests_to_skip + " or test_safe_cast_from_float_with_nans_to_int" %} # [ppc64le] - # gandiva tests are segfaulting on ppc - {% set tests_to_skip = tests_to_skip + " or test_float_with_null_as_integer" %} # [ppc64le] - # test is broken; header is in $PREFIX, not $SP_DIR - {% set tests_to_skip = tests_to_skip + " or (test_misc and test_get_include)" %} # [unix] - # flaky tests that fail occasionally - {% set tests_to_skip = tests_to_skip + " or test_total_bytes_allocated " %} # [linux] - {% set tests_to_skip = tests_to_skip + " or test_feather_format " %} # [linux] + {% set tests_to_skip = tests_to_skip + " or test_float_with_null_as_integer" %} # [ppc64le] # ^^^^^^^ TESTS THAT SHOULDN'T HAVE TO BE SKIPPED ^^^^^^^ - - pytest -rfEs -k "not ({{ tests_to_skip }})" + - pytest pyarrow/ -rfEs -k "not ({{ tests_to_skip }})" {% endif %} about: diff --git a/dev/tasks/tasks.yml b/dev/tasks/tasks.yml index 29e038a922412..859ff8ddb5b44 100644 --- a/dev/tasks/tasks.yml +++ b/dev/tasks/tasks.yml @@ -246,15 +246,15 @@ tasks: # generated and to be synced regularly from the feedstock. We have no way # yet to generate them inside the arrow repository automatically. - conda-linux-x64-cpu-r41: + conda-linux-x64-cpu-r43: ci: azure template: conda-recipes/azure.linux.yml params: config: linux_64_cuda_compiler_versionNone - r_config: linux_64_r_base4.1 + r_config: linux_64_r_base4.3 artifacts: - libarrow-{no_rc_version}-(h[a-z0-9]+)_0_cpu.conda - - r-arrow-{no_rc_version}-r41(h[a-z0-9]+)_0.conda + - r-arrow-{no_rc_version}-r43(h[a-z0-9]+)_0.conda conda-linux-x64-cpu-r42: ci: azure @@ -292,15 +292,15 @@ tasks: ########################### Conda Linux (aarch64) ########################### - conda-linux-aarch64-cpu-r41: + conda-linux-aarch64-cpu-r43: ci: azure template: conda-recipes/azure.linux.yml params: config: linux_aarch64_cuda_compiler_versionNone - r_config: linux_aarch64_r_base4.1 + r_config: linux_aarch64_r_base4.3 artifacts: - libarrow-{no_rc_version}-(h[a-z0-9]+)_0_cpu.conda - - r-arrow-{no_rc_version}-r41(h[a-z0-9]+)_0.conda + - r-arrow-{no_rc_version}-r43(h[a-z0-9]+)_0.conda conda-linux-aarch64-cpu-r42: ci: azure @@ -364,15 +364,15 @@ tasks: ############################## Conda OSX (x64) ############################## - conda-osx-x64-cpu-r41: + conda-osx-x64-cpu-r43: ci: azure template: conda-recipes/azure.osx.yml params: config: osx_64_ - r_config: osx_64_r_base4.1 + r_config: osx_64_r_base4.3 artifacts: - libarrow-{no_rc_version}-(h[a-z0-9]+)_0_cpu.conda - - r-arrow-{no_rc_version}-r41(h[a-z0-9]+)_0.conda + - r-arrow-{no_rc_version}-r43(h[a-z0-9]+)_0.conda conda-osx-x64-cpu-r42: ci: azure @@ -398,15 +398,15 @@ tasks: ############################# Conda OSX (arm64) ############################# - conda-osx-arm64-cpu-r41: + conda-osx-arm64-cpu-r43: ci: azure template: conda-recipes/azure.osx.yml params: config: osx_arm64_ - r_config: osx_arm64_r_base4.1 + r_config: osx_arm64_r_base4.3 artifacts: - libarrow-{no_rc_version}-(h[a-z0-9]+)_0_cpu.conda - - r-arrow-{no_rc_version}-r41(h[a-z0-9]+)_0.conda + - r-arrow-{no_rc_version}-r43(h[a-z0-9]+)_0.conda conda-osx-arm64-cpu-r42: ci: azure From 284dddd129fae072ec3e7ae269520b54c8decdb4 Mon Sep 17 00:00:00 2001 From: James Duong Date: Thu, 28 Sep 2023 06:22:26 -0700 Subject: [PATCH 76/96] GH-37863: [Java] Add typed getters for StructVector (#37916) ### Rationale for this change Add methods for getting child vectors as specific vector subtypes for convenience. ### What changes are included in this PR? Add child getters which let the caller specify the target vector type to StructVector. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37863 Authored-by: James Duong Signed-off-by: David Li --- .../vector/complex/NonNullableStructVector.java | 12 ++++++++++++ .../org/apache/arrow/vector/TestStructVector.java | 8 ++++++++ 2 files changed, 20 insertions(+) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java index 4da2668121af6..7d724656cdab7 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java @@ -374,6 +374,18 @@ public ValueVector getVectorById(int id) { return getChildByOrdinal(id); } + /** + * Gets a child vector by ordinal position and casts to the specified class. + */ + public V getVectorById(int id, Class clazz) { + ValueVector untyped = getVectorById(id); + if (clazz.isInstance(untyped)) { + return clazz.cast(untyped); + } + throw new ClassCastException("Id " + id + " had the wrong type. Expected " + clazz.getCanonicalName() + + " but was " + untyped.getClass().getCanonicalName()); + } + @Override public void setValueCount(int valueCount) { for (final ValueVector v : getChildren()) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestStructVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestStructVector.java index b4c30480000c8..552d5752f236f 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestStructVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestStructVector.java @@ -282,4 +282,12 @@ public void testAddChildVectorsWithDuplicatedFieldNamesForConflictPolicyReplace( } } + @Test + public void testTypedGetters() { + try (final StructVector s1 = StructVector.empty("s1", allocator)) { + s1.addOrGet("struct_child", FieldType.nullable(MinorType.INT.getType()), IntVector.class); + assertEquals(IntVector.class, s1.getChild("struct_child", IntVector.class).getClass()); + assertEquals(IntVector.class, s1.getVectorById(0, IntVector.class).getClass()); + } + } } From 26667340f2e72c84107c9be28e68aa88dcb064ff Mon Sep 17 00:00:00 2001 From: James Duong Date: Thu, 28 Sep 2023 06:22:45 -0700 Subject: [PATCH 77/96] GH-37864: [Java] Remove unnecessary throws from OrcReader (#37913) ### Rationale for this change Make OrcReader more friendly to use with try-with-resources blocks and AutoCloseables by removing an unnecessary throws modifier on close(). ### What changes are included in this PR? Removes an unused throws specifier on OrcReader#close(). ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37864 Authored-by: James Duong Signed-off-by: David Li --- .../src/main/java/org/apache/arrow/adapter/orc/OrcReader.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java b/java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java index b42ddb48433b5..648e17e9c374c 100644 --- a/java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java +++ b/java/adapter/orc/src/main/java/org/apache/arrow/adapter/orc/OrcReader.java @@ -84,7 +84,7 @@ public int getNumberOfStripes() throws IllegalArgumentException { } @Override - public void close() throws Exception { + public void close() { jniWrapper.close(nativeInstanceId); } } From 019d06df56ba3215148465554948a4d93fd9c707 Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Thu, 28 Sep 2023 07:50:51 -0700 Subject: [PATCH 78/96] GH-37893: [Java] Move Types.proto in a subfolder (#37894) ### Rationale for this change Types.proto is a Gandiva protobuf definition used by Gandiva to exchange data between Java and C++. This file is packaged automatically within arrow-gandiva jar but because of its generic name, it may cause conflicts in others' people project. ### What changes are included in this PR? Move `Types.proto` into `gandiva/types.proto` (also matching convention of using lowercase filename) so that it become less likely to cause a conflict. ### Are these changes tested? Change should have no impact on the feature itself. Manually check the resulting jar to confirm that `types.proto` is not located at the root of the archive. ### Are there any user-facing changes? No user-facing. Developers who were actually referencing Gandiva `Types.proto` in their project may have to change their `import` directive. **This PR includes breaking changes to public APIs.** * Closes: #37893 Authored-by: Laurent Goujon Signed-off-by: David Li --- java/gandiva/CMakeLists.txt | 10 +- .../{Types.proto => gandiva/types.proto} | 4 +- .../main/cpp/expression_registry_helper.cc | 87 +++++----- java/gandiva/src/main/cpp/jni_common.cc | 152 +++++++++--------- 4 files changed, 126 insertions(+), 127 deletions(-) rename java/gandiva/proto/{Types.proto => gandiva/types.proto} (99%) diff --git a/java/gandiva/CMakeLists.txt b/java/gandiva/CMakeLists.txt index 629ab2fb347d8..2aa8d92959e42 100644 --- a/java/gandiva/CMakeLists.txt +++ b/java/gandiva/CMakeLists.txt @@ -29,21 +29,21 @@ add_jar(arrow_java_jni_gandiva_jar arrow_java_jni_gandiva_headers) set(GANDIVA_PROTO_OUTPUT_DIR ${CMAKE_CURRENT_BINARY_DIR}) -set(GANDIVA_PROTO_OUTPUT_FILES "${GANDIVA_PROTO_OUTPUT_DIR}/Types.pb.cc" - "${GANDIVA_PROTO_OUTPUT_DIR}/Types.pb.h") +set(GANDIVA_PROTO_OUTPUT_FILES "${GANDIVA_PROTO_OUTPUT_DIR}/gandiva/types.pb.cc" + "${GANDIVA_PROTO_OUTPUT_DIR}/gandiva/types.pb.h") set_source_files_properties(${GANDIVA_PROTO_OUTPUT_FILES} PROPERTIES GENERATED TRUE) set(GANDIVA_PROTO_DIR ${CMAKE_CURRENT_SOURCE_DIR}/proto) -get_filename_component(GANDIVA_PROTO_FILE_ABSOLUTE ${GANDIVA_PROTO_DIR}/Types.proto - ABSOLUTE) +get_filename_component(GANDIVA_PROTO_FILE_ABSOLUTE + ${GANDIVA_PROTO_DIR}/gandiva/types.proto ABSOLUTE) find_package(Protobuf REQUIRED) add_custom_command(OUTPUT ${GANDIVA_PROTO_OUTPUT_FILES} COMMAND protobuf::protoc --proto_path ${GANDIVA_PROTO_DIR} --cpp_out ${GANDIVA_PROTO_OUTPUT_DIR} ${GANDIVA_PROTO_FILE_ABSOLUTE} DEPENDS ${GANDIVA_PROTO_FILE_ABSOLUTE} - COMMENT "Running Protobuf compiler on Types.proto" + COMMENT "Running Protobuf compiler on gandiva/types.proto" VERBATIM) add_custom_target(garrow_java_jni_gandiva_proto ALL DEPENDS ${GANDIVA_PROTO_OUTPUT_FILES}) diff --git a/java/gandiva/proto/Types.proto b/java/gandiva/proto/gandiva/types.proto similarity index 99% rename from java/gandiva/proto/Types.proto rename to java/gandiva/proto/gandiva/types.proto index eb0d996b92e63..4ce342681d614 100644 --- a/java/gandiva/proto/Types.proto +++ b/java/gandiva/proto/gandiva/types.proto @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -syntax = "proto2"; -package types; +syntax = "proto3"; +package gandiva.types; option java_package = "org.apache.arrow.gandiva.ipc"; option java_outer_classname = "GandivaTypes"; diff --git a/java/gandiva/src/main/cpp/expression_registry_helper.cc b/java/gandiva/src/main/cpp/expression_registry_helper.cc index 6765df3b9727f..66b97c8b9ef44 100644 --- a/java/gandiva/src/main/cpp/expression_registry_helper.cc +++ b/java/gandiva/src/main/cpp/expression_registry_helper.cc @@ -20,121 +20,120 @@ #include #include #include - -#include "Types.pb.h" -#include "org_apache_arrow_gandiva_evaluator_ExpressionRegistryJniHelper.h" +#include +#include using gandiva::DataTypePtr; using gandiva::ExpressionRegistry; -types::TimeUnit MapTimeUnit(arrow::TimeUnit::type& unit) { +gandiva::types::TimeUnit MapTimeUnit(arrow::TimeUnit::type& unit) { switch (unit) { case arrow::TimeUnit::MILLI: - return types::TimeUnit::MILLISEC; + return gandiva::types::TimeUnit::MILLISEC; case arrow::TimeUnit::SECOND: - return types::TimeUnit::SEC; + return gandiva::types::TimeUnit::SEC; case arrow::TimeUnit::MICRO: - return types::TimeUnit::MICROSEC; + return gandiva::types::TimeUnit::MICROSEC; case arrow::TimeUnit::NANO: - return types::TimeUnit::NANOSEC; + return gandiva::types::TimeUnit::NANOSEC; } // satisfy gcc. should be unreachable. - return types::TimeUnit::SEC; + return gandiva::types::TimeUnit::SEC; } -void ArrowToProtobuf(DataTypePtr type, types::ExtGandivaType* gandiva_data_type) { +void ArrowToProtobuf(DataTypePtr type, gandiva::types::ExtGandivaType* gandiva_data_type) { switch (type->id()) { case arrow::Type::BOOL: - gandiva_data_type->set_type(types::GandivaType::BOOL); + gandiva_data_type->set_type(gandiva::types::GandivaType::BOOL); break; case arrow::Type::UINT8: - gandiva_data_type->set_type(types::GandivaType::UINT8); + gandiva_data_type->set_type(gandiva::types::GandivaType::UINT8); break; case arrow::Type::INT8: - gandiva_data_type->set_type(types::GandivaType::INT8); + gandiva_data_type->set_type(gandiva::types::GandivaType::INT8); break; case arrow::Type::UINT16: - gandiva_data_type->set_type(types::GandivaType::UINT16); + gandiva_data_type->set_type(gandiva::types::GandivaType::UINT16); break; case arrow::Type::INT16: - gandiva_data_type->set_type(types::GandivaType::INT16); + gandiva_data_type->set_type(gandiva::types::GandivaType::INT16); break; case arrow::Type::UINT32: - gandiva_data_type->set_type(types::GandivaType::UINT32); + gandiva_data_type->set_type(gandiva::types::GandivaType::UINT32); break; case arrow::Type::INT32: - gandiva_data_type->set_type(types::GandivaType::INT32); + gandiva_data_type->set_type(gandiva::types::GandivaType::INT32); break; case arrow::Type::UINT64: - gandiva_data_type->set_type(types::GandivaType::UINT64); + gandiva_data_type->set_type(gandiva::types::GandivaType::UINT64); break; case arrow::Type::INT64: - gandiva_data_type->set_type(types::GandivaType::INT64); + gandiva_data_type->set_type(gandiva::types::GandivaType::INT64); break; case arrow::Type::HALF_FLOAT: - gandiva_data_type->set_type(types::GandivaType::HALF_FLOAT); + gandiva_data_type->set_type(gandiva::types::GandivaType::HALF_FLOAT); break; case arrow::Type::FLOAT: - gandiva_data_type->set_type(types::GandivaType::FLOAT); + gandiva_data_type->set_type(gandiva::types::GandivaType::FLOAT); break; case arrow::Type::DOUBLE: - gandiva_data_type->set_type(types::GandivaType::DOUBLE); + gandiva_data_type->set_type(gandiva::types::GandivaType::DOUBLE); break; case arrow::Type::STRING: - gandiva_data_type->set_type(types::GandivaType::UTF8); + gandiva_data_type->set_type(gandiva::types::GandivaType::UTF8); break; case arrow::Type::BINARY: - gandiva_data_type->set_type(types::GandivaType::BINARY); + gandiva_data_type->set_type(gandiva::types::GandivaType::BINARY); break; case arrow::Type::DATE32: - gandiva_data_type->set_type(types::GandivaType::DATE32); + gandiva_data_type->set_type(gandiva::types::GandivaType::DATE32); break; case arrow::Type::DATE64: - gandiva_data_type->set_type(types::GandivaType::DATE64); + gandiva_data_type->set_type(gandiva::types::GandivaType::DATE64); break; case arrow::Type::TIMESTAMP: { - gandiva_data_type->set_type(types::GandivaType::TIMESTAMP); + gandiva_data_type->set_type(gandiva::types::GandivaType::TIMESTAMP); std::shared_ptr cast_time_stamp_type = std::dynamic_pointer_cast(type); arrow::TimeUnit::type unit = cast_time_stamp_type->unit(); - types::TimeUnit time_unit = MapTimeUnit(unit); + gandiva::types::TimeUnit time_unit = MapTimeUnit(unit); gandiva_data_type->set_timeunit(time_unit); break; } case arrow::Type::TIME32: { - gandiva_data_type->set_type(types::GandivaType::TIME32); + gandiva_data_type->set_type(gandiva::types::GandivaType::TIME32); std::shared_ptr cast_time_32_type = std::dynamic_pointer_cast(type); arrow::TimeUnit::type unit = cast_time_32_type->unit(); - types::TimeUnit time_unit = MapTimeUnit(unit); + gandiva::types::TimeUnit time_unit = MapTimeUnit(unit); gandiva_data_type->set_timeunit(time_unit); break; } case arrow::Type::TIME64: { - gandiva_data_type->set_type(types::GandivaType::TIME32); + gandiva_data_type->set_type(gandiva::types::GandivaType::TIME32); std::shared_ptr cast_time_64_type = std::dynamic_pointer_cast(type); arrow::TimeUnit::type unit = cast_time_64_type->unit(); - types::TimeUnit time_unit = MapTimeUnit(unit); + gandiva::types::TimeUnit time_unit = MapTimeUnit(unit); gandiva_data_type->set_timeunit(time_unit); break; } case arrow::Type::NA: - gandiva_data_type->set_type(types::GandivaType::NONE); + gandiva_data_type->set_type(gandiva::types::GandivaType::NONE); break; case arrow::Type::DECIMAL: { - gandiva_data_type->set_type(types::GandivaType::DECIMAL); + gandiva_data_type->set_type(gandiva::types::GandivaType::DECIMAL); gandiva_data_type->set_precision(0); gandiva_data_type->set_scale(0); break; } case arrow::Type::INTERVAL_MONTHS: - gandiva_data_type->set_type(types::GandivaType::INTERVAL); - gandiva_data_type->set_intervaltype(types::IntervalType::YEAR_MONTH); + gandiva_data_type->set_type(gandiva::types::GandivaType::INTERVAL); + gandiva_data_type->set_intervaltype(gandiva::types::IntervalType::YEAR_MONTH); break; case arrow::Type::INTERVAL_DAY_TIME: - gandiva_data_type->set_type(types::GandivaType::INTERVAL); - gandiva_data_type->set_intervaltype(types::IntervalType::DAY_TIME); + gandiva_data_type->set_type(gandiva::types::GandivaType::INTERVAL); + gandiva_data_type->set_intervaltype(gandiva::types::IntervalType::DAY_TIME); break; default: // un-supported types. test ensures that @@ -146,10 +145,10 @@ void ArrowToProtobuf(DataTypePtr type, types::ExtGandivaType* gandiva_data_type) JNIEXPORT jbyteArray JNICALL Java_org_apache_arrow_gandiva_evaluator_ExpressionRegistryJniHelper_getGandivaSupportedDataTypes( // NOLINT JNIEnv* env, jobject types_helper) { - types::GandivaDataTypes gandiva_data_types; + gandiva::types::GandivaDataTypes gandiva_data_types; auto supported_types = ExpressionRegistry::supported_types(); for (auto const& type : supported_types) { - types::ExtGandivaType* gandiva_data_type = gandiva_data_types.add_datatype(); + gandiva::types::ExtGandivaType* gandiva_data_type = gandiva_data_types.add_datatype(); ArrowToProtobuf(type, gandiva_data_type); } auto size = static_cast(gandiva_data_types.ByteSizeLong()); @@ -169,15 +168,15 @@ JNIEXPORT jbyteArray JNICALL Java_org_apache_arrow_gandiva_evaluator_ExpressionRegistryJniHelper_getGandivaSupportedFunctions( // NOLINT JNIEnv* env, jobject types_helper) { ExpressionRegistry expr_registry; - types::GandivaFunctions gandiva_functions; + gandiva::types::GandivaFunctions gandiva_functions; for (auto function = expr_registry.function_signature_begin(); function != expr_registry.function_signature_end(); function++) { - types::FunctionSignature* function_signature = gandiva_functions.add_function(); + gandiva::types::FunctionSignature* function_signature = gandiva_functions.add_function(); function_signature->set_name((*function).base_name()); - types::ExtGandivaType* return_type = function_signature->mutable_returntype(); + gandiva::types::ExtGandivaType* return_type = function_signature->mutable_returntype(); ArrowToProtobuf((*function).ret_type(), return_type); for (auto& param_type : (*function).param_types()) { - types::ExtGandivaType* proto_param_type = function_signature->add_paramtypes(); + gandiva::types::ExtGandivaType* proto_param_type = function_signature->add_paramtypes(); ArrowToProtobuf(param_type, proto_param_type); } } diff --git a/java/gandiva/src/main/cpp/jni_common.cc b/java/gandiva/src/main/cpp/jni_common.cc index 43db266ff56f5..a5dff9981ce89 100644 --- a/java/gandiva/src/main/cpp/jni_common.cc +++ b/java/gandiva/src/main/cpp/jni_common.cc @@ -35,13 +35,13 @@ #include #include #include +#include +#include -#include "Types.pb.h" #include "config_holder.h" #include "env_helper.h" #include "id_to_module_map.h" #include "module_holder.h" -#include "org_apache_arrow_gandiva_evaluator_JniWrapper.h" using gandiva::ConditionPtr; using gandiva::DataTypePtr; @@ -65,7 +65,7 @@ using gandiva::FilterHolder; using gandiva::ProjectorHolder; // forward declarations -NodePtr ProtoTypeToNode(const types::TreeNode& node); +NodePtr ProtoTypeToNode(const gandiva::types::TreeNode& node); static jint JNI_VERSION = JNI_VERSION_1_6; @@ -131,11 +131,11 @@ void JNI_OnUnload(JavaVM* vm, void* reserved) { env->DeleteGlobalRef(vector_expander_ret_class_); } -DataTypePtr ProtoTypeToTime32(const types::ExtGandivaType& ext_type) { +DataTypePtr ProtoTypeToTime32(const gandiva::types::ExtGandivaType& ext_type) { switch (ext_type.timeunit()) { - case types::SEC: + case gandiva::types::SEC: return arrow::time32(arrow::TimeUnit::SECOND); - case types::MILLISEC: + case gandiva::types::MILLISEC: return arrow::time32(arrow::TimeUnit::MILLI); default: std::cerr << "Unknown time unit: " << ext_type.timeunit() << " for time32\n"; @@ -143,11 +143,11 @@ DataTypePtr ProtoTypeToTime32(const types::ExtGandivaType& ext_type) { } } -DataTypePtr ProtoTypeToTime64(const types::ExtGandivaType& ext_type) { +DataTypePtr ProtoTypeToTime64(const gandiva::types::ExtGandivaType& ext_type) { switch (ext_type.timeunit()) { - case types::MICROSEC: + case gandiva::types::MICROSEC: return arrow::time64(arrow::TimeUnit::MICRO); - case types::NANOSEC: + case gandiva::types::NANOSEC: return arrow::time64(arrow::TimeUnit::NANO); default: std::cerr << "Unknown time unit: " << ext_type.timeunit() << " for time64\n"; @@ -155,15 +155,15 @@ DataTypePtr ProtoTypeToTime64(const types::ExtGandivaType& ext_type) { } } -DataTypePtr ProtoTypeToTimestamp(const types::ExtGandivaType& ext_type) { +DataTypePtr ProtoTypeToTimestamp(const gandiva::types::ExtGandivaType& ext_type) { switch (ext_type.timeunit()) { - case types::SEC: + case gandiva::types::SEC: return arrow::timestamp(arrow::TimeUnit::SECOND); - case types::MILLISEC: + case gandiva::types::MILLISEC: return arrow::timestamp(arrow::TimeUnit::MILLI); - case types::MICROSEC: + case gandiva::types::MICROSEC: return arrow::timestamp(arrow::TimeUnit::MICRO); - case types::NANOSEC: + case gandiva::types::NANOSEC: return arrow::timestamp(arrow::TimeUnit::NANO); default: std::cerr << "Unknown time unit: " << ext_type.timeunit() << " for timestamp\n"; @@ -171,11 +171,11 @@ DataTypePtr ProtoTypeToTimestamp(const types::ExtGandivaType& ext_type) { } } -DataTypePtr ProtoTypeToInterval(const types::ExtGandivaType& ext_type) { +DataTypePtr ProtoTypeToInterval(const gandiva::types::ExtGandivaType& ext_type) { switch (ext_type.intervaltype()) { - case types::YEAR_MONTH: + case gandiva::types::YEAR_MONTH: return arrow::month_interval(); - case types::DAY_TIME: + case gandiva::types::DAY_TIME: return arrow::day_time_interval(); default: std::cerr << "Unknown interval type: " << ext_type.intervaltype() << "\n"; @@ -183,59 +183,59 @@ DataTypePtr ProtoTypeToInterval(const types::ExtGandivaType& ext_type) { } } -DataTypePtr ProtoTypeToDataType(const types::ExtGandivaType& ext_type) { +DataTypePtr ProtoTypeToDataType(const gandiva::types::ExtGandivaType& ext_type) { switch (ext_type.type()) { - case types::NONE: + case gandiva::types::NONE: return arrow::null(); - case types::BOOL: + case gandiva::types::BOOL: return arrow::boolean(); - case types::UINT8: + case gandiva::types::UINT8: return arrow::uint8(); - case types::INT8: + case gandiva::types::INT8: return arrow::int8(); - case types::UINT16: + case gandiva::types::UINT16: return arrow::uint16(); - case types::INT16: + case gandiva::types::INT16: return arrow::int16(); - case types::UINT32: + case gandiva::types::UINT32: return arrow::uint32(); - case types::INT32: + case gandiva::types::INT32: return arrow::int32(); - case types::UINT64: + case gandiva::types::UINT64: return arrow::uint64(); - case types::INT64: + case gandiva::types::INT64: return arrow::int64(); - case types::HALF_FLOAT: + case gandiva::types::HALF_FLOAT: return arrow::float16(); - case types::FLOAT: + case gandiva::types::FLOAT: return arrow::float32(); - case types::DOUBLE: + case gandiva::types::DOUBLE: return arrow::float64(); - case types::UTF8: + case gandiva::types::UTF8: return arrow::utf8(); - case types::BINARY: + case gandiva::types::BINARY: return arrow::binary(); - case types::DATE32: + case gandiva::types::DATE32: return arrow::date32(); - case types::DATE64: + case gandiva::types::DATE64: return arrow::date64(); - case types::DECIMAL: + case gandiva::types::DECIMAL: // TODO: error handling return arrow::decimal(ext_type.precision(), ext_type.scale()); - case types::TIME32: + case gandiva::types::TIME32: return ProtoTypeToTime32(ext_type); - case types::TIME64: + case gandiva::types::TIME64: return ProtoTypeToTime64(ext_type); - case types::TIMESTAMP: + case gandiva::types::TIMESTAMP: return ProtoTypeToTimestamp(ext_type); - case types::INTERVAL: + case gandiva::types::INTERVAL: return ProtoTypeToInterval(ext_type); - case types::FIXED_SIZE_BINARY: - case types::LIST: - case types::STRUCT: - case types::UNION: - case types::DICTIONARY: - case types::MAP: + case gandiva::types::FIXED_SIZE_BINARY: + case gandiva::types::LIST: + case gandiva::types::STRUCT: + case gandiva::types::UNION: + case gandiva::types::DICTIONARY: + case gandiva::types::MAP: std::cerr << "Unhandled data type: " << ext_type.type() << "\n"; return nullptr; @@ -245,7 +245,7 @@ DataTypePtr ProtoTypeToDataType(const types::ExtGandivaType& ext_type) { } } -FieldPtr ProtoTypeToField(const types::Field& f) { +FieldPtr ProtoTypeToField(const gandiva::types::Field& f) { const std::string& name = f.name(); DataTypePtr type = ProtoTypeToDataType(f.type()); bool nullable = true; @@ -256,7 +256,7 @@ FieldPtr ProtoTypeToField(const types::Field& f) { return field(name, type, nullable); } -NodePtr ProtoTypeToFieldNode(const types::FieldNode& node) { +NodePtr ProtoTypeToFieldNode(const gandiva::types::FieldNode& node) { FieldPtr field_ptr = ProtoTypeToField(node.field()); if (field_ptr == nullptr) { std::cerr << "Unable to create field node from protobuf\n"; @@ -266,12 +266,12 @@ NodePtr ProtoTypeToFieldNode(const types::FieldNode& node) { return TreeExprBuilder::MakeField(field_ptr); } -NodePtr ProtoTypeToFnNode(const types::FunctionNode& node) { +NodePtr ProtoTypeToFnNode(const gandiva::types::FunctionNode& node) { const std::string& name = node.functionname(); NodeVector children; for (int i = 0; i < node.inargs_size(); i++) { - const types::TreeNode& arg = node.inargs(i); + const gandiva::types::TreeNode& arg = node.inargs(i); NodePtr n = ProtoTypeToNode(arg); if (n == nullptr) { @@ -291,7 +291,7 @@ NodePtr ProtoTypeToFnNode(const types::FunctionNode& node) { return TreeExprBuilder::MakeFunction(name, children, return_type); } -NodePtr ProtoTypeToIfNode(const types::IfNode& node) { +NodePtr ProtoTypeToIfNode(const gandiva::types::IfNode& node) { NodePtr cond = ProtoTypeToNode(node.cond()); if (cond == nullptr) { std::cerr << "Unable to create cond node for if node\n"; @@ -319,11 +319,11 @@ NodePtr ProtoTypeToIfNode(const types::IfNode& node) { return TreeExprBuilder::MakeIf(cond, then_node, else_node, return_type); } -NodePtr ProtoTypeToAndNode(const types::AndNode& node) { +NodePtr ProtoTypeToAndNode(const gandiva::types::AndNode& node) { NodeVector children; for (int i = 0; i < node.args_size(); i++) { - const types::TreeNode& arg = node.args(i); + const gandiva::types::TreeNode& arg = node.args(i); NodePtr n = ProtoTypeToNode(arg); if (n == nullptr) { @@ -335,11 +335,11 @@ NodePtr ProtoTypeToAndNode(const types::AndNode& node) { return TreeExprBuilder::MakeAnd(children); } -NodePtr ProtoTypeToOrNode(const types::OrNode& node) { +NodePtr ProtoTypeToOrNode(const gandiva::types::OrNode& node) { NodeVector children; for (int i = 0; i < node.args_size(); i++) { - const types::TreeNode& arg = node.args(i); + const gandiva::types::TreeNode& arg = node.args(i); NodePtr n = ProtoTypeToNode(arg); if (n == nullptr) { @@ -351,7 +351,7 @@ NodePtr ProtoTypeToOrNode(const types::OrNode& node) { return TreeExprBuilder::MakeOr(children); } -NodePtr ProtoTypeToInNode(const types::InNode& node) { +NodePtr ProtoTypeToInNode(const gandiva::types::InNode& node) { NodePtr field = ProtoTypeToNode(node.node()); if (node.has_intvalues()) { @@ -417,7 +417,7 @@ NodePtr ProtoTypeToInNode(const types::InNode& node) { return nullptr; } -NodePtr ProtoTypeToNullNode(const types::NullNode& node) { +NodePtr ProtoTypeToNullNode(const gandiva::types::NullNode& node) { DataTypePtr data_type = ProtoTypeToDataType(node.type()); if (data_type == nullptr) { std::cerr << "Unknown type " << data_type->ToString() << " for null node\n"; @@ -427,7 +427,7 @@ NodePtr ProtoTypeToNullNode(const types::NullNode& node) { return TreeExprBuilder::MakeNull(data_type); } -NodePtr ProtoTypeToNode(const types::TreeNode& node) { +NodePtr ProtoTypeToNode(const gandiva::types::TreeNode& node) { if (node.has_fieldnode()) { return ProtoTypeToFieldNode(node.fieldnode()); } @@ -494,7 +494,7 @@ NodePtr ProtoTypeToNode(const types::TreeNode& node) { return nullptr; } -ExpressionPtr ProtoTypeToExpression(const types::ExpressionRoot& root) { +ExpressionPtr ProtoTypeToExpression(const gandiva::types::ExpressionRoot& root) { NodePtr root_node = ProtoTypeToNode(root.root()); if (root_node == nullptr) { std::cerr << "Unable to create expression node from expression protobuf\n"; @@ -510,7 +510,7 @@ ExpressionPtr ProtoTypeToExpression(const types::ExpressionRoot& root) { return TreeExprBuilder::MakeExpression(root_node, field); } -ConditionPtr ProtoTypeToCondition(const types::Condition& condition) { +ConditionPtr ProtoTypeToCondition(const gandiva::types::Condition& condition) { NodePtr root_node = ProtoTypeToNode(condition.root()); if (root_node == nullptr) { return nullptr; @@ -519,7 +519,7 @@ ConditionPtr ProtoTypeToCondition(const types::Condition& condition) { return TreeExprBuilder::MakeCondition(root_node); } -SchemaPtr ProtoTypeToSchema(const types::Schema& schema) { +SchemaPtr ProtoTypeToSchema(const gandiva::types::Schema& schema) { std::vector fields; for (int i = 0; i < schema.columns_size(); i++) { @@ -608,11 +608,11 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_gandiva_evaluator_JniWrapper_build std::shared_ptr projector; std::shared_ptr holder; - types::Schema schema; + gandiva::types::Schema schema; jsize schema_len = env->GetArrayLength(schema_arr); jbyte* schema_bytes = env->GetByteArrayElements(schema_arr, 0); - types::ExpressionList exprs; + gandiva::types::ExpressionList exprs; jsize exprs_len = env->GetArrayLength(exprs_arr); jbyte* exprs_bytes = env->GetByteArrayElements(exprs_arr, 0); @@ -643,7 +643,7 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_gandiva_evaluator_JniWrapper_build goto err_out; } - // convert types::Schema to arrow::Schema + // convert gandiva::types::Schema to arrow::Schema schema_ptr = ProtoTypeToSchema(schema); if (schema_ptr == nullptr) { ss << "Unable to construct arrow schema object from schema protobuf\n"; @@ -666,13 +666,13 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_gandiva_evaluator_JniWrapper_build } switch (selection_vector_type) { - case types::SV_NONE: + case gandiva::types::SV_NONE: mode = gandiva::SelectionVector::MODE_NONE; break; - case types::SV_INT16: + case gandiva::types::SV_INT16: mode = gandiva::SelectionVector::MODE_UINT16; break; - case types::SV_INT32: + case gandiva::types::SV_INT32: mode = gandiva::SelectionVector::MODE_UINT32; break; } @@ -809,17 +809,17 @@ Java_org_apache_arrow_gandiva_evaluator_JniWrapper_evaluateProjector( reinterpret_cast(sel_vec_addr), sel_vec_size); int output_row_count = 0; switch (sel_vec_type) { - case types::SV_NONE: { + case gandiva::types::SV_NONE: { output_row_count = num_rows; break; } - case types::SV_INT16: { + case gandiva::types::SV_INT16: { status = gandiva::SelectionVector::MakeImmutableInt16( sel_vec_rows, selection_buffer, &selection_vector); output_row_count = sel_vec_rows; break; } - case types::SV_INT32: { + case gandiva::types::SV_INT32: { status = gandiva::SelectionVector::MakeImmutableInt32( sel_vec_rows, selection_buffer, &selection_vector); output_row_count = sel_vec_rows; @@ -909,11 +909,11 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_gandiva_evaluator_JniWrapper_build std::shared_ptr filter; std::shared_ptr holder; - types::Schema schema; + gandiva::types::Schema schema; jsize schema_len = env->GetArrayLength(schema_arr); jbyte* schema_bytes = env->GetByteArrayElements(schema_arr, 0); - types::Condition condition; + gandiva::types::Condition condition; jsize condition_len = env->GetArrayLength(condition_arr); jbyte* condition_bytes = env->GetByteArrayElements(condition_arr, 0); @@ -943,7 +943,7 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_gandiva_evaluator_JniWrapper_build goto err_out; } - // convert types::Schema to arrow::Schema + // convert gandiva::types::Schema to arrow::Schema schema_ptr = ProtoTypeToSchema(schema); if (schema_ptr == nullptr) { ss << "Unable to construct arrow schema object from schema protobuf\n"; @@ -1008,15 +1008,15 @@ JNIEXPORT jint JNICALL Java_org_apache_arrow_gandiva_evaluator_JniWrapper_evalua } auto selection_vector_type = - static_cast(jselection_vector_type); + static_cast(jselection_vector_type); auto out_buffer = std::make_shared( reinterpret_cast(out_buf_addr), out_buf_size); switch (selection_vector_type) { - case types::SV_INT16: + case gandiva::types::SV_INT16: status = gandiva::SelectionVector::MakeInt16(num_rows, out_buffer, &selection_vector); break; - case types::SV_INT32: + case gandiva::types::SV_INT32: status = gandiva::SelectionVector::MakeInt32(num_rows, out_buffer, &selection_vector); break; From 853d8491addff3a10fc40950823a2942bb9fbf98 Mon Sep 17 00:00:00 2001 From: Gang Wu Date: Fri, 29 Sep 2023 00:07:16 +0800 Subject: [PATCH 79/96] GH-34950: [C++][Parquet] Support encryption for page index (#36574) ### Rationale for this change Parquet modular encryption requires page index to be encrypted if the column chunk is encrypted. This feature is missing for now. ### What changes are included in this PR? Support both encryption and decryption for parquet page index. ### Are these changes tested? Added round trip tests in write_configurations_test.cc and read_configurations_test.cc. ### Are there any user-facing changes? NO. * Closes: #34950 Authored-by: Gang Wu Signed-off-by: Antoine Pitrou --- cpp/src/parquet/column_reader.cc | 8 +- cpp/src/parquet/column_writer.cc | 4 +- .../parquet/encryption/encryption_internal.h | 2 + .../encryption/internal_file_decryptor.cc | 55 ++++++ .../encryption/internal_file_decryptor.h | 13 ++ .../encryption/read_configurations_test.cc | 58 ++++-- .../encryption/test_encryption_util.cc | 178 +++++++++++++++++- .../parquet/encryption/test_encryption_util.h | 16 +- cpp/src/parquet/encryption/type_fwd.h | 28 +++ .../encryption/write_configurations_test.cc | 2 +- cpp/src/parquet/file_reader.cc | 34 +--- cpp/src/parquet/file_writer.cc | 6 +- cpp/src/parquet/metadata.cc | 7 +- cpp/src/parquet/metadata.h | 8 +- cpp/src/parquet/page_index.cc | 105 +++++++---- cpp/src/parquet/page_index.h | 26 +-- cpp/src/parquet/thrift_internal.h | 7 +- cpp/src/parquet/type_fwd.h | 3 + 18 files changed, 439 insertions(+), 121 deletions(-) create mode 100644 cpp/src/parquet/encryption/type_fwd.h diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc index 6fe1ce9da60fe..fa013dd2ea583 100644 --- a/cpp/src/parquet/column_reader.cc +++ b/cpp/src/parquet/column_reader.cc @@ -363,10 +363,8 @@ void SerializedPageReader::UpdateDecryption(const std::shared_ptr& de int8_t module_type, std::string* page_aad) { ARROW_DCHECK(decryptor != nullptr); if (crypto_ctx_.start_decrypt_with_dictionary_page) { - std::string aad = encryption::CreateModuleAad( - decryptor->file_aad(), module_type, crypto_ctx_.row_group_ordinal, - crypto_ctx_.column_ordinal, kNonPageOrdinal); - decryptor->UpdateAad(aad); + UpdateDecryptor(decryptor, crypto_ctx_.row_group_ordinal, crypto_ctx_.column_ordinal, + module_type); } else { encryption::QuickUpdatePageAad(page_ordinal_, page_aad); decryptor->UpdateAad(*page_aad); @@ -449,7 +447,7 @@ std::shared_ptr SerializedPageReader::NextPage() { current_page_header_ = format::PageHeader(); deserializer.DeserializeMessage(reinterpret_cast(view.data()), &header_size, ¤t_page_header_, - crypto_ctx_.meta_decryptor); + crypto_ctx_.meta_decryptor.get()); break; } catch (std::exception& e) { // Failed to deserialize. Double the allowed page header size and try again diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc index ae9216ba7c312..a0aedeee9e968 100644 --- a/cpp/src/parquet/column_writer.cc +++ b/cpp/src/parquet/column_writer.cc @@ -330,7 +330,7 @@ class SerializedPageWriter : public PageWriter { UpdateEncryption(encryption::kDictionaryPageHeader); } const int64_t header_size = - thrift_serializer_->Serialize(&page_header, sink_.get(), meta_encryptor_); + thrift_serializer_->Serialize(&page_header, sink_.get(), meta_encryptor_.get()); PARQUET_THROW_NOT_OK(sink_->Write(output_data_buffer, output_data_len)); @@ -422,7 +422,7 @@ class SerializedPageWriter : public PageWriter { UpdateEncryption(encryption::kDataPageHeader); } const int64_t header_size = - thrift_serializer_->Serialize(&page_header, sink_.get(), meta_encryptor_); + thrift_serializer_->Serialize(&page_header, sink_.get(), meta_encryptor_.get()); PARQUET_THROW_NOT_OK(sink_->Write(output_data_buffer, output_data_len)); /// Collect page index diff --git a/cpp/src/parquet/encryption/encryption_internal.h b/cpp/src/parquet/encryption/encryption_internal.h index 4ed5b5cf61243..77921d8731d25 100644 --- a/cpp/src/parquet/encryption/encryption_internal.h +++ b/cpp/src/parquet/encryption/encryption_internal.h @@ -40,6 +40,8 @@ constexpr int8_t kDataPageHeader = 4; constexpr int8_t kDictionaryPageHeader = 5; constexpr int8_t kColumnIndex = 6; constexpr int8_t kOffsetIndex = 7; +constexpr int8_t kBloomFilterHeader = 8; +constexpr int8_t kBloomFilterBitset = 9; /// Performs AES encryption operations with GCM or CTR ciphers. class AesEncryptor { diff --git a/cpp/src/parquet/encryption/internal_file_decryptor.cc b/cpp/src/parquet/encryption/internal_file_decryptor.cc index 87bfc2bd12047..19e4845c8732d 100644 --- a/cpp/src/parquet/encryption/internal_file_decryptor.cc +++ b/cpp/src/parquet/encryption/internal_file_decryptor.cc @@ -16,8 +16,10 @@ // under the License. #include "parquet/encryption/internal_file_decryptor.h" +#include "arrow/util/logging.h" #include "parquet/encryption/encryption.h" #include "parquet/encryption/encryption_internal.h" +#include "parquet/metadata.h" namespace parquet { @@ -215,4 +217,57 @@ std::shared_ptr InternalFileDecryptor::GetColumnDecryptor( return column_data_map_[column_path]; } +namespace { + +std::shared_ptr GetColumnDecryptor( + const ColumnCryptoMetaData* crypto_metadata, InternalFileDecryptor* file_decryptor, + const std::function( + InternalFileDecryptor* file_decryptor, const std::string& column_path, + const std::string& column_key_metadata, const std::string& aad)>& func, + bool metadata) { + if (crypto_metadata == nullptr) { + return nullptr; + } + + if (file_decryptor == nullptr) { + throw ParquetException("RowGroup is noted as encrypted but no file decryptor"); + } + + if (crypto_metadata->encrypted_with_footer_key()) { + return metadata ? file_decryptor->GetFooterDecryptorForColumnMeta() + : file_decryptor->GetFooterDecryptorForColumnData(); + } + + // The column is encrypted with its own key + const std::string& column_key_metadata = crypto_metadata->key_metadata(); + const std::string column_path = crypto_metadata->path_in_schema()->ToDotString(); + return func(file_decryptor, column_path, column_key_metadata, /*aad=*/""); +} + +} // namespace + +std::shared_ptr GetColumnMetaDecryptor( + const ColumnCryptoMetaData* crypto_metadata, InternalFileDecryptor* file_decryptor) { + return GetColumnDecryptor(crypto_metadata, file_decryptor, + &InternalFileDecryptor::GetColumnMetaDecryptor, + /*metadata=*/true); +} + +std::shared_ptr GetColumnDataDecryptor( + const ColumnCryptoMetaData* crypto_metadata, InternalFileDecryptor* file_decryptor) { + return GetColumnDecryptor(crypto_metadata, file_decryptor, + &InternalFileDecryptor::GetColumnDataDecryptor, + /*metadata=*/false); +} + +void UpdateDecryptor(const std::shared_ptr& decryptor, + int16_t row_group_ordinal, int16_t column_ordinal, + int8_t module_type) { + ARROW_DCHECK(!decryptor->file_aad().empty()); + const std::string aad = + encryption::CreateModuleAad(decryptor->file_aad(), module_type, row_group_ordinal, + column_ordinal, kNonPageOrdinal); + decryptor->UpdateAad(aad); +} + } // namespace parquet diff --git a/cpp/src/parquet/encryption/internal_file_decryptor.h b/cpp/src/parquet/encryption/internal_file_decryptor.h index 2f9c3952aff2d..0b27effda8822 100644 --- a/cpp/src/parquet/encryption/internal_file_decryptor.h +++ b/cpp/src/parquet/encryption/internal_file_decryptor.h @@ -31,6 +31,7 @@ class AesDecryptor; class AesEncryptor; } // namespace encryption +class ColumnCryptoMetaData; class FileDecryptionProperties; class PARQUET_EXPORT Decryptor { @@ -110,4 +111,16 @@ class InternalFileDecryptor { bool metadata = false); }; +/// Utility to get column meta decryptor of an encrypted column. +std::shared_ptr GetColumnMetaDecryptor( + const ColumnCryptoMetaData* crypto_metadata, InternalFileDecryptor* file_decryptor); + +/// Utility to get column data decryptor of an encrypted column. +std::shared_ptr GetColumnDataDecryptor( + const ColumnCryptoMetaData* crypto_metadata, InternalFileDecryptor* file_decryptor); + +void UpdateDecryptor(const std::shared_ptr& decryptor, + int16_t row_group_ordinal, int16_t column_ordinal, + int8_t module_type); + } // namespace parquet diff --git a/cpp/src/parquet/encryption/read_configurations_test.cc b/cpp/src/parquet/encryption/read_configurations_test.cc index 10de7198ac5ff..695696db293fb 100644 --- a/cpp/src/parquet/encryption/read_configurations_test.cc +++ b/cpp/src/parquet/encryption/read_configurations_test.cc @@ -36,7 +36,7 @@ * The unit-test is called multiple times, each time to decrypt parquet files using * different decryption configuration as described below. * In each call two encrypted files are read: one temporary file that was generated using - * encryption-write-configurations-test.cc test and will be deleted upon + * write_configurations_test.cc test and will be deleted upon * reading it, while the second resides in * parquet-testing/data repository. Those two encrypted files were encrypted using the * same encryption configuration. @@ -59,8 +59,8 @@ * read the footer + all non-encrypted columns. * (pairs with encryption configuration 3) * - * The encrypted parquet files that is read was encrypted using one of the configurations - * below: + * The encrypted parquet files that are read were encrypted using one of the + * configurations below: * * - Encryption configuration 1: Encrypt all columns and the footer with the same key. * (uniform encryption) @@ -166,7 +166,11 @@ class TestDecryptionConfiguration vector_of_decryption_configurations_.push_back(NULL); } - void DecryptFile(std::string file, int decryption_config_num) { + void DecryptFileInternal( + const std::string& file, int decryption_config_num, + std::function&)> + decrypt_func) { std::string exception_msg; std::shared_ptr file_decryption_properties; // if we get decryption_config_num = x then it means the actual number is x+1 @@ -176,18 +180,40 @@ class TestDecryptionConfiguration vector_of_decryption_configurations_[decryption_config_num]->DeepClone(); } - decryptor_.DecryptFile(file, file_decryption_properties); + decrypt_func(std::move(file), std::move(file_decryption_properties)); + } + + void DecryptFile(const std::string& file, int decryption_config_num) { + DecryptFileInternal( + file, decryption_config_num, + [&](const std::string& file, + const std::shared_ptr& file_decryption_properties) { + decryptor_.DecryptFile(file, file_decryption_properties); + }); + } + + void DecryptPageIndex(const std::string& file, int decryption_config_num) { + DecryptFileInternal( + file, decryption_config_num, + [&](const std::string& file, + const std::shared_ptr& file_decryption_properties) { + decryptor_.DecryptPageIndex(file, file_decryption_properties); + }); } // Check that the decryption result is as expected. - void CheckResults(const std::string file_name, unsigned decryption_config_num, - unsigned encryption_config_num) { + void CheckResults(const std::string& file_name, unsigned decryption_config_num, + unsigned encryption_config_num, bool file_has_page_index) { // Encryption_configuration number five contains aad_prefix and // disable_aad_prefix_storage. // An exception is expected to be thrown if the file is not decrypted with aad_prefix. if (encryption_config_num == 5) { if (decryption_config_num == 1 || decryption_config_num == 3) { EXPECT_THROW(DecryptFile(file_name, decryption_config_num - 1), ParquetException); + if (file_has_page_index) { + EXPECT_THROW(DecryptPageIndex(file_name, decryption_config_num - 1), + ParquetException); + } return; } } @@ -196,6 +222,10 @@ class TestDecryptionConfiguration if (decryption_config_num == 2) { if (encryption_config_num != 5 && encryption_config_num != 4) { EXPECT_THROW(DecryptFile(file_name, decryption_config_num - 1), ParquetException); + if (file_has_page_index) { + EXPECT_THROW(DecryptPageIndex(file_name, decryption_config_num - 1), + ParquetException); + } return; } } @@ -205,6 +235,9 @@ class TestDecryptionConfiguration return; } EXPECT_NO_THROW(DecryptFile(file_name, decryption_config_num - 1)); + if (file_has_page_index) { + EXPECT_NO_THROW(DecryptPageIndex(file_name, decryption_config_num - 1)); + } } // Returns true if file exists. Otherwise returns false. @@ -217,14 +250,13 @@ class TestDecryptionConfiguration // Read encrypted parquet file. // The test reads two parquet files that were encrypted using the same encryption // configuration: -// one was generated in encryption-write-configurations-test.cc tests and is deleted +// one was generated in write_configurations_test.cc tests and is deleted // once the file is read and the second exists in parquet-testing/data folder. // The name of the files are passed as parameters to the unit-test. TEST_P(TestDecryptionConfiguration, TestDecryption) { int encryption_config_num = std::get<0>(GetParam()); const char* param_file_name = std::get<1>(GetParam()); - // Decrypt parquet file that was generated in encryption-write-configurations-test.cc - // test. + // Decrypt parquet file that was generated in write_configurations_test.cc test. std::string tmp_file_name = "tmp_" + std::string(param_file_name); std::string file_name = temp_dir->path().ToString() + tmp_file_name; if (!fexists(file_name)) { @@ -237,7 +269,8 @@ TEST_P(TestDecryptionConfiguration, TestDecryption) { // parquet file. for (unsigned index = 0; index < vector_of_decryption_configurations_.size(); ++index) { unsigned decryption_config_num = index + 1; - CheckResults(file_name, decryption_config_num, encryption_config_num); + CheckResults(file_name, decryption_config_num, encryption_config_num, + /*file_has_page_index=*/true); } // Delete temporary test file. ASSERT_EQ(std::remove(file_name.c_str()), 0); @@ -255,7 +288,8 @@ TEST_P(TestDecryptionConfiguration, TestDecryption) { // parquet file. for (unsigned index = 0; index < vector_of_decryption_configurations_.size(); ++index) { unsigned decryption_config_num = index + 1; - CheckResults(file_name, decryption_config_num, encryption_config_num); + CheckResults(file_name, decryption_config_num, encryption_config_num, + /*file_has_page_index=*/false); } } diff --git a/cpp/src/parquet/encryption/test_encryption_util.cc b/cpp/src/parquet/encryption/test_encryption_util.cc index 694ed3cf42d9e..4fa215312f265 100644 --- a/cpp/src/parquet/encryption/test_encryption_util.cc +++ b/cpp/src/parquet/encryption/test_encryption_util.cc @@ -19,14 +19,17 @@ // Parquet column chunk within a row group. It could be extended in the future // to iterate through all data pages in all chunks in a file. +#include #include -#include - +#include "arrow/io/file.h" #include "arrow/testing/future_util.h" +#include "arrow/util/unreachable.h" + #include "parquet/encryption/test_encryption_util.h" #include "parquet/file_reader.h" #include "parquet/file_writer.h" +#include "parquet/page_index.h" #include "parquet/test_util.h" using ::arrow::io::FileOutputStream; @@ -206,6 +209,7 @@ void FileEncryptor::EncryptFile( WriterProperties::Builder prop_builder; prop_builder.compression(parquet::Compression::UNCOMPRESSED); prop_builder.encryption(encryption_configurations); + prop_builder.enable_write_page_index(); std::shared_ptr writer_properties = prop_builder.build(); PARQUET_ASSIGN_OR_THROW(auto out_file, FileOutputStream::Open(file)); @@ -340,8 +344,8 @@ void ReadAndVerifyColumn(RowGroupReader* rg_reader, RowGroupMetadata* rg_md, } void FileDecryptor::DecryptFile( - std::string file, - std::shared_ptr file_decryption_properties) { + const std::string& file, + const std::shared_ptr& file_decryption_properties) { std::string exception_msg; parquet::ReaderProperties reader_properties = parquet::default_reader_properties(); if (file_decryption_properties) { @@ -353,7 +357,7 @@ void FileDecryptor::DecryptFile( source, ::arrow::io::ReadableFile::Open(file, reader_properties.memory_pool())); auto file_reader = parquet::ParquetFileReader::Open(source, reader_properties); - CheckFile(file_reader.get(), file_decryption_properties.get()); + CheckFile(file_reader.get(), file_decryption_properties); if (file_decryption_properties) { reader_properties.file_decryption_properties(file_decryption_properties->DeepClone()); @@ -361,14 +365,15 @@ void FileDecryptor::DecryptFile( auto fut = parquet::ParquetFileReader::OpenAsync(source, reader_properties); ASSERT_FINISHES_OK(fut); ASSERT_OK_AND_ASSIGN(file_reader, fut.MoveResult()); - CheckFile(file_reader.get(), file_decryption_properties.get()); + CheckFile(file_reader.get(), file_decryption_properties); file_reader->Close(); PARQUET_THROW_NOT_OK(source->Close()); } -void FileDecryptor::CheckFile(parquet::ParquetFileReader* file_reader, - FileDecryptionProperties* file_decryption_properties) { +void FileDecryptor::CheckFile( + parquet::ParquetFileReader* file_reader, + const std::shared_ptr& file_decryption_properties) { // Get the File MetaData std::shared_ptr file_metadata = file_reader->metadata(); @@ -509,4 +514,161 @@ void FileDecryptor::CheckFile(parquet::ParquetFileReader* file_reader, } } +void FileDecryptor::DecryptPageIndex( + const std::string& file, + const std::shared_ptr& file_decryption_properties) { + std::string exception_msg; + parquet::ReaderProperties reader_properties = parquet::default_reader_properties(); + if (file_decryption_properties) { + reader_properties.file_decryption_properties(file_decryption_properties->DeepClone()); + } + + std::shared_ptr<::arrow::io::RandomAccessFile> source; + PARQUET_ASSIGN_OR_THROW( + source, ::arrow::io::ReadableFile::Open(file, reader_properties.memory_pool())); + + auto file_reader = parquet::ParquetFileReader::Open(source, reader_properties); + CheckPageIndex(file_reader.get(), file_decryption_properties); + + ASSERT_NO_FATAL_FAILURE(file_reader->Close()); + PARQUET_THROW_NOT_OK(source->Close()); +} + +template +void AssertColumnIndex(const std::shared_ptr& column_index, + const std::vector& expected_null_counts, + const std::vector& expected_min_values, + const std::vector& expected_max_values) { + auto typed_column_index = + std::dynamic_pointer_cast>(column_index); + ASSERT_NE(typed_column_index, nullptr); + ASSERT_EQ(typed_column_index->null_counts(), expected_null_counts); + if constexpr (std::is_same_v) { + ASSERT_EQ(typed_column_index->min_values().size(), expected_min_values.size()); + ASSERT_EQ(typed_column_index->max_values().size(), expected_max_values.size()); + for (size_t i = 0; i < expected_min_values.size(); ++i) { + ASSERT_EQ( + FixedLenByteArrayToString(typed_column_index->min_values()[i], kFixedLength), + FixedLenByteArrayToString(expected_min_values[i], kFixedLength)); + } + for (size_t i = 0; i < expected_max_values.size(); ++i) { + ASSERT_EQ( + FixedLenByteArrayToString(typed_column_index->max_values()[i], kFixedLength), + FixedLenByteArrayToString(expected_max_values[i], kFixedLength)); + } + } else { + ASSERT_EQ(typed_column_index->min_values(), expected_min_values); + ASSERT_EQ(typed_column_index->max_values(), expected_max_values); + } +} + +void FileDecryptor::CheckPageIndex( + parquet::ParquetFileReader* file_reader, + const std::shared_ptr& file_decryption_properties) { + std::shared_ptr page_index_reader = file_reader->GetPageIndexReader(); + ASSERT_NE(page_index_reader, nullptr); + + const std::shared_ptr file_metadata = file_reader->metadata(); + const int num_row_groups = file_metadata->num_row_groups(); + const int num_columns = file_metadata->num_columns(); + ASSERT_EQ(num_columns, 8); + + // We cannot read page index of encrypted columns in the plaintext mode + std::vector need_row_groups(num_row_groups); + std::iota(need_row_groups.begin(), need_row_groups.end(), 0); + std::vector need_columns; + if (file_decryption_properties == nullptr) { + need_columns = {0, 1, 2, 3, 6, 7}; + } else { + need_columns = {0, 1, 2, 3, 4, 5, 6, 7}; + } + + // Provide hint of requested columns to avoid accessing encrypted columns without + // decryption properties. + page_index_reader->WillNeed( + need_row_groups, need_columns, + PageIndexSelection{/*column_index=*/true, /*offset_index=*/true}); + + // Iterate over all the RowGroups in the file. + for (int r = 0; r < num_row_groups; ++r) { + auto row_group_page_index_reader = page_index_reader->RowGroup(r); + ASSERT_NE(row_group_page_index_reader, nullptr); + + for (int c = 0; c < num_columns; ++c) { + // Skip reading encrypted columns without decryption properties. + if (file_decryption_properties == nullptr && (c == 4 || c == 5)) { + continue; + } + + constexpr size_t kExpectedNumPages = 1; + + // Check offset index. + auto offset_index = row_group_page_index_reader->GetOffsetIndex(c); + ASSERT_NE(offset_index, nullptr); + ASSERT_EQ(offset_index->page_locations().size(), kExpectedNumPages); + const auto& first_page = offset_index->page_locations()[0]; + ASSERT_EQ(first_page.first_row_index, 0); + ASSERT_GT(first_page.compressed_page_size, 0); + + // Int96 column does not have column index. + if (c == 3) { + continue; + } + + // Check column index + auto column_index = row_group_page_index_reader->GetColumnIndex(c); + ASSERT_NE(column_index, nullptr); + ASSERT_EQ(column_index->null_pages().size(), kExpectedNumPages); + ASSERT_EQ(column_index->null_pages()[0], false); + ASSERT_EQ(column_index->encoded_min_values().size(), kExpectedNumPages); + ASSERT_EQ(column_index->encoded_max_values().size(), kExpectedNumPages); + ASSERT_TRUE(column_index->has_null_counts()); + + switch (c) { + case 0: { + AssertColumnIndex(column_index, /*expected_null_counts=*/{0}, + /*expected_min_values=*/{false}, + /*expected_max_values=*/{true}); + } break; + case 1: { + AssertColumnIndex(column_index, /*expected_null_counts=*/{0}, + /*expected_min_values=*/{0}, + /*expected_max_values=*/{49}); + } break; + case 2: { + AssertColumnIndex(column_index, /*expected_null_counts=*/{0}, + /*expected_min_values=*/{0}, + /*expected_max_values=*/{99000000000000}); + } break; + case 4: { + AssertColumnIndex(column_index, /*expected_null_counts=*/{0}, + /*expected_min_values=*/{0.0F}, + /*expected_max_values=*/{53.9F}); + } break; + case 5: { + AssertColumnIndex(column_index, /*expected_null_counts=*/{0}, + /*expected_min_values=*/{0.0}, + /*expected_max_values=*/{54.4444439}); + } break; + case 6: { + AssertColumnIndex( + column_index, /*expected_null_counts=*/{25}, + /*expected_min_values=*/{ByteArray("parquet000")}, + /*expected_max_values=*/{ByteArray("parquet048")}); + } break; + case 7: { + const std::vector kExpectedMinValue(kFixedLength, 0); + const std::vector kExpectedMaxValue(kFixedLength, 49); + AssertColumnIndex( + column_index, /*expected_null_counts=*/{0}, + /*expected_min_values=*/{FLBA(kExpectedMinValue.data())}, + /*expected_max_values=*/{FLBA(kExpectedMaxValue.data())}); + } break; + default: + ::arrow::Unreachable("Unexpected column index " + std::to_string(c)); + } + } + } +} + } // namespace parquet::encryption::test diff --git a/cpp/src/parquet/encryption/test_encryption_util.h b/cpp/src/parquet/encryption/test_encryption_util.h index 19c230ee5ff99..86aa0ff07cf84 100644 --- a/cpp/src/parquet/encryption/test_encryption_util.h +++ b/cpp/src/parquet/encryption/test_encryption_util.h @@ -113,12 +113,20 @@ class FileEncryptor { class FileDecryptor { public: - void DecryptFile(std::string file_name, - std::shared_ptr file_decryption_properties); + void DecryptFile( + const std::string& file_name, + const std::shared_ptr& file_decryption_properties); + void DecryptPageIndex( + const std::string& file_name, + const std::shared_ptr& file_decryption_properties); private: - void CheckFile(parquet::ParquetFileReader* file_reader, - FileDecryptionProperties* file_decryption_properties); + void CheckFile( + parquet::ParquetFileReader* file_reader, + const std::shared_ptr& file_decryption_properties); + void CheckPageIndex( + parquet::ParquetFileReader* file_reader, + const std::shared_ptr& file_decryption_properties); }; } // namespace encryption::test diff --git a/cpp/src/parquet/encryption/type_fwd.h b/cpp/src/parquet/encryption/type_fwd.h new file mode 100644 index 0000000000000..623811718482c --- /dev/null +++ b/cpp/src/parquet/encryption/type_fwd.h @@ -0,0 +1,28 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +namespace parquet { + +class Decryptor; +class Encryptor; + +class InternalFileDecryptor; +class InternalFileEncryptor; + +} // namespace parquet diff --git a/cpp/src/parquet/encryption/write_configurations_test.cc b/cpp/src/parquet/encryption/write_configurations_test.cc index e262003db3e6a..f27da82694874 100644 --- a/cpp/src/parquet/encryption/write_configurations_test.cc +++ b/cpp/src/parquet/encryption/write_configurations_test.cc @@ -33,7 +33,7 @@ * This file contains unit-tests for writing encrypted Parquet files with * different encryption configurations. * The files are saved in temporary folder and will be deleted after reading - * them in encryption-read-configurations-test.cc test. + * them in read_configurations_test.cc test. * * A detailed description of the Parquet Modular Encryption specification can be found * here: diff --git a/cpp/src/parquet/file_reader.cc b/cpp/src/parquet/file_reader.cc index 08d493b0bca2f..5247b9d4b543d 100644 --- a/cpp/src/parquet/file_reader.cc +++ b/cpp/src/parquet/file_reader.cc @@ -227,37 +227,19 @@ class SerializedRowGroup : public RowGroupReader::Contents { always_compressed); } - if (file_decryptor_ == nullptr) { - throw ParquetException("RowGroup is noted as encrypted but no file decryptor"); - } + // The column is encrypted + std::shared_ptr meta_decryptor = + GetColumnMetaDecryptor(crypto_metadata.get(), file_decryptor_.get()); + std::shared_ptr data_decryptor = + GetColumnDataDecryptor(crypto_metadata.get(), file_decryptor_.get()); + ARROW_DCHECK_NE(meta_decryptor, nullptr); + ARROW_DCHECK_NE(data_decryptor, nullptr); constexpr auto kEncryptedRowGroupsLimit = 32767; if (i > kEncryptedRowGroupsLimit) { throw ParquetException("Encrypted files cannot contain more than 32767 row groups"); } - // The column is encrypted - std::shared_ptr meta_decryptor; - std::shared_ptr data_decryptor; - // The column is encrypted with footer key - if (crypto_metadata->encrypted_with_footer_key()) { - meta_decryptor = file_decryptor_->GetFooterDecryptorForColumnMeta(); - data_decryptor = file_decryptor_->GetFooterDecryptorForColumnData(); - CryptoContext ctx(col->has_dictionary_page(), row_group_ordinal_, - static_cast(i), meta_decryptor, data_decryptor); - return PageReader::Open(stream, col->num_values(), col->compression(), properties_, - always_compressed, &ctx); - } - - // The column is encrypted with its own key - std::string column_key_metadata = crypto_metadata->key_metadata(); - const std::string column_path = crypto_metadata->path_in_schema()->ToDotString(); - - meta_decryptor = - file_decryptor_->GetColumnMetaDecryptor(column_path, column_key_metadata); - data_decryptor = - file_decryptor_->GetColumnDataDecryptor(column_path, column_key_metadata); - CryptoContext ctx(col->has_dictionary_page(), row_group_ordinal_, static_cast(i), meta_decryptor, data_decryptor); return PageReader::Open(stream, col->num_values(), col->compression(), properties_, @@ -330,7 +312,7 @@ class SerializedFile : public ParquetFileReader::Contents { } if (!page_index_reader_) { page_index_reader_ = PageIndexReader::Make(source_.get(), file_metadata_, - properties_, file_decryptor_); + properties_, file_decryptor_.get()); } return page_index_reader_; } diff --git a/cpp/src/parquet/file_writer.cc b/cpp/src/parquet/file_writer.cc index 2a6a88df2dd0a..9a92d4525d23d 100644 --- a/cpp/src/parquet/file_writer.cc +++ b/cpp/src/parquet/file_writer.cc @@ -471,10 +471,6 @@ class FileSerializer : public ParquetFileWriter::Contents { void WritePageIndex() { if (page_index_builder_ != nullptr) { - if (properties_->file_encryption_properties()) { - throw ParquetException("Encryption is not supported with page index"); - } - // Serialize page index after all row groups have been written and report // location to the file metadata. PageIndexLocation page_index_location; @@ -533,7 +529,7 @@ class FileSerializer : public ParquetFileWriter::Contents { } if (properties_->page_index_enabled()) { - page_index_builder_ = PageIndexBuilder::Make(&schema_); + page_index_builder_ = PageIndexBuilder::Make(&schema_, file_encryptor_.get()); } } }; diff --git a/cpp/src/parquet/metadata.cc b/cpp/src/parquet/metadata.cc index 8aedf5b926add..4ef2151fee59d 100644 --- a/cpp/src/parquet/metadata.cc +++ b/cpp/src/parquet/metadata.cc @@ -211,7 +211,7 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl { ThriftDeserializer deserializer(properties_); deserializer.DeserializeMessage( reinterpret_cast(column->encrypted_column_metadata.c_str()), - &len, &decrypted_metadata_, decryptor); + &len, &decrypted_metadata_, decryptor.get()); column_metadata_ = &decrypted_metadata_; } else { throw ParquetException( @@ -603,7 +603,8 @@ class FileMetaData::FileMetaDataImpl { ThriftDeserializer deserializer(properties_); deserializer.DeserializeMessage(reinterpret_cast(metadata), - metadata_len, metadata_.get(), footer_decryptor); + metadata_len, metadata_.get(), + footer_decryptor.get()); metadata_len_ = *metadata_len; if (metadata_->__isset.created_by) { @@ -705,7 +706,7 @@ class FileMetaData::FileMetaDataImpl { encryption::kGcmTagLength)); } else { // either plaintext file (when encryptor is null) // or encrypted file with encrypted footer - serializer.Serialize(metadata_.get(), dst, encryptor); + serializer.Serialize(metadata_.get(), dst, encryptor.get()); } } diff --git a/cpp/src/parquet/metadata.h b/cpp/src/parquet/metadata.h index e62b2d187a20b..6609cff48bac2 100644 --- a/cpp/src/parquet/metadata.h +++ b/cpp/src/parquet/metadata.h @@ -25,6 +25,7 @@ #include #include +#include "parquet/encryption/type_fwd.h" #include "parquet/platform.h" #include "parquet/properties.h" #include "parquet/schema.h" @@ -34,15 +35,10 @@ namespace parquet { class ColumnDescriptor; class EncodedStatistics; +class FileCryptoMetaData; class Statistics; class SchemaDescriptor; -class FileCryptoMetaData; -class InternalFileDecryptor; -class Decryptor; -class Encryptor; -class FooterSigningEncryptor; - namespace schema { class ColumnPath; diff --git a/cpp/src/parquet/page_index.cc b/cpp/src/parquet/page_index.cc index 9bae90e5540bd..ec99af17f05a1 100644 --- a/cpp/src/parquet/page_index.cc +++ b/cpp/src/parquet/page_index.cc @@ -17,6 +17,9 @@ #include "parquet/page_index.h" #include "parquet/encoding.h" +#include "parquet/encryption/encryption_internal.h" +#include "parquet/encryption/internal_file_decryptor.h" +#include "parquet/encryption/internal_file_encryptor.h" #include "parquet/exception.h" #include "parquet/metadata.h" #include "parquet/schema.h" @@ -192,13 +195,13 @@ class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { const ReaderProperties& properties, int32_t row_group_ordinal, const RowGroupIndexReadRange& index_read_range, - std::shared_ptr file_decryptor) + InternalFileDecryptor* file_decryptor) : input_(input), row_group_metadata_(std::move(row_group_metadata)), properties_(properties), row_group_ordinal_(row_group_ordinal), index_read_range_(index_read_range), - file_decryptor_(std::move(file_decryptor)) {} + file_decryptor_(file_decryptor) {} /// Read column index of a column chunk. std::shared_ptr GetColumnIndex(int32_t i) override { @@ -207,11 +210,6 @@ class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { } auto col_chunk = row_group_metadata_->ColumnChunk(i); - std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); - if (crypto_metadata != nullptr) { - ParquetException::NYI("Cannot read encrypted column index yet"); - } - auto column_index_location = col_chunk->GetColumnIndexLocation(); if (!column_index_location.has_value()) { return nullptr; @@ -232,8 +230,17 @@ class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { // uint32_t uint32_t length = static_cast(column_index_location->length); auto descr = row_group_metadata_->schema()->Column(i); + + // Get decryptor of column index if encrypted. + std::shared_ptr decryptor = parquet::GetColumnMetaDecryptor( + col_chunk->crypto_metadata().get(), file_decryptor_); + if (decryptor != nullptr) { + UpdateDecryptor(decryptor, row_group_ordinal_, /*column_ordinal=*/i, + encryption::kColumnIndex); + } + return ColumnIndex::Make(*descr, column_index_buffer_->data() + buffer_offset, length, - properties_); + properties_, decryptor.get()); } /// Read offset index of a column chunk. @@ -243,11 +250,6 @@ class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { } auto col_chunk = row_group_metadata_->ColumnChunk(i); - std::unique_ptr crypto_metadata = col_chunk->crypto_metadata(); - if (crypto_metadata != nullptr) { - ParquetException::NYI("Cannot read encrypted offset index yet"); - } - auto offset_index_location = col_chunk->GetOffsetIndexLocation(); if (!offset_index_location.has_value()) { return nullptr; @@ -267,8 +269,17 @@ class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { // OffsetIndex::Make() requires the type of serialized thrift message to be // uint32_t uint32_t length = static_cast(offset_index_location->length); + + // Get decryptor of offset index if encrypted. + std::shared_ptr decryptor = + GetColumnMetaDecryptor(col_chunk->crypto_metadata().get(), file_decryptor_); + if (decryptor != nullptr) { + UpdateDecryptor(decryptor, row_group_ordinal_, /*column_ordinal=*/i, + encryption::kOffsetIndex); + } + return OffsetIndex::Make(offset_index_buffer_->data() + buffer_offset, length, - properties_); + properties_, decryptor.get()); } private: @@ -325,7 +336,7 @@ class RowGroupPageIndexReaderImpl : public RowGroupPageIndexReader { RowGroupIndexReadRange index_read_range_; /// File-level decryptor. - std::shared_ptr file_decryptor_; + InternalFileDecryptor* file_decryptor_; /// Buffer to hold the raw bytes of the page index. /// Will be set lazily when the corresponding page index is accessed for the 1st time. @@ -338,11 +349,11 @@ class PageIndexReaderImpl : public PageIndexReader { PageIndexReaderImpl(::arrow::io::RandomAccessFile* input, std::shared_ptr file_metadata, const ReaderProperties& properties, - std::shared_ptr file_decryptor) + InternalFileDecryptor* file_decryptor) : input_(input), file_metadata_(std::move(file_metadata)), properties_(properties), - file_decryptor_(std::move(file_decryptor)) {} + file_decryptor_(file_decryptor) {} std::shared_ptr RowGroup(int i) override { if (i < 0 || i >= file_metadata_->num_row_groups()) { @@ -418,7 +429,7 @@ class PageIndexReaderImpl : public PageIndexReader { const ReaderProperties& properties_; /// File-level decrypter. - std::shared_ptr file_decryptor_; + InternalFileDecryptor* file_decryptor_; /// Coalesced read ranges of page index of row groups that have been suggested by /// WillNeed(). Key is the row group ordinal. @@ -524,9 +535,9 @@ class ColumnIndexBuilderImpl final : public ColumnIndexBuilder { column_index_.__set_boundary_order(ToThrift(boundary_order)); } - void WriteTo(::arrow::io::OutputStream* sink) const override { + void WriteTo(::arrow::io::OutputStream* sink, Encryptor* encryptor) const override { if (state_ == BuilderState::kFinished) { - ThriftSerializer{}.Serialize(&column_index_, sink); + ThriftSerializer{}.Serialize(&column_index_, sink, encryptor); } } @@ -634,9 +645,9 @@ class OffsetIndexBuilderImpl final : public OffsetIndexBuilder { } } - void WriteTo(::arrow::io::OutputStream* sink) const override { + void WriteTo(::arrow::io::OutputStream* sink, Encryptor* encryptor) const override { if (state_ == BuilderState::kFinished) { - ThriftSerializer{}.Serialize(&offset_index_, sink); + ThriftSerializer{}.Serialize(&offset_index_, sink, encryptor); } } @@ -654,7 +665,9 @@ class OffsetIndexBuilderImpl final : public OffsetIndexBuilder { class PageIndexBuilderImpl final : public PageIndexBuilder { public: - explicit PageIndexBuilderImpl(const SchemaDescriptor* schema) : schema_(schema) {} + explicit PageIndexBuilderImpl(const SchemaDescriptor* schema, + InternalFileEncryptor* file_encryptor) + : schema_(schema), file_encryptor_(file_encryptor) {} void AppendRowGroup() override { if (finished_) { @@ -724,12 +737,31 @@ class PageIndexBuilderImpl final : public PageIndexBuilder { } } + std::shared_ptr GetColumnMetaEncryptor(int row_group_ordinal, + int column_ordinal, + int8_t module_type) const { + std::shared_ptr encryptor; + if (file_encryptor_ != nullptr) { + const auto column_path = schema_->Column(column_ordinal)->path()->ToDotString(); + encryptor = file_encryptor_->GetColumnMetaEncryptor(column_path); + if (encryptor != nullptr) { + encryptor->UpdateAad(encryption::CreateModuleAad( + encryptor->file_aad(), module_type, row_group_ordinal, column_ordinal, + kNonPageOrdinal)); + } + } + return encryptor; + } + template void SerializeIndex( const std::vector>>& page_index_builders, ::arrow::io::OutputStream* sink, std::map>>* location) const { const auto num_columns = static_cast(schema_->num_columns()); + constexpr int8_t module_type = std::is_same_v + ? encryption::kColumnIndex + : encryption::kOffsetIndex; /// Serialize the same kind of page index row group by row group. for (size_t row_group = 0; row_group < page_index_builders.size(); ++row_group) { @@ -743,9 +775,13 @@ class PageIndexBuilderImpl final : public PageIndexBuilder { for (size_t column = 0; column < num_columns; ++column) { const auto& column_page_index_builder = row_group_page_index_builders[column]; if (column_page_index_builder != nullptr) { + /// Get encryptor if encryption is enabled. + std::shared_ptr encryptor = GetColumnMetaEncryptor( + static_cast(row_group), static_cast(column), module_type); + /// Try serializing the page index. PARQUET_ASSIGN_OR_THROW(int64_t pos_before_write, sink->Tell()); - column_page_index_builder->WriteTo(sink); + column_page_index_builder->WriteTo(sink, encryptor.get()); PARQUET_ASSIGN_OR_THROW(int64_t pos_after_write, sink->Tell()); int64_t len = pos_after_write - pos_before_write; @@ -769,6 +805,7 @@ class PageIndexBuilderImpl final : public PageIndexBuilder { } const SchemaDescriptor* schema_; + InternalFileEncryptor* file_encryptor_; std::vector>> column_index_builders_; std::vector>> offset_index_builders_; bool finished_ = false; @@ -832,11 +869,12 @@ RowGroupIndexReadRange PageIndexReader::DeterminePageIndexRangesInRowGroup( std::unique_ptr ColumnIndex::Make(const ColumnDescriptor& descr, const void* serialized_index, uint32_t index_len, - const ReaderProperties& properties) { + const ReaderProperties& properties, + Decryptor* decryptor) { format::ColumnIndex column_index; ThriftDeserializer deserializer(properties); deserializer.DeserializeMessage(reinterpret_cast(serialized_index), - &index_len, &column_index); + &index_len, &column_index, decryptor); switch (descr.physical_type()) { case Type::BOOLEAN: return std::make_unique>(descr, @@ -871,20 +909,20 @@ std::unique_ptr ColumnIndex::Make(const ColumnDescriptor& descr, std::unique_ptr OffsetIndex::Make(const void* serialized_index, uint32_t index_len, - const ReaderProperties& properties) { + const ReaderProperties& properties, + Decryptor* decryptor) { format::OffsetIndex offset_index; ThriftDeserializer deserializer(properties); deserializer.DeserializeMessage(reinterpret_cast(serialized_index), - &index_len, &offset_index); + &index_len, &offset_index, decryptor); return std::make_unique(offset_index); } std::shared_ptr PageIndexReader::Make( ::arrow::io::RandomAccessFile* input, std::shared_ptr file_metadata, - const ReaderProperties& properties, - std::shared_ptr file_decryptor) { + const ReaderProperties& properties, InternalFileDecryptor* file_decryptor) { return std::make_shared(input, std::move(file_metadata), - properties, std::move(file_decryptor)); + properties, file_decryptor); } std::unique_ptr ColumnIndexBuilder::Make( @@ -917,8 +955,9 @@ std::unique_ptr OffsetIndexBuilder::Make() { return std::make_unique(); } -std::unique_ptr PageIndexBuilder::Make(const SchemaDescriptor* schema) { - return std::make_unique(schema); +std::unique_ptr PageIndexBuilder::Make( + const SchemaDescriptor* schema, InternalFileEncryptor* file_encryptor) { + return std::make_unique(schema, file_encryptor); } std::ostream& operator<<(std::ostream& out, const PageIndexSelection& selection) { diff --git a/cpp/src/parquet/page_index.h b/cpp/src/parquet/page_index.h index b6ea5fd6abc08..f2ed77cb97c3b 100644 --- a/cpp/src/parquet/page_index.h +++ b/cpp/src/parquet/page_index.h @@ -18,6 +18,7 @@ #pragma once #include "arrow/io/interfaces.h" +#include "parquet/encryption/type_fwd.h" #include "parquet/types.h" #include @@ -25,14 +26,8 @@ namespace parquet { -class ColumnDescriptor; class EncodedStatistics; -class FileMetaData; -class InternalFileDecryptor; struct PageIndexLocation; -class ReaderProperties; -class RowGroupMetaData; -class RowGroupPageIndexReader; /// \brief ColumnIndex is a proxy around format::ColumnIndex. class PARQUET_EXPORT ColumnIndex { @@ -41,7 +36,8 @@ class PARQUET_EXPORT ColumnIndex { static std::unique_ptr Make(const ColumnDescriptor& descr, const void* serialized_index, uint32_t index_len, - const ReaderProperties& properties); + const ReaderProperties& properties, + Decryptor* decryptor = NULLPTR); virtual ~ColumnIndex() = default; @@ -126,7 +122,8 @@ class PARQUET_EXPORT OffsetIndex { /// \brief Create a OffsetIndex from a serialized thrift message. static std::unique_ptr Make(const void* serialized_index, uint32_t index_len, - const ReaderProperties& properties); + const ReaderProperties& properties, + Decryptor* decryptor = NULLPTR); virtual ~OffsetIndex() = default; @@ -187,7 +184,7 @@ class PARQUET_EXPORT PageIndexReader { static std::shared_ptr Make( ::arrow::io::RandomAccessFile* input, std::shared_ptr file_metadata, const ReaderProperties& properties, - std::shared_ptr file_decryptor = NULLPTR); + InternalFileDecryptor* file_decryptor = NULLPTR); /// \brief Get the page index reader of a specific row group. /// \param[in] i row group ordinal to get page index reader. @@ -283,7 +280,9 @@ class PARQUET_EXPORT ColumnIndexBuilder { /// not write any data to the sink. /// /// \param[out] sink output stream to write the serialized message. - virtual void WriteTo(::arrow::io::OutputStream* sink) const = 0; + /// \param[in] encryptor encryptor to encrypt the serialized column index. + virtual void WriteTo(::arrow::io::OutputStream* sink, + Encryptor* encryptor = NULLPTR) const = 0; /// \brief Create a ColumnIndex directly. /// @@ -322,7 +321,9 @@ class PARQUET_EXPORT OffsetIndexBuilder { /// \brief Serialize the offset index thrift message. /// /// \param[out] sink output stream to write the serialized message. - virtual void WriteTo(::arrow::io::OutputStream* sink) const = 0; + /// \param[in] encryptor encryptor to encrypt the serialized offset index. + virtual void WriteTo(::arrow::io::OutputStream* sink, + Encryptor* encryptor = NULLPTR) const = 0; /// \brief Create an OffsetIndex directly. virtual std::unique_ptr Build() const = 0; @@ -332,7 +333,8 @@ class PARQUET_EXPORT OffsetIndexBuilder { class PARQUET_EXPORT PageIndexBuilder { public: /// \brief API convenience to create a PageIndexBuilder. - static std::unique_ptr Make(const SchemaDescriptor* schema); + static std::unique_ptr Make( + const SchemaDescriptor* schema, InternalFileEncryptor* file_encryptor = NULLPTR); virtual ~PageIndexBuilder() = default; diff --git a/cpp/src/parquet/thrift_internal.h b/cpp/src/parquet/thrift_internal.h index 5824a82d5b86d..7491f118d32a0 100644 --- a/cpp/src/parquet/thrift_internal.h +++ b/cpp/src/parquet/thrift_internal.h @@ -403,7 +403,7 @@ class ThriftDeserializer { // set to the actual length of the header. template void DeserializeMessage(const uint8_t* buf, uint32_t* len, T* deserialized_msg, - const std::shared_ptr& decryptor = NULLPTR) { + Decryptor* decryptor = NULLPTR) { if (decryptor == NULLPTR) { // thrift message is not encrypted DeserializeUnencryptedMessage(buf, len, deserialized_msg); @@ -495,7 +495,7 @@ class ThriftSerializer { template int64_t Serialize(const T* obj, ArrowOutputStream* out, - const std::shared_ptr& encryptor = NULLPTR) { + Encryptor* encryptor = NULLPTR) { uint8_t* out_buffer; uint32_t out_length; SerializeToBuffer(obj, &out_length, &out_buffer); @@ -523,8 +523,7 @@ class ThriftSerializer { } int64_t SerializeEncryptedObj(ArrowOutputStream* out, uint8_t* out_buffer, - uint32_t out_length, - const std::shared_ptr& encryptor) { + uint32_t out_length, Encryptor* encryptor) { auto cipher_buffer = std::static_pointer_cast(AllocateBuffer( encryptor->pool(), static_cast(encryptor->CiphertextSizeDelta() + out_length))); diff --git a/cpp/src/parquet/type_fwd.h b/cpp/src/parquet/type_fwd.h index 3e66f32fc0322..da0d0f7bdee96 100644 --- a/cpp/src/parquet/type_fwd.h +++ b/cpp/src/parquet/type_fwd.h @@ -69,6 +69,9 @@ struct ParquetVersion { }; class FileMetaData; +class RowGroupMetaData; + +class ColumnDescriptor; class SchemaDescriptor; class ReaderProperties; From e9730f5971480b942c7394846162c4dfa9145aa9 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Thu, 28 Sep 2023 18:24:26 +0200 Subject: [PATCH 80/96] GH-37934: [Doc][Integration] Document C Data Interface testing (#37935) ### Rationale for this change gh-37537 added integration testing for the C Data Interface, but the documentation was not updated. ### What changes are included in this PR? Add documentation for C Data Interface integration testing. ### Are these changes tested? N/A, only doc changes. ### Are there any user-facing changes? No. * Closes: #37934 Authored-by: Antoine Pitrou Signed-off-by: Antoine Pitrou --- docs/source/developers/java/development.rst | 10 +- docs/source/format/Integration.rst | 114 ++++++++++++++++---- 2 files changed, 97 insertions(+), 27 deletions(-) diff --git a/docs/source/developers/java/development.rst b/docs/source/developers/java/development.rst index 1094d02f1c140..ce7e1704f641c 100644 --- a/docs/source/developers/java/development.rst +++ b/docs/source/developers/java/development.rst @@ -84,11 +84,13 @@ UI Benchmark: Integration Testing =================== -Integration tests can be run via Archery: +Integration tests can be run :ref:`via Archery `. +For example, assuming you only built Arrow Java and want to run the IPC +integration tests, you would do: -.. code-block:: +.. code-block:: console - $ archery integration --with-java true --with-cpp false --with-js false --with-csharp false --with-go false --with-rust false + $ archery integration --run-ipc --with-java 1 Code Style ========== @@ -104,4 +106,4 @@ This checks the code style of all source code under the current directory or fro .. _benchmark: https://github.com/ursacomputing/benchmarks .. _archery: https://github.com/apache/arrow/blob/main/dev/conbench_envs/README.md#L188 .. _conbench: https://github.com/conbench/conbench -.. _checkstyle: https://github.com/apache/arrow/blob/main/java/dev/checkstyle/checkstyle.xml \ No newline at end of file +.. _checkstyle: https://github.com/apache/arrow/blob/main/java/dev/checkstyle/checkstyle.xml diff --git a/docs/source/format/Integration.rst b/docs/source/format/Integration.rst index 5f2341b9c469c..e1160b287e77c 100644 --- a/docs/source/format/Integration.rst +++ b/docs/source/format/Integration.rst @@ -20,32 +20,98 @@ Integration Testing =================== +To ensure Arrow implementations are interoperable between each other, +the Arrow project includes cross-language integration tests which are +regularly run as Continuous Integration tasks. + +The integration tests exercise compliance with several Arrow specifications: +the :ref:`IPC format `, the :ref:`Flight RPC ` protocol, +and the :ref:`C Data Interface `. + +Strategy +-------- + Our strategy for integration testing between Arrow implementations is: -* Test datasets are specified in a custom human-readable, JSON-based format - designed exclusively for Arrow's integration tests -* Each implementation provides a testing executable capable of converting - between the JSON and the binary Arrow file representation -* Each testing executable is used to generate binary Arrow file representations - from the JSON-based test datasets. These results are then used to call the - testing executable of each other implementation to validate the contents - against the corresponding JSON file. - - *ie.* the C++ testing executable generates binary arrow files from JSON - specified datasets. The resulting files are then used as input to the Java - testing executable for validation, confirming that the Java implementation - can correctly read what the C++ implementation wrote. +* Test datasets are specified in a custom human-readable, + :ref:`JSON-based format ` designed exclusively + for Arrow's integration tests. + +* The JSON files are generated by the integration test harness. Different + files are used to represent different data types and features, such as + numerics, lists, dictionary encoding, etc. This makes it easier to pinpoint + incompatibilities than if all data types were represented in a single file. + +* Each implementation provides entry points capable of converting + between the JSON and the Arrow in-memory representation, and of exposing + Arrow in-memory data using the desired format. + +* Each format (whether Arrow IPC, Flight or the C Data Interface) is tested for + all supported pairs of (producer, consumer) implementations. The producer + typically reads a JSON file, converts it to in-memory Arrow data, and exposes + this data using the format under test. The consumer reads the data in the + said format and converts it back to Arrow in-memory data; it also reads + the same JSON file as the producer, and validates that both datasets are + identical. + +Example: IPC format +~~~~~~~~~~~~~~~~~~~ + +Let's say we are testing Arrow C++ as a producer and Arrow Java as a consumer +of the Arrow IPC format. Testing a JSON file would go as follows: + +#. A C++ executable reads the JSON file, converts it into Arrow in-memory data + and writes an Arrow IPC file (the file paths are typically given on the command + line). + +#. A Java executable reads the JSON file, converts it into Arrow in-memory data; + it also reads the Arrow IPC file generated by C++. Finally, it validates that + both Arrow in-memory datasets are equal. + +Example: C Data Interface +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now, let's say we are testing Arrow Go as a producer and Arrow C# as a consumer +of the Arrow C Data Interface. + +#. The integration testing harness allocates a C + :ref:`ArrowArray ` structure on the heap. + +#. A Go in-process entrypoint (for example a C-compatible function call) + reads a JSON file and exports one of its :term:`record batches ` + into the ``ArrowArray`` structure. + +#. A C# in-process entrypoint reads the same JSON file, converts the + same record batch into Arrow in-memory data; it also imports the + record batch exported by Arrow Go in the ``ArrowArray`` structure. + It validates that both record batches are equal, and then releases the + imported record batch. + +#. Depending on the implementation languages' abilities, the integration + testing harness may assert that memory consumption remained identical + (i.e., that the exported record batch didn't leak). + +#. At the end, the integration testing harness deallocates the ``ArrowArray`` + structure. + +.. _running_integration_tests: Running integration tests ------------------------- The integration test data generator and runner are implemented inside -the :ref:`Archery ` utility. +the :ref:`Archery ` utility. You need to install the ``integration`` +component of archery: + +.. code:: console + + $ pip install -e "dev/archery[integration]" The integration tests are run using the ``archery integration`` command. -.. code-block:: shell +.. code-block:: console - archery integration --help + $ archery integration --help In order to run integration tests, you'll first need to build each component you want to include. See the respective developer docs for C++, Java, etc. @@ -56,26 +122,26 @@ testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON`` to your cmake command. Depending on which components you have built, you can enable and add them to -the archery test run. For example, if you only have the C++ project built, run: +the archery test run. For example, if you only have the C++ project built +and want to run the Arrow IPC integration tests, run: .. code-block:: shell - archery integration --with-cpp=1 - + archery integration --run-ipc --with-cpp=1 For Java, it may look like: .. code-block:: shell - VERSION=0.11.0-SNAPSHOT + VERSION=14.0.0-SNAPSHOT export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar - archery integration --with-cpp=1 --with-java=1 + archery integration --run-ipc --with-cpp=1 --with-java=1 -To run all tests, including Flight integration tests, do: +To run all tests, including Flight and C Data Interface integration tests, do: .. code-block:: shell - archery integration --with-all --run-flight + archery integration --with-all --run-flight --run-ipc --run-c-data Note that we run these tests in continuous integration, and the CI job uses docker-compose. You may also run the docker-compose job locally, or at least @@ -85,6 +151,8 @@ certain tests. See :ref:`docker-builds` for more information about the project's ``docker-compose`` configuration. +.. _format_json_integration: + JSON test data format --------------------- @@ -415,7 +483,7 @@ will have count 28. For "null" type, ``BufferData`` does not contain any buffers. Archery Integration Test Cases --------------------------------------- +------------------------------ This list can make it easier to understand what manual testing may need to be done for any future Arrow Format changes by knowing what cases the automated From 79abb7362f671f484675b89f19566df861c45f6f Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Thu, 28 Sep 2023 21:40:45 +0100 Subject: [PATCH 81/96] GH-37842: [R] Implement infer_schema.data.frame() (#37843) ### Rationale for this change Users will be able to easily see the schema which their `data.frame` object will have when it's converted into an Arrwo table. ### What changes are included in this PR? Implements `infer_schema()` method for `data.frame` objects. Before: ``` r library(arrow) schema(mtcars) #> Error in UseMethod("infer_schema"): no applicable method for 'infer_schema' applied to an object of class "data.frame" ``` After: ``` r library(arrow) schema(mtcars) #> Schema #> mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> #> See $metadata for additional Schema metadata ``` ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37842 Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/NAMESPACE | 1 + r/R/schema.R | 3 +++ r/tests/testthat/test-schema.R | 5 +++++ 3 files changed, 9 insertions(+) diff --git a/r/NAMESPACE b/r/NAMESPACE index 21f88b4180d24..d49255f781f94 100644 --- a/r/NAMESPACE +++ b/r/NAMESPACE @@ -112,6 +112,7 @@ S3method(infer_schema,ArrowTabular) S3method(infer_schema,Dataset) S3method(infer_schema,RecordBatchReader) S3method(infer_schema,arrow_dplyr_query) +S3method(infer_schema,data.frame) S3method(infer_type,ArrowDatum) S3method(infer_type,Expression) S3method(infer_type,blob) diff --git a/r/R/schema.R b/r/R/schema.R index 1ad18e314191e..ac0604b2b345c 100644 --- a/r/R/schema.R +++ b/r/R/schema.R @@ -285,6 +285,9 @@ infer_schema.Dataset <- function(x) x$schema #' @export infer_schema.arrow_dplyr_query <- function(x) implicit_schema(x) +#' @export +infer_schema.data.frame <- function(x) schema(!!!lapply(x, infer_type)) + #' @export names.Schema <- function(x) x$names diff --git a/r/tests/testthat/test-schema.R b/r/tests/testthat/test-schema.R index db91cee330960..b1dc06592955e 100644 --- a/r/tests/testthat/test-schema.R +++ b/r/tests/testthat/test-schema.R @@ -295,9 +295,14 @@ test_that("schema name assignment", { test_that("schema extraction", { skip_if_not_available("dataset") + tbl <- arrow_table(example_data) + expect_equal(schema(example_data), tbl$schema) expect_equal(schema(tbl), tbl$schema) + expect_equal(schema(data.frame(a = 1, a = "x", check.names = FALSE)), schema(a = double(), a = string())) + expect_equal(schema(data.frame()), schema()) + ds <- InMemoryDataset$create(example_data) expect_equal(schema(ds), ds$schema) From 9d3806a27747241c5daf2ecc901986c16962ead2 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Thu, 28 Sep 2023 21:41:56 +0100 Subject: [PATCH 82/96] GH-34640: [R] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied (#37658) ### Rationale for this change It wasn't possible to use the partitioning column in the dataset when reading in CSV datasets and supplying both a schema and a partition variable. ### What changes are included in this PR? This PR updates the code which creates the `CSVReadOptions` object and makes sure we don't pass in the partition variable column name as a column name there, as previously this was resulting in an error. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #34640 Lead-authored-by: Nic Crane Co-authored-by: Dewey Dunnington Signed-off-by: Nic Crane --- r/R/dataset-factory.R | 2 +- r/R/dataset-format.R | 31 +++++++++++++++----------- r/tests/testthat/test-dataset-csv.R | 34 +++++++++++++++++++++++++++++ 3 files changed, 53 insertions(+), 14 deletions(-) diff --git a/r/R/dataset-factory.R b/r/R/dataset-factory.R index adb7353a043b9..d3d4f639e3729 100644 --- a/r/R/dataset-factory.R +++ b/r/R/dataset-factory.R @@ -49,7 +49,7 @@ DatasetFactory$create <- function(x, } if (is.character(format)) { - format <- FileFormat$create(match.arg(format), ...) + format <- FileFormat$create(match.arg(format), partitioning = partitioning, ...) } else { assert_is(format, "FileFormat") } diff --git a/r/R/dataset-format.R b/r/R/dataset-format.R index e1f434d60cd50..5dd00b9344014 100644 --- a/r/R/dataset-format.R +++ b/r/R/dataset-format.R @@ -74,13 +74,14 @@ FileFormat <- R6Class("FileFormat", type = function() dataset___FileFormat__type_name(self) ) ) -FileFormat$create <- function(format, schema = NULL, ...) { + +FileFormat$create <- function(format, schema = NULL, partitioning = NULL, ...) { opt_names <- names(list(...)) if (format %in% c("csv", "text", "txt") || any(opt_names %in% c("delim", "delimiter"))) { - CsvFileFormat$create(schema = schema, ...) + CsvFileFormat$create(schema = schema, partitioning = partitioning, ...) } else if (format == "tsv") { # This delimiter argument is ignored. - CsvFileFormat$create(delimiter = "\t", schema = schema, ...) + CsvFileFormat$create(delimiter = "\t", schema = schema, partitioning = partitioning, ...) } else if (format == "parquet") { ParquetFileFormat$create(...) } else if (format %in% c("ipc", "arrow", "feather")) { # These are aliases for the same thing @@ -189,16 +190,19 @@ JsonFileFormat$create <- function(...) { #' #' @export CsvFileFormat <- R6Class("CsvFileFormat", inherit = FileFormat) -CsvFileFormat$create <- function(...) { +CsvFileFormat$create <- function(..., partitioning = NULL) { + dots <- list(...) - options <- check_csv_file_format_args(dots) - check_schema(options[["schema"]], options[["read_options"]]$column_names) + + options <- check_csv_file_format_args(dots, partitioning = partitioning) + check_schema(options[["schema"]], partitioning, options[["read_options"]]$column_names) dataset___CsvFileFormat__Make(options$parse_options, options$convert_options, options$read_options) } # Check all arguments are valid -check_csv_file_format_args <- function(args) { +check_csv_file_format_args <- function(args, partitioning = NULL) { + options <- list( parse_options = args$parse_options, convert_options = args$convert_options, @@ -223,7 +227,7 @@ check_csv_file_format_args <- function(args) { } if (is.null(args$read_options)) { - options$read_options <- do.call(csv_file_format_read_opts, args) + options$read_options <- do.call(csv_file_format_read_opts, c(args, list(partitioning = partitioning))) } else if (is.list(args$read_options)) { options$read_options <- do.call(CsvReadOptions$create, args$read_options) } @@ -339,7 +343,7 @@ check_ambiguous_options <- function(passed_opts, opts1, opts2) { } } -check_schema <- function(schema, column_names) { +check_schema <- function(schema, partitioning, column_names) { if (!is.null(schema) && !inherits(schema, "Schema")) { abort(paste0( "`schema` must be an object of class 'Schema' not '", @@ -348,7 +352,7 @@ check_schema <- function(schema, column_names) { )) } - schema_names <- names(schema) + schema_names <- setdiff(names(schema), names(partitioning)) if (!is.null(schema) && !identical(schema_names, column_names)) { missing_from_schema <- setdiff(column_names, schema_names) @@ -451,7 +455,8 @@ csv_file_format_convert_opts <- function(...) { do.call(CsvConvertOptions$create, opts) } -csv_file_format_read_opts <- function(schema = NULL, ...) { +csv_file_format_read_opts <- function(schema = NULL, partitioning = NULL, ...) { + opts <- list(...) # Filter out arguments meant for CsvParseOptions/CsvConvertOptions arrow_opts <- c(names(formals(CsvParseOptions$create)), "parse_options") @@ -477,9 +482,9 @@ csv_file_format_read_opts <- function(schema = NULL, ...) { if (!is.null(schema) && null_or_true(opts[["column_names"]]) && null_or_true(opts[["col_names"]])) { if (any(is_readr_opt)) { - opts[["col_names"]] <- names(schema) + opts[["col_names"]] <- setdiff(names(schema), names(partitioning)) } else { - opts[["column_names"]] <- names(schema) + opts[["column_names"]] <- setdiff(names(schema), names(partitioning)) } } diff --git a/r/tests/testthat/test-dataset-csv.R b/r/tests/testthat/test-dataset-csv.R index c83c30ff904ff..ff1712646a472 100644 --- a/r/tests/testthat/test-dataset-csv.R +++ b/r/tests/testthat/test-dataset-csv.R @@ -593,3 +593,37 @@ test_that("CSVReadOptions field access", { expect_equal(options$block_size, 1048576L) expect_equal(options$encoding, "UTF-8") }) + +test_that("GH-34640 - CSV datasets are read in correctly when both schema and partitioning supplied", { + target_schema <- schema( + int = int32(), dbl = float32(), lgl = bool(), chr = utf8(), + fct = utf8(), ts = timestamp(unit = "s"), part = int8() + ) + + ds <- open_dataset( + csv_dir, + partitioning = schema(part = int32()), + format = "csv", + schema = target_schema, + skip = 1 + ) + expect_r6_class(ds$format, "CsvFileFormat") + expect_r6_class(ds$filesystem, "LocalFileSystem") + expect_identical(names(ds), c(names(df1), "part")) + expect_identical(names(collect(ds)), c(names(df1), "part")) + + expect_identical(dim(ds), c(20L, 7L)) + expect_equal(schema(ds), target_schema) + + expect_equal( + ds %>% + select(string = chr, integer = int, part) %>% + filter(integer > 6 & part == 5) %>% + collect() %>% + summarize(mean = mean(as.numeric(integer))), + df1 %>% + select(string = chr, integer = int) %>% + filter(integer > 6) %>% + summarize(mean = mean(integer)) + ) +}) From 3ceb2f1ed871afbecfc481e5cabfd5e846dc5cd6 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Thu, 28 Sep 2023 23:09:43 +0200 Subject: [PATCH 83/96] GH-37803: [Python][CI] Pin setuptools_scm to fix release verification scripts (#37930) Follow-up on https://github.com/apache/arrow/pull/37819, which missed one place to add a pin for the release verification scripts * Closes: #37803 Authored-by: Joris Van den Bossche Signed-off-by: Sutou Kouhei --- dev/release/verify-release-candidate.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh index 77b996766f78c..ae28ebe792404 100755 --- a/dev/release/verify-release-candidate.sh +++ b/dev/release/verify-release-candidate.sh @@ -666,7 +666,7 @@ test_python() { show_header "Build and test Python libraries" # Build and test Python - maybe_setup_virtualenv "cython<3" numpy setuptools_scm setuptools || exit 1 + maybe_setup_virtualenv "cython<3" numpy "setuptools_scm<8.0.0" setuptools || exit 1 maybe_setup_conda --file ci/conda_env_python.txt || exit 1 if [ "${USE_CONDA}" -gt 0 ]; then From befcc90defcc6b2fc35e8b3226b1ee38851e0cdc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Franti=C5=A1ek=20Ne=C4=8Das?= Date: Thu, 28 Sep 2023 23:40:54 +0200 Subject: [PATCH 84/96] GH-21815: [JS] Add support for Duration type (#37341) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### Rationale for this change The `Duration` type is currently not supported and trying to deserialize a Table containing the type (e.g. using `tableFromIPC`) fails with `Unrecognized type` error. This PR aims to fix that. ### What changes are included in this PR? - definition of the `Duration` data type - updates to the visitor classes so that things like parsing work correctly - test coverage for the type - documentation update ### Are these changes tested? Yes, I extended the data generator with the new type so that the type is tested by the existing tests. ### Are there any user-facing changes? Yes, I've updated the documentation status page. I also noticed that it was outdated for JavaScript, i.e. there is already support for `Decimal` type so I updated this as well. Closes: https://github.com/apache/arrow/issues/21815 Closes: https://github.com/apache/arrow/issues/35439 * Closes: #21815 Lead-authored-by: František Necas Co-authored-by: ptaylor Signed-off-by: Dominik Moritz --- dev/archery/archery/integration/datagen.py | 3 +- docs/source/status.rst | 4 +- js/src/Arrow.dom.ts | 2 + js/src/Arrow.ts | 2 + js/src/builder.ts | 6 +-- js/src/builder/duration.ts | 46 ++++++++++++++++++++++ js/src/data.ts | 11 ++++++ js/src/enum.ts | 7 +++- js/src/interfaces.ts | 16 ++++++++ js/src/ipc/metadata/json.ts | 6 ++- js/src/ipc/metadata/message.ts | 7 +++- js/src/type.ts | 34 ++++++++++++++++ js/src/visitor.ts | 27 ++++++++++++- js/src/visitor/builderctor.ts | 6 +++ js/src/visitor/bytelength.ts | 5 ++- js/src/visitor/get.ts | 30 ++++++++++++++ js/src/visitor/indexof.ts | 11 ++++++ js/src/visitor/iterator.ts | 11 ++++++ js/src/visitor/jsontypeassembler.ts | 3 ++ js/src/visitor/jsonvectorassembler.ts | 6 ++- js/src/visitor/set.ts | 31 +++++++++++++++ js/src/visitor/typeassembler.ts | 6 +++ js/src/visitor/typecomparator.ts | 18 +++++++++ js/src/visitor/typector.ts | 5 +++ js/src/visitor/vectorassembler.ts | 6 ++- js/src/visitor/vectorloader.ts | 5 ++- js/test/data/tables.ts | 3 +- js/test/generate-test-data.ts | 20 +++++++++- js/test/unit/builders/builder-tests.ts | 4 ++ js/test/unit/generated-data-tests.ts | 4 ++ js/test/unit/visitor-tests.ts | 11 ++++++ 31 files changed, 338 insertions(+), 18 deletions(-) create mode 100644 js/src/builder/duration.ts diff --git a/dev/archery/archery/integration/datagen.py b/dev/archery/archery/integration/datagen.py index 299881c4b613a..8d0cc6b0b01a8 100644 --- a/dev/archery/archery/integration/datagen.py +++ b/dev/archery/archery/integration/datagen.py @@ -1805,8 +1805,7 @@ def _temp_path(): generate_datetime_case(), generate_duration_case() - .skip_tester('C#') - .skip_tester('JS'), # TODO(ARROW-5239): Intervals + JS + .skip_tester('C#'), generate_interval_case() .skip_tester('C#') diff --git a/docs/source/status.rst b/docs/source/status.rst index 6314fd4c8d31f..e2b3852e2229f 100644 --- a/docs/source/status.rst +++ b/docs/source/status.rst @@ -46,7 +46,7 @@ Data Types +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | Decimal128 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ -| Decimal256 | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | | +| Decimal256 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | Date32/64 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ @@ -54,7 +54,7 @@ Data Types +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | Timestamp | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ -| Duration | ✓ | ✓ | ✓ | | | ✓ | ✓ | | +| Duration | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ | Interval | ✓ | ✓ | ✓ | | | ✓ | ✓ | | +-------------------+-------+-------+-------+------------+-------+-------+-------+-------+ diff --git a/js/src/Arrow.dom.ts b/js/src/Arrow.dom.ts index 2fdef60c1fb55..451bf6acb6186 100644 --- a/js/src/Arrow.dom.ts +++ b/js/src/Arrow.dom.ts @@ -59,6 +59,7 @@ export { Union, DenseUnion, SparseUnion, Dictionary, Interval, IntervalDayTime, IntervalYearMonth, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, FixedSizeList, Map_, MapRow, Table, makeTable, tableFromArrays, @@ -86,6 +87,7 @@ export { FixedSizeListBuilder, FloatBuilder, Float16Builder, Float32Builder, Float64Builder, IntervalBuilder, IntervalDayTimeBuilder, IntervalYearMonthBuilder, + DurationBuilder, DurationSecondBuilder, DurationMillisecondBuilder, DurationMicrosecondBuilder, DurationNanosecondBuilder, IntBuilder, Int8Builder, Int16Builder, Int32Builder, Int64Builder, Uint8Builder, Uint16Builder, Uint32Builder, Uint64Builder, ListBuilder, MapBuilder, diff --git a/js/src/Arrow.ts b/js/src/Arrow.ts index 4a6394c266b1b..714861e764ccb 100644 --- a/js/src/Arrow.ts +++ b/js/src/Arrow.ts @@ -48,6 +48,7 @@ export { Union, DenseUnion, SparseUnion, Dictionary, Interval, IntervalDayTime, IntervalYearMonth, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, FixedSizeList, Map_ } from './type.js'; @@ -75,6 +76,7 @@ export { IntBuilder, Int8Builder, Int16Builder, Int32Builder, Int64Builder, Uint export { TimeBuilder, TimeSecondBuilder, TimeMillisecondBuilder, TimeMicrosecondBuilder, TimeNanosecondBuilder } from './builder/time.js'; export { TimestampBuilder, TimestampSecondBuilder, TimestampMillisecondBuilder, TimestampMicrosecondBuilder, TimestampNanosecondBuilder } from './builder/timestamp.js'; export { IntervalBuilder, IntervalDayTimeBuilder, IntervalYearMonthBuilder } from './builder/interval.js'; +export { DurationBuilder, DurationSecondBuilder, DurationMillisecondBuilder, DurationMicrosecondBuilder, DurationNanosecondBuilder } from './builder/duration.js'; export { Utf8Builder } from './builder/utf8.js'; export { BinaryBuilder } from './builder/binary.js'; export { ListBuilder } from './builder/list.js'; diff --git a/js/src/builder.ts b/js/src/builder.ts index 90fe3ddcc9477..93510eedf84ff 100644 --- a/js/src/builder.ts +++ b/js/src/builder.ts @@ -21,7 +21,7 @@ import { MapRow, kKeys } from './row/map.js'; import { DataType, strideForType, Float, Int, Decimal, FixedSizeBinary, - Date_, Time, Timestamp, Interval, + Date_, Time, Timestamp, Interval, Duration, Utf8, Binary, List, Map_, } from './type.js'; import { createIsValidFunction } from './builder/valid.js'; @@ -290,7 +290,7 @@ export abstract class Builder { } else if (valueOffsets = _offsets?.flush(length)) { // Variable-width primitives (Binary, Utf8), and Lists // Binary, Utf8 data = _values?.flush(_offsets.last()); - } else { // Fixed-width primitives (Int, Float, Decimal, Time, Timestamp, and Interval) + } else { // Fixed-width primitives (Int, Float, Decimal, Time, Timestamp, Duration and Interval) data = _values?.flush(length); } @@ -342,7 +342,7 @@ export abstract class Builder { (Builder.prototype as any)._isValid = () => true; /** @ignore */ -export abstract class FixedWidthBuilder extends Builder { +export abstract class FixedWidthBuilder extends Builder { constructor(opts: BuilderOptions) { super(opts); this._values = new DataBufferBuilder(new this.ArrayType(0), this.stride); diff --git a/js/src/builder/duration.ts b/js/src/builder/duration.ts new file mode 100644 index 0000000000000..968899ea55b91 --- /dev/null +++ b/js/src/builder/duration.ts @@ -0,0 +1,46 @@ + +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +import { FixedWidthBuilder } from '../builder.js'; +import { Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond } from '../type.js'; +import { setDuration, setDurationSecond, setDurationMillisecond, setDurationMicrosecond, setDurationNanosecond } from '../visitor/set.js'; + +/** @ignore */ +export class DurationBuilder extends FixedWidthBuilder { } + +(DurationBuilder.prototype as any)._setValue = setDuration; + +/** @ignore */ +export class DurationSecondBuilder extends DurationBuilder { } + +(DurationSecondBuilder.prototype as any)._setValue = setDurationSecond; + +/** @ignore */ +export class DurationMillisecondBuilder extends DurationBuilder { } + +(DurationMillisecondBuilder.prototype as any)._setValue = setDurationMillisecond; + +/** @ignore */ +export class DurationMicrosecondBuilder extends DurationBuilder { } + +(DurationMicrosecondBuilder.prototype as any)._setValue = setDurationMicrosecond; + +/** @ignore */ +export class DurationNanosecondBuilder extends DurationBuilder { } + +(DurationNanosecondBuilder.prototype as any)._setValue = setDurationNanosecond; diff --git a/js/src/data.ts b/js/src/data.ts index dc423cdb01e1c..1e9df71cff8a7 100644 --- a/js/src/data.ts +++ b/js/src/data.ts @@ -257,6 +257,7 @@ import { Int, Date_, Interval, + Duration, Time, Timestamp, Union, DenseUnion, SparseUnion, @@ -390,6 +391,13 @@ class MakeDataVisitor extends Visitor { const { ['length']: length = data.length / strideForType(type), ['nullCount']: nullCount = props['nullBitmap'] ? -1 : 0, } = props; return new Data(type, offset, length, nullCount, [undefined, data, nullBitmap]); } + public visitDuration(props: DurationDataProps) { + const { ['type']: type, ['offset']: offset = 0 } = props; + const nullBitmap = toUint8Array(props['nullBitmap']); + const data = toArrayBufferView(type.ArrayType, props['data']); + const { ['length']: length = data.length, ['nullCount']: nullCount = props['nullBitmap'] ? -1 : 0, } = props; + return new Data(type, offset, length, nullCount, [undefined, data, nullBitmap]); + } public visitFixedSizeList(props: FixedSizeListDataProps) { const { ['type']: type, ['offset']: offset = 0, ['child']: child = new MakeDataVisitor().visit({ type: type.valueType }) } = props; const nullBitmap = toUint8Array(props['nullBitmap']); @@ -424,6 +432,7 @@ interface Date_DataProps extends DataProps_ { data?: DataBuf interface TimeDataProps extends DataProps_ { data?: DataBuffer } interface TimestampDataProps extends DataProps_ { data?: DataBuffer } interface IntervalDataProps extends DataProps_ { data?: DataBuffer } +interface DurationDataProps extends DataProps_ { data?: DataBuffer } interface FixedSizeBinaryDataProps extends DataProps_ { data?: DataBuffer } interface BinaryDataProps extends DataProps_ { valueOffsets: ValueOffsetsBuffer; data?: DataBuffer } interface Utf8DataProps extends DataProps_ { valueOffsets: ValueOffsetsBuffer; data?: DataBuffer } @@ -446,6 +455,7 @@ export type DataProps = ( T extends Time /* */ ? TimeDataProps : T extends Timestamp /* */ ? TimestampDataProps : T extends Interval /* */ ? IntervalDataProps : + T extends Duration /* */ ? DurationDataProps : T extends FixedSizeBinary /* */ ? FixedSizeBinaryDataProps : T extends Binary /* */ ? BinaryDataProps : T extends Utf8 /* */ ? Utf8DataProps : @@ -471,6 +481,7 @@ export function makeData(props: Date_DataProps): Data; export function makeData(props: TimeDataProps): Data; export function makeData(props: TimestampDataProps): Data; export function makeData(props: IntervalDataProps): Data; +export function makeData(props: DurationDataProps): Data; export function makeData(props: FixedSizeBinaryDataProps): Data; export function makeData(props: BinaryDataProps): Data; export function makeData(props: Utf8DataProps): Data; diff --git a/js/src/enum.ts b/js/src/enum.ts index f5856bc06afbe..4e207dd37cec1 100644 --- a/js/src/enum.ts +++ b/js/src/enum.ts @@ -137,7 +137,7 @@ export enum MessageHeader { * nested type consisting of other data types, or another data type (e.g. a * timestamp encoded as an int64). * - * **Note**: Only enum values 0-17 (NONE through Map) are written to an Arrow + * **Note**: Only enum values 0-18 (NONE through Duration) are written to an Arrow * IPC payload. * * The rest of the values are specified here so TypeScript can narrow the type @@ -174,6 +174,7 @@ export enum Type { FixedSizeBinary = 15, /** Fixed-size binary. Each value occupies the same number of bytes */ FixedSizeList = 16, /** Fixed-size list. Each value occupies the same number of bytes */ Map = 17, /** Map of named logical types */ + Duration = 18, /** Measure of elapsed time in either seconds, miliseconds, microseconds or nanoseconds. */ Dictionary = -1, /** Dictionary aka Category type */ Int8 = -2, @@ -201,6 +202,10 @@ export enum Type { SparseUnion = -24, IntervalDayTime = -25, IntervalYearMonth = -26, + DurationSecond = -27, + DurationMillisecond = -28, + DurationMicrosecond = -29, + DurationNanosecond = -30 } export enum BufferType { diff --git a/js/src/interfaces.ts b/js/src/interfaces.ts index 8d61295919046..95c5adbb2a25e 100644 --- a/js/src/interfaces.ts +++ b/js/src/interfaces.ts @@ -31,6 +31,7 @@ import type { IntBuilder, Int8Builder, Int16Builder, Int32Builder, Int64Builder, import type { TimeBuilder, TimeSecondBuilder, TimeMillisecondBuilder, TimeMicrosecondBuilder, TimeNanosecondBuilder } from './builder/time.js'; import type { TimestampBuilder, TimestampSecondBuilder, TimestampMillisecondBuilder, TimestampMicrosecondBuilder, TimestampNanosecondBuilder } from './builder/timestamp.js'; import type { IntervalBuilder, IntervalDayTimeBuilder, IntervalYearMonthBuilder } from './builder/interval.js'; +import type { DurationBuilder, DurationSecondBuilder, DurationMillisecondBuilder, DurationMicrosecondBuilder, DurationNanosecondBuilder } from './builder/duration.js'; import type { Utf8Builder } from './builder/utf8.js'; import type { BinaryBuilder } from './builder/binary.js'; import type { ListBuilder } from './builder/list.js'; @@ -222,6 +223,11 @@ export type TypeToDataType = { [Type.Interval]: type.Interval; [Type.IntervalDayTime]: type.IntervalDayTime; [Type.IntervalYearMonth]: type.IntervalYearMonth; + [Type.Duration]: type.Duration; + [Type.DurationSecond]: type.DurationSecond; + [Type.DurationMillisecond]: type.DurationMillisecond; + [Type.DurationMicrosecond]: type.DurationMicrosecond; + [Type.DurationNanosecond]: type.DurationNanosecond; [Type.Map]: type.Map_; [Type.List]: type.List; [Type.Struct]: type.Struct; @@ -270,6 +276,11 @@ type TypeToBuilder = { [Type.Interval]: IntervalBuilder; [Type.IntervalDayTime]: IntervalDayTimeBuilder; [Type.IntervalYearMonth]: IntervalYearMonthBuilder; + [Type.Duration]: DurationBuilder; + [Type.DurationSecond]: DurationBuilder; + [Type.DurationMillisecond]: DurationMillisecondBuilder; + [Type.DurationMicrosecond]: DurationMicrosecondBuilder; + [Type.DurationNanosecond]: DurationNanosecondBuilder; [Type.Map]: MapBuilder; [Type.List]: ListBuilder; [Type.Struct]: StructBuilder; @@ -318,6 +329,11 @@ type DataTypeToBuilder = { [Type.Interval]: T extends type.Interval ? IntervalBuilder : never; [Type.IntervalDayTime]: T extends type.IntervalDayTime ? IntervalDayTimeBuilder : never; [Type.IntervalYearMonth]: T extends type.IntervalYearMonth ? IntervalYearMonthBuilder : never; + [Type.Duration]: T extends type.Duration ? DurationBuilder: never; + [Type.DurationSecond]: T extends type.DurationSecond ? DurationSecondBuilder : never; + [Type.DurationMillisecond]: T extends type.DurationMillisecond ? DurationMillisecondBuilder : never; + [Type.DurationMicrosecond]: T extends type.DurationMicrosecond ? DurationMicrosecondBuilder: never; + [Type.DurationNanosecond]: T extends type.DurationNanosecond ? DurationNanosecondBuilder: never; [Type.Map]: T extends type.Map_ ? MapBuilder : never; [Type.List]: T extends type.List ? ListBuilder : never; [Type.Struct]: T extends type.Struct ? StructBuilder : never; diff --git a/js/src/ipc/metadata/json.ts b/js/src/ipc/metadata/json.ts index e5995110f084b..f1f306730ddba 100644 --- a/js/src/ipc/metadata/json.ts +++ b/js/src/ipc/metadata/json.ts @@ -22,7 +22,7 @@ import { DataType, Dictionary, TimeBitWidth, Utf8, Binary, Decimal, FixedSizeBinary, List, FixedSizeList, Map_, Struct, Union, - Bool, Null, Int, Float, Date_, Time, Interval, Timestamp, IntBitWidth, Int32, TKeys, + Bool, Null, Int, Float, Date_, Time, Interval, Timestamp, IntBitWidth, Int32, TKeys, Duration, } from '../../type.js'; import { DictionaryBatch, RecordBatch, FieldNode, BufferRegion } from './message.js'; @@ -185,6 +185,10 @@ function typeFromJSON(f: any, children?: Field[]): DataType { const t = f['type']; return new Interval(IntervalUnit[t['unit']] as any); } + case 'duration': { + const t = f['type']; + return new Duration(TimeUnit[t['unit']] as any); + } case 'union': { const t = f['type']; const [m, ...ms] = (t['mode'] + '').toLowerCase(); diff --git a/js/src/ipc/metadata/message.ts b/js/src/ipc/metadata/message.ts index 6465d3d064720..27c9b92d6897b 100644 --- a/js/src/ipc/metadata/message.ts +++ b/js/src/ipc/metadata/message.ts @@ -36,6 +36,7 @@ import { Date as _Date } from '../../fb/date.js'; import { Time as _Time } from '../../fb/time.js'; import { Timestamp as _Timestamp } from '../../fb/timestamp.js'; import { Interval as _Interval } from '../../fb/interval.js'; +import { Duration as _Duration } from '../../fb/duration.js'; import { Union as _Union } from '../../fb/union.js'; import { FixedSizeBinary as _FixedSizeBinary } from '../../fb/fixed-size-binary.js'; import { FixedSizeList as _FixedSizeList } from '../../fb/fixed-size-list.js'; @@ -57,7 +58,7 @@ import { DataType, Dictionary, TimeBitWidth, Utf8, Binary, Decimal, FixedSizeBinary, List, FixedSizeList, Map_, Struct, Union, - Bool, Null, Int, Float, Date_, Time, Interval, Timestamp, IntBitWidth, Int32, TKeys, + Bool, Null, Int, Float, Date_, Time, Interval, Timestamp, IntBitWidth, Int32, TKeys, Duration, } from '../../type.js'; /** @@ -466,6 +467,10 @@ function decodeFieldType(f: _Field, children?: Field[]): DataType { const t = f.type(new _Interval())!; return new Interval(t.unit()); } + case Type['Duration']: { + const t = f.type(new _Duration())!; + return new Duration(t.unit()); + } case Type['Union']: { const t = f.type(new _Union())!; return new Union(t.mode(), t.typeIdsArray() || [], children || []); diff --git a/js/src/type.ts b/js/src/type.ts index 1dc90c47cbd10..34bbf45bca728 100644 --- a/js/src/type.ts +++ b/js/src/type.ts @@ -63,6 +63,7 @@ export abstract class DataType { construct /** @ignore */ export class IntervalYearMonth extends Interval_ { constructor() { super(IntervalUnit.YEAR_MONTH); } } +/** @ignore */ +type Durations = Type.Duration | Type.DurationSecond | Type.DurationMillisecond | Type.DurationMicrosecond | Type.DurationNanosecond; +/** @ignore */ +export interface Duration extends DataType { + TArray: BigInt64Array; + TValue: bigint; + ArrayType: BigInt64Array; +} + +/** @ignore */ +export class Duration extends DataType { + constructor(public readonly unit: TimeUnit) { + super(); + } + public get typeId() { return Type.Duration as T; } + public toString() { return `Duration<${TimeUnit[this.unit]}>`; } + protected static [Symbol.toStringTag] = ((proto: Duration) => { + (proto).unit = null; + (proto).ArrayType = BigInt64Array; + return proto[Symbol.toStringTag] = 'Duration'; + })(Duration.prototype); +} + +/** @ignore */ +export class DurationSecond extends Duration { constructor() { super(TimeUnit.SECOND); }} +/** @ignore */ +export class DurationMillisecond extends Duration { constructor() { super(TimeUnit.MILLISECOND); }} +/** @ignore */ +export class DurationMicrosecond extends Duration { constructor() { super(TimeUnit.MICROSECOND); }} +/** @ignore */ +export class DurationNanosecond extends Duration { constructor() { super(TimeUnit.NANOSECOND); }} + + /** @ignore */ export interface List extends DataType { TArray: Array; diff --git a/js/src/visitor.ts b/js/src/visitor.ts index 3be50a6d3eacf..c63640b038e47 100644 --- a/js/src/visitor.ts +++ b/js/src/visitor.ts @@ -16,7 +16,7 @@ // under the License. import { Type, Precision, DateUnit, TimeUnit, IntervalUnit, UnionMode } from './enum.js'; -import { DataType, Float, Int, Date_, Interval, Time, Timestamp, Union, } from './type.js'; +import { DataType, Float, Int, Date_, Interval, Time, Timestamp, Union, Duration } from './type.js'; export abstract class Visitor { public visitMany(nodes: any[], ...args: any[][]) { @@ -47,6 +47,7 @@ export abstract class Visitor { public visitUnion(_node: any, ..._args: any[]): any { return null; } public visitDictionary(_node: any, ..._args: any[]): any { return null; } public visitInterval(_node: any, ..._args: any[]): any { return null; } + public visitDuration(_node: any, ... _args: any[]): any { return null; } public visitFixedSizeList(_node: any, ..._args: any[]): any { return null; } public visitMap(_node: any, ..._args: any[]): any { return null; } } @@ -113,6 +114,11 @@ function getVisitFnByTypeId(visitor: Visitor, dtype: Type, throwIfNotFound = tru case Type.Interval: fn = visitor.visitInterval; break; case Type.IntervalDayTime: fn = visitor.visitIntervalDayTime || visitor.visitInterval; break; case Type.IntervalYearMonth: fn = visitor.visitIntervalYearMonth || visitor.visitInterval; break; + case Type.Duration: fn = visitor.visitDuration; break; + case Type.DurationSecond: fn = visitor.visitDurationSecond || visitor.visitDuration; break; + case Type.DurationMillisecond: fn = visitor.visitDurationMillisecond || visitor.visitDuration; break; + case Type.DurationMicrosecond: fn = visitor.visitDurationMicrosecond || visitor.visitDuration; break; + case Type.DurationNanosecond: fn = visitor.visitDurationNanosecond || visitor.visitDuration; break; case Type.FixedSizeList: fn = visitor.visitFixedSizeList; break; case Type.Map: fn = visitor.visitMap; break; } @@ -180,6 +186,15 @@ function inferDType(type: T): Type { } // @ts-ignore return Type.Interval; + case Type.Duration: + switch ((type as any as Duration).unit) { + case TimeUnit.SECOND: return Type.DurationSecond; + case TimeUnit.MILLISECOND: return Type.DurationMillisecond; + case TimeUnit.MICROSECOND: return Type.DurationMicrosecond; + case TimeUnit.NANOSECOND: return Type.DurationNanosecond; + } + // @ts-ignore + return Type.Duration; case Type.Map: return Type.Map; case Type.List: return Type.List; case Type.Struct: return Type.Struct; @@ -239,6 +254,11 @@ export interface Visitor { visitInterval(node: any, ...args: any[]): any; visitIntervalDayTime?(node: any, ...args: any[]): any; visitIntervalYearMonth?(node: any, ...args: any[]): any; + visitDuration(node: any, ...args: any[]): any; + visitDurationSecond(node: any, ...args: any[]): any; + visitDurationMillisecond(node: any, ...args: any[]): any; + visitDurationMicrosecond(node: any, ...args: any[]): any; + visitDurationNanosecond(node: any, ...args: any[]): any; visitFixedSizeList(node: any, ...args: any[]): any; visitMap(node: any, ...args: any[]): any; } @@ -270,3 +290,8 @@ export interface Visitor { (Visitor.prototype as any).visitSparseUnion = null; (Visitor.prototype as any).visitIntervalDayTime = null; (Visitor.prototype as any).visitIntervalYearMonth = null; +(Visitor.prototype as any).visitDuration = null; +(Visitor.prototype as any).visitDurationSecond = null; +(Visitor.prototype as any).visitDurationMillisecond = null; +(Visitor.prototype as any).visitDurationMicrosecond = null; +(Visitor.prototype as any).visitDurationNanosecond = null; diff --git a/js/src/visitor/builderctor.ts b/js/src/visitor/builderctor.ts index 9ce9ae4d4a797..2d20f2a8efd5c 100644 --- a/js/src/visitor/builderctor.ts +++ b/js/src/visitor/builderctor.ts @@ -30,6 +30,7 @@ import { FixedSizeBinaryBuilder } from '../builder/fixedsizebinary.js'; import { FixedSizeListBuilder } from '../builder/fixedsizelist.js'; import { FloatBuilder, Float16Builder, Float32Builder, Float64Builder } from '../builder/float.js'; import { IntervalBuilder, IntervalDayTimeBuilder, IntervalYearMonthBuilder } from '../builder/interval.js'; +import { DurationBuilder, DurationSecondBuilder, DurationMillisecondBuilder, DurationMicrosecondBuilder, DurationNanosecondBuilder } from '../builder/duration.js'; import { IntBuilder, Int8Builder, Int16Builder, Int32Builder, Int64Builder, Uint8Builder, Uint16Builder, Uint32Builder, Uint64Builder } from '../builder/int.js'; import { ListBuilder } from '../builder/list.js'; import { MapBuilder } from '../builder/map.js'; @@ -91,6 +92,11 @@ export class GetBuilderCtor extends Visitor { public visitInterval() { return IntervalBuilder; } public visitIntervalDayTime() { return IntervalDayTimeBuilder; } public visitIntervalYearMonth() { return IntervalYearMonthBuilder; } + public visitDuration() { return DurationBuilder; } + public visitDurationSecond() { return DurationSecondBuilder; } + public visitDurationMillisecond() { return DurationMillisecondBuilder; } + public visitDurationMicrosecond() { return DurationMicrosecondBuilder; } + public visistDurationNanosecond() { return DurationNanosecondBuilder; } public visitFixedSizeList() { return FixedSizeListBuilder; } public visitMap() { return MapBuilder; } } diff --git a/js/src/visitor/bytelength.ts b/js/src/visitor/bytelength.ts index 862808ad54ee9..72d6148a52fd8 100644 --- a/js/src/visitor/bytelength.ts +++ b/js/src/visitor/bytelength.ts @@ -25,7 +25,7 @@ import { TypeToDataType } from '../interfaces.js'; import { Type, TimeUnit, UnionMode } from '../enum.js'; import { DataType, Dictionary, - Float, Int, Date_, Interval, Time, Timestamp, + Float, Int, Date_, Interval, Time, Timestamp, Duration, Bool, Null, Utf8, Binary, Decimal, FixedSizeBinary, List, FixedSizeList, Map_, Struct, Union, DenseUnion, SparseUnion, } from '../type.js'; @@ -75,6 +75,9 @@ export class GetByteLengthVisitor extends Visitor { public visitInterval(data: Data, _: number) { return (data.type.unit + 1) * 4; } + public visitDuration(____: Data, _: number) { + return 8; + } public visitStruct(data: Data, i: number) { return data.children.reduce((total, child) => total + instance.visit(child, i), 0); } diff --git a/js/src/visitor/get.ts b/js/src/visitor/get.ts index 12f8325470bac..5aaaedf51a37e 100644 --- a/js/src/visitor/get.ts +++ b/js/src/visitor/get.ts @@ -34,6 +34,7 @@ import { Interval, IntervalDayTime, IntervalYearMonth, Time, TimeSecond, TimeMillisecond, TimeMicrosecond, TimeNanosecond, Timestamp, TimestampSecond, TimestampMillisecond, TimestampMicrosecond, TimestampNanosecond, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, Union, DenseUnion, SparseUnion, } from '../type.js'; @@ -84,6 +85,11 @@ export interface GetVisitor extends Visitor { visitInterval(data: Data, index: number): T['TValue'] | null; visitIntervalDayTime(data: Data, index: number): T['TValue'] | null; visitIntervalYearMonth(data: Data, index: number): T['TValue'] | null; + visitDuration(data: Data, index: number): T['TValue'] | null; + visitDurationSecond(data: Data, index: number): T['TValue'] | null; + visitDurationMillisecond(data: Data, index: number): T['TValue'] | null; + visitDurationMicrosecond(data: Data, index: number): T['TValue'] | null; + visitDurationNanosecond(data: Data, index: number): T['TValue'] | null; visitFixedSizeList(data: Data, index: number): T['TValue'] | null; visitMap(data: Data, index: number): T['TValue'] | null; } @@ -279,6 +285,25 @@ const getIntervalYearMonth = ({ values }: Data, return int32s; }; +/** @ignore */ +const getDurationSecond = ({ values }: Data, index: number): T['TValue'] => values[index]; +/** @ignore */ +const getDurationMillisecond = ({ values }: Data, index: number): T['TValue'] => values[index]; +/** @ignore */ +const getDurationMicrosecond = ({ values }: Data, index: number): T['TValue'] => values[index]; +/** @ignore */ +const getDurationNanosecond = ({ values }: Data, index: number): T['TValue'] => values[index]; +/* istanbul ignore next */ +/** @ignore */ +const getDuration = (data: Data, index: number): T['TValue'] => { + switch (data.type.unit) { + case TimeUnit.SECOND: return getDurationSecond(data as Data, index); + case TimeUnit.MILLISECOND: return getDurationMillisecond(data as Data, index); + case TimeUnit.MICROSECOND: return getDurationMicrosecond(data as Data, index); + case TimeUnit.NANOSECOND: return getDurationNanosecond(data as Data, index); + } +}; + /** @ignore */ const getFixedSizeList = (data: Data, index: number): T['TValue'] => { const { stride, children } = data; @@ -328,6 +353,11 @@ GetVisitor.prototype.visitDictionary = wrapGet(getDictionary); GetVisitor.prototype.visitInterval = wrapGet(getInterval); GetVisitor.prototype.visitIntervalDayTime = wrapGet(getIntervalDayTime); GetVisitor.prototype.visitIntervalYearMonth = wrapGet(getIntervalYearMonth); +GetVisitor.prototype.visitDuration = wrapGet(getDuration); +GetVisitor.prototype.visitDurationSecond = wrapGet(getDurationSecond); +GetVisitor.prototype.visitDurationMillisecond = wrapGet(getDurationMillisecond); +GetVisitor.prototype.visitDurationMicrosecond = wrapGet(getDurationMicrosecond); +GetVisitor.prototype.visitDurationNanosecond = wrapGet(getDurationNanosecond); GetVisitor.prototype.visitFixedSizeList = wrapGet(getFixedSizeList); GetVisitor.prototype.visitMap = wrapGet(getMap); diff --git a/js/src/visitor/indexof.ts b/js/src/visitor/indexof.ts index 654134c6dff04..28dcff20d3bd3 100644 --- a/js/src/visitor/indexof.ts +++ b/js/src/visitor/indexof.ts @@ -31,6 +31,7 @@ import { Interval, IntervalDayTime, IntervalYearMonth, Time, TimeSecond, TimeMillisecond, TimeMicrosecond, TimeNanosecond, Timestamp, TimestampSecond, TimestampMillisecond, TimestampMicrosecond, TimestampNanosecond, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, Union, DenseUnion, SparseUnion, } from '../type.js'; @@ -81,6 +82,11 @@ export interface IndexOfVisitor extends Visitor { visitInterval(data: Data, value: T['TValue'] | null, index?: number): number; visitIntervalDayTime(data: Data, value: T['TValue'] | null, index?: number): number; visitIntervalYearMonth(data: Data, value: T['TValue'] | null, index?: number): number; + visitDuration(data: Data, value: T['TValue'] | null, index?: number): number; + visitDurationSecond(data: Data, value: T['TValue'] | null, index?: number): number; + visitDurationMillisecond(data: Data, value: T['TValue'] | null, index?: number): number; + visitDurationMicrosecond(data: Data, value: T['TValue'] | null, index?: number): number; + visitDurationNanosecond(data: Data, value: T['TValue'] | null, index?: number): number; visitFixedSizeList(data: Data, value: T['TValue'] | null, index?: number): number; visitMap(data: Data, value: T['TValue'] | null, index?: number): number; } @@ -191,6 +197,11 @@ IndexOfVisitor.prototype.visitDictionary = indexOfValue; IndexOfVisitor.prototype.visitInterval = indexOfValue; IndexOfVisitor.prototype.visitIntervalDayTime = indexOfValue; IndexOfVisitor.prototype.visitIntervalYearMonth = indexOfValue; +IndexOfVisitor.prototype.visitDuration = indexOfValue; +IndexOfVisitor.prototype.visitDurationSecond = indexOfValue; +IndexOfVisitor.prototype.visitDurationMillisecond = indexOfValue; +IndexOfVisitor.prototype.visitDurationMicrosecond = indexOfValue; +IndexOfVisitor.prototype.visitDurationNanosecond = indexOfValue; IndexOfVisitor.prototype.visitFixedSizeList = indexOfValue; IndexOfVisitor.prototype.visitMap = indexOfValue; diff --git a/js/src/visitor/iterator.ts b/js/src/visitor/iterator.ts index 48021a78e86f6..e38bb907695d0 100644 --- a/js/src/visitor/iterator.ts +++ b/js/src/visitor/iterator.ts @@ -28,6 +28,7 @@ import { Interval, IntervalDayTime, IntervalYearMonth, Time, TimeSecond, TimeMillisecond, TimeMicrosecond, TimeNanosecond, Timestamp, TimestampSecond, TimestampMillisecond, TimestampMicrosecond, TimestampNanosecond, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, Union, DenseUnion, SparseUnion, } from '../type.js'; import { ChunkedIterator } from '../util/chunk.js'; @@ -79,6 +80,11 @@ export interface IteratorVisitor extends Visitor { visitInterval(vector: Vector): IterableIterator; visitIntervalDayTime(vector: Vector): IterableIterator; visitIntervalYearMonth(vector: Vector): IterableIterator; + visitDuration(vector: Vector): IterableIterator; + visitDurationSecond(vector: Vector): IterableIterator; + visitDurationMillisecond(vector: Vector): IterableIterator; + visitDurationMicrosecond(vector: Vector): IterableIterator; + visitDurationNanosecond(vector: Vector): IterableIterator; visitFixedSizeList(vector: Vector): IterableIterator; visitMap(vector: Vector): IterableIterator; } @@ -177,6 +183,11 @@ IteratorVisitor.prototype.visitDictionary = vectorIterator; IteratorVisitor.prototype.visitInterval = vectorIterator; IteratorVisitor.prototype.visitIntervalDayTime = vectorIterator; IteratorVisitor.prototype.visitIntervalYearMonth = vectorIterator; +IteratorVisitor.prototype.visitDuration = vectorIterator; +IteratorVisitor.prototype.visitDurationSecond = vectorIterator; +IteratorVisitor.prototype.visitDurationMillisecond = vectorIterator; +IteratorVisitor.prototype.visitDurationMicrosecond = vectorIterator; +IteratorVisitor.prototype.visitDurationNanosecond = vectorIterator; IteratorVisitor.prototype.visitFixedSizeList = vectorIterator; IteratorVisitor.prototype.visitMap = vectorIterator; diff --git a/js/src/visitor/jsontypeassembler.ts b/js/src/visitor/jsontypeassembler.ts index d83edfc24fbd8..6e6cfb07413c3 100644 --- a/js/src/visitor/jsontypeassembler.ts +++ b/js/src/visitor/jsontypeassembler.ts @@ -63,6 +63,9 @@ export class JSONTypeAssembler extends Visitor { public visitInterval({ typeId, unit }: T) { return { 'name': ArrowType[typeId].toLowerCase(), 'unit': IntervalUnit[unit] }; } + public visitDuration({ typeId, unit }: T) { + return { 'name': ArrowType[typeId].toLocaleLowerCase(), 'unit': TimeUnit[unit]}; + } public visitList({ typeId }: T) { return { 'name': ArrowType[typeId].toLowerCase() }; } diff --git a/js/src/visitor/jsonvectorassembler.ts b/js/src/visitor/jsonvectorassembler.ts index 7a617f4afe2c4..55a6b4e2ea390 100644 --- a/js/src/visitor/jsonvectorassembler.ts +++ b/js/src/visitor/jsonvectorassembler.ts @@ -26,7 +26,7 @@ import { UnionMode, DateUnit, TimeUnit } from '../enum.js'; import { BitIterator, getBit, getBool } from '../util/bit.js'; import { DataType, - Float, Int, Date_, Interval, Time, Timestamp, Union, + Float, Int, Date_, Interval, Time, Timestamp, Union, Duration, Bool, Null, Utf8, Binary, Decimal, FixedSizeBinary, List, FixedSizeList, Map_, Struct, IntArray, } from '../type.js'; @@ -52,6 +52,7 @@ export interface JSONVectorAssembler extends Visitor { visitStruct(data: Data): { children: any[] }; visitUnion(data: Data): { children: any[]; TYPE_ID: number[] }; visitInterval(data: Data): { DATA: number[] }; + visitDuration(data: Data): { DATA: string[] }; visitFixedSizeList(data: Data): { children: any[] }; visitMap(data: Data): { children: any[] }; } @@ -146,6 +147,9 @@ export class JSONVectorAssembler extends Visitor { public visitInterval(data: Data) { return { 'DATA': [...data.values] }; } + public visitDuration(data: Data) { + return { 'DATA': [...bigNumsToStrings(data.values, 2)]}; + } public visitFixedSizeList(data: Data) { return { 'children': this.visitMany(data.type.children, data.children) diff --git a/js/src/visitor/set.ts b/js/src/visitor/set.ts index c2d4319911afe..1a0eddc556899 100644 --- a/js/src/visitor/set.ts +++ b/js/src/visitor/set.ts @@ -32,6 +32,7 @@ import { Interval, IntervalDayTime, IntervalYearMonth, Time, TimeSecond, TimeMillisecond, TimeMicrosecond, TimeNanosecond, Timestamp, TimestampSecond, TimestampMillisecond, TimestampMicrosecond, TimestampNanosecond, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, Union, DenseUnion, SparseUnion, } from '../type.js'; @@ -82,6 +83,11 @@ export interface SetVisitor extends Visitor { visitInterval(data: Data, index: number, value: T['TValue']): void; visitIntervalDayTime(data: Data, index: number, value: T['TValue']): void; visitIntervalYearMonth(data: Data, index: number, value: T['TValue']): void; + visitDuration(data: Data, index: number, value: T['TValue']): void; + visitDurationSecond(data: Data, index: number, value: T['TValue']): void; + visitDurationMillisecond(data: Data, index: number, value: T['TValue']): void; + visitDurationMicrosecond(data: Data, index: number, value: T['TValue']): void; + visitDurationNanosecond(data: Data, index: number, value: T['TValue']): void; visitFixedSizeList(data: Data, index: number, value: T['TValue']): void; visitMap(data: Data, index: number, value: T['TValue']): void; } @@ -308,6 +314,26 @@ export const setIntervalDayTime = ({ values }: Data({ values }: Data, index: number, value: T['TValue']): void => { values[index] = (value[0] * 12) + (value[1] % 12); }; +/** @ignore */ +export const setDurationSecond = ({ values }: Data, index: number, value: T['TValue']): void => { values[index] = value; }; +/** @ignore */ +export const setDurationMillisecond = ({ values }: Data, index: number, value: T['TValue']): void => { values[index] = value; }; +/** @ignore */ +export const setDurationMicrosecond = ({ values }: Data, index: number, value: T['TValue']): void => { values[index] = value; }; +/** @ignore */ +export const setDurationNanosecond = ({ values }: Data, index: number, value: T['TValue']): void => { values[index] = value; }; +/* istanbul ignore next */ +/** @ignore */ +export const setDuration = (data: Data, index: number, value: T['TValue']): void => { + switch (data.type.unit) { + case TimeUnit.SECOND: return setDurationSecond(data as Data, index, value as DurationSecond['TValue']); + case TimeUnit.MILLISECOND: return setDurationMillisecond(data as Data, index, value as DurationMillisecond['TValue']); + case TimeUnit.MICROSECOND: return setDurationMicrosecond(data as Data, index, value as DurationMicrosecond['TValue']); + case TimeUnit.NANOSECOND: return setDurationNanosecond(data as Data, index, value as DurationNanosecond['TValue']); + } +}; + + /** @ignore */ const setFixedSizeList = (data: Data, index: number, value: T['TValue']): void => { const { stride } = data; @@ -364,6 +390,11 @@ SetVisitor.prototype.visitDictionary = wrapSet(setDictionary); SetVisitor.prototype.visitInterval = wrapSet(setIntervalValue); SetVisitor.prototype.visitIntervalDayTime = wrapSet(setIntervalDayTime); SetVisitor.prototype.visitIntervalYearMonth = wrapSet(setIntervalYearMonth); +SetVisitor.prototype.visitDuration = wrapSet(setDuration); +SetVisitor.prototype.visitDurationSecond = wrapSet(setDurationSecond); +SetVisitor.prototype.visitDurationMillisecond = wrapSet(setDurationMillisecond); +SetVisitor.prototype.visitDurationMicrosecond = wrapSet(setDurationMicrosecond); +SetVisitor.prototype.visitDurationNanosecond = wrapSet(setDurationNanosecond); SetVisitor.prototype.visitFixedSizeList = wrapSet(setFixedSizeList); SetVisitor.prototype.visitMap = wrapSet(setMap); diff --git a/js/src/visitor/typeassembler.ts b/js/src/visitor/typeassembler.ts index c84e3930f64f5..c2262d20531b9 100644 --- a/js/src/visitor/typeassembler.ts +++ b/js/src/visitor/typeassembler.ts @@ -32,6 +32,7 @@ import { Date } from '../fb/date.js'; import { Time } from '../fb/time.js'; import { Timestamp } from '../fb/timestamp.js'; import { Interval } from '../fb/interval.js'; +import { Duration } from '../fb/duration.js'; import { List } from '../fb/list.js'; import { Struct_ as Struct } from '../fb/struct-.js'; import { Union } from '../fb/union.js'; @@ -109,6 +110,11 @@ export class TypeAssembler extends Visitor { Interval.addUnit(b, node.unit); return Interval.endInterval(b); } + public visitDuration(node: T, b: Builder) { + Duration.startDuration(b); + Duration.addUnit(b, node.unit); + return Duration.endDuration(b); + } public visitList(_node: T, b: Builder) { List.startList(b); return List.endList(b); diff --git a/js/src/visitor/typecomparator.ts b/js/src/visitor/typecomparator.ts index a77c4020961ce..1de8e218dae4f 100644 --- a/js/src/visitor/typecomparator.ts +++ b/js/src/visitor/typecomparator.ts @@ -28,6 +28,7 @@ import { Interval, IntervalDayTime, IntervalYearMonth, Time, TimeSecond, TimeMillisecond, TimeMicrosecond, TimeNanosecond, Timestamp, TimestampSecond, TimestampMillisecond, TimestampMicrosecond, TimestampNanosecond, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, Union, DenseUnion, SparseUnion, } from '../type.js'; @@ -77,6 +78,11 @@ export interface TypeComparator extends Visitor { visitInterval(type: T, other?: DataType | null): other is T; visitIntervalDayTime(type: T, other?: DataType | null): other is T; visitIntervalYearMonth(type: T, other?: DataType | null): other is T; + visitDuration(type: T, other?: DataType | null): other is T; + visitDurationSecond(type: T, other?: DataType | null): other is T; + visitDurationMillisecond(type: T, other?: DataType | null): other is T; + visitDurationMicrosecond(type: T, other?: DataType | null): other is T; + visitDurationNanosecond(type: T, other?: DataType | null): other is T; visitFixedSizeList(type: T, other?: DataType | null): other is T; visitMap(type: T, other?: DataType | null): other is T; } @@ -202,6 +208,13 @@ function compareInterval(type: T, other?: DataType | null): ); } +function compareDuration(type: T, other?: DataType | null): other is T { + return (type === other) || ( + compareConstructor(type, other) && + type.unit === other.unit + ); +} + function compareFixedSizeList(type: T, other?: DataType | null): other is T { return (type === other) || ( compareConstructor(type, other) && @@ -261,6 +274,11 @@ TypeComparator.prototype.visitDictionary = compareDictionary; TypeComparator.prototype.visitInterval = compareInterval; TypeComparator.prototype.visitIntervalDayTime = compareInterval; TypeComparator.prototype.visitIntervalYearMonth = compareInterval; +TypeComparator.prototype.visitDuration = compareDuration; +TypeComparator.prototype.visitDurationSecond = compareDuration; +TypeComparator.prototype.visitDurationMillisecond = compareDuration; +TypeComparator.prototype.visitDurationMicrosecond = compareDuration; +TypeComparator.prototype.visitDurationNanosecond = compareDuration; TypeComparator.prototype.visitFixedSizeList = compareFixedSizeList; TypeComparator.prototype.visitMap = compareMap; diff --git a/js/src/visitor/typector.ts b/js/src/visitor/typector.ts index c825a61dbadfb..077f66592fbfb 100644 --- a/js/src/visitor/typector.ts +++ b/js/src/visitor/typector.ts @@ -74,6 +74,11 @@ export class GetDataTypeConstructor extends Visitor { public visitInterval() { return type.Interval; } public visitIntervalDayTime() { return type.IntervalDayTime; } public visitIntervalYearMonth() { return type.IntervalYearMonth; } + public visitDuration() { return type.Duration; } + public visitDurationSecond() { return type.DurationSecond; } + public visitDurationMillisecond() { return type.DurationMillisecond; } + public visitDurationMicrosecond() { return type.DurationMicrosecond; } + public visitDurationNanosecond() { return type.DurationNanosecond; } public visitFixedSizeList() { return type.FixedSizeList; } public visitMap() { return type.Map_; } } diff --git a/js/src/visitor/vectorassembler.ts b/js/src/visitor/vectorassembler.ts index dbf778c4c3631..949463272e718 100644 --- a/js/src/visitor/vectorassembler.ts +++ b/js/src/visitor/vectorassembler.ts @@ -26,7 +26,7 @@ import { packBools, truncateBitmap } from '../util/bit.js'; import { BufferRegion, FieldNode } from '../ipc/metadata/message.js'; import { DataType, Dictionary, - Float, Int, Date_, Interval, Time, Timestamp, Union, + Float, Int, Date_, Interval, Time, Timestamp, Union, Duration, Bool, Null, Utf8, Binary, Decimal, FixedSizeBinary, List, FixedSizeList, Map_, Struct, } from '../type.js'; @@ -51,6 +51,7 @@ export interface VectorAssembler extends Visitor { visitStruct(data: Data): this; visitUnion(data: Data): this; visitInterval(data: Data): this; + visitDuration(data: Data): this; visitFixedSizeList(data: Data): this; visitMap(data: Data): this; } @@ -195,7 +196,7 @@ function assembleBoolVector(this: VectorAssembler, data: Data } /** @ignore */ -function assembleFlatVector(this: VectorAssembler, data: Data) { +function assembleFlatVector(this: VectorAssembler, data: Data) { return addBuffer.call(this, data.values.subarray(0, data.length * data.stride)); } @@ -243,5 +244,6 @@ VectorAssembler.prototype.visitList = assembleListVector; VectorAssembler.prototype.visitStruct = assembleNestedVector; VectorAssembler.prototype.visitUnion = assembleUnion; VectorAssembler.prototype.visitInterval = assembleFlatVector; +VectorAssembler.prototype.visitDuration = assembleFlatVector; VectorAssembler.prototype.visitFixedSizeList = assembleListVector; VectorAssembler.prototype.visitMap = assembleListVector; diff --git a/js/src/visitor/vectorloader.ts b/js/src/visitor/vectorloader.ts index cb4bc2829274f..db34edad9a1c1 100644 --- a/js/src/visitor/vectorloader.ts +++ b/js/src/visitor/vectorloader.ts @@ -115,6 +115,9 @@ export class VectorLoader extends Visitor { public visitInterval(type: T, { length, nullCount } = this.nextFieldNode()) { return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), data: this.readData(type) }); } + public visitDuration(type: T, { length, nullCount } = this.nextFieldNode()) { + return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), data: this.readData(type) }); + } public visitFixedSizeList(type: T, { length, nullCount } = this.nextFieldNode()) { return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), 'child': this.visit(type.children[0]) }); } @@ -157,7 +160,7 @@ export class JSONVectorLoader extends VectorLoader { const { sources } = this; if (DataType.isTimestamp(type)) { return toArrayBufferView(Uint8Array, Int64.convertArray(sources[offset] as string[])); - } else if ((DataType.isInt(type) || DataType.isTime(type)) && type.bitWidth === 64) { + } else if ((DataType.isInt(type) || DataType.isTime(type)) && type.bitWidth === 64 || DataType.isDuration(type)) { return toArrayBufferView(Uint8Array, Int64.convertArray(sources[offset] as string[])); } else if (DataType.isDate(type) && type.unit === DateUnit.MILLISECOND) { return toArrayBufferView(Uint8Array, Int64.convertArray(sources[offset] as string[])); diff --git a/js/test/data/tables.ts b/js/test/data/tables.ts index e4d859e0a69b7..28aed7e4feccf 100644 --- a/js/test/data/tables.ts +++ b/js/test/data/tables.ts @@ -30,7 +30,8 @@ const valueVectorGeneratorNames = [ 'float16', 'float32', 'float64', 'utf8', 'binary', 'fixedSizeBinary', 'dateDay', 'dateMillisecond', 'timestampSecond', 'timestampMillisecond', 'timestampMicrosecond', 'timestampNanosecond', 'timeSecond', 'timeMillisecond', 'timeMicrosecond', 'timeNanosecond', 'decimal', - 'dictionary', 'intervalDayTime', 'intervalYearMonth' + 'dictionary', 'intervalDayTime', 'intervalYearMonth', + 'durationSecond', 'durationMillisecond', 'durationMicrosecond', 'durationNanosecond', ]; const vectorGeneratorNames = [...valueVectorGeneratorNames, ...listVectorGeneratorNames, ...nestedVectorGeneratorNames]; diff --git a/js/test/generate-test-data.ts b/js/test/generate-test-data.ts index a03b22c54c770..15fb715a31f95 100644 --- a/js/test/generate-test-data.ts +++ b/js/test/generate-test-data.ts @@ -36,6 +36,7 @@ import { Union, DenseUnion, SparseUnion, Dictionary, Interval, IntervalDayTime, IntervalYearMonth, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, FixedSizeList, Map_, DateUnit, TimeUnit, UnionMode, @@ -58,6 +59,7 @@ interface TestDataVectorGenerator extends Visitor { visit(type: T, length?: number, nullCount?: number): GeneratedVector; visit(type: T, length?: number, nullCount?: number): GeneratedVector; visit(type: T, length?: number, nullCount?: number): GeneratedVector; + visit(type: T, length?: number, nullCount?: number): GeneratedVector; visit(type: T, length?: number, nullCount?: number, child?: Vector): GeneratedVector; visit(type: T, length?: number, nullCount?: number, child?: Vector): GeneratedVector; visit(type: T, length?: number, nullCount?: number, dictionary?: Vector): GeneratedVector; @@ -84,6 +86,7 @@ interface TestDataVectorGenerator extends Visitor { visitUnion: typeof generateUnion; visitDictionary: typeof generateDictionary; visitInterval: typeof generateInterval; + visitDuration: typeof generateDuration; visitFixedSizeList: typeof generateFixedSizeList; visitMap: typeof generateMap; } @@ -108,6 +111,7 @@ TestDataVectorGenerator.prototype.visitStruct = generateStruct; TestDataVectorGenerator.prototype.visitUnion = generateUnion; TestDataVectorGenerator.prototype.visitDictionary = generateDictionary; TestDataVectorGenerator.prototype.visitInterval = generateInterval; +TestDataVectorGenerator.prototype.visitDuration = generateDuration; TestDataVectorGenerator.prototype.visitFixedSizeList = generateFixedSizeList; TestDataVectorGenerator.prototype.visitMap = generateMap; @@ -230,11 +234,15 @@ export const sparseUnion = (length = 100, nullCount = Math.trunc(length * 0.2), export const dictionary = (length = 100, nullCount = Math.trunc(length * 0.2), dict: T = new Utf8(), keys: TKey = new Int32()) => vectorGenerator.visit(new Dictionary(dict, keys), length, nullCount); export const intervalDayTime = (length = 100, nullCount = Math.trunc(length * 0.2)) => vectorGenerator.visit(new IntervalDayTime(), length, nullCount); export const intervalYearMonth = (length = 100, nullCount = Math.trunc(length * 0.2)) => vectorGenerator.visit(new IntervalYearMonth(), length, nullCount); +export const durationSecond = (length = 100, nullCount = Math.trunc(length * 0.2)) => vectorGenerator.visit(new DurationSecond(), length, nullCount); +export const durationMillisecond = (length = 100, nullCount = Math.trunc(length * 0.2)) => vectorGenerator.visit(new DurationMillisecond(), length, nullCount); +export const durationMicrosecond = (length = 100, nullCount = Math.trunc(length * 0.2)) => vectorGenerator.visit(new DurationMicrosecond(), length, nullCount); +export const durationNanosecond = (length = 100, nullCount = Math.trunc(length * 0.2)) => vectorGenerator.visit(new DurationNanosecond(), length, nullCount); export const fixedSizeList = (length = 100, nullCount = Math.trunc(length * 0.2), listSize = 2, child = defaultListChild) => vectorGenerator.visit(new FixedSizeList(listSize, child), length, nullCount); export const map = (length = 100, nullCount = Math.trunc(length * 0.2), child: Field> = defaultMapChild()) => vectorGenerator.visit(new Map_(child), length, nullCount); export const vecs = { - null_, bool, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, utf8, binary, fixedSizeBinary, dateDay, dateMillisecond, timestampSecond, timestampMillisecond, timestampMicrosecond, timestampNanosecond, timeSecond, timeMillisecond, timeMicrosecond, timeNanosecond, decimal, list, struct, denseUnion, sparseUnion, dictionary, intervalDayTime, intervalYearMonth, fixedSizeList, map + null_, bool, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, utf8, binary, fixedSizeBinary, dateDay, dateMillisecond, timestampSecond, timestampMillisecond, timestampMicrosecond, timestampNanosecond, timeSecond, timeMillisecond, timeMicrosecond, timeNanosecond, decimal, list, struct, denseUnion, sparseUnion, dictionary, intervalDayTime, intervalYearMonth, fixedSizeList, map, durationSecond, durationMillisecond, durationMicrosecond, durationNanosecond } as { [k: string]: (...args: any[]) => any }; function generateNull(this: TestDataVectorGenerator, type: T, length = 100): GeneratedVector { @@ -421,6 +429,16 @@ function generateInterval(this: TestDataVectorGenerator, typ return { values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, data })]) }; } +function generateDuration(this: TestDataVectorGenerator, type: T, length = 100, nullCount = Math.trunc(length * 0.2)): GeneratedVector { + const nullBitmap = createBitmap(length, nullCount); + const multiple = type.unit === TimeUnit.NANOSECOND ? 1000000000 : + type.unit === TimeUnit.MICROSECOND ? 1000000 : + type.unit === TimeUnit.MILLISECOND ? 1000 : 1; + const values: bigint[] = []; + const data = createTime64(length, nullBitmap, multiple, values); + return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, data })]) }; +} + function generateList(this: TestDataVectorGenerator, type: T, length = 100, nullCount = Math.trunc(length * 0.2), child = this.visit(type.children[0].type, length * 3, nullCount * 3)): GeneratedVector { const childVec = child.vector; const nullBitmap = createBitmap(length, nullCount); diff --git a/js/test/unit/builders/builder-tests.ts b/js/test/unit/builders/builder-tests.ts index a73183a7a5d47..b261e4f815e3a 100644 --- a/js/test/unit/builders/builder-tests.ts +++ b/js/test/unit/builders/builder-tests.ts @@ -64,6 +64,10 @@ describe('Generated Test Data', () => { describe('DictionaryBuilder', () => { validateBuilder(generate.dictionary); }); describe('IntervalDayTimeBuilder', () => { validateBuilder(generate.intervalDayTime); }); describe('IntervalYearMonthBuilder', () => { validateBuilder(generate.intervalYearMonth); }); + describe('DurationSecondBuilder', () => { validateBuilder(generate.durationSecond); }); + describe('DurationMillisecondBuilder', () => { validateBuilder(generate.durationMillisecond); }); + describe('DurationMicrosecondBuilder', () => { validateBuilder(generate.durationMicrosecond); }); + describe('DurationNanosecondBuilder', () => { validateBuilder(generate.durationNanosecond); }); describe('FixedSizeListBuilder', () => { validateBuilder(generate.fixedSizeList); }); describe('MapBuilder', () => { validateBuilder(generate.map); }); }); diff --git a/js/test/unit/generated-data-tests.ts b/js/test/unit/generated-data-tests.ts index 90cf0d598aa6f..d64c7c188d3ed 100644 --- a/js/test/unit/generated-data-tests.ts +++ b/js/test/unit/generated-data-tests.ts @@ -58,6 +58,10 @@ describe('Generated Test Data', () => { describe('Dictionary', () => { validateVector(generate.dictionary()); }); describe('IntervalDayTime', () => { validateVector(generate.intervalDayTime()); }); describe('IntervalYearMonth', () => { validateVector(generate.intervalYearMonth()); }); + describe('DurationSecond', () => { validateVector(generate.durationSecond()); }); + describe('DurationMillisecond', () => { validateVector(generate.durationMillisecond()); }); + describe('DurationMicrosecond', () => { validateVector(generate.durationMicrosecond()); }); + describe('DurationNanosecond', () => { validateVector(generate.durationNanosecond()); }); describe('FixedSizeList', () => { validateVector(generate.fixedSizeList()); }); describe('Map', () => { validateVector(generate.map()); }); }); diff --git a/js/test/unit/visitor-tests.ts b/js/test/unit/visitor-tests.ts index 645fcc60f8d90..8a7ba1ed778aa 100644 --- a/js/test/unit/visitor-tests.ts +++ b/js/test/unit/visitor-tests.ts @@ -25,6 +25,7 @@ import { Interval, IntervalDayTime, IntervalYearMonth, Time, TimeSecond, TimeMillisecond, TimeMicrosecond, TimeNanosecond, Timestamp, TimestampSecond, TimestampMillisecond, TimestampMicrosecond, TimestampNanosecond, + Duration, DurationSecond, DurationMillisecond, DurationMicrosecond, DurationNanosecond, Union, DenseUnion, SparseUnion, } from 'apache-arrow'; @@ -46,6 +47,7 @@ class BasicVisitor extends Visitor { public visitUnion(type: T) { return (this.type = type); } public visitDictionary(type: T) { return (this.type = type); } public visitInterval(type: T) { return (this.type = type); } + public visitDuration(type: T) { return (this.type = type); } public visitFixedSizeList(type: T) { return (this.type = type); } public visitMap(type: T) { return (this.type = type); } } @@ -86,6 +88,10 @@ class FeatureVisitor extends Visitor { public visitDictionary(type: T) { return (this.type = type); } public visitIntervalDayTime(type: T) { return (this.type = type); } public visitIntervalYearMonth(type: T) { return (this.type = type); } + public visitDurationSecond(type: T) { return (this.type = type); } + public visitDurationMillisecond(type: T) { return (this.type = type); } + public visitDurationMicrosecond(type: T) { return (this.type = type); } + public visitDurationNanosecond(type: T) { return (this.type = type); } public visitFixedSizeList(type: T) { return (this.type = type); } public visitMap(type: T) { return (this.type = type); } } @@ -109,6 +115,7 @@ describe('Visitor', () => { test(`visits Union types`, () => validateBasicVisitor(new Union(0, [] as any[], [] as any[]))); test(`visits Dictionary types`, () => validateBasicVisitor(new Dictionary(null as any, null as any))); test(`visits Interval types`, () => validateBasicVisitor(new Interval(0))); + test(`visits Duration types`, () => validateBasicVisitor(new Duration(0))); test(`visits FixedSizeList types`, () => validateBasicVisitor(new FixedSizeList(2, null as any))); test(`visits Map types`, () => validateBasicVisitor(new Map_(new Field('', new Struct<{ key: Utf8; value: Int }>([ new Field('key', new Utf8()), new Field('value', new Int8()) @@ -158,6 +165,10 @@ describe('Visitor', () => { test(`visits IntervalDayTime types`, () => validateFeatureVisitor(new IntervalDayTime())); test(`visits IntervalYearMonth types`, () => validateFeatureVisitor(new IntervalYearMonth())); test(`visits FixedSizeList types`, () => validateFeatureVisitor(new FixedSizeList(2, null as any))); + test(`visits DurationSecond types`, () => validateFeatureVisitor(new DurationSecond())); + test(`visits DurationMillisecond types`, () => validateFeatureVisitor(new DurationMillisecond())); + test(`visits DurationMicrosecond types`, () => validateFeatureVisitor(new DurationMicrosecond())); + test(`visits DurationNanosecond types`, () => validateFeatureVisitor(new DurationNanosecond())); test(`visits Map types`, () => validateFeatureVisitor(new Map_(new Field('', new Struct<{ key: Utf8; value: Int }>([ new Field('key', new Utf8()), new Field('value', new Int8()) ] as any[]))))); From 80b351cc99881dc864c9dc8850a9e67379a3f657 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 29 Sep 2023 09:53:39 +0100 Subject: [PATCH 85/96] MINOR: [R] Bump versions file for navigating between docs versions (#37928) ### Rationale for this change Versions files needed bumping with the 13.0.01 release ### What changes are included in this PR? Increasing version numbers ### Are these changes tested? No ### Are there any user-facing changes? No Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/pkgdown/assets/versions.html | 5 +++-- r/pkgdown/assets/versions.json | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/r/pkgdown/assets/versions.html b/r/pkgdown/assets/versions.html index 31f393a27785d..8ba513a98c85b 100644 --- a/r/pkgdown/assets/versions.html +++ b/r/pkgdown/assets/versions.html @@ -1,7 +1,8 @@ -

12.0.1.9000 (dev)

-

12.0.1.1 (release)

+

13.0.0.9000 (dev)

+

13.0.0.1 (release)

+

12.0.1.1

11.0.0.3

10.0.1

9.0.0

diff --git a/r/pkgdown/assets/versions.json b/r/pkgdown/assets/versions.json index 565f67b9730a4..b7c6984e3c660 100644 --- a/r/pkgdown/assets/versions.json +++ b/r/pkgdown/assets/versions.json @@ -4,7 +4,7 @@ "version": "dev/" }, { - "name": "13.0.0 (release)", + "name": "13.0.0.1 (release)", "version": "" }, { From c703b874417d1165b8afc0a42265d5513d5786d7 Mon Sep 17 00:00:00 2001 From: Jacob Wujciak-Jens Date: Fri, 29 Sep 2023 11:20:59 +0200 Subject: [PATCH 86/96] MINOR: [CI][Dev] Fix crossbow badge url (#37946) Github has changed the badge url so all crossbow badges where showing up as grey. Authored-by: Jacob Wujciak-Jens Signed-off-by: Sutou Kouhei --- dev/archery/archery/crossbow/reports.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/archery/archery/crossbow/reports.py b/dev/archery/archery/crossbow/reports.py index 1cf19841c6939..ea10e75ad3478 100644 --- a/dev/archery/archery/crossbow/reports.py +++ b/dev/archery/archery/crossbow/reports.py @@ -284,7 +284,7 @@ class CommentReport(Report): 'github': _markdown_badge.format( title='Github Actions', badge=( - 'https://github.com/{repo}/workflows/Crossbow/' + 'https://github.com/{repo}/actions/workflows/crossbow.yml/' 'badge.svg?branch={branch}' ), ), From 141559816eea8474ed96680a60eceae574d499c3 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 29 Sep 2023 13:31:38 +0100 Subject: [PATCH 87/96] GH-37950: [R] tests fail on R < 4.0 due to test calling data.frame() without specifying stringsAsFactors=FALSE (#37951) ### Rationale for this change Tests failing on R < 4.0 builds due to the default value of the the `data.frame()` parameter `stringsAsFactors` between older and newer versions of R ### What changes are included in this PR? Update a test using `data.frame()` to manually specify the value of `stringsAsFactors` as `FALSE`. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #37950 Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/tests/testthat/test-schema.R | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/r/tests/testthat/test-schema.R b/r/tests/testthat/test-schema.R index b1dc06592955e..15342add38fae 100644 --- a/r/tests/testthat/test-schema.R +++ b/r/tests/testthat/test-schema.R @@ -300,7 +300,11 @@ test_that("schema extraction", { expect_equal(schema(example_data), tbl$schema) expect_equal(schema(tbl), tbl$schema) - expect_equal(schema(data.frame(a = 1, a = "x", check.names = FALSE)), schema(a = double(), a = string())) + expect_equal( + schema(data.frame(a = 1, a = "x", check.names = FALSE, stringsAsFactors = FALSE)), + schema(a = double(), a = string()) + ) + expect_equal(schema(data.frame()), schema()) ds <- InMemoryDataset$create(example_data) From 72c6497dd02f1d0c3caeff54fdc7bf4c3b846a70 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 29 Sep 2023 13:32:41 +0100 Subject: [PATCH 88/96] GH-37813: [R] add quoted_na argument to open_delim_dataset() (#37828) ### Rationale for this change The `open_delim_dataset()` family of functions were implemented to have the same arguments as the `read_delim_arrow()` functions where possible, but `quoted_na` was missed. ### What changes are included in this PR? Adding `quoted_na` to those functions. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** Empty strings in input datasets now default to being read in by `open_delim_dataset()` and its derivates as NAs and not empty string * Closes: #37813 Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/R/dataset-format.R | 15 ++++++++++----- r/R/dataset.R | 13 ++++++++----- r/man/open_delim_dataset.Rd | 15 +++++++++++---- r/tests/testthat/test-dataset-csv.R | 14 ++++++++++++-- 4 files changed, 41 insertions(+), 16 deletions(-) diff --git a/r/R/dataset-format.R b/r/R/dataset-format.R index 5dd00b9344014..cdaaf08827bfd 100644 --- a/r/R/dataset-format.R +++ b/r/R/dataset-format.R @@ -239,7 +239,7 @@ check_unsupported_args <- function(args) { opt_names <- get_opt_names(args) # Filter out arguments meant for CsvConvertOptions/CsvReadOptions - supported_convert_opts <- c(names(formals(CsvConvertOptions$create)), "na") + supported_convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "quoted_na") supported_read_opts <- c( names(formals(CsvReadOptions$create)), @@ -312,7 +312,8 @@ check_unrecognised_args <- function(opts) { readr_opts <- c( names(formals(readr_to_csv_parse_options)), names(formals(readr_to_csv_read_options)), - "na" + "na", + "quoted_na" ) is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts)) @@ -394,7 +395,7 @@ check_schema <- function(schema, partitioning, column_names) { csv_file_format_parse_opts <- function(...) { opts <- list(...) # Filter out arguments meant for CsvConvertOptions/CsvReadOptions - convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "convert_options") + convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "quoted_na", "convert_options") read_opts <- c( names(formals(CsvReadOptions$create)), names(formals(readr_to_csv_read_options)), @@ -452,6 +453,11 @@ csv_file_format_convert_opts <- function(...) { opts[["na"]] <- NULL } + if ("quoted_na" %in% names(opts)) { + opts[["strings_can_be_null"]] <- opts[["quoted_na"]] + opts[["quoted_na"]] <- NULL + } + do.call(CsvConvertOptions$create, opts) } @@ -461,7 +467,7 @@ csv_file_format_read_opts <- function(schema = NULL, partitioning = NULL, ...) { # Filter out arguments meant for CsvParseOptions/CsvConvertOptions arrow_opts <- c(names(formals(CsvParseOptions$create)), "parse_options") readr_opts <- names(formals(readr_to_csv_parse_options)) - convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "convert_options") + convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "quoted_na", "convert_options") opts[arrow_opts] <- NULL opts[readr_opts] <- NULL opts[convert_opts] <- NULL @@ -473,7 +479,6 @@ csv_file_format_read_opts <- function(schema = NULL, partitioning = NULL, ...) { is_arrow_opt <- !is.na(match(opt_names, arrow_opts)) is_readr_opt <- !is.na(match(opt_names, readr_opts)) - check_ambiguous_options(opt_names, arrow_opts, readr_opts) null_or_true <- function(x) { diff --git a/r/R/dataset.R b/r/R/dataset.R index b7728ff897fff..9d91839c220bb 100644 --- a/r/R/dataset.R +++ b/r/R/dataset.R @@ -240,7 +240,6 @@ open_dataset <- function(sources, #' @section Options currently supported by [read_delim_arrow()] which are not supported here: #' * `file` (instead, please specify files in `sources`) #' * `col_select` (instead, subset columns after dataset creation) -#' * `quoted_na` #' * `as_data_frame` (instead, convert to data frame after dataset creation) #' * `parse_options` #' @@ -276,7 +275,8 @@ open_delim_dataset <- function(sources, skip = 0L, convert_options = NULL, read_options = NULL, - timestamp_parsers = NULL) { + timestamp_parsers = NULL, + quoted_na = TRUE) { open_dataset( sources = sources, schema = schema, @@ -296,7 +296,8 @@ open_delim_dataset <- function(sources, skip = skip, convert_options = convert_options, read_options = read_options, - timestamp_parsers = timestamp_parsers + timestamp_parsers = timestamp_parsers, + quoted_na = quoted_na ) } @@ -318,7 +319,8 @@ open_csv_dataset <- function(sources, skip = 0L, convert_options = NULL, read_options = NULL, - timestamp_parsers = NULL) { + timestamp_parsers = NULL, + quoted_na = TRUE) { mc <- match.call() mc$delim <- "," mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow")) @@ -343,7 +345,8 @@ open_tsv_dataset <- function(sources, skip = 0L, convert_options = NULL, read_options = NULL, - timestamp_parsers = NULL) { + timestamp_parsers = NULL, + quoted_na = TRUE) { mc <- match.call() mc$delim <- "\t" mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow")) diff --git a/r/man/open_delim_dataset.Rd b/r/man/open_delim_dataset.Rd index 2bfd047040a8b..cf08302cc6436 100644 --- a/r/man/open_delim_dataset.Rd +++ b/r/man/open_delim_dataset.Rd @@ -24,7 +24,8 @@ open_delim_dataset( skip = 0L, convert_options = NULL, read_options = NULL, - timestamp_parsers = NULL + timestamp_parsers = NULL, + quoted_na = TRUE ) open_csv_dataset( @@ -44,7 +45,8 @@ open_csv_dataset( skip = 0L, convert_options = NULL, read_options = NULL, - timestamp_parsers = NULL + timestamp_parsers = NULL, + quoted_na = TRUE ) open_tsv_dataset( @@ -64,7 +66,8 @@ open_tsv_dataset( skip = 0L, convert_options = NULL, read_options = NULL, - timestamp_parsers = NULL + timestamp_parsers = NULL, + quoted_na = TRUE ) } \arguments{ @@ -178,6 +181,11 @@ starting from the beginning of this vector. Possible values are: \item a character vector of \link[base:strptime]{strptime} parse strings \item a list of \link{TimestampParser} objects }} + +\item{quoted_na}{Should missing values inside quotes be treated as missing +values (the default) or strings. (Note that this is different from the +the Arrow C++ default for the corresponding convert option, +\code{strings_can_be_null}.)} } \description{ A wrapper around \link{open_dataset} which explicitly includes parameters mirroring \code{\link[=read_csv_arrow]{read_csv_arrow()}}, @@ -189,7 +197,6 @@ for opening single files and functions for opening datasets. \itemize{ \item \code{file} (instead, please specify files in \code{sources}) \item \code{col_select} (instead, subset columns after dataset creation) -\item \code{quoted_na} \item \code{as_data_frame} (instead, convert to data frame after dataset creation) \item \code{parse_options} } diff --git a/r/tests/testthat/test-dataset-csv.R b/r/tests/testthat/test-dataset-csv.R index ff1712646a472..e8e7c61fc8848 100644 --- a/r/tests/testthat/test-dataset-csv.R +++ b/r/tests/testthat/test-dataset-csv.R @@ -220,7 +220,7 @@ test_that("readr parse options", { # With not yet supported readr parse options expect_error( - open_dataset(tsv_dir, partitioning = "part", delim = "\t", quoted_na = TRUE), + open_dataset(tsv_dir, partitioning = "part", delim = "\t", col_select = "integer"), "supported" ) @@ -253,7 +253,7 @@ test_that("readr parse options", { tsv_dir, partitioning = "part", format = "text", - quo = "\"", + del = "," ), "Ambiguous" ) @@ -561,6 +561,16 @@ test_that("open_delim_dataset params passed through to open_dataset", { expect_named(ds, c("int", "dbl", "lgl", "chr", "fct", "ts")) + # quoted_na + dst_dir <- make_temp_dir() + dst_file <- file.path(dst_dir, "data.csv") + writeLines("text,num\none,1\ntwo,2\n,3\nfour,4", dst_file) + ds <- open_csv_dataset(dst_dir, quoted_na = TRUE) %>% collect() + expect_equal(ds$text, c("one", "two", NA, "four")) + + ds <- open_csv_dataset(dst_dir, quoted_na = FALSE) %>% collect() + expect_equal(ds$text, c("one", "two", "", "four")) + # timestamp_parsers skip("GH-33708: timestamp_parsers don't appear to be working properly") From a0041027902ac71f00572d5b31022ee98502ed6a Mon Sep 17 00:00:00 2001 From: James Duong Date: Fri, 29 Sep 2023 06:11:12 -0700 Subject: [PATCH 89/96] GH-37702: [Java] Add vector validation consistent with C++ (#37942) ### Rationale for this change Make vector validation code more consistent with C++. Add missing checks and have the entry point be the same so that the code is easier to read/write when working with both languages. ### What changes are included in this PR? Make vector validation more consistent with Array::Validate() in C++: * Add validate() and validateFull() instance methods to vectors. * Validate that VarCharVector and LargeVarCharVector contents are valid UTF-8. * Validate that DecimalVector and Decimal256Vector contents fit within the supplied precision and scale. * Validate that NullVectors contain only nulls. * Validate that FixedSizeBinaryVector values have the correct length. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37702 Authored-by: James Duong Signed-off-by: David Li --- .../arrow/vector/BaseFixedWidthVector.java | 7 ++ .../vector/BaseLargeVariableWidthVector.java | 7 ++ .../arrow/vector/BaseVariableWidthVector.java | 7 ++ .../apache/arrow/vector/Decimal256Vector.java | 13 ++++ .../apache/arrow/vector/DecimalVector.java | 13 ++++ .../arrow/vector/FixedSizeBinaryVector.java | 13 ++++ .../arrow/vector/LargeVarCharVector.java | 12 +++ .../org/apache/arrow/vector/ValueVector.java | 9 +++ .../apache/arrow/vector/VarCharVector.java | 12 +++ .../arrow/vector/util/DecimalUtility.java | 10 ++- .../org/apache/arrow/vector/util/Text.java | 50 ++++++++++--- .../validate/ValidateVectorDataVisitor.java | 5 ++ .../vector/validate/TestValidateVector.java | 14 ++++ .../validate/TestValidateVectorFull.java | 74 +++++++++++++++++++ 14 files changed, 233 insertions(+), 13 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java index 223ae9aa8cb1c..04a038a0b5dfd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java @@ -550,6 +550,13 @@ private void setReaderAndWriterIndex() { } } + /** + * Validate the scalar values held by this vector. + */ + public void validateScalars() { + // No validation by default. + } + /** * Construct a transfer pair of this vector and another vector of same type. * @param ref name of the target vector diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseLargeVariableWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseLargeVariableWidthVector.java index 90694db830cd6..4d5a8a5119c53 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseLargeVariableWidthVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseLargeVariableWidthVector.java @@ -643,6 +643,13 @@ public ArrowBuf[] getBuffers(boolean clear) { return buffers; } + /** + * Validate the scalar values held by this vector. + */ + public void validateScalars() { + // No validation by default. + } + /** * Construct a transfer pair of this vector and another vector of same type. * @param ref name of the target vector diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java index 2a89590bf8440..d7f5ff05a935d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java @@ -685,6 +685,13 @@ public ArrowBuf[] getBuffers(boolean clear) { return buffers; } + /** + * Validate the scalar values held by this vector. + */ + public void validateScalars() { + // No validation by default. + } + /** * Construct a transfer pair of this vector and another vector of same type. * @param ref name of the target vector diff --git a/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java b/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java index 70a895ff40496..79a9badc3955d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/Decimal256Vector.java @@ -35,6 +35,7 @@ import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.DecimalUtility; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.validate.ValidateUtil; /** @@ -527,6 +528,18 @@ public void setSafe(int index, int isSet, long start, ArrowBuf buffer) { set(index, isSet, start, buffer); } + @Override + public void validateScalars() { + for (int i = 0; i < getValueCount(); ++i) { + BigDecimal value = getObject(i); + if (value != null) { + ValidateUtil.validateOrThrow(DecimalUtility.checkPrecisionAndScaleNoThrow(value, getPrecision(), getScale()), + "Invalid value for Decimal256Vector at position " + i + ". Value does not fit in precision " + + getPrecision() + " and scale " + getScale() + "."); + } + } + } + /*----------------------------------------------------------------* | | | vector transfer | diff --git a/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java b/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java index 6a3ec60afc52e..d1a3bfc3afb10 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java @@ -35,6 +35,7 @@ import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.DecimalUtility; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.validate.ValidateUtil; /** * DecimalVector implements a fixed width vector (16 bytes) of @@ -526,6 +527,18 @@ public void setSafe(int index, int isSet, long start, ArrowBuf buffer) { set(index, isSet, start, buffer); } + @Override + public void validateScalars() { + for (int i = 0; i < getValueCount(); ++i) { + BigDecimal value = getObject(i); + if (value != null) { + ValidateUtil.validateOrThrow(DecimalUtility.checkPrecisionAndScaleNoThrow(value, getPrecision(), getScale()), + "Invalid value for DecimalVector at position " + i + ". Value does not fit in precision " + + getPrecision() + " and scale " + getScale() + "."); + } + } + } + /*----------------------------------------------------------------* | | | vector transfer | diff --git a/java/vector/src/main/java/org/apache/arrow/vector/FixedSizeBinaryVector.java b/java/vector/src/main/java/org/apache/arrow/vector/FixedSizeBinaryVector.java index 3ce2bb77ccc55..967d560d78dea 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/FixedSizeBinaryVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/FixedSizeBinaryVector.java @@ -31,6 +31,7 @@ import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.validate.ValidateUtil; /** * FixedSizeBinaryVector implements a fixed width vector of @@ -320,6 +321,18 @@ public static byte[] get(final ArrowBuf buffer, final int index, final int byteW return dst; } + @Override + public void validateScalars() { + for (int i = 0; i < getValueCount(); ++i) { + byte[] value = get(i); + if (value != null) { + ValidateUtil.validateOrThrow(value.length == byteWidth, + "Invalid value for FixedSizeBinaryVector at position " + i + ". The length was " + + value.length + " but the length of each element should be " + byteWidth + "."); + } + } + } + /*----------------------------------------------------------------* | | | vector transfer | diff --git a/java/vector/src/main/java/org/apache/arrow/vector/LargeVarCharVector.java b/java/vector/src/main/java/org/apache/arrow/vector/LargeVarCharVector.java index 1f8d9b7d3a85c..e9472c9f2c71e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/LargeVarCharVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/LargeVarCharVector.java @@ -27,6 +27,7 @@ import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.Text; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.validate.ValidateUtil; /** * LargeVarCharVector implements a variable width vector of VARCHAR @@ -261,6 +262,17 @@ public void setSafe(int index, Text text) { setSafe(index, text.getBytes(), 0, text.getLength()); } + @Override + public void validateScalars() { + for (int i = 0; i < getValueCount(); ++i) { + byte[] value = get(i); + if (value != null) { + ValidateUtil.validateOrThrow(Text.validateUTF8NoThrow(value), + "Non-UTF-8 data in VarCharVector at position " + i + "."); + } + } + } + /*----------------------------------------------------------------* | | | vector transfer | diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index aa29c29314e33..462b512c65436 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -29,6 +29,7 @@ import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.util.ValueVectorUtility; /** * An abstraction that is used to store a sequence of values in an individual column. @@ -282,4 +283,12 @@ public interface ValueVector extends Closeable, Iterable { * @return the name of the vector. */ String getName(); + + default void validate() { + ValueVectorUtility.validate(this); + } + + default void validateFull() { + ValueVectorUtility.validateFull(this); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java b/java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java index bc5c68b29f310..2c83893819a1e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java @@ -29,6 +29,7 @@ import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.Text; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.validate.ValidateUtil; /** * VarCharVector implements a variable width vector of VARCHAR @@ -261,6 +262,17 @@ public void setSafe(int index, Text text) { setSafe(index, text.getBytes(), 0, text.getLength()); } + @Override + public void validateScalars() { + for (int i = 0; i < getValueCount(); ++i) { + byte[] value = get(i); + if (value != null) { + ValidateUtil.validateOrThrow(Text.validateUTF8NoThrow(value), + "Non-UTF-8 data in VarCharVector at position " + i + "."); + } + } + } + /*----------------------------------------------------------------* | | | vector transfer | diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java index 137ac746f4aee..a81169b8f7d73 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java @@ -95,11 +95,19 @@ public static boolean checkPrecisionAndScale(BigDecimal value, int vectorPrecisi } if (value.precision() > vectorPrecision) { throw new UnsupportedOperationException("BigDecimal precision can not be greater than that in the Arrow " + - "vector: " + value.precision() + " > " + vectorPrecision); + "vector: " + value.precision() + " > " + vectorPrecision); } return true; } + /** + * Check that the BigDecimal scale equals the vectorScale and that the BigDecimal precision is + * less than or equal to the vectorPrecision. Return true if so, otherwise return false. + */ + public static boolean checkPrecisionAndScaleNoThrow(BigDecimal value, int vectorPrecision, int vectorScale) { + return value.scale() == vectorScale && value.precision() < vectorPrecision; + } + /** * Check that the decimal scale equals the vectorScale and that the decimal precision is * less than or equal to the vectorPrecision. If not, then an UnsupportedOperationException is diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java index b479305c6e39b..778af0ca956df 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java @@ -30,6 +30,7 @@ import java.text.CharacterIterator; import java.text.StringCharacterIterator; import java.util.Arrays; +import java.util.Optional; import com.fasterxml.jackson.core.JsonGenerationException; import com.fasterxml.jackson.core.JsonGenerator; @@ -466,6 +467,16 @@ public static ByteBuffer encode(String string, boolean replace) private static final int TRAIL_BYTE = 2; + /** + * Check if a byte array contains valid utf-8. + * + * @param utf8 byte array + * @return true if the input is valid UTF-8. False otherwise. + */ + public static boolean validateUTF8NoThrow(byte[] utf8) { + return !validateUTF8Internal(utf8, 0, utf8.length).isPresent(); + } + /** * Check if a byte array contains valid utf-8. * @@ -484,8 +495,22 @@ public static void validateUTF8(byte[] utf8) throws MalformedInputException { * @param len the length of the byte sequence * @throws MalformedInputException if the byte array contains invalid bytes */ - public static void validateUTF8(byte[] utf8, int start, int len) - throws MalformedInputException { + public static void validateUTF8(byte[] utf8, int start, int len) throws MalformedInputException { + Optional result = validateUTF8Internal(utf8, start, len); + if (result.isPresent()) { + throw new MalformedInputException(result.get()); + } + } + + /** + * Check to see if a byte array is valid utf-8. + * + * @param utf8 the array of bytes + * @param start the offset of the first byte in the array + * @param len the length of the byte sequence + * @return the position where a malformed byte occurred or Optional.empty() if the byte array was valid UTF-8. + */ + private static Optional validateUTF8Internal(byte[] utf8, int start, int len) { int count = start; int leadByte = 0; int length = 0; @@ -501,51 +526,51 @@ public static void validateUTF8(byte[] utf8, int start, int len) switch (length) { case 0: // check for ASCII if (leadByte > 0x7F) { - throw new MalformedInputException(count); + return Optional.of(count); } break; case 1: if (leadByte < 0xC2 || leadByte > 0xDF) { - throw new MalformedInputException(count); + return Optional.of(count); } state = TRAIL_BYTE_1; break; case 2: if (leadByte < 0xE0 || leadByte > 0xEF) { - throw new MalformedInputException(count); + return Optional.of(count); } state = TRAIL_BYTE_1; break; case 3: if (leadByte < 0xF0 || leadByte > 0xF4) { - throw new MalformedInputException(count); + return Optional.of(count); } state = TRAIL_BYTE_1; break; default: // too long! Longest valid UTF-8 is 4 bytes (lead + three) // or if < 0 we got a trail byte in the lead byte position - throw new MalformedInputException(count); + return Optional.of(count); } // switch (length) break; case TRAIL_BYTE_1: if (leadByte == 0xF0 && aByte < 0x90) { - throw new MalformedInputException(count); + return Optional.of(count); } if (leadByte == 0xF4 && aByte > 0x8F) { - throw new MalformedInputException(count); + return Optional.of(count); } if (leadByte == 0xE0 && aByte < 0xA0) { - throw new MalformedInputException(count); + return Optional.of(count); } if (leadByte == 0xED && aByte > 0x9F) { - throw new MalformedInputException(count); + return Optional.of(count); } // falls through to regular trail-byte test!! case TRAIL_BYTE: if (aByte < 0x80 || aByte > 0xBF) { - throw new MalformedInputException(count); + return Optional.of(count); } if (--length == 0) { state = LEAD_BYTE; @@ -558,6 +583,7 @@ public static void validateUTF8(byte[] utf8, int start, int len) } // switch (state) count++; } + return Optional.empty(); } /** diff --git a/java/vector/src/main/java/org/apache/arrow/vector/validate/ValidateVectorDataVisitor.java b/java/vector/src/main/java/org/apache/arrow/vector/validate/ValidateVectorDataVisitor.java index cdeb4f1eaa1ca..6d33be7a0dbac 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/validate/ValidateVectorDataVisitor.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/validate/ValidateVectorDataVisitor.java @@ -85,18 +85,21 @@ private void validateTypeBuffer(ArrowBuf typeBuf, int valueCount) { @Override public Void visit(BaseFixedWidthVector vector, Void value) { + vector.validateScalars(); return null; } @Override public Void visit(BaseVariableWidthVector vector, Void value) { validateOffsetBuffer(vector, vector.getValueCount()); + vector.validateScalars(); return null; } @Override public Void visit(BaseLargeVariableWidthVector vector, Void value) { validateLargeOffsetBuffer(vector, vector.getValueCount()); + vector.validateScalars(); return null; } @@ -169,6 +172,8 @@ public Void visit(DenseUnionVector vector, Void value) { @Override public Void visit(NullVector vector, Void value) { + ValidateUtil.validateOrThrow(vector.getNullCount() == vector.getValueCount(), + "NullVector should have only null entries."); return null; } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVector.java b/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVector.java index 2354b281ed41d..20492036dab99 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVector.java @@ -251,6 +251,20 @@ public void testDenseUnionVector() { } } + @Test + public void testBaseFixedWidthVectorInstanceMethod() { + try (final IntVector vector = new IntVector("v", allocator)) { + vector.validate(); + setVector(vector, 1, 2, 3); + vector.validate(); + + vector.getDataBuffer().capacity(0); + ValidateUtil.ValidateException e = assertThrows(ValidateUtil.ValidateException.class, + () -> vector.validate()); + assertTrue(e.getMessage().contains("Not enough capacity for fixed width data buffer")); + } + } + private void writeStructVector(NullableStructWriter writer, int value1, long value2) { writer.start(); writer.integer("f0").writeInt(value1); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVectorFull.java b/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVectorFull.java index 4241a0d9cff93..ca71a622bb8ea 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVectorFull.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/validate/TestValidateVectorFull.java @@ -23,11 +23,14 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertThrows; +import java.nio.charset.StandardCharsets; import java.util.Arrays; import org.apache.arrow.memory.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.Decimal256Vector; +import org.apache.arrow.vector.DecimalVector; import org.apache.arrow.vector.Float4Vector; import org.apache.arrow.vector.IntVector; import org.apache.arrow.vector.LargeVarCharVector; @@ -231,4 +234,75 @@ public void testDenseUnionVector() { assertTrue(e.getMessage().contains("Dense union vector offset exceeds sub-vector boundary")); } } + + @Test + public void testBaseVariableWidthVectorInstanceMethod() { + try (final VarCharVector vector = new VarCharVector("v", allocator)) { + vector.validateFull(); + setVector(vector, "aaa", "bbb", "ccc"); + vector.validateFull(); + + ArrowBuf offsetBuf = vector.getOffsetBuffer(); + offsetBuf.setInt(0, 100); + offsetBuf.setInt(4, 50); + + ValidateUtil.ValidateException e = assertThrows(ValidateUtil.ValidateException.class, + vector::validateFull); + assertTrue(e.getMessage().contains("The values in positions 0 and 1 of the offset buffer are decreasing")); + } + } + + @Test + public void testValidateVarCharUTF8() { + try (final VarCharVector vector = new VarCharVector("v", allocator)) { + vector.validateFull(); + setVector(vector, "aaa".getBytes(StandardCharsets.UTF_8), "bbb".getBytes(StandardCharsets.UTF_8), + new byte[] {(byte) 0xFF, (byte) 0xFE}); + ValidateUtil.ValidateException e = assertThrows(ValidateUtil.ValidateException.class, + vector::validateFull); + assertTrue(e.getMessage().contains("UTF")); + } + } + + @Test + public void testValidateLargeVarCharUTF8() { + try (final LargeVarCharVector vector = new LargeVarCharVector("v", allocator)) { + vector.validateFull(); + setVector(vector, "aaa".getBytes(StandardCharsets.UTF_8), "bbb".getBytes(StandardCharsets.UTF_8), + new byte[] {(byte) 0xFF, (byte) 0xFE}); + ValidateUtil.ValidateException e = assertThrows(ValidateUtil.ValidateException.class, + vector::validateFull); + assertTrue(e.getMessage().contains("UTF")); + } + } + + @Test + public void testValidateDecimal() { + try (final DecimalVector vector = new DecimalVector(Field.nullable("v", + new ArrowType.Decimal(2, 0, DecimalVector.TYPE_WIDTH * 8)), allocator)) { + vector.validateFull(); + setVector(vector, 1L); + vector.validateFull(); + vector.clear(); + setVector(vector, Long.MAX_VALUE); + ValidateUtil.ValidateException e = assertThrows(ValidateUtil.ValidateException.class, + vector::validateFull); + assertTrue(e.getMessage().contains("Decimal")); + } + } + + @Test + public void testValidateDecimal256() { + try (final Decimal256Vector vector = new Decimal256Vector(Field.nullable("v", + new ArrowType.Decimal(2, 0, DecimalVector.TYPE_WIDTH * 8)), allocator)) { + vector.validateFull(); + setVector(vector, 1L); + vector.validateFull(); + vector.clear(); + setVector(vector, Long.MAX_VALUE); + ValidateUtil.ValidateException e = assertThrows(ValidateUtil.ValidateException.class, + vector::validateFull); + assertTrue(e.getMessage().contains("Decimal")); + } + } } From 816eac4579775b45771fd0f462b73f5a68a87d4b Mon Sep 17 00:00:00 2001 From: mwish Date: Sat, 30 Sep 2023 01:11:33 +0800 Subject: [PATCH 90/96] GH-37851: [C++] IPC: ArrayLoader style enhancement (#37872) ### Rationale for this change Enhance the style of `arrow::ipc::ArrayLoader`'s `SkipField` ### What changes are included in this PR? Set `out_` to `nullptr` when `SkipField` is called. ### Are these changes tested? No ### Are there any user-facing changes? No * Closes: #37851 Authored-by: mwish Signed-off-by: Benjamin Kietzman --- cpp/src/arrow/ipc/reader.cc | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 0def0e036e3c1..6e801e1f8adb7 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -205,7 +205,10 @@ class ArrayLoader { } } - Status LoadType(const DataType& type) { return VisitTypeInline(type, this); } + Status LoadType(const DataType& type) { + DCHECK_NE(out_, nullptr); + return VisitTypeInline(type, this); + } Status Load(const Field* field, ArrayData* out) { if (max_recursion_depth_ <= 0) { @@ -223,6 +226,9 @@ class ArrayLoader { skip_io_ = true; Status status = Load(field, &dummy); skip_io_ = false; + // GH-37851: Reset state. Load will set `out_` to `&dummy`, which would + // be a dangling pointer. + out_ = nullptr; return status; } @@ -258,6 +264,7 @@ class ArrayLoader { } Status LoadCommon(Type::type type_id) { + DCHECK_NE(out_, nullptr); // This only contains the length and null count, which we need to figure // out what to do with the buffers. For example, if null_count == 0, then // we can skip that buffer without reading from shared memory @@ -276,6 +283,7 @@ class ArrayLoader { template Status LoadPrimitive(Type::type type_id) { + DCHECK_NE(out_, nullptr); out_->buffers.resize(2); RETURN_NOT_OK(LoadCommon(type_id)); @@ -290,6 +298,7 @@ class ArrayLoader { template Status LoadBinary(Type::type type_id) { + DCHECK_NE(out_, nullptr); out_->buffers.resize(3); RETURN_NOT_OK(LoadCommon(type_id)); @@ -299,6 +308,7 @@ class ArrayLoader { template Status LoadList(const TYPE& type) { + DCHECK_NE(out_, nullptr); out_->buffers.resize(2); RETURN_NOT_OK(LoadCommon(type.id())); @@ -313,6 +323,7 @@ class ArrayLoader { } Status LoadChildren(const std::vector>& child_fields) { + DCHECK_NE(out_, nullptr); ArrayData* parent = out_; parent->child_data.resize(child_fields.size()); @@ -2010,7 +2021,7 @@ class StreamDecoder::StreamDecoderImpl : public StreamDecoderInternal { }; StreamDecoder::StreamDecoder(std::shared_ptr listener, IpcReadOptions options) { - impl_.reset(new StreamDecoderImpl(std::move(listener), options)); + impl_ = std::make_unique(std::move(listener), options); } StreamDecoder::~StreamDecoder() {} From 1bd3cba84a1187e594655b2ba4880214dcc73e63 Mon Sep 17 00:00:00 2001 From: Angela Li Date: Fri, 29 Sep 2023 15:37:39 -0400 Subject: [PATCH 91/96] MINOR: [R][Documentation] Add default descriptions in CsvParseOptions$create() docs (#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in #37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li Co-authored-by: Nic Crane Signed-off-by: Nic Crane --- r/R/csv.R | 4 ++-- r/man/CsvReadOptions.Rd | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/r/R/csv.R b/r/R/csv.R index b119d16a84c06..116c620f83490 100644 --- a/r/R/csv.R +++ b/r/R/csv.R @@ -404,10 +404,10 @@ CsvTableReader$create <- function(file, #' #' - `delimiter` Field delimiting character (default `","`) #' - `quoting` Logical: are strings quoted? (default `TRUE`) -#' - `quote_char` Quoting character, if `quoting` is `TRUE` +#' - `quote_char` Quoting character, if `quoting` is `TRUE` (default `'"'`) #' - `double_quote` Logical: are quotes inside values double-quoted? (default `TRUE`) #' - `escaping` Logical: whether escaping is used (default `FALSE`) -#' - `escape_char` Escaping character, if `escaping` is `TRUE` +#' - `escape_char` Escaping character, if `escaping` is `TRUE` (default `"\\"`) #' - `newlines_in_values` Logical: are values allowed to contain CR (`0x0d`) #' and LF (`0x0a`) characters? (default `FALSE`) #' - `ignore_empty_lines` Logical: should empty lines be ignored (default) or diff --git a/r/man/CsvReadOptions.Rd b/r/man/CsvReadOptions.Rd index a18ff959ce7e5..6ebb2355184c1 100644 --- a/r/man/CsvReadOptions.Rd +++ b/r/man/CsvReadOptions.Rd @@ -52,10 +52,10 @@ The order of application is as follows: \itemize{ \item \code{delimiter} Field delimiting character (default \code{","}) \item \code{quoting} Logical: are strings quoted? (default \code{TRUE}) -\item \code{quote_char} Quoting character, if \code{quoting} is \code{TRUE} +\item \code{quote_char} Quoting character, if \code{quoting} is \code{TRUE} (default \code{'"'}) \item \code{double_quote} Logical: are quotes inside values double-quoted? (default \code{TRUE}) \item \code{escaping} Logical: whether escaping is used (default \code{FALSE}) -\item \code{escape_char} Escaping character, if \code{escaping} is \code{TRUE} +\item \code{escape_char} Escaping character, if \code{escaping} is \code{TRUE} (default \code{"\\\\"}) \item \code{newlines_in_values} Logical: are values allowed to contain CR (\code{0x0d}) and LF (\code{0x0a}) characters? (default \code{FALSE}) \item \code{ignore_empty_lines} Logical: should empty lines be ignored (default) or From 00efb06dc0de9c40907576811ebb546198b7f528 Mon Sep 17 00:00:00 2001 From: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Date: Fri, 29 Sep 2023 16:16:50 -0400 Subject: [PATCH 92/96] GH-37835: [MATLAB] Improve `arrow.tabular.Schema` display (#37836) ### Rationale for this change We would like to change how `arrow.tabular.Schema`s are displayed in the Command Window. Below is the current display: ```matlab >> field1 = arrow.field("A", arrow.time32()); >> field2 = arrow.field("B", arrow.boolean()); >>s = arrow.schema([field1 field2]) s = A: time32[s] B: bool ``` This display is not very MATLAB-like. ### What changes are included in this PR? 1. Updated the display of `arrow.tabular.Schema`. Below is the new display: ```matlab >> field1 = arrow.field("A", arrow.time32()); >> field2 = arrow.field("B", arrow.boolean()); >> s = arrow.schema([field1 field2]) s = Arrow Schema with 2 fields: A: Time32 | B: Boolean ``` When MATLAB is opened in desktop mode, `Schema`, `Time32`, and `Boolean` are hyperlinks users can click on to view the help text for the different class types. ### Are these changes tested? Yes. Added three new test cases to `tSchema.m`: 1. `TestDisplaySchemaZeroFields` 2. `TestDisplaySchemaOneField` 3. `TestDisplaySchemaMultipleFields` ### Are there any user-facing changes? Yes. `arrow.tabular.Schema`s will be displayed differently in the Command Window. ### Notes Once #37826 is merged, I will rebase my changes and mark this PR as ready for review. * Closes: #37835 Authored-by: Sarah Gilmore Signed-off-by: Kevin Gurney --- .../cpp/arrow/matlab/tabular/proxy/schema.cc | 11 ---- .../cpp/arrow/matlab/tabular/proxy/schema.h | 1 - .../+arrow/+tabular/+internal/displaySchema.m | 50 +++++++++++++++ matlab/src/matlab/+arrow/+tabular/Schema.m | 25 +++++--- matlab/test/arrow/tabular/tSchema.m | 63 +++++++++++++++++++ 5 files changed, 131 insertions(+), 19 deletions(-) create mode 100644 matlab/src/matlab/+arrow/+tabular/+internal/displaySchema.m diff --git a/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc b/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc index ec1ac1eecb2fd..023381e005969 100644 --- a/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc +++ b/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.cc @@ -34,7 +34,6 @@ namespace arrow::matlab::tabular::proxy { REGISTER_METHOD(Schema, getFieldByName); REGISTER_METHOD(Schema, getNumFields); REGISTER_METHOD(Schema, getFieldNames); - REGISTER_METHOD(Schema, toString); } libmexclass::proxy::MakeResult Schema::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) { @@ -141,14 +140,4 @@ namespace arrow::matlab::tabular::proxy { context.outputs[0] = field_names_mda; } - void Schema::toString(libmexclass::proxy::method::Context& context) { - namespace mda = ::matlab::data; - mda::ArrayFactory factory; - - const auto str_utf8 = schema->ToString(); - MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto str_utf16, arrow::util::UTF8StringToUTF16(str_utf8), context, error::UNICODE_CONVERSION_ERROR_ID); - auto str_mda = factory.createScalar(str_utf16); - context.outputs[0] = str_mda; - } - } diff --git a/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.h b/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.h index 30883bc2a85ac..9ca4a94e53071 100644 --- a/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.h +++ b/matlab/src/cpp/arrow/matlab/tabular/proxy/schema.h @@ -39,7 +39,6 @@ namespace arrow::matlab::tabular::proxy { void getFieldByName(libmexclass::proxy::method::Context& context); void getNumFields(libmexclass::proxy::method::Context& context); void getFieldNames(libmexclass::proxy::method::Context& context); - void toString(libmexclass::proxy::method::Context& context); std::shared_ptr schema; }; diff --git a/matlab/src/matlab/+arrow/+tabular/+internal/displaySchema.m b/matlab/src/matlab/+arrow/+tabular/+internal/displaySchema.m new file mode 100644 index 0000000000000..8d6740b195abc --- /dev/null +++ b/matlab/src/matlab/+arrow/+tabular/+internal/displaySchema.m @@ -0,0 +1,50 @@ +%DISPLAYSCHEMA Generates arrow.tabular.Schema display text. + +% Licensed to the Apache Software Foundation (ASF) under one or more +% contributor license agreements. See the NOTICE file distributed with +% this work for additional information regarding copyright ownership. +% The ASF licenses this file to you under the Apache License, Version +% 2.0 (the "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, software +% distributed under the License is distributed on an "AS IS" BASIS, +% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +% implied. See the License for the specific language governing +% permissions and limitations under the License. + +function text = displaySchema(schema) + fields = schema.Fields; + names = [fields.Name]; + types = [fields.Type]; + typeIDs = string([types.ID]); + + % Use as the sentinel for field names with zero characters. + idx = strlength(names) == 0; + names(idx) = ""; + + if usejava("desktop") + % When in desktop mode, the Command Window can interpret HTML tags + % to display bold font and hyperlinks. + names = compose("%s", names); + classNames = arrayfun(@(type) string(class(type)), types); + + % Creates a string array with the following form: + % + % ["arrow.type.BooleanType" "Boolean" "arrow.type.StringType" "String" ...] + % + % This string array is passed to the compose call below. The + % format specifier operator supplied to compose contains two + % formatting operators (%s), so compose uses two elements from the + % string array (classNameAndIDs) at a time. + classNameAndIDs = strings([1 numel(typeIDs) * 2]); + classNameAndIDs(1:2:end-1) = classNames; + classNameAndIDs(2:2:end) = typeIDs; + typeIDs = compose("%s", classNameAndIDs); + end + + text = names + ": " + typeIDs; + text = " " + strjoin(text, " | "); +end \ No newline at end of file diff --git a/matlab/src/matlab/+arrow/+tabular/Schema.m b/matlab/src/matlab/+arrow/+tabular/Schema.m index f679b1e0bc22c..3ee40f0e14293 100644 --- a/matlab/src/matlab/+arrow/+tabular/Schema.m +++ b/matlab/src/matlab/+arrow/+tabular/Schema.m @@ -97,18 +97,29 @@ end end - methods (Access = private) + methods (Access=protected) - function str = toString(obj) - str = obj.Proxy.toString(); + function header = getHeader(obj) + name = matlab.mixin.CustomDisplay.getClassNameForHeader(obj); + numFields = obj.NumFields; + if numFields == 0 + header = compose(" Arrow %s with 0 fields" + newline, name); + elseif numFields == 1 + header = compose(" Arrow %s with %d field:" + newline, name, numFields); + else + header = compose(" Arrow %s with %d fields:" + newline, name, numFields); + end end - end + function displayScalarObject(obj) + disp(getHeader(obj)); + numFields = obj.NumFields; - methods (Access=protected) + if numFields > 0 + text = arrow.tabular.internal.displaySchema(obj); + disp(text + newline); + end - function displayScalarObject(obj) - disp(obj.toString()); end end diff --git a/matlab/test/arrow/tabular/tSchema.m b/matlab/test/arrow/tabular/tSchema.m index e4c706d9a3d6c..bb95c1823b9fc 100644 --- a/matlab/test/arrow/tabular/tSchema.m +++ b/matlab/test/arrow/tabular/tSchema.m @@ -526,7 +526,70 @@ function TestIsEqualFalse(testCase) % Compare schema to double testCase.verifyFalse(isequal(schema4, 5)); + end + + function TestDisplaySchemaZeroFields(testCase) + import arrow.internal.test.display.makeLinkString + + schema = arrow.schema(arrow.type.Field.empty(0, 0)); %#ok + classnameLink = makeLinkString(FullClassName="arrow.tabular.Schema",... + ClassName="Schema", BoldFont=true); + expectedDisplay = " Arrow " + classnameLink + " with 0 fields" + newline; + expectedDisplay = char(expectedDisplay + newline); + actualDisplay = evalc('disp(schema)'); + testCase.verifyEqual(actualDisplay, char(expectedDisplay)); + end + + function TestDisplaySchemaOneField(testCase) + import arrow.internal.test.display.makeLinkString + + schema = arrow.schema(arrow.field("TestField", arrow.boolean())); %#ok + classnameLink = makeLinkString(FullClassName="arrow.tabular.Schema",... + ClassName="Schema", BoldFont=true); + header = " Arrow " + classnameLink + " with 1 field:" + newline; + indent = " "; + + if usejava("desktop") + type = makeLinkString(FullClassName="arrow.type.BooleanType", ... + ClassName="Boolean", BoldFont=true); + name = "TestField: "; + fieldLine = indent + name + type + newline; + else + fieldLine = indent + "TestField: Boolean" + newline; + end + expectedDisplay = join([header, fieldLine], newline); + expectedDisplay = char(expectedDisplay + newline); + actualDisplay = evalc('disp(schema)'); + testCase.verifyEqual(actualDisplay, char(expectedDisplay)); + end + function TestDisplaySchemaField(testCase) + import arrow.internal.test.display.makeLinkString + + field1 = arrow.field("Field1", arrow.timestamp()); + field2 = arrow.field("Field2", arrow.string()); + schema = arrow.schema([field1, field2]); %#ok + classnameLink = makeLinkString(FullClassName="arrow.tabular.Schema",... + ClassName="Schema", BoldFont=true); + header = " Arrow " + classnameLink + " with 2 fields:" + newline; + + indent = " "; + if usejava("desktop") + type1 = makeLinkString(FullClassName="arrow.type.TimestampType", ... + ClassName="Timestamp", BoldFont=true); + field1String = "Field1: " + type1; + type2 = makeLinkString(FullClassName="arrow.type.StringType", ... + ClassName="String", BoldFont=true); + field2String = "Field2: " + type2; + fieldLine = indent + field1String + " | " + field2String + newline; + else + fieldLine = indent + "Field1: Timestamp | Field2: String" + newline; + end + + expectedDisplay = join([header, fieldLine], newline); + expectedDisplay = char(expectedDisplay + newline); + actualDisplay = evalc('disp(schema)'); + testCase.verifyEqual(actualDisplay, char(expectedDisplay)); end end From ffdb9274abb7401289a671274b52ae10ce830b00 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 2 Oct 2023 04:42:13 +0900 Subject: [PATCH 93/96] MINOR: [JS] Bump eslint-plugin-jest from 27.2.3 to 27.4.2 in /js (#37963) Bumps [eslint-plugin-jest](https://github.com/jest-community/eslint-plugin-jest) from 27.2.3 to 27.4.2.
Release notes

Sourced from eslint-plugin-jest's releases.

v27.4.2

27.4.2 (2023-09-29)

Bug Fixes

  • make rule message punctuation consistent (#1444) (84121ee)

v27.4.1

27.4.1 (2023-09-29)

Bug Fixes

  • no-focused-tests: make reporting location consistent (#1443) (a871775)

v27.4.0

27.4.0 (2023-09-15)

Features

  • valid-title: support ignoring leading and trailing whitespace (#1433) (bc96473)

v27.3.0

27.3.0 (2023-09-15)

Features

Changelog

Sourced from eslint-plugin-jest's changelog.

27.4.2 (2023-09-29)

Bug Fixes

  • make rule message punctuation consistent (#1444) (84121ee)

27.4.1 (2023-09-29)

Bug Fixes

  • no-focused-tests: make reporting location consistent (#1443) (a871775)

27.4.0 (2023-09-15)

Features

  • valid-title: support ignoring leading and trailing whitespace (#1433) (bc96473)

27.3.0 (2023-09-15)

Features

Commits
  • 094ac10 chore(release): 27.4.2 [skip ci]
  • 0ec155a chore(deps): update dependency semantic-release to v22 (#1434)
  • 84121ee fix: make rule message punctuation consistent (#1444)
  • 90488de chore(release): 27.4.1 [skip ci]
  • a871775 fix(no-focused-tests): make reporting location consistent (#1443)
  • 1ee0087 refactor(max-nested-describe): simplify implementation (#1442)
  • c846f7f chore(deps): lock file maintenance
  • e790193 chore(deps): update danger/danger-js action to v11.3.0 (#1437)
  • 225c3cf chore(deps): lock file maintenance
  • 42fec48 chore(release): 27.4.0 [skip ci]
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=eslint-plugin-jest&package-manager=npm_and_yarn&previous-version=27.2.3&new-version=27.4.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- js/package.json | 2 +- js/yarn.lock | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/js/package.json b/js/package.json index 11bbe24f0244c..d3b06ae43f23f 100644 --- a/js/package.json +++ b/js/package.json @@ -82,7 +82,7 @@ "esbuild": "0.19.2", "esbuild-plugin-alias": "0.2.1", "eslint": "8.42.0", - "eslint-plugin-jest": "27.2.3", + "eslint-plugin-jest": "27.4.2", "eslint-plugin-unicorn": "47.0.0", "esm": "https://github.com/jsg2021/esm/releases/download/v3.x.x-pr883/esm-3.x.x-pr883.tgz", "glob": "10.2.7", diff --git a/js/yarn.lock b/js/yarn.lock index 66ede59a598b1..9805f706946f2 100644 --- a/js/yarn.lock +++ b/js/yarn.lock @@ -2766,10 +2766,10 @@ escape-string-regexp@^4.0.0: resolved "https://registry.yarnpkg.com/escape-string-regexp/-/escape-string-regexp-4.0.0.tgz#14ba83a5d373e3d311e5afca29cf5bfad965bf34" integrity sha512-TtpcNJ3XAzx3Gq8sWRzJaVajRs0uVxA2YAkdb1jm2YkPz4G6egUFAyA3n5vtEIZefPk5Wa4UXbKuS5fKkJWdgA== -eslint-plugin-jest@27.2.3: - version "27.2.3" - resolved "https://registry.yarnpkg.com/eslint-plugin-jest/-/eslint-plugin-jest-27.2.3.tgz#6f8a4bb2ca82c0c5d481d1b3be256ab001f5a3ec" - integrity sha512-sRLlSCpICzWuje66Gl9zvdF6mwD5X86I4u55hJyFBsxYOsBCmT5+kSUjf+fkFWVMMgpzNEupjW8WzUqi83hJAQ== +eslint-plugin-jest@27.4.2: + version "27.4.2" + resolved "https://registry.yarnpkg.com/eslint-plugin-jest/-/eslint-plugin-jest-27.4.2.tgz#181d999ac67a9b6040db1d27935887cf5a2882ed" + integrity sha512-3Nfvv3wbq2+PZlRTf2oaAWXWwbdBejFRBR2O8tAO67o+P8zno+QGbcDYaAXODlreXVg+9gvWhKKmG2rgfb8GEg== dependencies: "@typescript-eslint/utils" "^5.10.0" From ff98e7d43de44caa6650a5cfbc395986809900f2 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 2 Oct 2023 04:43:07 +0900 Subject: [PATCH 94/96] MINOR: [JS] Bump memfs from 4.2.1 to 4.5.0 in /js (#37964) Bumps [memfs](https://github.com/streamich/memfs) from 4.2.1 to 4.5.0.
Release notes

Sourced from memfs's releases.

v4.5.0

4.5.0 (2023-09-25)

Features

  • volume: fromJSON now accepts Buffer as volume content (#880) (9c0a6ff)

v4.4.0

4.4.0 (2023-09-22)

Features

  • volume: toJSON now accepts the asBuffer parameter (#952) (91a3742)
  • volume: implement readv and writev (#946) (966e17e)

v4.3.0

4.3.0 (2023-09-15)

Features

v4.2.3

4.2.3 (2023-09-15)

Bug Fixes

  • add missing nanosecond-precision properties to Stats (#943) (b9d4c6d)

v4.2.2

4.2.2 (2023-09-15)

Bug Fixes

Changelog

Sourced from memfs's changelog.

4.5.0 (2023-09-25)

Features

  • volume: fromJSON now accepts Buffer as volume content (#880) (9c0a6ff)

4.4.0 (2023-09-22)

Features

  • volume: toJSON now accepts the asBuffer parameter (#952) (91a3742)
  • volume: implement readv and writev (#946) (966e17e)

4.3.0 (2023-09-15)

Features

4.2.3 (2023-09-15)

Bug Fixes

  • add missing nanosecond-precision properties to Stats (#943) (b9d4c6d)

4.2.2 (2023-09-15)

Bug Fixes

Commits
  • 7fc1f7e chore(release): 4.5.0 [skip ci]
  • 9c0a6ff feat(volume): fromJSON now accepts Buffer as volume content (#880)
  • d1d66f6 chore(release): 4.4.0 [skip ci]
  • 91a3742 feat(volume): toJSON now accepts the asBuffer parameter (#952)
  • 966e17e feat(volume): implement readv and writev (#946)
  • 125a996 chore: sort package.json (#950)
  • 539a78d chore(deps): update jest monorepo to v29 (major) (#844)
  • b15a80e chore(release): 4.3.0 [skip ci]
  • 96cbce4 feat: add support for O_SYMLINK (#944)
  • f6d988d chore(release): 4.2.3 [skip ci]
  • Additional commits viewable in compare view

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=memfs&package-manager=npm_and_yarn&previous-version=4.2.1&new-version=4.5.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- js/package.json | 2 +- js/yarn.lock | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/js/package.json b/js/package.json index d3b06ae43f23f..e3c3b946a0b12 100644 --- a/js/package.json +++ b/js/package.json @@ -99,7 +99,7 @@ "ix": "5.0.0", "jest": "29.6.2", "jest-silent-reporter": "0.5.0", - "memfs": "4.2.1", + "memfs": "4.5.0", "mkdirp": "3.0.1", "multistream": "4.1.0", "randomatic": "3.1.1", diff --git a/js/yarn.lock b/js/yarn.lock index 9805f706946f2..d16bcf191e9d6 100644 --- a/js/yarn.lock +++ b/js/yarn.lock @@ -4931,10 +4931,10 @@ math-random@^1.0.1: resolved "https://registry.yarnpkg.com/math-random/-/math-random-1.0.4.tgz#5dd6943c938548267016d4e34f057583080c514c" integrity sha512-rUxjysqif/BZQH2yhd5Aaq7vXMSx9NdEsQcyA07uEzIvxgI7zIr33gGsh+RU0/XjmQpCW7RsVof1vlkvQVCK5A== -memfs@4.2.1: - version "4.2.1" - resolved "https://registry.yarnpkg.com/memfs/-/memfs-4.2.1.tgz#8c5a48707a460dde8e734b15e405e8377db2bec5" - integrity sha512-CINEB6cNAAhLUfRGrB4lj2Pj47ygerEmw3jxPb6R1gkD6Jfp484gJLteQ6MzqIjGWtFWuVzDl+KN7HiipMuKSw== +memfs@4.5.0: + version "4.5.0" + resolved "https://registry.yarnpkg.com/memfs/-/memfs-4.5.0.tgz#03082709987760022275e0d3bc0f24545b7fe279" + integrity sha512-8QePW5iXi/ZCySFTo39h3ujKGT0rYVnZywuSo5AzR7POAuy4uBEFZKziYkkrlGdWuxACUxKAJ0L/sry3DSG+TA== dependencies: json-joy "^9.2.0" thingies "^1.11.1" From a381c05d596cddd341437de6b277520345f9bb8e Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 2 Oct 2023 04:44:34 +0900 Subject: [PATCH 95/96] MINOR: [JS] Bump del from 7.0.0 to 7.1.0 in /js (#37967) Bumps [del](https://github.com/sindresorhus/del) from 7.0.0 to 7.1.0.
Release notes

Sourced from del's releases.

v7.1.0

  • Add path to onProgress event (#155) f5d31e6

https://github.com/sindresorhus/del/compare/v7.0.0...v7.1.0

Commits

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=del&package-manager=npm_and_yarn&previous-version=7.0.0&new-version=7.1.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- js/package.json | 2 +- js/yarn.lock | 16 +--------------- 2 files changed, 2 insertions(+), 16 deletions(-) diff --git a/js/package.json b/js/package.json index e3c3b946a0b12..dfb17d24c2f41 100644 --- a/js/package.json +++ b/js/package.json @@ -77,7 +77,7 @@ "async-done": "2.0.0", "benny": "3.7.1", "cross-env": "7.0.3", - "del": "7.0.0", + "del": "7.1.0", "del-cli": "5.1.0", "esbuild": "0.19.2", "esbuild-plugin-alias": "0.2.1", diff --git a/js/yarn.lock b/js/yarn.lock index d16bcf191e9d6..647b696931bb0 100644 --- a/js/yarn.lock +++ b/js/yarn.lock @@ -2495,21 +2495,7 @@ del-cli@5.1.0: del "^7.1.0" meow "^10.1.3" -del@7.0.0: - version "7.0.0" - resolved "https://registry.yarnpkg.com/del/-/del-7.0.0.tgz#79db048bec96f83f344b46c1a66e35d9c09fe8ac" - integrity sha512-tQbV/4u5WVB8HMJr08pgw0b6nG4RGt/tj+7Numvq+zqcvUFeMaIWWOUFltiU+6go8BSO2/ogsB4EasDaj0y68Q== - dependencies: - globby "^13.1.2" - graceful-fs "^4.2.10" - is-glob "^4.0.3" - is-path-cwd "^3.0.0" - is-path-inside "^4.0.0" - p-map "^5.5.0" - rimraf "^3.0.2" - slash "^4.0.0" - -del@^7.1.0: +del@7.1.0, del@^7.1.0: version "7.1.0" resolved "https://registry.yarnpkg.com/del/-/del-7.1.0.tgz#0de0044d556b649ff05387f1fa7c885e155fd1b6" integrity sha512-v2KyNk7efxhlyHpjEvfyxaAihKKK0nWCuf6ZtqZcFFpQRG0bJ12Qsr0RpvsICMjAAZ8DOVCxrlqpxISlMHC4Kg== From b0eb0216be3ac365e31ba34151cdaaf193c96244 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 2 Oct 2023 06:46:59 +0900 Subject: [PATCH 96/96] MINOR: [JS] Bump google-closure-compiler from 20230502.0.0 to 20230802.0.0 in /js (#37965) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [google-closure-compiler](https://github.com/google/closure-compiler-npm) from 20230502.0.0 to 20230802.0.0.
Release notes

Sourced from google-closure-compiler's releases.

20230802.0.0

Closure-compiler 20230802 release

What's Changed

New Contributors

Full Changelog: https://github.com/google/closure-compiler-npm/compare/v20230502.0.0...v20230802.0.0

Commits

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=google-closure-compiler&package-manager=npm_and_yarn&previous-version=20230502.0.0&new-version=20230802.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@ dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@ dependabot rebase` will rebase this PR - `@ dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@ dependabot merge` will merge this PR after your CI passes on it - `@ dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@ dependabot cancel merge` will cancel a previously requested merge and block automerging - `@ dependabot reopen` will reopen this PR if it is closed - `@ dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@ dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@ dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@ dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Sutou Kouhei --- js/package.json | 2 +- js/yarn.lock | 56 ++++++++++++++++++++++++------------------------- 2 files changed, 29 insertions(+), 29 deletions(-) diff --git a/js/package.json b/js/package.json index dfb17d24c2f41..14f26c74d29f3 100644 --- a/js/package.json +++ b/js/package.json @@ -86,7 +86,7 @@ "eslint-plugin-unicorn": "47.0.0", "esm": "https://github.com/jsg2021/esm/releases/download/v3.x.x-pr883/esm-3.x.x-pr883.tgz", "glob": "10.2.7", - "google-closure-compiler": "20230502.0.0", + "google-closure-compiler": "20230802.0.0", "gulp": "4.0.2", "gulp-esbuild": "0.11.1", "gulp-json-transform": "0.4.8", diff --git a/js/yarn.lock b/js/yarn.lock index 647b696931bb0..f027be218245f 100644 --- a/js/yarn.lock +++ b/js/yarn.lock @@ -3447,40 +3447,40 @@ glogg@^1.0.0: dependencies: sparkles "^1.0.0" -google-closure-compiler-java@^20230502.0.0: - version "20230502.0.0" - resolved "https://registry.yarnpkg.com/google-closure-compiler-java/-/google-closure-compiler-java-20230502.0.0.tgz#111240655adf9d64a0ac7eb16f73e896f3f9cefd" - integrity sha512-2nMQPQz2ppU9jvHhz2zpUP5jBDAqZp4gFVOEvirEyfUuLLkHwAvU2Tl1c7xaKX+Z4uMxpxttxcwdIjQhV2g8eQ== - -google-closure-compiler-linux@^20230502.0.0: - version "20230502.0.0" - resolved "https://registry.yarnpkg.com/google-closure-compiler-linux/-/google-closure-compiler-linux-20230502.0.0.tgz#c71114611b7ca47febd6feb1289ae152ca020b92" - integrity sha512-4NDgPKJXQHUxEyJoVFPVMQPJs5at7ThOXa9u3+9UeYk2K+vtW5wVZlmW07VOy8Mk/O/n2dp+Vl+wuE35BIiHAA== - -google-closure-compiler-osx@^20230502.0.0: - version "20230502.0.0" - resolved "https://registry.yarnpkg.com/google-closure-compiler-osx/-/google-closure-compiler-osx-20230502.0.0.tgz#9ea082f0c6ad40b829802f0993f2e5b4b0e079e8" - integrity sha512-jB13dcbu8O02cG3JcCCVZku1oI0ZirJc/Sr9xcGHY5MMyw3qEMlXb3IU97W6UXLcg2wCRawMWadOwL9K4L9lfQ== - -google-closure-compiler-windows@^20230502.0.0: - version "20230502.0.0" - resolved "https://registry.yarnpkg.com/google-closure-compiler-windows/-/google-closure-compiler-windows-20230502.0.0.tgz#81eef5de8b86364716b77a2d8068afba8b0e8244" - integrity sha512-wW5/liBxejvUViiBNo8/C9Vnhw+Lm+n3RdfE4spNkmdH9bcpKM+KQBLrPPakW17P3HbAPOPZ0L1RsrmyLYA5Cg== - -google-closure-compiler@20230502.0.0: - version "20230502.0.0" - resolved "https://registry.yarnpkg.com/google-closure-compiler/-/google-closure-compiler-20230502.0.0.tgz#65b19e673255b4b4dad4271724932e0970b11a97" - integrity sha512-C2WZkuRnXpNjU2nc0W/Cgxm6t2VlwEyUJOTaGHaLr6qZCXK0L1uhOneKWN2X7AORKdzyLW6Tq8ONxRc7eODGJg== +google-closure-compiler-java@^20230802.0.0: + version "20230802.0.0" + resolved "https://registry.yarnpkg.com/google-closure-compiler-java/-/google-closure-compiler-java-20230802.0.0.tgz#5de4679f3d014b6b66471a48fb82c2772db4c872" + integrity sha512-PWKLMLwj7pR/U0yYbiy649LLqAscu+F1gyY4Y/jK6CmSLb8cIJbL8BTJd00828TzTNfWnYwxbkcQw0y9C2YsGw== + +google-closure-compiler-linux@^20230802.0.0: + version "20230802.0.0" + resolved "https://registry.yarnpkg.com/google-closure-compiler-linux/-/google-closure-compiler-linux-20230802.0.0.tgz#1acaf12ef386e5c1dcb5ff5796d4ae9f48ebce46" + integrity sha512-F13U4iSXiWeGtHOFS25LVem1s6zI+pJvXVPVR7zSib5ppoUJ0JXnABJQezUR3FnpxmnkALG4oIGW0syH9zPLZA== + +google-closure-compiler-osx@^20230802.0.0: + version "20230802.0.0" + resolved "https://registry.yarnpkg.com/google-closure-compiler-osx/-/google-closure-compiler-osx-20230802.0.0.tgz#10746ecfa81ad6eecc4d42d4ce9d0ed3ca8071e7" + integrity sha512-ANAi/ux92Tt+Na7vFDLeK2hRzotjC5j+nxoPtE0OcuNcbjji5dREKoJxkq7r0YwRTCzAFZszK5ip/NPdTOdCEg== + +google-closure-compiler-windows@^20230802.0.0: + version "20230802.0.0" + resolved "https://registry.yarnpkg.com/google-closure-compiler-windows/-/google-closure-compiler-windows-20230802.0.0.tgz#d57968dc24d5e0d538840b4313e1bec7c71b18d6" + integrity sha512-ZQPujoNiiUyTGl8zEGR/0yAygWnbMtX/NQ/S/EHVgq5nmYkvDEVuiVbgpPAmO9lzBTq0hvUTRRATZbTU2ISxgA== + +google-closure-compiler@20230802.0.0: + version "20230802.0.0" + resolved "https://registry.yarnpkg.com/google-closure-compiler/-/google-closure-compiler-20230802.0.0.tgz#849181359823f8c9130faec9a1597377680823d6" + integrity sha512-o2fYoc8lqOBdhm95Ick0vWrtwH2Icd5yLZhbTcQ0T7NfGiBepYvx1BB63hR8ebgzEZemz9Fh+O6Kg/3Mjm28ww== dependencies: chalk "4.x" - google-closure-compiler-java "^20230502.0.0" + google-closure-compiler-java "^20230802.0.0" minimist "1.x" vinyl "2.x" vinyl-sourcemaps-apply "^0.2.0" optionalDependencies: - google-closure-compiler-linux "^20230502.0.0" - google-closure-compiler-osx "^20230502.0.0" - google-closure-compiler-windows "^20230502.0.0" + google-closure-compiler-linux "^20230802.0.0" + google-closure-compiler-osx "^20230802.0.0" + google-closure-compiler-windows "^20230802.0.0" graceful-fs@^4.0.0, graceful-fs@^4.1.11, graceful-fs@^4.1.2, graceful-fs@^4.1.6, graceful-fs@^4.2.0, graceful-fs@^4.2.10, graceful-fs@^4.2.4, graceful-fs@^4.2.9: version "4.2.11"