Skip to content

Commit

Permalink
chore: Bump to apache-arrow@17 (#3113)
Browse files Browse the repository at this point in the history
  • Loading branch information
ibgreen authored Oct 2, 2024
1 parent 6ca3d45 commit 4a2f920
Show file tree
Hide file tree
Showing 10 changed files with 126 additions and 83 deletions.
38 changes: 38 additions & 0 deletions docs/arrowjs/developer-guide/data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,41 @@ function toDate(timestamp) {
return new Date((timestamp[1] * Math.pow(2, 32) + timestamp[0]) / 1000);
}
```

## Data Types Reference

At the heart of Arrow is set of well-known logical [data types](/docs/arrowjs/developer-guide/data-types), ensuring each Column in an Arrow Table is strongly-typed. These data types define how a Column's underlying buffers should be constructed and read, and includes configurable (and custom) metadata fields for further annotating a Column. A Schema describing each Column's name and data type is encoded alongside each Column's data buffers, allowing you to consume an Arrow data source without knowing the data types or column layout beforehand.

Each data type falls into one of three rough categories: Fixed-width types, variable-width types, or composite types that contain other Arrow data types. All data types can represent null values, which are stored in a separate validity [bitmask](<https://en.wikipedia.org/wiki/Mask_(computing)>). Follow the links below for a more detailed description of each data type.

### Fixed-width Data Types

Fixed-width data types describe physical primitive values (bytes or bits of some fixed size), or logical values that can be represented as primitive values. In addition to an optional [`Uint8Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array) validity bitmask, these data types have a physical data buffer (a [`TypedArray`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray#TypedArray_objects) corresponding to the data type's physical element width).

- [Null](/docs/arrowjs/developer-guide/data-types) - A column of NULL values having no physical storage
- [Bool](/docs/arrowjs/developer-guide/data-types) - Booleans as either 0 or 1 (bit-packed, LSB-ordered)
- [Int](/docs/arrowjs/developer-guide/data-types) - Signed or unsigned 8, 16, 32, or 64-bit little-endian integers
- [Float](/docs/arrowjs/developer-guide/data-types) - 2, 4, or 8-byte floating point values
- [Decimal](/docs/arrowjs/developer-guide/data-types) - Precision-and-scale-based 128-bit decimal values
- [FixedSizeBinary](/docs/arrowjs/developer-guide/data-types) - A list of fixed-size binary sequences, where each value occupies the same number of bytes
- [Date](/docs/arrowjs/developer-guide/data-types) - Date as signed 32-bit integer days or 64-bit integer milliseconds since the UNIX epoch
- [Time](/docs/arrowjs/developer-guide/data-types) - Time as signed 32 or 64-bit integers, representing either seconds, millisecond, microseconds, or nanoseconds since midnight (00:00:00)
- [Timestamp](/docs/arrowjs/developer-guide/data-types) - Exact timestamp as signed 64-bit integers, representing either seconds, milliseconds, microseconds, or nanoseconds since the UNIX epoch
- [Interval](/docs/arrowjs/developer-guide/data-types) - Time intervals as pairs of either (year, month) or (day, time) in SQL style
- [FixedSizeList](/docs/arrowjs/developer-guide/data-types) - Fixed-size sequences of another logical Arrow data type

### Variable-width Data Types

Variable-width types describe lists of values with different widths, including binary blobs, Utf8 code-points, or slices of another underlying Arrow data type. These types store the values contiguously in memory, and have a physical [`Int32Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Int32Array) of offsets that describe the start and end indicies of each list element.

- [List](/docs/arrowjs/developer-guide/data-types) - Variable-length sequences of another logical Arrow data type
- [Utf8](/docs/arrowjs/developer-guide/data-types) - Variable-length byte sequences of UTF8 code-points (strings)
- [Binary](/docs/arrowjs/developer-guide/data-types) - Variable-length byte sequences (no guarantee of UTF8-ness)

### Composite Data Types

Composite types don't have physical data buffers of their own. They contain other Arrow data types and delegate work to them.

- [Union](/docs/arrowjs/developer-guide/data-types) - Union of logical child data types
- [Map](/docs/arrowjs/developer-guide/data-types) - Map of named logical child data types
- [Struct](/docs/arrowjs/developer-guide/data-types) - Struct of ordered logical child data types
43 changes: 11 additions & 32 deletions docs/arrowjs/developer-guide/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,40 +15,19 @@ The Arrow library is organized into separate components responsible for creating
- [IPC Readers and Writers](/docs/arrowjs/developer-guide/reading-and-writing) - Classes to read and write the Arrow IPC (inter-process communication) binary file and stream formats
- [Fields, Schemas, RecordBatches, Tables, and Columns](/docs/arrowjs/developer-guide/schemas) - Classes to describe, manipulate, read, and write groups of strongly-typed Vectors or Columns

## Data Types

At the heart of Arrow is set of well-known logical [data types](/docs/arrowjs/developer-guide/data-types), ensuring each Column in an Arrow Table is strongly-typed. These data types define how a Column's underlying buffers should be constructed and read, and includes configurable (and custom) metadata fields for further annotating a Column. A Schema describing each Column's name and data type is encoded alongside each Column's data buffers, allowing you to consume an Arrow data source without knowing the data types or column layout beforehand.

Each data type falls into one of three rough categories: Fixed-width types, variable-width types, or composite types that contain other Arrow data types. All data types can represent null values, which are stored in a separate validity [bitmask](<https://en.wikipedia.org/wiki/Mask_(computing)>). Follow the links below for a more detailed description of each data type.

### Fixed-width Data Types
## Concepts

Fixed-width data types describe physical primitive values (bytes or bits of some fixed size), or logical values that can be represented as primitive values. In addition to an optional [`Uint8Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array) validity bitmask, these data types have a physical data buffer (a [`TypedArray`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray#TypedArray_objects) corresponding to the data type's physical element width).
it's probably good to define some terminology:

- [Null](/docs/arrowjs/developer-guide/data-types) - A column of NULL values having no physical storage
- [Bool](/docs/arrowjs/developer-guide/data-types) - Booleans as either 0 or 1 (bit-packed, LSB-ordered)
- [Int](/docs/arrowjs/developer-guide/data-types) - Signed or unsigned 8, 16, 32, or 64-bit little-endian integers
- [Float](/docs/arrowjs/developer-guide/data-types) - 2, 4, or 8-byte floating point values
- [Decimal](/docs/arrowjs/developer-guide/data-types) - Precision-and-scale-based 128-bit decimal values
- [FixedSizeBinary](/docs/arrowjs/developer-guide/data-types) - A list of fixed-size binary sequences, where each value occupies the same number of bytes
- [Date](/docs/arrowjs/developer-guide/data-types) - Date as signed 32-bit integer days or 64-bit integer milliseconds since the UNIX epoch
- [Time](/docs/arrowjs/developer-guide/data-types) - Time as signed 32 or 64-bit integers, representing either seconds, millisecond, microseconds, or nanoseconds since midnight (00:00:00)
- [Timestamp](/docs/arrowjs/developer-guide/data-types) - Exact timestamp as signed 64-bit integers, representing either seconds, milliseconds, microseconds, or nanoseconds since the UNIX epoch
- [Interval](/docs/arrowjs/developer-guide/data-types) - Time intervals as pairs of either (year, month) or (day, time) in SQL style
- [FixedSizeList](/docs/arrowjs/developer-guide/data-types) - Fixed-size sequences of another logical Arrow data type
- `Data` a collection of rows in contiguous Arrow memory. This is called "Array" in most arrow implementations but is called `Data` in Arrow JS to avoid shadowing the JS `Array` type. `Data` can have one or more underlying buffers but those buffers all represent the same data. E.g. integer storage like a `Data` of type `Uint8` has two buffers: one for the raw data (directly viewable by a `Uint8Array`) and another for the nullability bitmask: one bit for each row to confer whether the row is null or not. Nested types can have more buffers. E.g. points can be represented as a `Data` of struct type, where there's a buffer for the `x` coordinates and another buffer for the `y` coordinates.
- `Vector` a collection of rows in batches. This is essentially a list of `Data`.
- `Field`: metadata that describes an individual `Data` or `Vector`. This contains `name: string`, data type, `nullable: bool`, and `metadata: Map<string, string>`.
- `Schema`: metadata that describes a named collection of `Data` or `Vector`. This is essentially `List<Field>`, but it can also store optional associated `metadata: Map<string, string>`.
- `RecordBatch` an ordered and named collection of `Data` instances. This is essentially a `List<Data>` plus a `Schema`.
- `Table`: an ordered and named collection of `Vector` instances. This is essentially a `List<Vector>` plus a `Schema`.

### Variable-width Data Types

Variable-width types describe lists of values with different widths, including binary blobs, Utf8 code-points, or slices of another underlying Arrow data type. These types store the values contiguously in memory, and have a physical [`Int32Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Int32Array) of offsets that describe the start and end indicies of each list element.

- [List](/docs/arrowjs/developer-guide/data-types) - Variable-length sequences of another logical Arrow data type
- [Utf8](/docs/arrowjs/developer-guide/data-types) - Variable-length byte sequences of UTF8 code-points (strings)
- [Binary](/docs/arrowjs/developer-guide/data-types) - Variable-length byte sequences (no guarantee of UTF8-ness)

### Composite Data Types
## Data Types

Composite types don't have physical data buffers of their own. They contain other Arrow data types and delegate work to them.
At the heart of Arrow is set of well-known logical [data types](/docs/arrowjs/developer-guide/data-types), ensuring each Column in an Arrow Table is strongly-typed. These data types define how a Column's underlying buffers should be constructed and read, and includes configurable (and custom) metadata fields for further annotating a Column. A Schema describing each Column's name and data type is encoded alongside each Column's data buffers, allowing you to consume an Arrow data source without knowing the data types or column layout beforehand.

- [Union](/docs/arrowjs/developer-guide/data-types) - Union of logical child data types
- [Map](/docs/arrowjs/developer-guide/data-types) - Map of named logical child data types
- [Struct](/docs/arrowjs/developer-guide/data-types) - Struct of ordered logical child data types
Each data type falls into one of three rough categories: Fixed-width types, variable-width types, or composite types that contain other Arrow data types. All data types can represent null values, which are stored in a separate validity [bitmask](<https://en.wikipedia.org/wiki/Mask_(computing)>). Follow the links below for a more detailed description of each data type.
33 changes: 17 additions & 16 deletions docs/arrowjs/roadmap.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,25 @@
# Roadmap
# Usability Feedback

What's Next for Apache Arrow in Javascript
As loaders.gl and the rest of the vis.gl / Open Visualization frameworks increase their usage of Arrow JS, we are facing some challenges with the library.
This can be considered an open letter of feedback to Arrow JS maintainers.

There are a lot of features we'd like to add over the next few Javascript releases:
## General packaging and documentation

- **Inline predicates**: Function calls in the inner loop of a scan over millions of records can be very expensive. We can potentially save that time by generating a new scan function with the predicates inlined when a filter is created.
- **Semver** - Conforming to semantic versioning (semver) conventions would be a big improvement. Depending on a specific arrowjs version in a vis.gl project is hard as there will soon be a new major version. We need to identify ranges of major versions that are likely to work starting from the last breaking version.
- **Arrow JS release notes** - Solid clean arrowjs release notes written for an end user would help a lot. loaders.gl maintains a page that tries to make sense of the commit lists but keeping it current is a challenge.
- **Roadmap info** - when breaking changes are being worked on
- **Upgrade guides** -
- **Updated docs** - Arrow JS docs on web are outdated. It’s hard enough for the vis.gl maintainers to learn the api, docs are not been great for e.g average vis.gl user audience.

- **Cache filter results**: Right now every time we do a scan on a filtered DataFrame we re-check the predicate on every row. There should be an (optional?) lazily computed index to store the predicate results for subsequent re-use.
## Feature wish list

- **Friendlier API**: I shouldn't have to write a custom scan function just to take a look at the results of a filter! Every DataFrame should have a toJSON() function (See ARROW-2202).
This set of features are perhaps more specific to the usage patterns in the vis.gl frameworks, but still could be general improvements to the library.

- **node.js ↔ (Python, C++, Java, ...) interaction**: A big benefit of Arrow's common in-memory format is that different tools can operate on the same memory. Unfortunately we're pretty closed off in the browser, but node doesn't have that problem! Finishing ARROW-1700, node.js Plasma store client should make this type of interaction possible.
### Pure JS representation of parsed Arrow data.

Have an idea? Tell us! Generally JIRAs are preferred but we'll take GitHub issues too. If you just want to discuss something, reach out on the mailing list or slack. But PRs are the best of all, we can always use more contributors!
loaders.gl's philosophy is to return pure JavaScript structures, rather than classes.
The Arrow JS type system (schemas etc could be represented in this way, in fact loaders.gl maintains such an alternative representation).
This reduces the need for serialization and deserialization.
Having a helper class that can be instantiated on top of the pure data structure is of course fine.

## Feature Completeness

Ideally each Apache Arrow language binding would offer the same set of features, at least to the extent that the language/platform in question allows. In practice however, not all features have been implemented in all language bindings.

In comparison with the C++ Arrow API bindings, there are some missing features in the JavaScript bindings:

- Tensors are not yet supported.
- No explicit support for Apache Arrow Flight
TBA...
4 changes: 4 additions & 0 deletions docs/arrowjs/upgrade-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ Also Apache Arrow JS follows a common cross-language versioning number scheme wh

The biggest changes were made in Apache Arrow JS Version 9.0 (based on feedback from loaders.gl users).

## Upgrading to v16.0

- No significant changes in Apache Arrow JS

## Upgrading to v15.0

- No significant changes in Apache Arrow JS
Expand Down
27 changes: 27 additions & 0 deletions docs/arrowjs/whats-new.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,33 @@ especially if your app uses multiple JavaScript packages dependent on arrow. You
different arrow js versions or the build may break due to version requirement incompatibilities.
:::

## v17.0

July 16, 2024

- [Apache Arrow 16.0.0](https://arrow.apache.org/release/17.0.0.html)

## v16.1

May 14, 2024

- GH-40407 - [JS] Fix string coercion in MapRowProxyHandler.ownKeys (#40408)
- GH-39131 - [JS] Add at() for array like types (#40730)
- GH-39482 - [JS] Refactor imports (#39483)
- GH-40959 - [JS] Store Timestamps in 64 bits (#40960)
- GH-40989 - [JS] Update dependencies (#40990)

## v16.0

April 20, 2024

- [Apache Arrow 16.0.0](https://arrow.apache.org/release/16.0.0.html)
- GH-40718 - [JS] Fix set visitor in vectors for js dates (#40725)
- GH-40851 - [JS] Fix nullcount and make vectors created from typed arrays not nullable (#40852)
- GH-40891 - [JS] Store Dates as TimestampMillisecond (#40892)
- GH-41015 - [JS][Benchmarking] allow JS benchmarks to run more portably (#41031)
- GH-40784 - [JS] Use bigIntToNumber (#40785)

## v15.0

Jan 21, 2024
Expand Down
15 changes: 2 additions & 13 deletions docs/developer-guide/loader-categories.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,26 +12,15 @@ The fact that loaders belong to categories enable applications to flexibly regis

For instance, once an application has added support for one loader in a category, other loaders in the same category can be registered during application startup.

Original code

```typescript
import {parse, registerLoaders} from '@loaders.gl/core';
import {PCDLoader} from `@loaders.gl/pcd';
registerLoaders([PCDLoader]);
import {PCDLoader} from '@loaders.gl/pcd';
async function loadPointCloud(url) {
const pointCloud = await parse(fetch(url));
const pointCloud = await parse(fetch(url, PCDLoader));
// Use some WebGL framework to render the parsed cloud
}
```

Now support for additional point cloud formats can be added to the application without touching the original code:

```typescript
import {LASLoader} from `@loaders.gl/las';
import {DracoLoader} from `@loaders.gl/draco';
registerLoaders([LASLoader, DracoLoader]);
```

## Data Format

Each category documents the returned data format. loaders and writers reference the category documentation.
Expand Down
2 changes: 1 addition & 1 deletion modules/arrow/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
"@loaders.gl/wkt": "^5.0.0-alpha.0",
"@loaders.gl/worker-utils": "^5.0.0-alpha.0",
"@math.gl/polygon": "^4.1.0",
"apache-arrow": ">= 15.0.0"
"apache-arrow": ">= 16.1.0"
},
"peerDependencies": {
"@loaders.gl/core": "^4.0.0"
Expand Down
5 changes: 2 additions & 3 deletions modules/parquet/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -87,12 +87,11 @@
"@types/node": "^10.14.15",
"@types/node-int64": "^0.4.29",
"@types/thrift": "^0.10.8",
"@types/varint": "^5.0.0",
"apache-arrow": "^15.0.0"
"@types/varint": "^5.0.0"
},
"peerDependencies": {
"@loaders.gl/core": "^4.0.0",
"apache-arrow": ">= 15.0.0"
"apache-arrow": ">= 16.1.0"
},
"gitHead": "3213679d79e6ff2814d48fd3337acfa446c74099"
}
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
"apache-arrow version should be aligned with parquet-wasm version"
],
"resolutions": {
"apache-arrow": "^15.0.0",
"apache-arrow": "^17.0.0",
"typescript": "^5.6.0",
"@cmfcmf/docusaurus-search-local": "~1.1.0",
"@docusaurus/module-type-aliases": "~3.3.2",
Expand Down
Loading

0 comments on commit 4a2f920

Please sign in to comment.