Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unaligned writes to fs.ReadStream internal Buffer pool leads to misaligned reads #24817

Closed
trxcllnt opened this issue Dec 3, 2018 · 2 comments
Labels
buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system.

Comments

@trxcllnt
Copy link
Contributor

trxcllnt commented Dec 3, 2018

  • Version: v11.3.0
  • Platform: Linux 4.15.0-42-generic Ubuntu x86_64
  • Subsystem: fs

The internal buffer pooling inside fs.ReadStream doesn't align its pooled writes to any byte boundaries, leading to issues consuming chunks where alignment is required (such as creating typed array views over the underlying ArrayBuffer).

In the simplest case, picture two fs.ReadStreams that are consumed concurrently. One is consumed with no alignment requirement, but the second must be consumed aligned to an 8-byte boundary (for example, as chunks of Float64s). Reads from the unaligned stream will cause the pool.used offset to advance by positions that aren't multiples of 8, such that the next read from the aligned stream won't be written at a multiple-of-8 offset into the pool.

Attached is a repro script demonstrating the issue. The byte lengths of the test files have been carefully selected to reliably reproduce the problem.

One possible fix is to add an alignment member variable to the fs.ReadStream, such that writes into the pool are aligned correctly for each stream instance. Another alternative is to always write on some base address alignment (128 is common), but this will use slightly more memory and is less flexible than the former.

$ node createReadStreamTest.js 
advanceUnaligned done
RangeError: start offset of Float64Array should be a multiple of 8. Received 51050
    at asTypedArray (createReadStreamTest.js:63:15)
    at readN (createReadStreamTest.js:36:5)
    at process.internalTickCallback (internal/process/next_tick.js:77:7)

screenshot from 2018-12-03 14-40-51

// createReadStreamTest.js
const fs = require('fs');

(async () => {

    await makeTestFile(`${__dirname}/bytesA`, 116586);
    await makeTestFile(`${__dirname}/bytesB`, 3961808);

    const streamA = fs.createReadStream(`${__dirname}/bytesA`);
    const streamB = fs.createReadStream(`${__dirname}/bytesB`);

    await advanceUnaligned(streamA);
    await advanceAligned8s(streamB);

    [streamA, streamB].forEach((s) => s.close());
})()
.then(() => 0, (err) => console.error(err) || 1)
.then((code) => process.exit(code));

async function makeTestFile(p, nBytes) {
    const exists = async (p) => {
        try { return !!(await fs.promises.stat(p)); }
        catch (e) { return false; }
    };
    if (!(await exists(p))) {
        const buffer = Uint8Array.from({ length: nBytes }, (_, i) => i);
        await fs.promises.writeFile(p, Buffer.from(buffer.buffer, 0, buffer.byteLength));
    }
}

async function readN(stream, Arr, n) {
    let buf = stream.read(n);
    if (!buf) {
        await onEvent(stream, 'readable');
        buf = stream.read(n);
    }
    asTypedArray(Arr, buf);
}

async function advanceUnaligned(stream) {
    const end = onEvent(stream, 'end');
    while (1) {
        if ('end' === (await Promise.race([end, readN(stream, Uint16Array, (Math.random() * 100 | 0) * 6)]))) { break; }
        if ('end' === (await Promise.race([end, readN(stream, Uint16Array, (Math.random() * 100 | 0) * 4)]))) { break; }
        if ('end' === (await Promise.race([end, readN(stream, Uint16Array, (Math.random() * 100 | 0) * 2)]))) { break; }
    }
    console.log(`advanceUnaligned done`);
}

async function advanceAligned8s(stream) {
    const end = onEvent(stream, 'end');
    while (1) {
        if ('end' === (await Promise.race([end, readN(stream, Float64Array, (Math.random() * 100 | 0) * 8)]))) { break; }
        if ('end' === (await Promise.race([end, readN(stream, Float64Array, (Math.random() * 100 | 0) * 8)]))) { break; }
        if ('end' === (await Promise.race([end, readN(stream, Float64Array, (Math.random() * 100 | 0) * 8)]))) { break; }
    }
    console.log(`advanceAligned8s done`);
}

function onEvent(stream, event) { return new Promise((r) => stream.once(event, () => r(event))); }
function asTypedArray(ArrayCtor, buf) {
    const { buffer, byteOffset, byteLength } = buf || new Uint8Array(ArrayCtor.BYTES_PER_ELEMENT);
    if (byteOffset % ArrayCtor.BYTES_PER_ELEMENT !== 0) {
        throw new RangeError(`start offset of ${ArrayCtor.name} should be a multiple of ${ArrayCtor.BYTES_PER_ELEMENT}. Received ${byteOffset}`);
    }
    return new ArrayCtor(buffer, byteOffset, byteLength / ArrayCtor.BYTES_PER_ELEMENT);
}
@lmeyerov
Copy link

lmeyerov commented Dec 4, 2018

A third option comes from POSIX, where client provide their own buffer, and guarantee alignment by allocating it with something like valloc().

+1 . As JavaScript and especially v8 continue to support optimizations around binary data, enabling zero-copy aligned memory access is good.

@bnoordhuis bnoordhuis added buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. labels Dec 4, 2018
@bnoordhuis
Copy link
Member

PR welcome. 285d8c6 from 6.5 years ago can be used as inspiration. :-)

wesm pushed a commit to apache/arrow that referenced this issue Jan 13, 2019
…ts, dependencies

It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. [Check out this sweet gif](https://user-images.githubusercontent.com/178183/50551046-19a94d00-0c30-11e9-80ed-74b9290e8c49.gif) of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like [uber/deck.gl](https://github.com/Pessimistress/deck.gl/tree/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow) 🎉

### The JIRAs

In addition to everything I detail below, this PR closes the following JIRAs:

* [ARROW-2828](https://issues.apache.org/jira/browse/ARROW-2828): Refactor Vector Data classes
* [ARROW-2839](https://issues.apache.org/jira/browse/ARROW-2839): Support whatwg/streams in IPC reader/writer
* [ARROW-2235](https://issues.apache.org/jira/browse/ARROW-2235): Add tests for IPC messages split across multiple buffers
* [ARROW-3337](https://issues.apache.org/jira/browse/ARROW-3337): IPC writer doesn't serialize the dictionary of nested Vectors
* [ARROW-3689](https://issues.apache.org/jira/browse/ARROW-3689): Upgrade to TS 3.1
* [ARROW-3560](https://issues.apache.org/jira/browse/ARROW-3560): Remove @std/esm
* [ARROW-3561](https://issues.apache.org/jira/browse/ARROW-3561): Update ts-jest
* [ARROW-2778](https://issues.apache.org/jira/browse/ARROW-2778): Add Utf8Vector.from
* [ARROW-2766](https://issues.apache.org/jira/browse/ARROW-2766): Add ability to construct a Table from a list of Arrays/TypedArrays

### The stats

The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0:

```sh
$ time npm run build
[22:11:04] Finished 'build' after 39 s

real	0m40.341s
user	4m55.428s
sys	0m5.559s
```
```sh
$ npm run test:coverage
=============================== Coverage summary ===============================
Statements   : 90.45% ( 4321/4777 )
Branches     : 76.7% ( 1570/2047 )
Functions    : 84.62% ( 1106/1307 )
Lines        : 91.5% ( 3777/4128 )
================================================================================

Test Suites: 21 passed, 21 total
Tests:       5644 passed, 5644 total
Snapshots:   0 total
Time:        16.023s
```

### The fixes

* `Vector#indexOf(value)` works for all DataTypes
* `Vector#set(i, value)` now works for all DataTypes
* Reading from node streams is now fully zero-copy
* The IPC writers now serialize dictionaries of nested Vectors correctly (ARROW-3337)
* DictionaryBatches marked as `isDelta` now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrived
* A few `arrow2csv` fixes:
  * Ignore `stdin` if it's a TTY
  * Now read all the Arrow formats from `stdin`
  * Always show the `help` text when we don't understand the input
  * Proper backpressure support to play nicely with other Unix utilities like `head` and `less`
* [Fixes an unfiled bug](trxcllnt@070ec98) we encountered last week where JS would throw an error creating RowProxies for a Table or Struct with duplicate column names

### The upgrades

* New zero-copy Message/RecordBatchReaders!
  * [`RecordBatchReader.from()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/from-inference-tests.ts#L37) will peek at the underlying bytes, and return the correct implementation based on whether the data is an Arrow File, Stream, or JSON
  * [`RecordBatchFileReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/file-reader-tests.ts#L74) now supports random-access seek, enabling more efficient web-worker/multi-process workflows
  * [`RecordBatchStreamReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L119) can now read multiple tables from the same underlying socket
  * `MessageReader` now [guarantees/enforces](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/ipc/message.ts#L126) message body byte alignment (this one even surfaced bugs in [node core](nodejs/node#24817) and the [DOM streams polyfill](MattiasBuelens/web-streams-polyfill#3))
* New RecordBatchWriters
  * Adds RecordBatchJSONWriter, RecordBatchFileWriter and RecordBatchStreamWriter
  * Adds static `RecordBatchWriter.writeAll()` method to easily write a Table or stream of RecordBatches
  * Both sync and async flushes based on the WritableSink
* Full integration with platform I/O primitives
  * We can still synchronously read JSON, Buffers, `Iterable<Buffer>`, or `AsyncIterable<Buffer>`
  * In node, we can now read from any [`ReadableStream`](https://nodejs.org/docs/latest/api/stream.html#stream_class_stream_readable), [`fs.FileHandle`](https://nodejs.org/docs/latest/api/fs.html#fs_class_filehandle)
  * In the browser, we can read from any [`ReadableStream` or `ReadableByteStream`](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream), or the [`Response`](https://developer.mozilla.org/en-US/docs/Web/API/Response) returned from the `fetch()` API. (Wrapping the [FileReader](https://developer.mozilla.org/en-US/docs/Web/API/FileReader) is still todo)
  * We also [accept Promises](https://github.com/Pessimistress/deck.gl/blob/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow/loader.js#L20) of any of the above
  * New convenience methods for integrating with node or DOM streams
    * [`throughNode()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L54)/[`throughDOM()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L50)
    * [`toReadableNodeStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L69)/[`toReadableDOMStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L65)
    * [`pipe()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-node-tests.ts#L91)/[`pipeTo()`/`pipeThrough()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-dom-tests.ts#L92)
* Generic type parameters inherited from `DataType` now flow recursively
  ```js
  const table = Table.from<{ str: Utf8, i32: Int32, bools: List<Bool> }>(data);
  table.get(0); // will be of type { str: string, i32: number, bools: BoolVector }
  ```
* New simplified [`Data` class](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/data.ts)
* New simplified, faster `Visitor` class with support for optional, more narrow [`visitT` implementations](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor.ts#L181)
  * New specialized Visitor implementations to enable runtime reflection (e.g. dynamically lookup the Vector constructor for a given DataType)
* New abstract `Chunked` base class for the applicative (concat) operation
  * public `chunkedInst.chunks` field is the list of inner chunks
* New `Column` class extends `Chunked`, combines `Field` with the chunks (provides access to the field `name` from the Schema)
* `RecordBatch#concat(...batchesOrTables)` now returns a Table
* Table now extends `Chunked`, so it inherits:
  * `Table#slice(from, to)`
  * `Table#concat(...batchesOrTables)`
  * `Table#getChildAt(i)` exists, alias of `getColumnAt(i)`
* `Table#getColumn[At]()` returns a Column

### The breaking changes

* All the old IPC functions are gone, but the new APIs will live for much longer
* `Table#batches` is now `Table#chunks`, which it inherits from `Chunked` (maybe controversial, open to aliasing)
* `Table#batchesUnion` is now just... the Table instance itself (also maybe controversial, open to aliasing)
* `DataType#TType` is now `DataType#typeId` -- it should have always been this, was a typo. Easy to alias if necessary.
* The complicated View classes are now gone, logic centralized as specialized [`Visitors`](https://github.com/trxcllnt/arrow/tree/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor)

### The tests

* **Tests no longer rely on any C++ or Java generated integration files**
* Integration tests have been moved into `bin/integration.js`, and they finish much quicker
* The tsconfig files have been tweaked to speed up test run time and improve the async debugging experience
* A streaming `RecordBatchJSONWriter` has been implemented so we can easily debug and validate written output
  * The JSON results are also tested against the corresponding binary representation, similar to the integration tests
* A [suite of test-data helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/generate-test-data.ts) have been added to auto-generate data for validation at runtime
  * They produce the underlying Arrow VectorData buffers, as well as the expected plain-JS-value representation [for verification](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/generated-data-tests.ts#L23)
  * This allows us to test all possible type configuration combinations, e.g. [all types Dictionary-encode](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/data/tables.ts#L61), all types serialize when nested, etc.
* A [suite of IO test helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/helpers.ts#L36) has been added
  * We use [`memfs`](https://www.npmjs.com/package/memfs) to mock the file system, which contributes to test performance improvements
  * This enables us to [easily test](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/reader/file-reader-tests.ts#L38) all the flavors of io primitives across node and browser environments
* A vscode debugging launch configuration has been added to ease the process of contributing more tests (and because I've been asked for mine so often)

### The build

* Faster
* Node 11+ (needs `Symbol.asyncIterator` enabled)
* Closure-compiler upgrades and build enhancements mean we can auto-generate the externs file during compilation, rather than maintaining it by hand

### Misc

* Added `arrow2csv` to `js/bin/arrow2csv`, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (`cat foo.arrow | js/bin/arrow2csv.js`)

### Todos

* Docs/Recipes/Examples
* Highlight/write more tools (like `arrow2csv`)
* Flesh out the RecordBatchWriters a bit more
* Gather feedback on the new RecordBatchReader APIs

Author: ptaylor <paul.e.taylor@me.com>
Author: Paul Taylor <paul.e.taylor@me.com>

Closes #3290 from trxcllnt/js-data-refactor and squashes the following commits:

2ef150f <ptaylor> bind getByteWidth to the vector type
9acfaa3 <ptaylor> handle the case where collapsed Uint8Arrays fully overlap
6a97ee0 <ptaylor> perf: defer creating rowProxy on nested types, use Array instead of Object for creating Data instances
2cad760 <ptaylor> pipe directly to stdout to ensure backpressure is preserved
f006a26 <ptaylor> ensure schema and field always have a metadata map
8dc5d2c <ptaylor> fix Float64 Array typings
162c7d8 <ptaylor> fix arrow2csv left-pad measurement for new bignum/decimal output
64dc015 <ptaylor> teach closure about Symbol.toPrimitive
ca0db9e <ptaylor> fix lint
ec12cdd <ptaylor> add a small BigNum mixin to make working with Int64 and Decimal values a bit easier
62578b9 <ptaylor> fix bug where valueToString function would return undefined (JSON.striingify(undefined) === undefined)
4b58bde <ptaylor> fix visitor method overload type signatures
d165413 <ptaylor> don't print comma that includes system paths
708f1b4 <ptaylor> move stride to data, fix chunked slicing, remove intermediate binding and getters in favor of direct property accesses
78ecc4c <ptaylor> use the textencoders from the global instead of Buffer for perf testing
47f0677 <ptaylor> perf: use a closure instead of binding
380dbc7 <ptaylor> add a single-chunk column type
6bcaad6 <ptaylor> fix lint
f7d2b2e <ptaylor> add getters for the dictionary and indices of chunked dictionary vectors
aaf42c8 <Paul Taylor> Consolidated JS data handling refactor
trxcllnt added a commit to trxcllnt/node that referenced this issue Apr 14, 2019
Prevents alignment issues when creating a typed array from a buffer.

Fixes: nodejs#24817
targos pushed a commit that referenced this issue May 4, 2019
Prevents alignment issues when creating a typed array from a buffer.

Fixes: #24817

PR-URL: #24838
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants