-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unaligned writes to fs.ReadStream internal Buffer pool leads to misaligned reads #24817
Comments
A third option comes from POSIX, where client provide their own buffer, and guarantee alignment by allocating it with something like valloc(). +1 . As JavaScript and especially v8 continue to support optimizations around binary data, enabling zero-copy aligned memory access is good. |
bnoordhuis
added
buffer
Issues and PRs related to the buffer subsystem.
fs
Issues and PRs related to the fs subsystem / file system.
labels
Dec 4, 2018
PR welcome. 285d8c6 from 6.5 years ago can be used as inspiration. :-) |
3 tasks
wesm
pushed a commit
to apache/arrow
that referenced
this issue
Jan 13, 2019
…ts, dependencies It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. [Check out this sweet gif](https://user-images.githubusercontent.com/178183/50551046-19a94d00-0c30-11e9-80ed-74b9290e8c49.gif) of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like [uber/deck.gl](https://github.com/Pessimistress/deck.gl/tree/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow) 🎉 ### The JIRAs In addition to everything I detail below, this PR closes the following JIRAs: * [ARROW-2828](https://issues.apache.org/jira/browse/ARROW-2828): Refactor Vector Data classes * [ARROW-2839](https://issues.apache.org/jira/browse/ARROW-2839): Support whatwg/streams in IPC reader/writer * [ARROW-2235](https://issues.apache.org/jira/browse/ARROW-2235): Add tests for IPC messages split across multiple buffers * [ARROW-3337](https://issues.apache.org/jira/browse/ARROW-3337): IPC writer doesn't serialize the dictionary of nested Vectors * [ARROW-3689](https://issues.apache.org/jira/browse/ARROW-3689): Upgrade to TS 3.1 * [ARROW-3560](https://issues.apache.org/jira/browse/ARROW-3560): Remove @std/esm * [ARROW-3561](https://issues.apache.org/jira/browse/ARROW-3561): Update ts-jest * [ARROW-2778](https://issues.apache.org/jira/browse/ARROW-2778): Add Utf8Vector.from * [ARROW-2766](https://issues.apache.org/jira/browse/ARROW-2766): Add ability to construct a Table from a list of Arrays/TypedArrays ### The stats The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0: ```sh $ time npm run build [22:11:04] Finished 'build' after 39 s real 0m40.341s user 4m55.428s sys 0m5.559s ``` ```sh $ npm run test:coverage =============================== Coverage summary =============================== Statements : 90.45% ( 4321/4777 ) Branches : 76.7% ( 1570/2047 ) Functions : 84.62% ( 1106/1307 ) Lines : 91.5% ( 3777/4128 ) ================================================================================ Test Suites: 21 passed, 21 total Tests: 5644 passed, 5644 total Snapshots: 0 total Time: 16.023s ``` ### The fixes * `Vector#indexOf(value)` works for all DataTypes * `Vector#set(i, value)` now works for all DataTypes * Reading from node streams is now fully zero-copy * The IPC writers now serialize dictionaries of nested Vectors correctly (ARROW-3337) * DictionaryBatches marked as `isDelta` now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrived * A few `arrow2csv` fixes: * Ignore `stdin` if it's a TTY * Now read all the Arrow formats from `stdin` * Always show the `help` text when we don't understand the input * Proper backpressure support to play nicely with other Unix utilities like `head` and `less` * [Fixes an unfiled bug](trxcllnt@070ec98) we encountered last week where JS would throw an error creating RowProxies for a Table or Struct with duplicate column names ### The upgrades * New zero-copy Message/RecordBatchReaders! * [`RecordBatchReader.from()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/from-inference-tests.ts#L37) will peek at the underlying bytes, and return the correct implementation based on whether the data is an Arrow File, Stream, or JSON * [`RecordBatchFileReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/file-reader-tests.ts#L74) now supports random-access seek, enabling more efficient web-worker/multi-process workflows * [`RecordBatchStreamReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L119) can now read multiple tables from the same underlying socket * `MessageReader` now [guarantees/enforces](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/ipc/message.ts#L126) message body byte alignment (this one even surfaced bugs in [node core](nodejs/node#24817) and the [DOM streams polyfill](MattiasBuelens/web-streams-polyfill#3)) * New RecordBatchWriters * Adds RecordBatchJSONWriter, RecordBatchFileWriter and RecordBatchStreamWriter * Adds static `RecordBatchWriter.writeAll()` method to easily write a Table or stream of RecordBatches * Both sync and async flushes based on the WritableSink * Full integration with platform I/O primitives * We can still synchronously read JSON, Buffers, `Iterable<Buffer>`, or `AsyncIterable<Buffer>` * In node, we can now read from any [`ReadableStream`](https://nodejs.org/docs/latest/api/stream.html#stream_class_stream_readable), [`fs.FileHandle`](https://nodejs.org/docs/latest/api/fs.html#fs_class_filehandle) * In the browser, we can read from any [`ReadableStream` or `ReadableByteStream`](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream), or the [`Response`](https://developer.mozilla.org/en-US/docs/Web/API/Response) returned from the `fetch()` API. (Wrapping the [FileReader](https://developer.mozilla.org/en-US/docs/Web/API/FileReader) is still todo) * We also [accept Promises](https://github.com/Pessimistress/deck.gl/blob/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow/loader.js#L20) of any of the above * New convenience methods for integrating with node or DOM streams * [`throughNode()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L54)/[`throughDOM()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L50) * [`toReadableNodeStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L69)/[`toReadableDOMStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L65) * [`pipe()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-node-tests.ts#L91)/[`pipeTo()`/`pipeThrough()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-dom-tests.ts#L92) * Generic type parameters inherited from `DataType` now flow recursively ```js const table = Table.from<{ str: Utf8, i32: Int32, bools: List<Bool> }>(data); table.get(0); // will be of type { str: string, i32: number, bools: BoolVector } ``` * New simplified [`Data` class](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/data.ts) * New simplified, faster `Visitor` class with support for optional, more narrow [`visitT` implementations](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor.ts#L181) * New specialized Visitor implementations to enable runtime reflection (e.g. dynamically lookup the Vector constructor for a given DataType) * New abstract `Chunked` base class for the applicative (concat) operation * public `chunkedInst.chunks` field is the list of inner chunks * New `Column` class extends `Chunked`, combines `Field` with the chunks (provides access to the field `name` from the Schema) * `RecordBatch#concat(...batchesOrTables)` now returns a Table * Table now extends `Chunked`, so it inherits: * `Table#slice(from, to)` * `Table#concat(...batchesOrTables)` * `Table#getChildAt(i)` exists, alias of `getColumnAt(i)` * `Table#getColumn[At]()` returns a Column ### The breaking changes * All the old IPC functions are gone, but the new APIs will live for much longer * `Table#batches` is now `Table#chunks`, which it inherits from `Chunked` (maybe controversial, open to aliasing) * `Table#batchesUnion` is now just... the Table instance itself (also maybe controversial, open to aliasing) * `DataType#TType` is now `DataType#typeId` -- it should have always been this, was a typo. Easy to alias if necessary. * The complicated View classes are now gone, logic centralized as specialized [`Visitors`](https://github.com/trxcllnt/arrow/tree/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor) ### The tests * **Tests no longer rely on any C++ or Java generated integration files** * Integration tests have been moved into `bin/integration.js`, and they finish much quicker * The tsconfig files have been tweaked to speed up test run time and improve the async debugging experience * A streaming `RecordBatchJSONWriter` has been implemented so we can easily debug and validate written output * The JSON results are also tested against the corresponding binary representation, similar to the integration tests * A [suite of test-data helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/generate-test-data.ts) have been added to auto-generate data for validation at runtime * They produce the underlying Arrow VectorData buffers, as well as the expected plain-JS-value representation [for verification](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/generated-data-tests.ts#L23) * This allows us to test all possible type configuration combinations, e.g. [all types Dictionary-encode](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/data/tables.ts#L61), all types serialize when nested, etc. * A [suite of IO test helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/helpers.ts#L36) has been added * We use [`memfs`](https://www.npmjs.com/package/memfs) to mock the file system, which contributes to test performance improvements * This enables us to [easily test](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/reader/file-reader-tests.ts#L38) all the flavors of io primitives across node and browser environments * A vscode debugging launch configuration has been added to ease the process of contributing more tests (and because I've been asked for mine so often) ### The build * Faster * Node 11+ (needs `Symbol.asyncIterator` enabled) * Closure-compiler upgrades and build enhancements mean we can auto-generate the externs file during compilation, rather than maintaining it by hand ### Misc * Added `arrow2csv` to `js/bin/arrow2csv`, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (`cat foo.arrow | js/bin/arrow2csv.js`) ### Todos * Docs/Recipes/Examples * Highlight/write more tools (like `arrow2csv`) * Flesh out the RecordBatchWriters a bit more * Gather feedback on the new RecordBatchReader APIs Author: ptaylor <paul.e.taylor@me.com> Author: Paul Taylor <paul.e.taylor@me.com> Closes #3290 from trxcllnt/js-data-refactor and squashes the following commits: 2ef150f <ptaylor> bind getByteWidth to the vector type 9acfaa3 <ptaylor> handle the case where collapsed Uint8Arrays fully overlap 6a97ee0 <ptaylor> perf: defer creating rowProxy on nested types, use Array instead of Object for creating Data instances 2cad760 <ptaylor> pipe directly to stdout to ensure backpressure is preserved f006a26 <ptaylor> ensure schema and field always have a metadata map 8dc5d2c <ptaylor> fix Float64 Array typings 162c7d8 <ptaylor> fix arrow2csv left-pad measurement for new bignum/decimal output 64dc015 <ptaylor> teach closure about Symbol.toPrimitive ca0db9e <ptaylor> fix lint ec12cdd <ptaylor> add a small BigNum mixin to make working with Int64 and Decimal values a bit easier 62578b9 <ptaylor> fix bug where valueToString function would return undefined (JSON.striingify(undefined) === undefined) 4b58bde <ptaylor> fix visitor method overload type signatures d165413 <ptaylor> don't print comma that includes system paths 708f1b4 <ptaylor> move stride to data, fix chunked slicing, remove intermediate binding and getters in favor of direct property accesses 78ecc4c <ptaylor> use the textencoders from the global instead of Buffer for perf testing 47f0677 <ptaylor> perf: use a closure instead of binding 380dbc7 <ptaylor> add a single-chunk column type 6bcaad6 <ptaylor> fix lint f7d2b2e <ptaylor> add getters for the dictionary and indices of chunked dictionary vectors aaf42c8 <Paul Taylor> Consolidated JS data handling refactor
trxcllnt
added a commit
to trxcllnt/node
that referenced
this issue
Apr 14, 2019
Prevents alignment issues when creating a typed array from a buffer. Fixes: nodejs#24817
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The internal buffer pooling inside
fs.ReadStream
doesn't align its pooled writes to any byte boundaries, leading to issues consuming chunks where alignment is required (such as creating typed array views over the underlying ArrayBuffer).In the simplest case, picture two fs.ReadStreams that are consumed concurrently. One is consumed with no alignment requirement, but the second must be consumed aligned to an 8-byte boundary (for example, as chunks of Float64s). Reads from the unaligned stream will cause the
pool.used
offset to advance by positions that aren't multiples of 8, such that the next read from the aligned stream won't be written at a multiple-of-8 offset into the pool.Attached is a repro script demonstrating the issue. The byte lengths of the test files have been carefully selected to reliably reproduce the problem.
One possible fix is to add an alignment member variable to the fs.ReadStream, such that writes into the pool are aligned correctly for each stream instance. Another alternative is to always write on some base address alignment (128 is common), but this will use slightly more memory and is less flexible than the former.
The text was updated successfully, but these errors were encountered: