Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1990: [JS] C++ Refactor, Add DataFrame #1482

Closed
wants to merge 75 commits into from
Closed
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
74e828a
fix typings issues (ARROW-1903)
trxcllnt Jan 11, 2018
6ff18e9
ship es2015 commonJS in main package to avoid confusion
trxcllnt Jan 11, 2018
62db338
update dependencies and add es6+ umd targets to jest transform ignore…
trxcllnt Jan 11, 2018
61dc699
WIP -- refactor types to closer match arrow-cpp
trxcllnt Jan 11, 2018
d18d915
fix struct and map rows
trxcllnt Jan 11, 2018
a1edac2
Add perf tests for table scans
Jan 5, 2018
30f0330
Add basic DataFrame impl ...
Jan 9, 2018
796f45d
add DataFrame filter and count ops
Jan 10, 2018
4d9e8c0
Add concept of predicates for filtering dataframes
Jan 10, 2018
aa999f8
Add DictionaryVector optimization for equals predicate
Jan 10, 2018
2744c63
Remove Chunked/Simple DataFrame distinction
Jan 10, 2018
6a41d68
clean up table benchmarks
Jan 11, 2018
e8979ba
Refactor DataFrame to extend Vector<StructRow>
Jan 11, 2018
1d60aa1
Moved DataFrame ops to Table. DataFrame is now an interface
Jan 11, 2018
a9fff89
Move Table out of the Vector hierarchy
Jan 11, 2018
a788db3
Cleanup
Jan 11, 2018
2e118ab
linter
Jan 11, 2018
2f4a349
Minor tweaks
Jan 12, 2018
6719147
Add DataFrame.countBy operation
Jan 15, 2018
7244887
Add table unit tests...
Jan 15, 2018
20717d5
Fixed countBy(string)
Jan 15, 2018
edcbdbe
cleanup
Jan 16, 2018
e20decd
Add license headers
Jan 16, 2018
fa7c17a
passing all tests except es5 umd mangler ones
trxcllnt Jan 18, 2018
e3f629d
fix rest of the mangling issues
trxcllnt Jan 18, 2018
f3f3b86
rename table.ts to recordbatch.ts in preparation for merging latest
trxcllnt Jan 19, 2018
d2b18d5
Merge remote-tracking branch 'ccri/table-scan-perf' into js-cpp-refac…
trxcllnt Jan 19, 2018
6c91ed4
Merge branch 'master' of github.com:apache/arrow into js-cpp-refactor…
trxcllnt Jan 19, 2018
e761eee
Rename asJSON to toJSON
Jan 16, 2018
0126dc4
Don't recompute total length
Jan 17, 2018
0620cfd
use Math.fround
Jan 19, 2018
e859e13
fix package.json bin entry
trxcllnt Jan 20, 2018
700a47c
export visitors
trxcllnt Jan 20, 2018
e81082f
export vector views, allow cloning data as another type
trxcllnt Jan 22, 2018
b7f5bfb
rename numRows to length, add table.getColumn()
trxcllnt Jan 22, 2018
87334a5
Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-r…
trxcllnt Jan 22, 2018
614b688
add asEpochMs to date and timestamp vectors
trxcllnt Jan 23, 2018
5bb63af
Don't read OFFSET vector for FixedSizeList
Jan 23, 2018
e33c068
Merge pull request #2 from ccri/fixed-size-list
trxcllnt Jan 23, 2018
c0fd2f9
use the dictionary of the last chunked vector list for chunked dictio…
trxcllnt Jan 23, 2018
e537789
make it easier to run all integration tests from local data
trxcllnt Jan 24, 2018
fe31ee0
slice the flat data values before returning an iterator of them
trxcllnt Jan 24, 2018
40b3638
run integration tests with local data for coverage stats
trxcllnt Jan 24, 2018
54d4f5b
lazily allocate table and recordbatch columns, support NestedView's g…
trxcllnt Jan 24, 2018
a00415e
Fix perf
Jan 24, 2018
c8cd286
Add Table.fromStruct
Jan 24, 2018
a5f200f
Merge pull request #3 from ccri/table-from-struct
trxcllnt Jan 25, 2018
7e43b78
add test:integration npm script
trxcllnt Jan 25, 2018
18807c6
rename ChunkData's fields so it's more clear they're not semantically…
trxcllnt Jan 25, 2018
f1dead0
compute chunked nested childData list correctly
trxcllnt Jan 25, 2018
8ddce0a
check bounds in getChildAt(i) to avoid NPEs
trxcllnt Jan 25, 2018
7bc7363
Fix exception for empty Table
Jan 25, 2018
272d293
Merge pull request #4 from ccri/empty-table
trxcllnt Jan 25, 2018
016ba78
Merge pull request #1 from trxcllnt/js-cpp-refactor
TheNeuralBit Jan 25, 2018
9769773
fix vector perf tests
trxcllnt Jan 25, 2018
f3cde1a
fix lint
trxcllnt Jan 25, 2018
579ab1f
Merge pull request #2 from trxcllnt/js-cpp-refactor
TheNeuralBit Jan 25, 2018
25e6af7
Create predicate namespace
Jan 25, 2018
dc7c728
add more view, predicate externs
trxcllnt Jan 26, 2018
25cdc4a
add src/predicate to the list of exports we should save from uglify
trxcllnt Jan 26, 2018
e148ee4
Merge branch 'extern-woes' into js-cpp-refactor
trxcllnt Jan 26, 2018
f7bb0ed
Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-r…
trxcllnt Jan 26, 2018
f6adfb3
Create predicate namespace
Jan 25, 2018
5a91fab
add more view, predicate externs
trxcllnt Jan 26, 2018
4a41b18
add src/predicate to the list of exports we should save from uglify
trxcllnt Jan 26, 2018
8cf2473
Merge remote-tracking branch 'origin/master' into table-scan-perf
Jan 26, 2018
5bdf17f
Fix perf
Jan 26, 2018
1910962
Add optional bind callback to scan
Jan 26, 2018
dbe7f81
Add more Table unit tests
Jan 26, 2018
10c48ad
Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-r…
trxcllnt Jan 27, 2018
3d5240a
fix more externs
trxcllnt Jan 27, 2018
fe300df
fix closure es5/umd toString() iterator
trxcllnt Jan 27, 2018
16b9ccb
Merge pull request #4 from trxcllnt/js-cpp-refactor
TheNeuralBit Jan 29, 2018
04b1838
even more table tests
Jan 29, 2018
52f1e0e
<, > are not commutative, misc cleanup
Jan 29, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions js/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"clean": "gulp clean",
"debug": "gulp debug",
"perf": "node ./perf/index.js",
"create:perfdata": "python ./test/data/tables/generate.py ./test/data/tables/tracks.arrow",
"release": "./npm-release.sh",
"clean:all": "run-p clean clean:testdata",
"clean:testdata": "gulp clean:testdata",
Expand Down
112 changes: 104 additions & 8 deletions js/perf/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,21 @@
// under the License.

// Use the ES5 UMD target as perf baseline
// const { Table, readVectors } = require('../targets/es5/umd');
// const { Table, readVectors } = require('../targets/es5/cjs');
const { Table, readVectors } = require('../targets/es2015/umd');
// const { Table, readVectors } = require('../targets/es2015/cjs');
// const { col, Table, readVectors } = require('../targets/es5/umd');
// const { col, Table, readVectors } = require('../targets/es5/cjs');
// const { col, Table, readVectors } = require('../targets/es2015/umd');
const { col, Table, readVectors } = require('../targets/es2015/cjs');

const config = require('./config');
const Benchmark = require('benchmark');

const suites = [];

for (let { name, buffers} of config) {
const parseSuite = new Benchmark.Suite(`Parse ${name}`, { async: true });
const sliceSuite = new Benchmark.Suite(`Slice ${name} vectors`, { async: true });
const iterateSuite = new Benchmark.Suite(`Iterate ${name} vectors`, { async: true });
const getByIndexSuite = new Benchmark.Suite(`Get ${name} values by index`, { async: true });
const parseSuite = new Benchmark.Suite(`Parse "${name}"`, { async: true });
const sliceSuite = new Benchmark.Suite(`Slice "${name}" vectors`, { async: true });
const iterateSuite = new Benchmark.Suite(`Iterate "${name}" vectors`, { async: true });
const getByIndexSuite = new Benchmark.Suite(`Get "${name}" values by index`, { async: true });
parseSuite.add(createFromTableTest(name, buffers));
parseSuite.add(createReadVectorsTest(name, buffers));
for (const vector of Table.from(buffers).columns) {
Expand All @@ -41,6 +41,25 @@ for (let { name, buffers} of config) {
suites.push(getByIndexSuite, iterateSuite, sliceSuite, parseSuite);
}

for (let {name, buffers, countBys, counts} of require('./table_config')) {
const table = Table.from(buffers);

const dfCountBySuite = new Benchmark.Suite(`DataFrame Count By "${name}"`, { async: true });
for (countBy of countBys) {
dfCountBySuite.add(createDataFrameCountByTest(table, countBy));
}

const dfFilterCountSuite = new Benchmark.Suite(`DataFrame Filter-Scan Count "${name}"`, { async: true });
const dfDirectCountSuite = new Benchmark.Suite(`DataFrame Direct Count "${name}"`, { async: true });

for (test of counts) {
dfFilterCountSuite.add(createDataFrameFilterCountTest(table, test.col, test.test, test.value))
dfDirectCountSuite.add(createDataFrameDirectCountTest(table, test.col, test.test, test.value))
}

suites.push(dfCountBySuite, dfFilterCountSuite, dfDirectCountSuite)
}

console.log('Running apache-arrow performance tests...\n');

run();
Expand Down Expand Up @@ -109,3 +128,80 @@ function createGetByIndexTest(vector) {
}
};
}

function createDataFrameDirectCountTest(table, column, test, value) {
let colidx = table.columns.findIndex((c)=>c.name === column);

if (test == 'gteq') {
op = function () {
sum = 0;
for (let batch = -1; ++batch < table.lengths.length;) {
const length = table.lengths[batch];

// load batches
const columns = table.batches[batch];

// yield all indices
for (let idx = -1; ++idx < length;) {
sum += (columns[colidx].get(idx) >= value);
}
}
}
} else if (test == 'eq') {
op = function() {
sum = 0;
for (let batch = -1; ++batch < table.lengths.length;) {
const length = table.lengths[batch];

// load batches
const columns = table.batches[batch]

// yield all indices
for (let idx = -1; ++idx < length;) {
sum += (columns[colidx].get(idx) == value);
}
}
}
} else {
throw new Error(`Unrecognized test "${test}"`);
}

return {
async: true,
name: `name: '${column}', length: ${table.length}, type: ${table.columns[colidx].type}, test: ${test}, value: ${value}`,
fn: op
};
}

function createDataFrameCountByTest(table, column) {
let colidx = table.columns.findIndex((c)=>c.name === column);

return {
async: true,
name: `name: '${column}', length: ${table.length}, type: ${table.columns[colidx].type}`,
fn() {
table.countBy(column);
}
};
}

function createDataFrameFilterCountTest(table, column, test, value) {
let colidx = table.columns.findIndex((c)=>c.name === column);
let df;

if (test == 'gteq') {
df = table.filter(col(column).gteq(value));
} else if (test == 'eq') {
df = table.filter(col(column).eq(value));
} else {
throw new Error(`Unrecognized test "${test}"`);
}

return {
async: true,
name: `name: '${column}', length: ${table.length}, type: ${table.columns[colidx].type}, test: ${test}, value: ${value}`,
fn() {
df.count();
}
};
}
48 changes: 48 additions & 0 deletions js/perf/table_config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

const fs = require('fs');
const path = require('path');
const glob = require('glob');

const config = [];
const filenames = glob.sync(path.resolve(__dirname, `../test/data/tables/`, `*.arrow`));

countBys = {
"tracks": ['origin', 'destination']
}
counts = {
"tracks": [
{col: 'lat', test: 'gteq', value: 0 },
{col: 'lng', test: 'gteq', value: 0 },
{col: 'origin', test: 'eq', value: 'Seattle'},
]
}

for (const filename of filenames) {
const { name } = path.parse(filename);
if (name in counts) {
config.push({
name,
buffers: [fs.readFileSync(filename)],
countBys: countBys[name],
counts: counts[name],
});
}
}

module.exports = config;
22 changes: 22 additions & 0 deletions js/src/Arrow.externs.ts
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,20 @@ Table.prototype.key;
Table.prototype.select;
/** @type {?} */
Table.prototype.toString;
/** @type {?} */
Table.prototype.lengths;
/** @type {?} */
Table.prototype.batches;
/** @type {?} */
Table.prototype.countBy;
/** @type {?} */
Table.prototype.scan;
/** @type {?} */
Table.prototype.get;

let CountByResult = function() {};
/** @type {?} */
CountByResult.prototype.asJSON;

let Vector = function() {};
/** @type {?} */
Expand Down Expand Up @@ -82,3 +96,11 @@ let DictionaryVector = function() {};
DictionaryVector.prototype.getKey;
/** @type {?} */
DictionaryVector.prototype.getValue;

let Col = function() {};
/** @type {?} */
Col.prototype.gteq;
/** @type {?} */
Col.prototype.lteq;
/** @type {?} */
Col.prototype.eq;
14 changes: 11 additions & 3 deletions js/src/Arrow.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
// specific language governing permissions and limitations
// under the License.

import { Table } from './vector/table';
import { Table, TableRow, CountByResult } from './table';
import { lit, col, Col, Value } from './predicate';
import { Vector } from './vector/vector';
import { Utf8Vector } from './vector/utf8';
import { DictionaryVector } from './vector/dictionary';
Expand Down Expand Up @@ -53,7 +54,9 @@ Table['fromAsync'] = Table.fromAsync;
BoolVector['pack'] = BoolVector.pack;

export { read, readAsync };
export { Table, Vector, StructRow };
export { Table, TableRow, CountByResult };
export { lit, col, Col, Value };
export { Vector, StructRow };
export { Uint64, Int64, Int128 };
export { NumericVectorConstructor } from './vector/numeric';
export { List, TypedArray, TypedArrayConstructor } from './vector/types';
Expand Down Expand Up @@ -89,9 +92,13 @@ try {
const Arrow = eval('exports');
if (typeof Arrow === 'object') {
// string indexers tell closure compiler not to rename these properties
Arrow['lit'] = lit;
Arrow['col'] = col;
Arrow['Col'] = Col;
Arrow['read'] = read;
Arrow['readAsync'] = readAsync;
Arrow['Value'] = Value;
Arrow['Table'] = Table;
Arrow['readAsync'] = readAsync;
Arrow['Vector'] = Vector;
Arrow['StructRow'] = StructRow;
Arrow['BoolVector'] = BoolVector;
Expand All @@ -115,6 +122,7 @@ try {
Arrow['Float32Vector'] = Float32Vector;
Arrow['Float64Vector'] = Float64Vector;
Arrow['DecimalVector'] = DecimalVector;
Arrow['CountByResult'] = CountByResult;
Arrow['TimestampVector'] = TimestampVector;
Arrow['DictionaryVector'] = DictionaryVector;
Arrow['FixedSizeListVector'] = FixedSizeListVector;
Expand Down
6 changes: 3 additions & 3 deletions js/src/bin/arrow2csv.ts
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,8 @@ files.forEach((source) => {
printTable(table);
});

function printTable(table: Arrow.Table<any>) {
let header = [...table.columns.map((_, i) => table.key(i))].map(stringify);
function printTable(table: Arrow.Table) {
let header = [...table.columns.map((c) => c.name)].map(stringify);
let maxColumnWidths = header.map(x => x.length);
// Pass one to convert to strings and count max column widths
for (let i = -1, n = table.length - 1; ++i < n;) {
Expand Down Expand Up @@ -132,4 +132,4 @@ function stringify(x: any) {
: `${x}`;
}

})();
})();
Loading