Skip to content

Commit

Permalink
ARROW-1990: [JS] C++ Refactor, Add DataFrame
Browse files Browse the repository at this point in the history
This PR moves the `Table` class out of the Vector hierarchy and adds optimized dataframe operations to it. Currently implements an optimized `scan()` method, `filter(predicate)`, `count()`, and `countBy(column_name)` (only works on dictionary-encoded columns).

Some usage examples, based on the file generated by `js/test/data/tables/generate.py`:
``` js
> let table = Table.from(...);
> table.count()
1000000
> table.filter(col('lat').gteq(0)).count()
499718
> table.countBy('origin').toJSON()
{ Charlottesville: 166839,
  'New York': 166251,
  'San Francisco': 166642,
  Seattle: 166659,
  'Terre Haute': 166756,
  'Washington, DC': 166853 }
> table.filter(col('lng').gteq(0)).countBy('origin').toJSON()
{ Charlottesville: 83109,
  'New York': 83221,
  'San Francisco': 83515,
  Seattle: 83362,
  'Terre Haute': 83314,
  'Washington, DC': 83479 }
```
There are performance tests for the dataframe operations, to run them you must first generate the test data by running `npm run create:perfdata`.

The PR also includes @trxcllnt's refactor of the JS implementation to make it more closely resemble the C++ implementation. This refactor resolves multiple JIRAs: ARROW-1903, ARROW-1898, ARROW-1502, ARROW-1952 (partially), and ARROW-1985

Author: Paul Taylor <paul.e.taylor@me.com>
Author: Brian Hulette <brian.hulette@ccri.com>
Author: Brian Hulette <hulettbh@gmail.com>

Closes apache#1482 from TheNeuralBit/table-scan-perf and squashes the following commits:

52f1e0e [Brian Hulette] <, > are not commutative, misc cleanup
04b1838 [Brian Hulette] even more table tests
16b9ccb [Brian Hulette] Merge pull request #4 from trxcllnt/js-cpp-refactor
fe300df [Paul Taylor] fix closure es5/umd toString() iterator
3d5240a [Paul Taylor] fix more externs
10c48ad [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor
dbe7f81 [Brian Hulette] Add more Table unit tests
1910962 [Brian Hulette] Add optional bind callback to scan
5bdf17f [Brian Hulette] Fix perf
8cf2473 [Brian Hulette] Merge remote-tracking branch 'origin/master' into table-scan-perf
4a41b18 [Paul Taylor] add src/predicate to the list of exports we should save from uglify
5a91fab [Paul Taylor] add more view, predicate externs
f6adfb3 [Brian Hulette] Create predicate namespace
f7bb0ed [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor
e148ee4 [Paul Taylor] Merge branch 'extern-woes' into js-cpp-refactor
25cdc4a [Paul Taylor] add src/predicate to the list of exports we should save from uglify
dc7c728 [Paul Taylor] add more view, predicate externs
25e6af7 [Brian Hulette] Create predicate namespace
579ab1f [Brian Hulette] Merge pull request #2 from trxcllnt/js-cpp-refactor
f3cde1a [Paul Taylor] fix lint
9769773 [Paul Taylor] fix vector perf tests
016ba78 [Brian Hulette] Merge pull request #1 from trxcllnt/js-cpp-refactor
272d293 [Paul Taylor] Merge pull request #4 from ccri/empty-table
7bc7363 [Brian Hulette] Fix exception for empty Table
8ddce0a [Paul Taylor] check bounds in getChildAt(i) to avoid NPEs
f1dead0 [Paul Taylor] compute chunked nested childData list correctly
18807c6 [Paul Taylor] rename ChunkData's fields so it's more clear they're not semantically similar to other similarly named fields
7e43b78 [Paul Taylor] add test:integration npm script
a5f200f [Paul Taylor] Merge pull request #3 from ccri/table-from-struct
c8cd286 [Brian Hulette] Add Table.fromStruct
a00415e [Brian Hulette] Fix perf
54d4f5b [Paul Taylor] lazily allocate table and recordbatch columns, support NestedView's getChildAt(i) method in ChunkedView
40b3638 [Paul Taylor] run integration tests with local data for coverage stats
fe31ee0 [Paul Taylor] slice the flat data values before returning an iterator of them
e537789 [Paul Taylor] make it easier to run all integration tests from local data
c0fd2f9 [Paul Taylor] use the dictionary of the last chunked vector list for chunked dictionary vectors
e33c068 [Paul Taylor] Merge pull request #2 from ccri/fixed-size-list
5bb63af [Brian Hulette] Don't read OFFSET vector for FixedSizeList
614b688 [Paul Taylor] add asEpochMs to date and timestamp vectors
87334a5 [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor
b7f5bfb [Paul Taylor] rename numRows to length, add table.getColumn()
e81082f [Paul Taylor] export vector views, allow cloning data as another type
700a47c [Paul Taylor] export visitors
e859e13 [Paul Taylor] fix package.json bin entry
0620cfd [Brian Hulette] use Math.fround
0126dc4 [Brian Hulette] Don't recompute total length
e761eee [Brian Hulette] Rename asJSON to toJSON
6c91ed4 [Paul Taylor] Merge branch 'master' of github.com:apache/arrow into js-cpp-refactor-merge_with-table-scan-perf
d2b18d5 [Paul Taylor] Merge remote-tracking branch 'ccri/table-scan-perf' into js-cpp-refactor-merge_with-table-scan-perf
f3f3b86 [Paul Taylor] rename table.ts to recordbatch.ts in preparation for merging latest
e3f629d [Paul Taylor] fix rest of the mangling issues
fa7c17a [Paul Taylor] passing all tests except es5 umd mangler ones
e20decd [Brian Hulette] Add license headers
edcbdbe [Brian Hulette] cleanup
20717d5 [Brian Hulette] Fixed countBy(string)
7244887 [Brian Hulette] Add table unit tests...
6719147 [Brian Hulette] Add DataFrame.countBy operation
2f4a349 [Brian Hulette] Minor tweaks
2e118ab [Brian Hulette] linter
a788db3 [Brian Hulette] Cleanup
a9fff89 [Brian Hulette] Move Table out of the Vector hierarchy
1d60aa1 [Brian Hulette] Moved DataFrame ops to Table. DataFrame is now an interface
e8979ba [Brian Hulette] Refactor DataFrame to extend Vector<StructRow>
6a41d68 [Brian Hulette] clean up table benchmarks
2744c63 [Brian Hulette] Remove Chunked/Simple DataFrame distinction
aa999f8 [Brian Hulette] Add DictionaryVector optimization for equals predicate
4d9e8c0 [Brian Hulette] Add concept of predicates for filtering dataframes
796f45d [Brian Hulette] add DataFrame filter and count ops
30f0330 [Brian Hulette] Add basic DataFrame impl ...
a1edac2 [Brian Hulette] Add perf tests for table scans
d18d915 [Paul Taylor] fix struct and map rows
61dc699 [Paul Taylor] WIP -- refactor types to closer match arrow-cpp
62db338 [Paul Taylor] update dependencies and add es6+ umd targets to jest transform ignore patterns to fix ci
6ff18e9 [Paul Taylor] ship es2015 commonJS in main package to avoid confusion
74e828a [Paul Taylor] fix typings issues (ARROW-1903)
  • Loading branch information
trxcllnt authored and wesm committed Feb 1, 2018
1 parent f84af8f commit e327747
Show file tree
Hide file tree
Showing 67 changed files with 6,025 additions and 3,065 deletions.
56 changes: 50 additions & 6 deletions js/bin/integration.js
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
// specific language governing permissions and limitations
// under the License.

var fs = require('fs');
var glob = require('glob');
var path = require('path');
var gulp = require.resolve(path.join(`..`, `node_modules/gulp/bin/gulp.js`));
var child_process = require(`child_process`);
Expand All @@ -29,12 +31,14 @@ var optionList = [
{
type: String,
name: 'arrow', alias: 'a',
description: 'The Arrow file to read/write'
multiple: true, defaultValue: [],
description: 'The Arrow file[s] to read/write'
},
{
type: String,
name: 'json', alias: 'j',
description: 'The JSON file to read/write'
multiple: true, defaultValue: [],
description: 'The JSON file[s] to read/write'
}
];

Expand Down Expand Up @@ -66,20 +70,60 @@ function print_usage() {
process.exit(1);
}

if (!argv.arrow || !argv.json || !argv.mode) {
let jsonPaths = argv.json;
let arrowPaths = argv.arrow;

if (!argv.mode) {
return print_usage();
}

let mode = argv.mode.toUpperCase();
if (mode === 'VALIDATE' && !jsonPaths.length) {
jsonPaths = glob.sync(path.resolve(__dirname, `../test/data/json/`, `*.json`));
if (!arrowPaths.length) {
[jsonPaths, arrowPaths] = jsonPaths.reduce(([jsonPaths, arrowPaths], jsonPath) => {
const { name } = path.parse(jsonPath);
for (const source of ['cpp', 'java']) {
for (const format of ['file', 'stream']) {
const arrowPath = path.resolve(__dirname, `../test/data/${source}/${format}/${name}.arrow`);
if (fs.existsSync(arrowPath)) {
jsonPaths.push(jsonPath);
arrowPaths.push(arrowPath);
console.log('-j', jsonPath, '-a', arrowPath, '\\');
}
}
}
return [jsonPaths, arrowPaths];
}, [[], []]);
}
} else if (!jsonPaths.length) {
return print_usage();
}

switch (argv.mode.toUpperCase()) {
switch (mode) {
case 'VALIDATE':
const args = [`test`, `-i`].concat(argv._unknown || []);
jsonPaths.forEach((p, i) => {
args.push('-j', p, '-a', arrowPaths[i]);
});
child_process.spawnSync(
gulp,
[`test`, `-i`].concat(process.argv.slice(2)),
gulp, args,
{
cwd: path.resolve(__dirname, '..'),
stdio: ['ignore', 'inherit', 'inherit']
}
);
// for (let i = -1, n = jsonPaths.length; ++i < n;) {
// const jsonPath = jsonPaths[i];
// const arrowPath = arrowPaths[i];
// child_process.spawnSync(
// gulp, args.concat(['-j', jsonPath, '-a', arrowPath]),
// {
// cwd: path.resolve(__dirname, '..'),
// stdio: ['ignore', 'inherit', 'inherit']
// }
// );
// }
break;
default:
print_usage();
Expand Down
31 changes: 27 additions & 4 deletions js/gulp/argv.js
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,22 @@
// specific language governing permissions and limitations
// under the License.

const fs = require('fs');
const glob = require('glob');
const path = require('path');

const argv = require(`command-line-args`)([
{ name: `all`, type: Boolean },
{ name: 'update', alias: 'u', type: Boolean },
{ name: 'verbose', alias: 'v', type: Boolean },
{ name: `target`, type: String, defaultValue: `` },
{ name: `module`, type: String, defaultValue: `` },
{ name: `coverage`, type: Boolean, defaultValue: false },
{ name: `json_file`, alias: `j`, type: String, defaultValue: null },
{ name: `arrow_file`, alias: `a`, type: String, defaultValue: null },
{ name: `integration`, alias: `i`, type: Boolean, defaultValue: false },
{ name: `targets`, alias: `t`, type: String, multiple: true, defaultValue: [] },
{ name: `modules`, alias: `m`, type: String, multiple: true, defaultValue: [] },
{ name: `sources`, alias: `s`, type: String, multiple: true, defaultValue: [`cpp`, `java`] },
{ name: `formats`, alias: `f`, type: String, multiple: true, defaultValue: [`file`, `stream`] },
{ name: `json_files`, alias: `j`, type: String, multiple: true, defaultValue: [] },
{ name: `arrow_files`, alias: `a`, type: String, multiple: true, defaultValue: [] },
], { partial: true });

const { targets, modules } = argv;
Expand All @@ -38,4 +40,25 @@ argv.module && !modules.length && modules.push(argv.module);
(argv.all || !targets.length) && targets.push(`all`);
(argv.all || !modules.length) && modules.push(`all`);

if (argv.coverage && (!argv.json_files || !argv.json_files.length)) {

let [jsonPaths, arrowPaths] = glob
.sync(path.resolve(__dirname, `../test/data/json/`, `*.json`))
.reduce((paths, jsonPath) => {
const { name } = path.parse(jsonPath);
const [jsonPaths, arrowPaths] = paths;
['cpp', 'java'].forEach((source) => ['file', 'stream'].forEach((format) => {
const arrowPath = path.resolve(__dirname, `../test/data/${source}/${format}/${name}.arrow`);
if (fs.existsSync(arrowPath)) {
jsonPaths.push(jsonPath);
arrowPaths.push(arrowPath);
}
}));
return paths;
}, [[], []]);

argv.json_files = jsonPaths;
argv.arrow_files = arrowPaths;
}

module.exports = { argv, targets, modules };
8 changes: 4 additions & 4 deletions js/gulp/closure-task.js
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ const closureTask = ((cache) => memoizeTask(cache, function closure(target, form
const src = targetDir(target, `cls`);
const out = targetDir(target, format);
const entry = path.join(src, mainExport);
const externs = path.join(src, `${mainExport}.externs`);
const externs = path.join(`src/Arrow.externs.js`);
return observableFromStreams(
gulp.src([
/* external libs first --> */ `node_modules/tslib/package.json`,
Expand All @@ -46,7 +46,6 @@ const closureTask = ((cache) => memoizeTask(cache, function closure(target, form
`node_modules/text-encoding-utf-8/package.json`,
`node_modules/text-encoding-utf-8/src/encoding.js`,
/* then sources globs --> */ `${src}/**/*.js`,
/* and exclusions last --> */ `!${src}/Arrow.externs.js`,
], { base: `./` }),
sourcemaps.init(),
closureCompiler(createClosureArgs(entry, externs)),
Expand All @@ -60,14 +59,15 @@ const closureTask = ((cache) => memoizeTask(cache, function closure(target, form
}))({});

const createClosureArgs = (entry, externs) => ({
externs,
third_party: true,
warning_level: `QUIET`,
dependency_mode: `STRICT`,
rewrite_polyfills: false,
externs: `${externs}.js`,
entry_point: `${entry}.js`,
module_resolution: `NODE`,
// formatting: `PRETTY_PRINT`, debug: true,
// formatting: `PRETTY_PRINT`,
// debug: true,
compilation_level: `ADVANCED`,
allow_method_call_decomposing: true,
package_json_entry_names: `module,jsnext:main,main`,
Expand Down
23 changes: 13 additions & 10 deletions js/gulp/package-task.js
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,11 @@ const createMainPackageJson = (target, format) => (orig) => ({
...createTypeScriptPackageJson(target, format)(orig),
name: npmPkgName,
main: mainExport,
types: `${mainExport}.d.ts`,
module: `${mainExport}.mjs`,
dist: `${mainExport}.es5.min.js`,
[`dist:es2015`]: `${mainExport}.es2015.min.js`,
[`@std/esm`]: { esm: `mjs` }
[`@std/esm`]: { esm: `mjs`, warnings: false, sourceMap: true }
});

const createTypeScriptPackageJson = (target, format) => (orig) => ({
Expand All @@ -63,18 +64,20 @@ const createTypeScriptPackageJson = (target, format) => (orig) => ({

const createScopedPackageJSON = (target, format) => (({ name, ...orig }) =>
conditionallyAddStandardESMEntry(target, format)(
packageJSONFields.reduce(
(xs, key) => ({ ...xs, [key]: xs[key] || orig[key] }),
{ name: `${npmOrgName}/${packageName(target, format)}`,
version: undefined, main: `${mainExport}.js`, types: `${mainExport}.d.ts`,
dist: undefined, [`dist:es2015`]: undefined, module: undefined, [`@std/esm`]: undefined }
)
packageJSONFields.reduce(
(xs, key) => ({ ...xs, [key]: xs[key] || orig[key] }),
{
name: `${npmOrgName}/${packageName(target, format)}`,
version: undefined, main: `${mainExport}.js`, types: `${mainExport}.d.ts`,
dist: undefined, [`dist:es2015`]: undefined, module: undefined, [`@std/esm`]: undefined
}
)
)
);

const conditionallyAddStandardESMEntry = (target, format) => (packageJSON) => (
format !== `esm`
? packageJSON
: { ...packageJSON, [`@std/esm`]: { esm: `js` } }
format !== `esm` && format !== `cls`
? packageJSON
: { ...packageJSON, [`@std/esm`]: { esm: `js`, warnings: false, sourceMap: true } }
);

10 changes: 5 additions & 5 deletions js/gulp/test-task.js
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ const testOptions = {
const testTask = ((cache, execArgv, testOptions) => memoizeTask(cache, function test(target, format, debug = false) {
const opts = { ...testOptions };
const args = !debug ? [...execArgv] : [...debugArgv, ...execArgv];
args.push(`test/${argv.integration ? `integration/*` : `unit/*`}`);
if (!argv.coverage) {
args.push(`test/${argv.integration ? `integration/*` : `unit/*`}`);
}
opts.env = { ...opts.env,
TEST_TARGET: target,
TEST_MODULE: format,
JSON_PATH: argv.json_file,
ARROW_PATH: argv.arrow_file,
TEST_TS_SOURCE: !!argv.coverage,
TEST_SOURCES: JSON.stringify(Array.isArray(argv.sources) ? argv.sources : [argv.sources]),
TEST_FORMATS: JSON.stringify(Array.isArray(argv.formats) ? argv.formats : [argv.formats]),
JSON_PATHS: JSON.stringify(Array.isArray(argv.json_files) ? argv.json_files : [argv.json_files]),
ARROW_PATHS: JSON.stringify(Array.isArray(argv.arrow_files) ? argv.arrow_files : [argv.arrow_files]),
};
return !debug ?
child_process.spawn(jest, args, opts) :
Expand Down
8 changes: 4 additions & 4 deletions js/gulp/typescript-task.js
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ const typescriptTask = ((cache) => memoizeTask(cache, function typescript(target
const tsProject = ts.createProject(path.join(`tsconfig`, tsconfigFile), { typescript: require(`typescript`) });
const { stream: { js, dts } } = observableFromStreams(
tsProject.src(), sourcemaps.init(),
tsProject(ts.reporter.fullReporter(true))
tsProject(ts.reporter.defaultReporter())
);
const writeDTypes = observableFromStreams(dts, gulp.dest(out));
const writeJS = observableFromStreams(js, sourcemaps.write(), gulp.dest(out));
Expand All @@ -52,12 +52,12 @@ function maybeCopyRawJSArrowFormatFiles(target, format) {
return Observable.empty();
}
return Observable.defer(async () => {
const outFormatDir = path.join(targetDir(target, format), `format`, `fb`);
const outFormatDir = path.join(targetDir(target, format), `fb`);
await del(path.join(outFormatDir, '*.js'));
await observableFromStreams(
gulp.src(path.join(`src`, `format`, `fb`, `*_generated.js`)),
gulp.src(path.join(`src`, `fb`, `*_generated.js`)),
gulpRename((p) => { p.basename = p.basename.replace(`_generated`, ``); }),
gulp.dest(outFormatDir)
).toPromise();
});
}
}
23 changes: 16 additions & 7 deletions js/gulp/uglify-task.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ const webpack = require(`webpack`);
const { memoizeTask } = require('./memoize-task');
const { Observable, ReplaySubject } = require('rxjs');
const UglifyJSPlugin = require(`uglifyjs-webpack-plugin`);
const esmRequire = require(`@std/esm`)(module, { cjs: true, esm: `js` });
const esmRequire = require(`@std/esm`)(module, { cjs: true, esm: `js`, warnings: false });

const uglifyTask = ((cache, commonConfig) => memoizeTask(cache, function uglifyJS(target, format) {

Expand Down Expand Up @@ -84,11 +84,20 @@ module.exports = uglifyTask;
module.exports.uglifyTask = uglifyTask;

const reservePublicNames = ((ESKeywords) => function reservePublicNames(target, format) {
const publicModulePath = `../${targetDir(target, format)}/${mainExport}.js`;
return [
...ESKeywords,
...reserveExportedNames(esmRequire(publicModulePath))
const src = targetDir(target, format);
const publicModulePaths = [
`../${src}/data.js`,
`../${src}/type.js`,
`../${src}/table.js`,
`../${src}/vector.js`,
`../${src}/util/int.js`,
`../${src}/predicate.js`,
`../${src}/recordbatch.js`,
`../${src}/${mainExport}.js`,
];
return publicModulePaths.reduce((keywords, publicModulePath) => [
...keywords, ...reserveExportedNames(esmRequire(publicModulePath, { warnings: false }))
], [...ESKeywords]);
})(ESKeywords);

// Reflect on the Arrow modules to come up with a list of keys to save from Uglify's
Expand All @@ -104,8 +113,8 @@ const reserveExportedNames = (entryModule) => (
.map((name) => [name, entryModule[name]])
.reduce((reserved, [name, value]) => {
const fn = function() {};
const ownKeys = value && Object.getOwnPropertyNames(value) || [];
const protoKeys = typeof value === `function` && Object.getOwnPropertyNames(value.prototype) || [];
const ownKeys = value && typeof value === 'object' && Object.getOwnPropertyNames(value) || [];
const protoKeys = typeof value === `function` && Object.getOwnPropertyNames(value.prototype || {}) || [];
const publicNames = [...ownKeys, ...protoKeys].filter((x) => x !== `default` && x !== `undefined` && !(x in fn));
return [...reserved, name, ...publicNames];
}, []
Expand Down
7 changes: 3 additions & 4 deletions js/gulp/util.js
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ const ESKeywords = [
// EventTarget
`addListener`, `removeListener`, `addEventListener`, `removeEventListener`,
// Arrow properties
`low`, `high`, `data`, `index`, `field`, `validity`, `columns`, `fieldNode`, `subarray`,
`low`, `high`, `data`, `index`, `field`, `columns`, 'numCols', 'numRows', `values`, `valueOffsets`, `nullBitmap`, `subarray`
];

function taskName(target, format) {
Expand All @@ -108,14 +108,13 @@ function targetDir(target, format) {

function logAndDie(e) {
if (e) {
console.error(e);
process.exit(1);
}
}

function observableFromStreams(...streams) {
const pumped = streams.length <= 1 ? streams[0]
: pump(...streams, logAndDie);
if (streams.length <= 0) { return Observable.empty(); }
const pumped = streams.length <= 1 ? streams[0] : pump(...streams, logAndDie);
const fromEvent = Observable.fromEvent.bind(null, pumped);
const streamObs = fromEvent(`data`)
.merge(fromEvent(`error`).flatMap((e) => Observable.throw(e)))
Expand Down
Loading

0 comments on commit e327747

Please sign in to comment.