Add tests and CI for optional pyarrow module #1711

wjones127 · 2022-01-30T21:06:47Z

Which issue does this PR close?

Closes #1635.

Rationale for this change

The build was reported as broken, so we should test it.

In addition, datafusion-contrib/datafusion-python#21 will benefit from the completed implementation of roundtrip conversion.

What changes are included in this PR?

Adds a new CI job to tests pyarrow module.
Completes to_pyarrow implementation for ScalarValue
Adds unit tests to pyarrow conversion logic.

Are there any user-facing changes?

ScalarValue.to_pyarrow will no longer throw a not implemented error. Does that count as an API change?

wjones127 · 2022-01-30T23:26:35Z

cc @alamb @kszucs

houqp · 2022-01-31T07:19:58Z

@wjones127 looks like the newly added CI job is failing.

wjones127 · 2022-01-31T16:43:47Z

@wjones127 looks like the newly added CI job is failing.

@houqp Thanks. I got it passing now in my branch: https://github.com/wjones127/arrow-datafusion/runs/5008841849?check_suite_focus=true

alamb

https://github.com/apache/arrow-datafusion/runs/5009634913?check_suite_focus=true

I tried to run this locally and I got an error (even after following the suggestion of setting PYO3_PYTHON=$(which python)).

PYO3_PYTHON=$(which python) cargo test --features=pyarrow -p datafusion -- pyarrow
    Finished test [unoptimized + debuginfo] target(s) in 0.16s
     Running unittests (target/debug/deps/datafusion-59322809a250018b)

running 2 tests
test pyarrow::tests::test_py_scalar ... FAILED
test pyarrow::tests::test_roundtrip ... FAILED

failures:

---- pyarrow::tests::test_py_scalar stdout ----
thread 'pyarrow::tests::test_py_scalar' panicked at 'pyarrow not found
Executable: /Users/alamb/Software/arrow-datafusion/target/debug/deps/datafusion-59322809a250018b
Python path: ["/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python39.zip", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages"]
HINT: try `pip install pyarrow`
                         HINT: if wrong Python path, try `PYO3_PYTHON=$(which python)`: PyErr { type: <class 'ModuleNotFoundError'>, value: ModuleNotFoundError("No module named 'pyarrow'"), traceback: Some(<traceback object at 0x110e69740>) }', datafusion/src/pyarrow.rs:97:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- pyarrow::tests::test_roundtrip stdout ----
thread 'pyarrow::tests::test_roundtrip' panicked at 'pyarrow not found
Executable: /Users/alamb/Software/arrow-datafusion/target/debug/deps/datafusion-59322809a250018b
Python path: ["/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python39.zip", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages"]
HINT: try `pip install pyarrow`
                         HINT: if wrong Python path, try `PYO3_PYTHON=$(which python)`: PyErr { type: <class 'ModuleNotFoundError'>, value: ModuleNotFoundError("No module named 'pyarrow'"), traceback: Some(<traceback object at 0x110e69740>) }', datafusion/src/pyarrow.rs:97:26


failures:
    pyarrow::tests::test_py_scalar
    pyarrow::tests::test_roundtrip

test result: FAILED. 0 passed; 2 failed; 0 ignored; 0 measured; 903 filtered out; finished in 0.02s

error: test failed, to rerun pass '-p datafusion --lib'

Any thoughts @wjones127 ?

alamb · 2022-01-31T20:02:15Z

datafusion/src/pyarrow.rs

+                    locals.get_item("python_path").unwrap().extract().unwrap();
+
+                Err(err).expect(
+                    format!(


alamb · 2022-01-31T20:10:30Z

Thank you for the contribution @wjones127 ❤️

wjones127 · 2022-01-31T21:14:39Z

Any thoughts @wjones127 ?

TBH, I get this too on my local. It seems to be using your homebrew Python 3.9 but not the version of Python that you've installed pyarrow in. (On my local, I was trying to use a pyenv-installed Python, but it didn't seem to want to use that.) To get it to pass, I set PYO3_PYTHON to the path of my Homebrew-installed Python (basically gave up on my Pyenv installs).

At the very least, it seems to work in CI without much configuration.

I assumed this is specific to my Mac OS + PyEnv setup, and that PYO3_PYTHON=$(which python) would cover other cases. But on second thought, maybe in all cases where that command would find the right Python, PyO3 will already do the right thing. So perhaps I need to change the instructions? Something like Try setting PYO3_PYTHON=<full-path-to-python>?

alamb · 2022-01-31T21:29:36Z

I also tried

PYO3_PRINT_CONFIG=1 PYO3_PYTHON=$(which python) cargo test --features=pyarrow -p datafusion -- pyarrow
error: failed to run custom build command for `pyo3 v0.15.1`

Caused by:
  process didn't exit successfully: `/Users/alamb/Software/arrow-datafusion/target/debug/build/pyo3-0d2dc5a0a08d5a5d/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-env-changed=PYO3_CROSS
  cargo:rerun-if-env-changed=PYO3_CROSS_LIB_DIR
  cargo:rerun-if-env-changed=PYO3_CROSS_PYTHON_VERSION
  cargo:rerun-if-env-changed=PYO3_PRINT_CONFIG

  -- PYO3_PRINT_CONFIG=1 is set, printing configuration and halting compile --
  implementation=CPython
  version=3.9
  shared=true
  abi3=false
  lib_name=python3.9
  lib_dir=/usr/local/opt/python@3.9/Frameworks/Python.framework/Versions/3.9/lib
  executable=/Users/alamb/Software/virtual_envs/arrow_dev/bin/python
  pointer_width=64
  build_flags=WITH_THREAD
  suppress_build_script_link_lines=false

So it seems like maybe pyo3 is picking up the right executable (my virtual env) but not the correct libdir 🤔

wjones127 · 2022-01-31T21:48:51Z

So it seems like maybe pyo3 is picking up the right executable (my virtual env) but not the correct libdir 🤔

Have you checked that your virtualenv has the python3.9.lib file? I think some installations just have the Python binary and no libraries.

Otherwise, I do think there is something on MacOS about shared libraries vs the Frameworks folder libraries that I don't yet comprehend, but might be key here.

wjones127 · 2022-02-03T05:27:20Z

So it seems like maybe pyo3 is picking up the right executable (my virtual env) but not the correct libdir 🤔

@alamb Actually this seems to be a known issue: PyO3/pyo3#1741 (comment). Python doesn't seem to do the virtual environment path computation for embedded systems on Mac OS. A workaround is to extract that into PYTHON_PATH like so:

export PYTHONPATH=$(python -c "import sys; print(sys.path[-1])")

This is orthogonal to the original issue I had, which is that you must use a framework based Python install. That's default in Homebrew and Python.org installers, but not in pyenv.

I have updated the error help to reflect these two issues.

…sion into 1635-pyarrow-test

alamb · 2022-02-03T15:23:38Z

Sorry -- I missed this one -- I just started CI again. When that passes I'll plan to merge it in. Thanks again @wjones127

* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <13716567376yh@gmail.com> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * Add tests and CI for optional pyarrow module (#1711) * Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info * Update parking_lot requirement from 0.11 to 0.12 (#1735) Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version. - [Release notes](https://github.com/Amanieu/parking_lot/releases) - [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md) - [Commits](Amanieu/parking_lot@0.11.0...0.12.0) --- updated-dependencies: - dependency-name: parking_lot dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Prevent repartitioning of certain operator's direct children (#1731) (#1732) * Prevent repartitioning of certain operator's direct children (#1731) * Update ballista tests * Don't repartition children of RepartitionExec * Revert partition restriction on Repartition and Projection * Review feedback * Lint * API to get Expr's type and nullability without a `DFSchema` (#1726) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs * Fix typos in crate documentation (#1739) * add `cargo check --release` to ci (#1737) * remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Move optimize test out of context.rs (#1742) * Move optimize test out of context.rs * Update * use clap 3 style args parsing for datafusion cli (#1749) * use clap 3 style args parsing for datafusion cli * upgrade cli version * Add partitioned_csv setup code to sql_integration test (#1743) * use ordered-float 2.10 (#1756) Signed-off-by: Andy Grove <agrove@apache.org> * #1768 Support TimeUnit::Second in hasher (#1769) * Support TimeUnit::Second in hasher * fix linter * format (#1745) * Create built-in scalar functions programmatically (#1734) * create build-in scalar functions programatically Signed-off-by: remzi <13716567376yh@gmail.com> * solve conflict Signed-off-by: remzi <13716567376yh@gmail.com> * fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com> * rename to call_fn Signed-off-by: remzi <13716567376yh@gmail.com> * [split/1] split datafusion-common module (#1751) * split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <agrove@apache.org> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <agrove@apache.org> * fix: Case insensitive unquoted identifiers (#1747) * move dfschema and column (#1758) * add datafusion-expr module (#1759) * move column, dfschema, etc. to common module (#1760) * include window frames and operator into datafusion-expr (#1761) * move signature, type signature, and volatility to split module (#1763) * [split/10] split up expr for rewriting, visiting, and simplification traits (#1774) * split up expr for rewriting, visiting, and simplification * add docs * move built-in scalar functions (#1764) * split expr type and null info to be expr-schemable (#1784) * rewrite predicates before pushing to union inputs (#1781) * move accumulator and columnar value (#1765) * move accumulator and columnar value (#1762) * fix bad data type in test_try_cast_decimal_to_decimal * added projections for avro columns Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Yang <37145547+Ted-Jiang@users.noreply.github.com> Co-authored-by: Remzi Yang <59198230+HaoYang670@users.noreply.github.com> Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Kun Liu <liukun@apache.org> Co-authored-by: Dmitry Patsura <talk@dmtry.me> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: r.4ntix <r.4ntix@gmail.com> Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: Rich <jychen7@users.noreply.github.com> Co-authored-by: Marko Mikulicic <mmikulicic@gmail.com> Co-authored-by: Eduard Karacharov <13005055+korowa@users.noreply.github.com>

wjones127 added 3 commits January 30, 2022 10:46

Implement other side of conversion

5948803

Add test workflow

bab0dbc

Add (failing) tests

4bd8e7e

github-actions bot added the datafusion Changes in the datafusion crate label Jan 30, 2022

Get unit tests passing

8db4147

wjones127 marked this pull request as ready for review January 30, 2022 23:24

wjones127 added 3 commits January 31, 2022 07:37

Use python -m pip

c8ddfa6

Debug LD_LIBRARY_PATH

4c89436

Set LIBRARY_PATH

eeabd8e

alamb approved these changes Jan 31, 2022

View reviewed changes

datafusion/src/pyarrow.rs

locals.get_item("python_path").unwrap().extract().unwrap();

Err(err).expect(

format!(

Copy link

Contributor

alamb Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Update help with better info

78ac9ff

Merge branch '1635-pyarrow-test' of github.com:wjones127/arrow-datafu…

0764a3b

…sion into 1635-pyarrow-test

wjones127 force-pushed the 1635-pyarrow-test branch from d545c46 to 0764a3b Compare February 3, 2022 05:29

alamb merged commit d1ebdbf into apache:master Feb 3, 2022

wjones127 deleted the 1635-pyarrow-test branch February 3, 2022 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests and CI for optional pyarrow module #1711

Add tests and CI for optional pyarrow module #1711

wjones127 commented Jan 30, 2022

wjones127 commented Jan 30, 2022

houqp commented Jan 31, 2022

wjones127 commented Jan 31, 2022

alamb left a comment

alamb Jan 31, 2022

alamb commented Jan 31, 2022

wjones127 commented Jan 31, 2022

alamb commented Jan 31, 2022 •

edited

Loading

wjones127 commented Jan 31, 2022

wjones127 commented Feb 3, 2022

alamb commented Feb 3, 2022

Add tests and CI for optional pyarrow module #1711

Add tests and CI for optional pyarrow module #1711

Conversation

wjones127 commented Jan 30, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

wjones127 commented Jan 30, 2022

houqp commented Jan 31, 2022

wjones127 commented Jan 31, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 31, 2022

Choose a reason for hiding this comment

alamb commented Jan 31, 2022

wjones127 commented Jan 31, 2022

alamb commented Jan 31, 2022 • edited Loading

wjones127 commented Jan 31, 2022

wjones127 commented Feb 3, 2022

alamb commented Feb 3, 2022

alamb commented Jan 31, 2022 •

edited

Loading