Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Python dependencies in an intrinsic #18854

Merged
merged 35 commits into from
May 9, 2023

Conversation

thejcannon
Copy link
Member

@thejcannon thejcannon commented Apr 28, 2023

This PR does two things:

  • Introduces a new crate to the engine: dep_inference. A module inside is dedicated to Python, and leverages tree-sitter and tree-sitter-python to parse Parse dependencies. tree-sitter was chosen because it supports Py2/3, supports other languages, and also is syntax-error-resistant.
  • Leverages the new crate in an intrinsic. The new behavior is forced opt-in/out and will eventually be the "only" way to do the inference.

TImings

Helper script:

#!/bin/bash
# Replace some random numbers
find src/python/pants -type f -name "*.py" -not -name "__init__.py" | xargs sed -i s/'Copyright [0123456789][0123456789][0123456789][0123456789]'/"Copyright $RANDOM"/
# Wait for the kernel really quick
sleep 1
# Wait for the inotify notifications to stop
while true; do
  mtime=$(stat -c %Y .pants.d/pants.log)
  now=$(date +%s)
  diff=$((now - mtime))
  if (( diff >= 5 )); then
    break
  fi
  sleep $((5 - diff))
done

Timings follows. ./dirty_files.sh runs test worst case scenario ( touch every copyright header). I'm on a 64 core machine, so I run as if we only had 8 cores.

Findings:

  • In the worst case (the extraction process is not in the process cache) we blow it out of the water in terms of time saved
  • In the best case (the process cache is hot) we're comparable. Put another way, the time it takes to execute the rule code and lookup the process in the process cache is roughly the amount of time it takes just to parse it again

Worst case (completely cold cache)

$ hyperfine --prepare ./dirty_files.sh --runs 4 --warmup 1 'pants --rule-threads-core=4 --process-execution-local-parallelism=8 --no-python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::' 'pants --rule-threads-core=4 --process-execution-local-parallelism=8 --python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::'
Benchmark 1: pants --rule-threads-core=4 --process-execution-local-parallelism=8 --no-python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::
  Time (mean ± σ):     36.335 s ±  1.286 s    [User: 0.754 s, System: 0.151 s]
  Range (min … max):   34.698 s … 37.645 s    4 runs
 
Benchmark 2: pants --rule-threads-core=4 --process-execution-local-parallelism=8 --python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::
  Time (mean ± σ):      2.899 s ±  0.096 s    [User: 0.758 s, System: 0.131 s]
  Range (min … max):    2.764 s …  2.990 s    4 runs
 
Summary
  'pants --rule-threads-core=4 --process-execution-local-parallelism=8 --python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::' ran
   12.54 ± 0.61 times faster than 'pants --rule-threads-core=4 --process-execution-local-parallelism=8 --no-python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::'

Best Case (hot cache, but no daemon)

$ hyperfine --runs 4 --warmup 1 'pants --no-pantsd --rule-threads-core=4 --process-execution-local-parallelism=8 --no-python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::' 'pants --no-pantsd --rule-threads-core=4 --process-execution-local-parallelism=8 --python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::'
Benchmark 1: pants --no-pantsd --rule-threads-core=4 --process-execution-local-parallelism=8 --no-python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::
  Time (mean ± σ):     20.589 s ±  0.319 s    [User: 20.303 s, System: 2.002 s]
  Range (min … max):   20.167 s … 20.934 s    4 runs
 
Benchmark 2: pants --no-pantsd --rule-threads-core=4 --process-execution-local-parallelism=8 --python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::
  Time (mean ± σ):     19.273 s ±  0.347 s    [User: 18.881 s, System: 1.669 s]
  Range (min … max):   18.940 s … 19.759 s    4 runs
 
Summary
  'pants --no-pantsd --rule-threads-core=4 --process-execution-local-parallelism=8 --python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::' ran
    1.07 ± 0.03 times faster than 'pants --no-pantsd --rule-threads-core=4 --process-execution-local-parallelism=8 --no-python-infer-use-rust-parser --filter-target-type=python_source  dependencies ::'

Copy link
Member Author

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle the Django plugin and the dep inference helper union in a separate PR

@@ -109,11 +109,11 @@ def test_normal_imports(rule_runner: RuleRunner) -> None:
ignored1 as alias1, # pants: no-infer-dep
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this file will be most important, as they are potential changes in behavior.

@thejcannon thejcannon changed the title WIP: Extract Py3 deps in an intrinsic Extract Python dependencies in an intrinsic May 4, 2023
@thejcannon thejcannon requested review from stuhood and huonw May 4, 2023 02:08
@thejcannon thejcannon marked this pull request as ready for review May 4, 2023 02:08
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

As mentioned in Slack, I'm fine with beginning to do more of this in-process. As you mentioned, there is nothing stopping someone from writing a native wheel 3rdparty plugin to do the same thing, but this eases distribution/release.

@thejcannon thejcannon force-pushed the intrinsic-parsing branch from 0024d65 to df2b964 Compare May 4, 2023 20:49
@thejcannon
Copy link
Member Author

Whew! Nice comments. Got through 'em

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only review the Rust code (rather than the big picture of how it fits into Pants). I've also not looked through all the 'resolved' comments, so I might've been repeating some things.

My general comment is: nice!

Broadly speaking though, it feels like there's quite a lot of unwraping. I've commented on some specific ones, but there's also a lot of name.named_child(...).unwrap() or equivalent. I'd be nervous about this making dep inference fragile to invalid code, or syntax changes, or other unexpected problems, and bringing down the whole pantsd when they fail.

Some alternatives, depending on the situation:

  • propagate errors with ? etc.
  • .expect("<short explanation>") or unwrap_or_else(|| panic!("foo {some_value} bar")) to at least give more context
  • ignore silently or with some logging (e.g. if the code is invalid, just skip over)


#[derive(Serialize, Deserialize)]
pub struct ParsedPythonDependencies {
pub imports: HashMap<String, (u64, bool)>,
Copy link
Contributor

@huonw huonw May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using HashMap<&str, ...> here may not work directly because from .. import abc might be referring to a module that is never named literally.

I guess this could be std::borrow::Cow<'file_contents, str> to allow &str references into the raw file contents when available, but still support dynamically constructed Strings too.

(But, also, not important for this PR.)

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the build.rs script is fixed to avoid needing to commit the generated code, I'll shipit. Thanks!

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@thejcannon
Copy link
Member Author

Oh wait, we still need to port or fallback for the Django deo inference plugin 😅

@huonw
Copy link
Contributor

huonw commented May 5, 2023

Given this is behind an opt-in flag, maybe that could be follow up: people who are using Django inference can not opt-in for now?

@thejcannon
Copy link
Member Author

I made sure we resort to the old behavior if the Django framework is enabled. So we can port in a follow up

@thejcannon thejcannon merged commit 3a20af9 into pantsbuild:main May 9, 2023
@thejcannon thejcannon deleted the intrinsic-parsing branch May 9, 2023 15:00
huonw added a commit that referenced this pull request Nov 19, 2023
There's a been a few new crates added recently (#18854, #19958), but we
didn't update `[workspace].members` and `[workspace].default-members` in
`src/rust/engine/Cargo.toml` to match. In addition, the older `protos`
and `grpc_util` weren't listed. This syncs up the lists.

The lists of `members` and `default-members` now exactly matches, except
for `fs/brfs` as commented. They also match the `Cargo.toml`s that exist
on disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants