Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid resolving all targets for first-party module mapping #11459

Closed
gatesn opened this issue Jan 13, 2021 · 12 comments
Closed

Avoid resolving all targets for first-party module mapping #11459

gatesn opened this issue Jan 13, 2021 · 12 comments

Comments

@gatesn
Copy link

gatesn commented Jan 13, 2021

I've found that Pants can run quite slowly in my large Python monorepo. I think this is because there are lots of unnecessary imports that make the transitive closure of dependencies cover a huge portion of the repository.

An example of this is when a core library shows a usage example using application code (perhaps in a main guard). Anyone who now depends on that library also pulls in the application code. Enough of these types of imports and we end up in a pretty bad place...

I know this is less than ideal, so I'm trying to track down where these spurious imports are happening. If the dependencies --dependencies-transitive goal showed a from/to pair this would allow me to reconstruct the dependency graph and track down the imports.

@stuhood
Copy link
Member

stuhood commented Jan 13, 2021

I think this is because there are lots of unnecessary imports that make the transitive closure of dependencies cover a huge portion of the repository.

Unnecessary imports are definitely a problem! If you aren't already, I would strongly recommend using a linter in the lint goal (such as flake8) that will detect unused imports, and would recommend using dependency inference with mostly-empty BUILD files, so that stale explicit dependency declarations are much less likely.

But which command(s) in particular are running slowly? There is a known performance issue that affects a few commands, but want to independently narrow in on which you're seeing, if possible!

@gatesn
Copy link
Author

gatesn commented Jan 14, 2021

Maybe it's worth renaming this issue actually, since I think it is more general performance issues / setup help I'm looking for. Happy to move to Slack too if you prefer.

I managed to track down a few of the spurious imports using findimports. As I say, it's not that the imports are unused. It's that they're often in a main guard showing examples of how to use the library. Removing these has hugely reduced the transitive dependency graph. Dependency inference is an awesome feature by the way! The BUILD files are all empty with no explicit deps.

It's worth noting that my current setup is to have BUILD files only in the top-level directories with recursive globs. Given dependency inference provides file-granularity dependencies, and goals can often be run with file targets, what's the benefit of a BUILD file in every directory? It seems like a lot to maintain in a large monorepo.

What I'm seeing though (and is the bit that seems to be taking all the time, is there a profile output similar to ./gradlew --profile?) is that every time I run Pants it shows warnings for unmatched globs. This is true, the globs are over-specified. But it's suggesting that even when I run Pants against a single target it's still walking the entire file tree?

I'm running pants 2.2.0.dev1

@gatesn
Copy link
Author

gatesn commented Jan 14, 2021

Is this perhaps where dependency inference looks at the whole repo?

all_expanded_targets = await Get(Targets, AddressSpecs([DescendantAddresses("")]))

[edit] it seems to be based on my understanding of https://www.pantsbuild.org/v2.0/docs/rules-api-and-target-api#how-to-resolve-targets . What's the reason for resolving all targets vs accepting Targets as a parameter to the rule and resolving transitively.

@Eric-Arellano
Copy link
Contributor

Happy to move to Slack too if you prefer.

Either works. We're sometimes faster to reply on Slack but either works.

As I say, it's not that the imports are unused. It's that they're often in a main guard showing examples of how to use the library.

Hm, you could move the examples into comments perhaps? Once you uncomment out the block to run the example, Pants's dependency inference should kick in automatically and the example will be runnable.

Given dependency inference provides file-granularity dependencies, and goals can often be run with file targets, what's the benefit of a BUILD file in every directory?

Indeed, this is a workflow we wanted to enable and a key reason that we added dependency inference + file targets.

It's still not all the way where we want: the main issue now is that adding explicit metadata to a target will apply that metadata to all files in the target. Let's say you have a target with 100 files; 1 file has a dependency on a database dependency that Pants cannot infer, so you add it to the dependencies field; now, all 100 files and their transitive dependees depend on it even if it's not necessary. Instead, you'd need to split out more fine-grained targets. This is why we still recommend 1 BUILD file per directory, to mitigate this risk, but otherwise it's fine and we want to fix this problem robustly.

But it's suggesting that even when I run Pants against a single target it's still walking the entire file tree?

Do you have Pantsd enabled (the default)? There are two stages to dependency inference, as you found:

  1. Create a module mapping. Because in a monorepo you can import from anywhere, this must be a global view of the world. We need to know about every 3rdparty req you have + every first party module. We create a dictionary mapping my.module.foo -> the source(s) for it.
    • This is re-calculated every time a Python file name changes (resulting in a different module name), or is added/deleted. It also happens when BUILD files have been touched, even if purely adding a comment, because we cannot know beforehand if you changed targets like adding new ones. It should not be impacted by changing the content of any file, outside of BUILD files.
  2. For the relevant files, parse their imports using the AST. Then, look up in the dictionaries from step 1 where this is coming from.
    • This happens every time a file's name or content have changed, as we do not know if import statements have been tweaked.

Step 1 can indeed get expensive in a large repo. Pantsd is pretty crucial to performance to avoid needing to rerun step 1 every single run.

It sounds like step 1 is running more than you would expect?

Is this perhaps where dependency inference looks at the whole repo?

Yep! Great find.

I'm running pants 2.2.0.dev1

2.2.0rc2 will be more polished btw. Change pants_version in pants.toml.

@gatesn
Copy link
Author

gatesn commented Jan 14, 2021

Hm, you could move the examples into comments perhaps? Once you uncomment out the block to run the example, Pants's dependency inference should kick in automatically and the example will be runnable.

I'm happy to just delete all of this to be honest! So these shouldn't be a problem.

Yup, I'm running with Pantsd but it is restarting more than I expect. I sometimes also get messages saying "Filesystem changed during execution" (something like that), but it doesn't tell me what changed? And I'm not sure I did change anything... maybe my exclusion patterns are wrong somehow.

  1. Create a module mapping.

So am I right in thinking this step scales with the monorepo, rather than the amount of work to be done? That might be a bit of an issue.

My understanding of this step is that it walks the filetree taking Python source files, strips the source root, strips init.py, then replaces slashes for dots:

def create_from_stripped_path(cls, path: PurePath) -> PythonModule:

Would it therefore be feasible to instead do the reverse? When parsing the AST, take the module name and produce a list of potential source files, e.g. replace dots for slashes, add {.py,.pyi,.pyx}, maybe add /__init__.py, then check for file existence in each of the source roots. (Probably checking against the third-party module list first since that just requires executing BUILD files?)

I think this should scale with the number of modules imported by the selected targets rather than with the repo as a whole.

Thought there does seem to be quite a few rules that pull in targets from the entire repo so may this isn't the only case? https://github.com/pantsbuild/pants/search?q=DescendantAddresses%28%22%22%29

@Eric-Arellano
Copy link
Contributor

And I'm not sure I did change anything... maybe my exclusion patterns are wrong somehow.

That's plausible! You'd want to tweak the pants_ignore option for this. It defaults to your .gitignore's contents, but you may sometimes want to add more to it. See https://www.pantsbuild.org/docs/troubleshooting#pants-cannot-find-a-file-in-your-project.

I'm running with Pantsd but it is restarting more than I expect

Hm, to clarify, Pantsd itself is restarting? Or only the module_mapping is restarting? The former is indeed very costly and we've been chipping away at reducing the need to restart. For example, ctrl-c used to restart pantsd, but as of newer 2.2 versions no longer does.

It's plausible also that the module_mapping is being invalidated more than we expect - if that's the case, that would be a big win to figure out why and how to fix.

So am I right in thinking this step scales with the monorepo, rather than the amount of work to be done

Yes, this is correct.

Would it therefore be feasible to instead do the reverse?

I don't think it's very feasible because of source roots. If you see a module project.my_lib, you have no way of knowing beforehand where that might live; it could be src/py/project/my_lib.py, src/python/project/my_lib.py, project/my_lib.py, foo/bar/baz/project/my_lib.py. So, you must consult every source root known by Pants.

That is, I think it will always make sense to precompute this global view of what files are out there. It is plausible that we don't want to eagerly normalize? Lazily convert the files to Python modules, although I suspect that wouldn't be much of a saving.

--

We do have one known performance issue that occurs with transitive dependencies: #11270. Transitive deps get used a lot, for things like typcheck, run, and test, so you often pay this cost.

Improving this should make ./pants --no-pantsd dependencies --transitive :: faster, but it will have no impact on ./pants --no-pantsd dependencies ::. (--no-pantsd used for benchmarking purposes.)

@gatesn
Copy link
Author

gatesn commented Jan 14, 2021

(Sorry, accidentally pressed enter so I'm editing instead of posting again)

Hm, to clarify, Pantsd itself is restarting? Or only the module_mapping is restarting?

I'm not sure to be honest, I'll keep an eye out to check.

I don't think it's very feasible because of source roots. If you see a module project.my_lib, you have no way of knowing beforehand where that might live; it could be src/py/project/my_lib.py, src/python/project/my_lib.py, project/my_lib.py, foo/bar/baz/project/my_lib.py. So, you must consult every source root known by Pants.

Suppose a repo has 10-100k files, I think it makes a lot of sense to check every source root rather than pre-compute over the whole repo. Particularly when some of the more core libraries have no first-party dependencies at all.

I guess it also depends on the project layout. I currently only have a single source root "/".

I can also imagine a heuristic for sorting source roots (i.e. based on where you found other modules with the same prefix) would shortcut a lot of the expense.

@gatesn gatesn changed the title Show source -> dest map for transitive dependencies/dependees Avoid resolving all targets for first-party module mapping Jan 18, 2021
@gatesn
Copy link
Author

gatesn commented Jan 18, 2021

I was taking a look at how to support this alternative behaviour for building a first party module map. I got a little stuck on a couple of things:

  1. It's not obvious whether it's possible to use Subsystem config objects to change the shape of the rule graph. e.g. if I pass a flag, "--lazy-module-mapping" can I swap out the rule that provides FirstPartyModuleMappingImpl with another one?
  2. Regardless of 1, I'm not quite sure whether it's ok in this case to use os.path.exists, or whether that breaks caching in some way. I'd have to think more. If it's not ok, then it sounds like a lot of context switching between Rust/Python, and maybe we'd need some bulk method where we construct a list of potential locations and defer to the Rust engine to return the first path in that list that exists, e.g. something like Get(FirstPathExists, FirstPathExistsRequest(['root1/mod.py', 'root1/mod/__init__.py', 'root2/mod.py', 'root2/mod/__init__.py' ]))

@Eric-Arellano
Copy link
Contributor

I can also imagine a heuristic for sorting source roots (i.e. based on where you found other modules with the same prefix) would shortcut a lot of the expense.

Unfortunately, this wouldn't work for correctness. We can't rely on "first one found". If there are >1 files and/or third party requirements that share the same module name, we must not use any because there is ambiguity between which the user intended to use. So, we must be confident we have exhausted all possible sources of that particular module name.

It's not obvious whether it's possible to use Subsystem config objects to change the shape of the rule graph. e.g. if I pass a flag, "--lazy-module-mapping" can I swap out the rule that provides FirstPartyModuleMappingImpl with another one?

Indeed, this is possible! It's a key mechanism we wanted with the Rules API. An example:

if dependencies_subsystem.transitive:
transitive_targets = await Get(
TransitiveTargets, TransitiveTargetsRequest(addresses, include_special_cased_deps=True)
)
targets = Targets(transitive_targets.dependencies)
else:
target_roots = await Get(UnexpandedTargets, Addresses, addresses)
dependencies_per_target_root = await MultiGet(
Get(
Targets,
DependenciesRequest(tgt.get(DependenciesField), include_special_cased_deps=True),
)
for tgt in target_roots
)
targets = Targets(itertools.chain.from_iterable(dependencies_per_target_root))

You could either have two dedicated rules for each approach, where each rule takes a different type as its parameter, like FirstPartyImplApproach1 vs. FirstPartyImplApproach2, or use the same rule and have the "request" object gain a new argument to determine the functionality. For example, FirstPartyImpl(..., lazy_mapping=True).

The rule to create the first party module mapping can also directly request the Subsystem in its @rule signature and can have that conditional logic directly in it.

--

That does remind me of a tricky requirement of this dependency inference idea, though. We need to support the hook for plugin authors to use Python dependency inference. For example, Protobuf adds on to the global module mapping here https://github.com/pantsbuild/pants/blob/master/src/python/pants/backend/codegen/protobuf/python/python_protobuf_module_mapper.py, so that when we encounter an import like path.to.my_module_pb2, we know it maps back to a protobuf_library() target.

That file would need to be made lazy too. For the sake of a proof-of-concept, it's fine to ignore that for now. But, we would need a solution for what that would look like.

I'm not quite sure whether it's ok in this case to use os.path.exists

Unfortunately, it is not safe. You would use await Get(Paths, PathGlobs) for this and check the Paths result. An example:

paths = await Get(Paths, PathGlobs([str(path / mf) for mf in marker_filenames]))
if len(paths.files) > 0:
return OptionalSourceRoot(SourceRoot(str(path)))

--

Taking a step back, is this still the approach you're taking?

Would it therefore be feasible to instead do the reverse? When parsing the AST, take the module name and produce a list of potential source files, e.g. replace dots for slashes, add {.py,.pyi,.pyx}, maybe add /init.py, then check for file existence in each of the source roots.

Before spending too much time battling the Rules API's bad error messages, it may be fruitful to sketch out what the algorithm will look like so that we can help think it through.

For example, I have the import pants.util.strutil, with the below source roots. How will your algorithm figure out which targets, if any, include a module pants.util.strutil?

3rdparty/python
build-support/bin
build-support/migration-support
pants-plugins
src/assets
src/python
src/rust
src/rust/engine/process_execution/src/nailgun
src/rust/engine/src/externs
testprojects/3rdparty/python
testprojects/pants-plugins/3rdparty/python
testprojects/pants-plugins/src/python
testprojects/pants-plugins/tests/python
testprojects/src/python
testprojects/tests/python
tests/python

(I hope none of this reads as dismissive - I'm really glad you're exploring this and looking at ways to improve dependency inference! This has the potential to greatly improve Pants.)

@gatesn
Copy link
Author

gatesn commented Jan 19, 2021

Not at all dismissive - I really appreciate the time and depth you're putting into this. I'm trying to balance 'doing something useful' with bootstrapping myself on Pants internals and the rules API so I suspect many of my questions don't even make sense!

Thinking through this more, I don't believe there's any way to provide dependency inference that doesn't scale with the size of the repository in some way. Currently it scales with O(F), the number of files. My suggestion (and you're correct that we'd need to exhaustively stat files instead of just find the first) scales O(S I), number of source roots multiplied by number of first-party imports. The shape of the repository defines which of these is best.

If you’re on board with the idea that Pants should support both of these strategies behind a config flag then I'm happy to look further into implementing the latter.

I'm still not entirely convinced myself though. For instance, when running CI it's obvious that we'll want to checksum all files in the repo at some point during the build. It's only when running locally and invalidating inference that we take the performance hit. So I guess the question is when / how often do we invalidate inference?

Is my understanding of the current implementation correct that inference is invalidated whenever a Python file is added/removed/renamed in the repository (due to validating import ambiguity)?

@stuhood
Copy link
Member

stuhood commented Jan 20, 2021

See https://news.ycombinator.com/item?id=24937228 for some discussion of the costs involved here! In the best case, "finding all modules in the repository" is equivalent to "listing all files in the repository", which should be a sub second operation for most repositories (equivalent to the un-cached time of ./pants list ::, which also lists all files). To the extent that the operation currently takes longer than that, we are very interested in optimizing that!

@Eric-Arellano
Copy link
Contributor

A couple months later...

Is my understanding of the current implementation correct that inference is invalidated whenever a Python file is added/removed/renamed in the repository (due to validating import ambiguity)?

Yes, that should be the case that only changes to file names (but not content) results in invalidating the mapping. If we're seeing differently, that'd be great to identify and fix.

I don't believe there's any way to provide dependency inference that doesn't scale with the size of the repository in some way.

Agreed. Going to close this issue as not solvable because we must consider the universe of all targets.

But, to be clear, very open to performance enhancements and ensuring the invalidation behaves how we want.

Thanks for asking such great questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants