-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid resolving all targets for first-party module mapping #11459
Comments
Unnecessary imports are definitely a problem! If you aren't already, I would strongly recommend using a linter in the But which command(s) in particular are running slowly? There is a known performance issue that affects a few commands, but want to independently narrow in on which you're seeing, if possible! |
Maybe it's worth renaming this issue actually, since I think it is more general performance issues / setup help I'm looking for. Happy to move to Slack too if you prefer. I managed to track down a few of the spurious imports using findimports. As I say, it's not that the imports are unused. It's that they're often in a main guard showing examples of how to use the library. Removing these has hugely reduced the transitive dependency graph. Dependency inference is an awesome feature by the way! The BUILD files are all empty with no explicit deps. It's worth noting that my current setup is to have BUILD files only in the top-level directories with recursive globs. Given dependency inference provides file-granularity dependencies, and goals can often be run with file targets, what's the benefit of a BUILD file in every directory? It seems like a lot to maintain in a large monorepo. What I'm seeing though (and is the bit that seems to be taking all the time, is there a profile output similar to I'm running pants 2.2.0.dev1 |
Is this perhaps where dependency inference looks at the whole repo?
[edit] it seems to be based on my understanding of https://www.pantsbuild.org/v2.0/docs/rules-api-and-target-api#how-to-resolve-targets . What's the reason for resolving all targets vs accepting Targets as a parameter to the rule and resolving transitively. |
Either works. We're sometimes faster to reply on Slack but either works.
Hm, you could move the examples into comments perhaps? Once you uncomment out the block to run the example, Pants's dependency inference should kick in automatically and the example will be runnable.
Indeed, this is a workflow we wanted to enable and a key reason that we added dependency inference + file targets. It's still not all the way where we want: the main issue now is that adding explicit metadata to a target will apply that metadata to all files in the target. Let's say you have a target with 100 files; 1 file has a dependency on a database dependency that Pants cannot infer, so you add it to the
Do you have Pantsd enabled (the default)? There are two stages to dependency inference, as you found:
Step 1 can indeed get expensive in a large repo. Pantsd is pretty crucial to performance to avoid needing to rerun step 1 every single run. It sounds like step 1 is running more than you would expect?
Yep! Great find.
2.2.0rc2 will be more polished btw. Change |
I'm happy to just delete all of this to be honest! So these shouldn't be a problem. Yup, I'm running with Pantsd but it is restarting more than I expect. I sometimes also get messages saying "Filesystem changed during execution" (something like that), but it doesn't tell me what changed? And I'm not sure I did change anything... maybe my exclusion patterns are wrong somehow.
So am I right in thinking this step scales with the monorepo, rather than the amount of work to be done? That might be a bit of an issue. My understanding of this step is that it walks the filetree taking Python source files, strips the source root, strips init.py, then replaces slashes for dots:
Would it therefore be feasible to instead do the reverse? When parsing the AST, take the module name and produce a list of potential source files, e.g. replace dots for slashes, add I think this should scale with the number of modules imported by the selected targets rather than with the repo as a whole. Thought there does seem to be quite a few rules that pull in targets from the entire repo so may this isn't the only case? https://github.com/pantsbuild/pants/search?q=DescendantAddresses%28%22%22%29 |
That's plausible! You'd want to tweak the
Hm, to clarify, Pantsd itself is restarting? Or only the module_mapping is restarting? The former is indeed very costly and we've been chipping away at reducing the need to restart. For example, ctrl-c used to restart pantsd, but as of newer 2.2 versions no longer does. It's plausible also that the module_mapping is being invalidated more than we expect - if that's the case, that would be a big win to figure out why and how to fix.
Yes, this is correct.
I don't think it's very feasible because of source roots. If you see a module That is, I think it will always make sense to precompute this global view of what files are out there. It is plausible that we don't want to eagerly normalize? Lazily convert the files to Python modules, although I suspect that wouldn't be much of a saving. -- We do have one known performance issue that occurs with transitive dependencies: #11270. Transitive deps get used a lot, for things like Improving this should make |
(Sorry, accidentally pressed enter so I'm editing instead of posting again)
I'm not sure to be honest, I'll keep an eye out to check.
Suppose a repo has 10-100k files, I think it makes a lot of sense to check every source root rather than pre-compute over the whole repo. Particularly when some of the more core libraries have no first-party dependencies at all. I guess it also depends on the project layout. I currently only have a single source root "/". I can also imagine a heuristic for sorting source roots (i.e. based on where you found other modules with the same prefix) would shortcut a lot of the expense. |
I was taking a look at how to support this alternative behaviour for building a first party module map. I got a little stuck on a couple of things:
|
Unfortunately, this wouldn't work for correctness. We can't rely on "first one found". If there are >1 files and/or third party requirements that share the same module name, we must not use any because there is ambiguity between which the user intended to use. So, we must be confident we have exhausted all possible sources of that particular module name.
Indeed, this is possible! It's a key mechanism we wanted with the Rules API. An example: pants/src/python/pants/backend/project_info/dependencies.py Lines 71 to 85 in c55fd82
You could either have two dedicated rules for each approach, where each rule takes a different type as its parameter, like The rule to create the first party module mapping can also directly request the -- That does remind me of a tricky requirement of this dependency inference idea, though. We need to support the hook for plugin authors to use Python dependency inference. For example, Protobuf adds on to the global module mapping here https://github.com/pantsbuild/pants/blob/master/src/python/pants/backend/codegen/protobuf/python/python_protobuf_module_mapper.py, so that when we encounter an import like That file would need to be made lazy too. For the sake of a proof-of-concept, it's fine to ignore that for now. But, we would need a solution for what that would look like.
Unfortunately, it is not safe. You would use pants/src/python/pants/source/source_root.py Lines 276 to 278 in c55fd82
-- Taking a step back, is this still the approach you're taking?
Before spending too much time battling the Rules API's bad error messages, it may be fruitful to sketch out what the algorithm will look like so that we can help think it through. For example, I have the import
(I hope none of this reads as dismissive - I'm really glad you're exploring this and looking at ways to improve dependency inference! This has the potential to greatly improve Pants.) |
Not at all dismissive - I really appreciate the time and depth you're putting into this. I'm trying to balance 'doing something useful' with bootstrapping myself on Pants internals and the rules API so I suspect many of my questions don't even make sense! Thinking through this more, I don't believe there's any way to provide dependency inference that doesn't scale with the size of the repository in some way. Currently it scales with O(F), the number of files. My suggestion (and you're correct that we'd need to exhaustively stat files instead of just find the first) scales O(S I), number of source roots multiplied by number of first-party imports. The shape of the repository defines which of these is best. If you’re on board with the idea that Pants should support both of these strategies behind a config flag then I'm happy to look further into implementing the latter. I'm still not entirely convinced myself though. For instance, when running CI it's obvious that we'll want to checksum all files in the repo at some point during the build. It's only when running locally and invalidating inference that we take the performance hit. So I guess the question is when / how often do we invalidate inference? Is my understanding of the current implementation correct that inference is invalidated whenever a Python file is added/removed/renamed in the repository (due to validating import ambiguity)? |
See https://news.ycombinator.com/item?id=24937228 for some discussion of the costs involved here! In the best case, "finding all modules in the repository" is equivalent to "listing all files in the repository", which should be a sub second operation for most repositories (equivalent to the un-cached time of |
A couple months later...
Yes, that should be the case that only changes to file names (but not content) results in invalidating the mapping. If we're seeing differently, that'd be great to identify and fix.
Agreed. Going to close this issue as not solvable because we must consider the universe of all targets. But, to be clear, very open to performance enhancements and ensuring the invalidation behaves how we want. Thanks for asking such great questions! |
I've found that Pants can run quite slowly in my large Python monorepo. I think this is because there are lots of unnecessary imports that make the transitive closure of dependencies cover a huge portion of the repository.
An example of this is when a core library shows a usage example using application code (perhaps in a main guard). Anyone who now depends on that library also pulls in the application code. Enough of these types of imports and we end up in a pretty bad place...
I know this is less than ideal, so I'm trying to track down where these spurious imports are happening. If the
dependencies --dependencies-transitive
goal showed a from/to pair this would allow me to reconstruct the dependency graph and track down the imports.The text was updated successfully, but these errors were encountered: