Exporting project's dependency graph #20242

AlexTereshenkov · 2023-11-25T21:46:55Z

AlexTereshenkov
Nov 25, 2023
Collaborator

Introduction

Pants already provides a rich set of tools to query the dependency graph to find out information about dependencies and dependents of individual build targets (be it a file, a package, or a glob pattern):

dependencies goal lets you list the dependencies:

$ pants dependencies cheeseshop/cli/cli.py                 
cheeseshop/cli/cli.py
cheeseshop/cli/utils/utils.py
cheeseshop/repository/package.py
cheeseshop/repository/properties.py
cheeseshop/repository/query.py
cheeseshop/repository/repository.py
cheeseshop/version.py
cheeseshop:project-version
requirements#click
requirements#loguru

dependents (formerly known as dependees) goal lets you list the dependents (also known as reverse dependencies):

$ pants dependents cheeseshop/repository/parsing/casts.py  
cheeseshop/repository/package.py
cheeseshop/repository/parsing:parsing
tests/repository/parsing/test_casts.py:tests

It is possible to list dependencies for multiple files, one after another, e.g.

$ pants dependencies cheeseshop/repository/*.py            
cheeseshop/configs.py
cheeseshop/repository/package.py
cheeseshop/repository/parsing/casts.py
cheeseshop/repository/parsing/exceptions.py
cheeseshop/repository/properties.py
requirements#loguru
requirements#packaging
requirements#requests
requirements#typing-extensions

Motivation

Having the dependencies listed for multiple targets such as individual source files, you don't know what modules out of those files in the cheeseshop/repository package depends on what.

Running Pants goal on each individual file is very inefficient: each invocation of Pants has an overhead, so it's more preferrable to get all the work done within a single Pants call. It is also possible that a command would be run in a environment without pantsd process already running and/or any cache available. So even though this works, it will prove to be unreasonably slow for even a medium sized codebase:

$ for filename in $(ls cheeseshop/repository/*.py); do 
  echo "--${filename}--"; pants dependencies ${filename}
done

--cheeseshop/repository/package.py--
cheeseshop/repository/parsing/casts.py
cheeseshop/repository/properties.py
requirements#packaging
requirements#typing-extensions

--cheeseshop/repository/query.py--
cheeseshop/repository/package.py
requirements#packaging

--cheeseshop/repository/repository.py--
cheeseshop/configs.py
cheeseshop/repository/package.py
cheeseshop/repository/parsing/exceptions.py
requirements#loguru
requirements#requests

It is therefore more helpful to list dependencies for multiple files individually to be able to distinguish them, using a new goal when construction of the graph happens only once:

$ pants <goal> <options> cheeseshop/repository/*.py
{
    "cheeseshop/repository/__init__.py": [],
    "cheeseshop/repository/package.py": [
        "cheeseshop/repository/parsing/casts.py",
        "cheeseshop/repository/properties.py",
        "requirements#packaging",
        "requirements#typing-extensions"
    ],
    "cheeseshop/repository/properties.py": [],
    "cheeseshop/repository/query.py": [
        "cheeseshop/repository/package.py",
        "requirements#packaging"
    ],
    "cheeseshop/repository/repository.py": [
        "cheeseshop/configs.py",
        "cheeseshop/repository/package.py",
        "cheeseshop/repository/parsing/exceptions.py",
        "requirements#loguru",
        "requirements#requests"
    ],
    "cheeseshop/repository/types.py": []
}

The information produced by this new goal would return adjacency representation of the dependency graph as a dictionary of lists. The output is JSON compatible which makes it trivial to filter and query the graph using standard tooling such as jq and standard library of most programming languages.

More importantly, this data structure may be used to construct graphs using 3rd party tooling such as networkx to be able to query and manipulate it, see networkx.convert.from_dict_of_lists:

$ pants <goal> <options> cheeseshop/repository/*.py > depgraph.json
$ python3
>>> import json
>>> import networkx
>>> with open("depgraph.json") as fh:
...     g = networkx.from_dict_of_lists(json.load(fh), create_using=networkx.DiGraph)
>>> networkx.shortest_path(g, "cheeseshop/repository/query.py", "cheeseshop/repository/properties.py")
['cheeseshop/repository/query.py', 'cheeseshop/repository/package.py', 'cheeseshop/repository/properties.py']

Having the dependency graph exported makes it possible to cheaply answer a variety of useful questions such as:

are there any build targets that no one depends on?
what is the longest path in the graph?
what source module leads to most tests?
what test module has most dependencies?

Having the graph exported also opens up the opportunity to visualize the whole graph or its parts using visualization libraries such as graphviz:

$ python3 -m venv .venv && source .venv/bin/activate
$ pip install networkx pydot

import json
import networkx
from networkx.drawing.nx_pydot import write_dot

with open("depgraph.json") as fh:
  g = networkx.from_dict_of_lists(json.load(fh), create_using=networkx.DiGraph)
  write_dot(g, "graph.dot")

$ dot -Tpng graph.dot > graph.png

Having the graph exported into a JSON data structure is enough to be able to perform any query/manipulation with the graph, but for practical reasons, it may be helpful to provide additional functionality available out-of-the-box to avoid forcing users to write additional programs. This could mean:

listing the dependents (reverse dependencies)
listing the dependencies/dependents transitively

Implementation

Practically, fetching dependencies (direct or transitive) is trivial:

direct_deps_request_result = await Get(Targets, DependenciesRequest(target[Dependencies]))
deps = [str(d.address) for d in FrozenOrderedSet(direct_deps_request_result)]
...
transitive_deps_request_result = await Get(TransitiveTargets, TransitiveTargetsRequest([target.address]))
dependencies = transitive_deps_request_result.dependencies

and so is fetching dependents:

dependees = await Get(
	Dependents,
	DependentsRequest(
		(target.address,),
		transitive=True,
		include_roots=False,
	),
)

Fetching dependencies for multiple targets is likely to happen in a MultiGet call to a rule, filling a mapping of build targets and their dependencies which will be the output of the new goal.

With the naming of the goal and the options being subject to change, this is how the user interface may look like:

# fetching direct dependencies of two files
$ pants dep-graph --dependencies cheeseshop/repository/query.py cheeseshop/repository/package.py                                
{
    "cheeseshop/repository/package.py": [
        "cheeseshop/repository/parsing/casts.py",
        "cheeseshop/repository/properties.py",
        "requirements#packaging",
        "requirements#typing-extensions"
    ],
    "cheeseshop/repository/query.py": [
        "cheeseshop/repository/package.py",
        "requirements#packaging"
    ]
}

# fetching transitive dependencies of two files
$ pants dep-graph --dependencies --transitive cheeseshop/repository/query.py cheeseshop/repository/package.py
{
    "cheeseshop/repository/package.py": [
        "cheeseshop/repository/parsing/casts.py",
        "cheeseshop/repository/parsing/exceptions.py",
        "cheeseshop/repository/properties.py",
        "requirements#loguru",
        "requirements#packaging",
        "requirements#python-dateutil",
        "requirements#typing-extensions",
        "requirements/requirements.lock:_python-default_lockfile",
        "requirements/requirements.txt"
    ],
    "cheeseshop/repository/query.py": [
        "cheeseshop/repository/package.py",
        "cheeseshop/repository/parsing/casts.py",
        "cheeseshop/repository/parsing/exceptions.py",
        "cheeseshop/repository/properties.py",
        "requirements#loguru",
        "requirements#packaging",
        "requirements#python-dateutil",
        "requirements#typing-extensions",
        "requirements/requirements.lock:_python-default_lockfile",
        "requirements/requirements.txt"
    ]
}

# fetching direct dependents of two files
$ pants dep-graph --dependents cheeseshop/repository/query.py cheeseshop/repository/package.py
{
    "cheeseshop/repository/package.py": [
        "cheeseshop/cli/cli.py",
        "cheeseshop/repository/__init__.py",
        "cheeseshop/repository/query.py",
        "cheeseshop/repository/repository.py",
        "tests/repository/test_query.py"
    ],
    "cheeseshop/repository/query.py": [
        "cheeseshop/cli/cli.py",
        "cheeseshop/repository/__init__.py",
        "tests/repository/test_query.py"
    ]
}

# fetching transitive dependents of two files
$ pants dep-graph --dependents --transitive cheeseshop/repository/query.py cheeseshop/repository/package.py
{
    "cheeseshop/repository/package.py": [
        "//:cheeseshop-query-wheel",
        "cheeseshop/cli/__init__.py",
        "cheeseshop/cli/cli.py",
        "cheeseshop/cli:cheeseshop-query",
        "cheeseshop/repository/__init__.py",
        "cheeseshop/repository/query.py",
        "cheeseshop/repository/repository.py",
        "tests/cli/test_cli.py",
        "tests/repository/test_query.py",
        "tests/repository/test_repository.py"
    ],
    "cheeseshop/repository/query.py": [
        "//:cheeseshop-query-wheel",
        "cheeseshop/cli/__init__.py",
        "cheeseshop/cli/cli.py",
        "cheeseshop/cli:cheeseshop-query",
        "cheeseshop/repository/__init__.py",
        "tests/cli/test_cli.py",
        "tests/repository/test_query.py"
    ]
}

Existing implementations

For comparison, the dependency graph export functionality is available in Bazel via the query command; see Display a graph of the result to learn more. The graph can be exported into a variety of formats such as DOT file or XML. Even though the DOT files can be loaded into a graph data structure, for instance, using networkx.drawing.nx_pydot.from_pydot, it may be preferrable to have the graph available in JSON to make it more accessible to standard operating system tooling as it's a lot easier to process JSON than DOT files.

Buck2 implements a query command with a similar functionality to export the dependency graph by listing all paths between nodes in the dependency graph or dependencies of individual targets using the query command with the output format set to either JSON or DOT. Supporting both formats in the new Pants goal may be desired as well as it removes the burden from the user to convert the data. However, supporting initially only JSON shall be considered reasonable.

Proof of concept

There's a plugin written for Pants 2.16 a few months ago and it has been used (with only minor adjustments to accommodate corporate needs) in production since then. See the source code and the published PyPI wheel to install it in a Pants 2.16 repository. Check out this PR's branch to experiment locally.

huonw · 2023-11-25T23:01:06Z

huonw
Nov 25, 2023
Collaborator

Getting good graph introspection is nice!

This seems quite similar to the pants peek goal, just with a subset of the info there. Could you describe a bit how this improves/changes that?

1 reply

AlexTereshenkov Nov 25, 2023
Collaborator Author

Thanks for taking a look! Yes, indeed, the peek goal does produce the dependencies for individual build targets in JSON format. So

{
    "address": "tests/repository/parsing/test_casts.py:tests",
    "target_type": "python_test",
    "batch_compatibility_tag": null,
    "dependencies": [
      "cheeseshop/repository/parsing/casts.py",
      "cheeseshop/repository/parsing/exceptions.py",
      "requirements#python-dateutil",
      "requirements:requirements-test#pytest"
    ],
    "dependencies_raw": null,
    "description": null,
    "environment": "__local__",
    "extra_env_vars": null,
    "interpreter_constraints": null,
    "resolve": null,
    "run_goal_use_sandbox": null,
    "runtime_package_dependencies": null,
    "skip_black": false,
    "skip_docformatter": false,
    "skip_flake8": false,
    "skip_isort": false,
    "skip_mypy": false,
    "skip_tests": false,
    "source_raw": "test_casts.py",
    "sources": [
      "tests/repository/parsing/test_casts.py"
    ],
    "sources_fingerprint": "bdfc88a5f817a631689a90ee725c2e1e33d0f4a52896b168bc133877e8fc44f0",
    "tags": null,
    "timeout": null,
    "xdist_concurrency": null
  }

could be converted into

{"tests/repository/parsing/test_casts.py:tests": [
      "cheeseshop/repository/parsing/casts.py",
      "cheeseshop/repository/parsing/exceptions.py",
      "requirements#python-dateutil",
      "requirements:requirements-test#pytest"
    ]
}

with a few lines of Python code. However, the peek goal does not provide

dependents of a target (direct/transitive)
transitive dependencies of a target

Arguably, this information can be obtained by having strictly the adjacency lists with dependencies only, but this would require every user to write an own script to mung the data.

I was hoping that in the future the graph export mechanism could be extended to provide more sophisticated filtering, for instance:

applying a scope (e.g. "list transitive dependencies for all these targets but ignore any dependencies coming from these packages as if the targets didn't depend on them")
applying a query depth (e.g. "list transitive dependencies for all these targets going three steps away only")
etc

This is not something that I plan to ship as part of the new goal addition, but it feels it would be much more sensible to keep it in a separate goal that deals specifically with the graph. The scope or query depth do not seem apply well to the peek goal which is why I propose having a separate goal for this. Bazel and Buck2 provide the query command which is combining Pants' peek and the new goal I suggest adding. Perhaps naming the new goal query would be sensible, too.

Does this help?

kaos · 2023-11-26T01:35:24Z

kaos
Nov 26, 2023
Collaborator

Have you considered adding a new output format option to the dependencies/dependents goals. As you say, you can run pants dependents --transitive src/*.py only you won't know which entry belongs to which file. However, if we support say, outputting this in a JSON format, that could be structured precisely as you suggest here, but we don't have to come up with new goal names for it.

3 replies

AlexTereshenkov Nov 26, 2023
Collaborator Author

Thanks for reading the document! Yes, this is a great option, indeed. The user interface change is minimal, it's just another flag, isn't it. We could do that if folks are uncomfortable about adding a new goal, as you can see from the document I have a somewhat ambitious plan on what the graph export could look and what it could support. :)

What I am worried about is that once we have added the --output flag for dependencies and dependents goals, we'd like to add more sophisticated querying mechanisms (depth, scope, etc) and I am not sure if those concepts shall be part of those goals? Perhaps they should though, it's just that I think it may be easier to not overcomplicate the code/logic of those goals and have a separate goal. So I am kind of torn apart :)

Whatever we decide though, the document provides the motivation for why we want to export this data in the first place, so I am happy with a compromise where we don't add a new goal.

kaos Nov 27, 2023
Collaborator

What I'm most reluctant to is the proliferation of closely related goals. However, if we turn this on it's head, what if a new dep-graph goal would be the primary goal for dependency related stuff (on the implementation level, still keeping dependencies/dependents as the high level easy to use goals for what they do)?

kaos Nov 27, 2023
Collaborator

I think I agree with Benjy's reasoning about keeping the dep graph output focused and concise, and as such I think we can fit it in without too much trouble on the existing goal(s).

benjyw · 2023-11-26T19:47:20Z

benjyw
Nov 26, 2023
Maintainer Sponsor

This data exists (along with a lot of other data) in peek output, no? So maybe this is a new output format on peek?

1 reply

benjyw Nov 26, 2023
Maintainer Sponsor

Oh I see you've addressed this.

benjyw · 2023-11-26T19:49:45Z

benjyw
Nov 26, 2023
Maintainer Sponsor

What is the use case for providing transitive info? I would expect that a simple adjacency list is enough for graph visualization etc?

9 replies

AlexTereshenkov Nov 26, 2023
Collaborator Author

Ditto depth? What problem are we solving?

Depth is not in the list of urgent features to have, but it may be useful.

How can specifying an upper bound on the depth of the search be helpful? If the depth is omitted, the search is unbounded, that is, it computes the entire transitive closure of dependencies. What if you only care about your package dependencies and everything the direct dependencies depend on?

A use case would be to have an interactive exploration of the graph where as you move around the graph, the nearest neighbors are fetched to keep the querying fast and avoid consuming too much memory. Arguably, this won't be relevant for small to medium codebases, but may be vital for large codebases with thousands of build targets and a tight dependency graph.

Graph visualization may also benefit from having a subset of the graph where you can stop at a few steps from your starting point to keep the graph picture sane. With direct dependencies listed, you can visualize the whole graph, but to keep the image reasonably small to explore, you'd need to write custom logic to cut off the dependencies which is what I'd like to avoid.

benjyw Nov 26, 2023
Maintainer Sponsor

That can still be in dependencies/dependents: we can deprecate --transitive and use --depth=9999 or something, and default to --depth=1

kaos Nov 27, 2023
Collaborator

we don't need to deprecated --transitive as that is a nice way to express an unbounded depth, which could be awkward to set using --depth=*

AlexTereshenkov Dec 3, 2023
Collaborator Author

Thank you! I really like the idea of extending the dependencies and dependents goals with the format option - we can add more formats easily, following what Bazel and Buck do (JSON, DOT, XML, etc). merged is great too which would be exactly how we list the results now.

AlexTereshenkov Dec 3, 2023
Collaborator Author

Let's park the depth discussion aside as this is something we can add much later! I have a ton of other ideas I'd love to add to Pants in terms of querying the dependency graph, so more separate discussions are coming :)

benjyw · 2023-11-26T19:51:14Z

benjyw
Nov 26, 2023
Maintainer Sponsor

If a new goal, I think graph is a good name.

0 replies

AlexTereshenkov · 2023-12-03T20:21:40Z

AlexTereshenkov
Dec 3, 2023
Collaborator Author

Let's settle down with the following implementation plan:

Add --format option to the dependencies and dependents goals to export the results into a different format.
Option values:

merged: the current output (all targets in a single list)
json: a mapping {"input dependency": ["its dependencies"]}

More supported formats will be added later.

The --transitive flag is respected, just as every other current option.

Do we have a consensus? :) @kaos @benjyw

1 reply

benjyw Dec 4, 2023
Maintainer Sponsor

Sounds great me! And incidentally this should also fix #20181 , while we're at it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting project's dependency graph #20242

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Exporting project's dependency graph #20242

AlexTereshenkov Nov 25, 2023 Collaborator

Introduction

Motivation

Implementation

Existing implementations

Proof of concept

Replies: 6 comments · 15 replies

huonw Nov 25, 2023 Collaborator

AlexTereshenkov Nov 25, 2023 Collaborator Author

kaos Nov 26, 2023 Collaborator

AlexTereshenkov Nov 26, 2023 Collaborator Author

kaos Nov 27, 2023 Collaborator

kaos Nov 27, 2023 Collaborator

benjyw Nov 26, 2023 Maintainer Sponsor

benjyw Nov 26, 2023 Maintainer Sponsor

benjyw Nov 26, 2023 Maintainer Sponsor

AlexTereshenkov Nov 26, 2023 Collaborator Author

benjyw Nov 26, 2023 Maintainer Sponsor

kaos Nov 27, 2023 Collaborator

AlexTereshenkov Dec 3, 2023 Collaborator Author

AlexTereshenkov Dec 3, 2023 Collaborator Author

benjyw Nov 26, 2023 Maintainer Sponsor

AlexTereshenkov Dec 3, 2023 Collaborator Author

benjyw Dec 4, 2023 Maintainer Sponsor

AlexTereshenkov
Nov 25, 2023
Collaborator

Replies: 6 comments 15 replies

huonw
Nov 25, 2023
Collaborator

AlexTereshenkov Nov 25, 2023
Collaborator Author

kaos
Nov 26, 2023
Collaborator

AlexTereshenkov Nov 26, 2023
Collaborator Author

kaos Nov 27, 2023
Collaborator

kaos Nov 27, 2023
Collaborator

benjyw
Nov 26, 2023
Maintainer Sponsor

benjyw Nov 26, 2023
Maintainer Sponsor

benjyw
Nov 26, 2023
Maintainer Sponsor

AlexTereshenkov Nov 26, 2023
Collaborator Author

benjyw Nov 26, 2023
Maintainer Sponsor

kaos Nov 27, 2023
Collaborator

AlexTereshenkov Dec 3, 2023
Collaborator Author

AlexTereshenkov Dec 3, 2023
Collaborator Author

benjyw
Nov 26, 2023
Maintainer Sponsor

AlexTereshenkov
Dec 3, 2023
Collaborator Author

benjyw Dec 4, 2023
Maintainer Sponsor