Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call tree analysis with graph databases (e.g. Neo4j) v2 #3128

Merged
merged 1 commit into from
Sep 21, 2021

Conversation

galderz
Copy link
Contributor

@galderz galderz commented Jan 18, 2021

This PR is a second version of the one proposed here.

It enhances the print analysis call tree to generate extra files that make it easy to import into graph databases such as Neo4j. It generates the following files:

  • CSV file with all methods.
  • CSV file with all virtual methods.
  • CSV file with all direct call relationships.
  • CSV file with all virtual call relationships.
  • CSV file with all overriden by relationships.
  • Neo4j Cypher that loads all the CSV files.

To load the data into Neo4j, just call:

cp ${REPORTS_DIR}/*.csv ${NEO_HOME}/import
time cat $(find ${REPORTS_DIR} -name '*.cypher') | ${NEO_HOME}/bin/cypher-shell -u neo4j -p xxx

By loading the method universe into a graph database, issues such as infinispan/infinispan-quarkus#44 can be more easily debugged. The following query can help find the Infinispan source code path(s) that lead to the problematic location:

match p=((m:Method {name: "cloneKey"}) <- [r*1..4] - (n:Method))
with p
where last(nodes(p)).type starts with "org.infinispan" 
return nodes(p), relationships(p)

The Neo4j browser would present the output of this query in this way:

Screen Shot 2021-01-18 at 11 46 08

Other interesting queries for this use case are:

  • Find a method name and all its incoming relationships. This is to find out what brings a given method into the universe:
match (m:Method {name: "cloneKey"}) <- [r] - (o) return *
  • Find a methods incoming relationships 2 layers deep:
match (m:Method {name: "cloneKey"}) <- [r*1..2] - (o) return *
  • Find a method with with its incoming relationships, filtered by only its DIRECT and VIRTUAL incoming relationships:
match (m:Method {name: "cloneKey"}) <- [:DIRECT|:VIRTUAL] - (o) return *

This PR improves on the previous PR:

  • Adding an extra display field has been added that can be selected as field to show in the Neo4j browser. This field is a shortened version of the package+class name+method name. The package is shortened with a letter per package, the class name is built out of the upper case letters (e.g. camel case) and the method name is included as is. This field helps share info with others more easily. With a live Neo4j instance, it is possible to click on each node and get more info.
  • It's functionality complete so no outstanding work remains.
  • Loads fast into Neo4j. Even for complex applications, such as Infinispan server based on Quarkus (~150k nodes, 450k relationships), it takes less than 30 seconds to load data.

At this stage I'd like to hear feedback from other GraalVM contributors (@christianwimmer @olpaw ?) on the proposal above as well as the following points:

  1. Can piggy back on the existing configuration option, or a new one is required?
  2. Can we keep all files along with existing report under the reports folder...etc?
  3. Where should this be documented?

@zakkak
Copy link
Collaborator

zakkak commented Jan 25, 2021

Hi @galderz I have been experimenting with this, and I noticed the following.

When navigating the txt call tree it's trivial to follow the call chain back to the root, while in the neo4j graph it is not.
In theory one could get the call chain from the root down to the method they are interested in by running a cypher query like this:

match p= (vm:VM) - [*] -> (n:Method {name: "<init>", type: "sun.java2d.cmm.lcms.LCMS"})
return p

This however ends up being extremely heavy, and the reason is that in neo4j we model the call tree as a graph and not as a tree.
In neo4j each node represents a method which may be invoked by multiple other methods and may invoke multiple methods in its turn.
On the contrary, in the txt format each line (tree node) represents a specific call site of a method which has a single incoming edge (so there is a single parent and thus it's trivial to walk up the tree).

So to illustrate this we get the following graphs:

Graph corresponding to the txt representation

txt

Graph corresponding to the neo4j representation

neo4j

In the former case it's trivial to find the callers of C (D1, B1, and A), while in the latter case there is not enough info to know how we reached C by just walking up from C. As real graphs are significantly bigger and more complex than this example finding the path between A and C becomes increasingly heavy, and thus impractical.

The trivial approach to solve this would be to replicate the txt representation and have a node per line, this however doesn't allow one to visit a method and see all the possible incoming and outgoing edges to and from that method. Instead they will get multiple nodes which might not be ideal for some queries/use-cases.

WDYT?

@d-kozak
Copy link
Contributor

d-kozak commented Jan 28, 2021

In general, I really like this idea. I believe that having both automated and visual way of processing call trees, points-to analysis results, class initialization chains etc is the way to go. One possible problem I can see is having neo4j specific code in native image codebase. One way to avoid this would be to have a unified tool-friendly output format for all relevant debug data, which would be precisely specified and documented. What do you think about it, @christianwimmer ?

@schauder
Copy link

One possible problem I can see is having neo4j specific code in native image codebase. One way to avoid this would be to have a unified tool-friendly output format for all relevant debug data, which would be precisely specified and documented.

It just generates a Cypher script, right? And Cypher is standardised by now: https://en.wikipedia.org/wiki/Cypher_(query_language)

@schauder
Copy link

@zakkak Aren't W, X and Z callers too, which you are likely to miss, due to the distributed nature of the txt representation?

Probably I'm misunderstanding something. Clarification would be appreciated.

@zakkak
Copy link
Collaborator

zakkak commented Jan 28, 2021

@zakkak Aren't W, X and Z callers too ...?

B, W, X and Z are all indeed callers of D.
Now assume that D contains an if statement that when its condition is true it invokes C.
When D is invoked from B the condition happens to be true and thus C is invoked, however when invoked from W, X, or Z that condition is false thus C doesn't get invoked.
So we know that there is no call-chain linking W,X, or Z to C and thus we are not interested in them.

... which you are likely to miss, due to the distributed nature of the txt representation?

In the distributed representation the info is still there but you get multiple graphs instead of a single one for the same method. If for example you want to find the callers of D you will get the following graphs as a result:

  1. A -> B1 -> D1
  2. W -> D2
  3. X -> D3
  4. Z -> D4

HTH

@schauder
Copy link

@zakkak Interesting, then I misunderstood the format all the time (that's for two days ;-). Could you share two txt snippets, one demonstrating the situation from your example, and one where the calls from W, X and Z actually (might) call C as well?

If this is going to far off topic feel free to recommend a different channel

@christianwimmer
Copy link

Thanks for the PR, using a generic graph database is certainly interesting. There is also the GraalVM specific "Dashboard" dump format for the points-to information, see

Regarding the choice of graph database: It should not be specific to just Neo4J. Instead, there should be a separate option to enable that dumping, which takes a specific database name as the parameter. So you would say something like -H:ExportAnalysisCallTree=neo4j, and adding an abstraction so that different handlers can

  1. change what Cypher code gets emitted (since that is completely dependent on the graph databse), and
  2. change which column names are emitted for the CSV files (because different graph databases use keywords different to StartId and EndId).

@zakkak
Copy link
Collaborator

zakkak commented Jan 28, 2021

@zakkak Interesting, then I misunderstood the format all the time (that's for two days ;-). Could you share two txt snippets, one demonstrating the situation from your example, and one where the calls from W, X and Z actually (might) call C as well?

In the process of doing this I stumbled on #3166 (so I am afraid I can't do it till it gets sorted out)

Apparently I had too much faith in the static analysis and it looks like I was wrong about being able to walk up the tree in the text representation (I didn't know how to read the call-tree properly). So it looks like my suggestion is void :(.

@schauder thanks for pushing me in the right direction :)

@zakkak
Copy link
Collaborator

zakkak commented Jan 28, 2021

For the record.
I was assuming that since LCMS.<init> appears only one time in the txt call-tree, there is also only a single call-chain leading to it.

However further inspection (now that I know what to look for) suggests that I was wrong.

The text below is part of the calltree that leads to LCMS.<init>:

...
│   │       │           │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id=23885 @bci=15
│   │       │           │   │   │   │   ├── directly calls java.awt.color.CMMException.<init>(java.lang.String):void id=28915 @bci=93
│   │       │           │   │   │   │   │   └── directly calls java.lang.RuntimeException.<init>(java.lang.String):void id-ref=738 @bci=2
│   │       │           │   │   │   │   ├── directly calls java.lang.Class.forName(java.lang.String):java.lang.Class id-ref=1093 @bci=35
│   │       │           │   │   │   │   ├── directly calls java.lang.Class.getConstructor(java.lang.Class[]):java.lang.reflect.Constructor id-ref=6414 @bci=44
│   │       │           │   │   │   │   ├── directly calls java.lang.reflect.Constructor.newInstance(java.lang.Object[]):java.lang.Object id-ref=6415 @bci=51
│   │       │           │   │   │   │   ├── directly calls java.security.AccessController.doPrivileged(java.security.PrivilegedAction):java.lang.Object id-ref=1088 @bci=21
│   │       │           │   │   │   │   ├── directly calls java.security.AccessController.doPrivileged(java.security.PrivilegedAction):java.lang.Object id-ref=1088 @bci=108
│   │       │           │   │   │   │   ├── directly calls sun.java2d.cmm.CMMServiceProvider.getColorManagementModule():sun.java2d.cmm.PCMM id=28916 @bci=75
│   │       │           │   │   │   │   │   ├── virtually calls sun.java2d.cmm.CMMServiceProvider.getModule():sun.java2d.cmm.PCMM @bci=7
│   │       │           │   │   │   │   │   │   └── is overridden by sun.java2d.cmm.lcms.LcmsServiceProvider.getModule():sun.java2d.cmm.PCMM id=33250
│   │       │           │   │   │   │   │   │       └── directly calls sun.java2d.cmm.lcms.LCMS.getModule():sun.java2d.cmm.PCMM id=37085 @bci=0
│   │       │           │   │   │   │   │   │           ├── directly calls java.security.AccessController.doPrivileged(java.security.PrivilegedAction):java.lang.Object id-ref=1088 @bci=17
│   │       │           │   │   │   │   │   │           ├── directly calls sun.java2d.cmm.lcms.LCMS$1.<init>():void id=39949 @bci=14
│   │       │           │   │   │   │   │   │           ├── directly calls sun.java2d.cmm.lcms.LCMS.<init>():void id=39950 @bci=34
│   │       │           │   │   │   │   │   │           └── directly calls sun.java2d.cmm.lcms.LCMS.initLCMS(java.lang.Class, java.lang.Class, java.lang.Class):void id=39951 @bci=27
│   │       │           │   │   │   │   │   │               └── directly calls java.lang.UnsupportedOperationException.<init>(java.lang.String):void id-ref=939 @bci=6

If we take the id of the top line (23885) and search for in the call-tree txt file we get:

│   │       │   │       │       │           │           │   │   │   │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=65 
│   │       │   │       │       │           │           │   │   │   │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=67 
│   │       │   │       │       │           │           │   │   │   │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=54 
│   │       │   │       │       │           │           │   │   │   │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=54 
│   │       │   │       │       │           │           │   │   │   │   │   │   │   │   │       ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=22 
│   │       │   │       │       │           │           │   │   │   │       │       ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=9 
│   │       │           │   │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=0 
│   │       │           │   │   │   ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id=23885 @bci=15 
│   │       │           │   │   │                       ├── directly calls sun.java2d.cmm.CMSManager.getModule():sun.java2d.cmm.PCMM id-ref=23885 @bci=100

Which indicates that there are at least 9 call chains leading to LCMS.<init> and it's safe to assume that moving up the tree other ids are also being cross-referenced, leading to even more call chains (in an exponential manner!?).

@hongsup2
Copy link

I'm very happy to see that people in Java Community see the power of graph visualization and analysis, as I have been in the field for more than 10 years, advocating it for many different use cases.

I have a general concern, however, about this PR containing vendor-specific codes. More specifically, the PR produces not only data files (as CSV files) but also a loader script for a specific database: Cypher for Neo4J. The CSV data files are good. CSV is a popular way for exporting/importing data that are widely supported by virtually every graph database or library.

The Cypher loading script, however, is very vendor-specific. Cypher is a proprietary language for Neo4J and does not work for other graph databases -- AWS Neptune, TigerGraph, Oracle Property Graph, Stellagraph, dGraph,, etc ... Note that those other systems can indeed load the CSV files and execute the example graph queries without problems, but only with its own different syntax.

Although @christianwimmer proposed to have a abstraction layer for vendor-specific code, I am not sure that is the best way to resolve this. Sure, I can write a code to generate SQL*Loader code to dump the data into Oracle Property Graph. Somebody in TigerGraph can come and write GSQL script to do the same. Neptune person can have another go ....

But is this really worth to keep all of them in GraalVM? What if some vendor goes out of business in future? What if some vendors make changes to their query language?

I think it best that the GraalVM only produces CSV datafiles. To be more precise, we just remove "printCypher" part of this PR. Every graph database vendor (or it supporters) can write up its own blog articles on how their system can load up call-tree CSV files, run graph queries and do visualization with their own syntax.

What do you think?

@michael-simons
Copy link

Hey @hongsup2 even though I am a Neo4j employee I would agree with your statement and I would like to add a couple of thoughts:

The CSV files are versatile and highly useful, not only for the teams at RedHat, but also on Neo4j and VMWare working with similar tasks for Spring Data.

What @galderz managed to create with the loader script is great and of course not in vain. The loader script can of course be somewhere referenced ("one way of doing it…").

In addition to the vendor specifics, a different analysis might need the graph in a different shape, as already has been pointed out here.

@michael-simons
Copy link

  1. change which column names are emitted for the CSV files (because different graph databases use keywords different to StartId and EndId).

Well those are arbitrary names anyway, also for our product.

@galderz
Copy link
Contributor Author

galderz commented Feb 1, 2021

Thanks everyone for the input. It seems to me the best option would be to just produce generic CSV. Then, as part of documentation, show how one would do import it to say, Neo4j.

My only doubt is then, where the logic to do this should live...

Thanks for the PR, using a generic graph database is certainly interesting. There is also the GraalVM specific "Dashboard" dump format for the points-to information, see

@christianwimmer Any reason to keep the call tree analysis output and logic separate from the dashboard that you mention? Should be used one instead of the other? It's the first time I've heard of the dashboard and json/bgv outputs available there. Where should the code that produces the CSV files be generated from? From inside the CallTreePrinter when PrintAnalysisCallTree is passed? Or from the dashboard?

@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch from 486a119 to 5c63f6c Compare February 1, 2021 13:11
@christianwimmer
Copy link

Where should the code that produces the CSV files be generated from? From inside the CallTreePrinter when PrintAnalysisCallTree is passed? Or from the dashboard?

That depends on the level of detail that you want your output to have. If you are only interested in call edges, then CallTreePrinter is the simpler and easier to use place. But keep in mind that this code has been written as a quick-and-dirty tracing output for developers - feel free to restructure it if you want.

The dashboard output provides much more detailed information about why a certain call graph edge is present. If you feel like you are going in that direction and level of detail, then extending the dashboard output is better.

The Dashboard itself is linked from https://www.graalvm.org/docs/tools/ if you want to try it. Especially the static analysis output is still quite experimental though.

@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch from 8817f05 to 0b81c76 Compare February 15, 2021 09:22
@galderz
Copy link
Contributor Author

galderz commented Feb 15, 2021

Played around with the dashboard but it didn't feel the right place. Instead I've moved the code into CallTreePrinter and removed the cypher generation, so it only generates the CSV files now. I've updated the PR.

The next step would be to document this and show how to use it with an example graph db, e.g. neo4j. Where would that go?

@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch 2 times, most recently from c9f70ba to 140920a Compare March 1, 2021 07:34
@cstancu
Copy link
Member

cstancu commented Mar 6, 2021

@zakkak the reports documentation has some details that should clarify the confusion around the call tree textual representation, especially the logic behind the use of id=<method-id> and id-ref=<method-id>.

@galderz
Copy link
Contributor Author

galderz commented Mar 15, 2021

@cstancu @christianwimmer Can this PR be integrated unless you have specific objectives on where to put documentation about using the CSV files? As said above, this documentation is mostly graph tool specific (e.g. how to combine get the csv files imported into neo4j). Is there place for that? @michael-simons do you have any ideas?

@galderz
Copy link
Contributor Author

galderz commented Mar 15, 2021

@cstancu Is Reports.md the best place for the explaining the csv files? Would you also add graph tool specific instructions there?

@galderz
Copy link
Contributor Author

galderz commented Mar 23, 2021

I've added a commit to add some documentation to Reports.md. Can someone please give this a final review? @christianwimmer @cstancu?

@Sanne
Copy link
Contributor

Sanne commented Jun 6, 2021

Just a reminder about this - I am once again looking into call graphs, and it's really hard to crack with the current text file format as I'm looking into non-trivial applications: just the section listing all possible code places overriding toString and equals weighs in at a multiple Gigabytes.

Clearly I have a problem of "too large" - but that's what I'm aiming to fix: as it often happens, a minor change in our code or a minor upgrade in GraalVM can suddenly trigger a significant additional overhead, and when this happens the tooling we have to track it down makes it tricky.

I'm rebuilding a custom fork to include this, but would realy love if this could be unnecessary.

@Sanne
Copy link
Contributor

Sanne commented Jun 8, 2021

I've been using this all day yesterday and it's amazing.

It would have been far better to keep the ability to generate cypher scripts though - I needed to re-import data multiple times to track a complex issue, and this idea of having only CSVs is very time consuming. Could we at least produce reports with a constant name, or have an option for that?

Currently the produced filenames encode both the base name of the built binary and timestamps - scripting this to get in an efficient workspace is very inconvenient.

@zakkak
Copy link
Collaborator

zakkak commented Jun 8, 2021

FWIW: @Sanne you might find #3227 useful as well.

@christianwimmer christianwimmer requested a review from cstancu June 22, 2021 05:41
@cstancu cstancu self-assigned this Jun 22, 2021
}

ReportUtils.report("call tree for vm entry point", path + File.separatorChar + "reports", "csv_call_tree_vm_" + reportName, "csv",
CallTreePrinter::printVMEntryPoint);
Copy link
Member

@cstancu cstancu Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly you need all the .csv files to inspect the graphs. In that case using this report method is not ideal as it will timestamp each file (as @Sanne also noticed), potentially with a different timestamp. You could instead use com.oracle.graal.pointsto.reports.ReportUtils#report(java.lang.String, java.nio.file.Path, java.util.function.Consumer<java.io.PrintWriter>) and provide a stable name for each file. To keep files organized you can write all .csv files for a run into a directory and timestamp the directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to follow convention with the timestamp, but indeed as @Sanne noted, it makes scripting complicated. Timestamp the directory and you have the same issue. I think the report folder is just fine as is, but we can use symlinks to link to the latest csv files, e.g. report/csv_call_tree_vm.csv links to report/csv_call_tree_vm_123457890. Whenever a new execution happens, the symlink is updated to the latest one. This would help with scripting providing stable paths.

@@ -147,3 +147,16 @@ The call tree report name has the structure `call_tree_<image_name>_<date_time>.
The object tree report name has the structure: `object_tree_<image_name>_<date_time>.txt`.
The image name is the name of the generated image, which can be set with the `-H:Name=<name>` option.
The `<date_time>` is in the `yyyyMMdd_HHmmss` format.

#### CSV files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Reports.md has moved to ../docs/reference-manual/native-image/Reports.md. Can you please rebase and add this documentation there?

@cstancu
Copy link
Member

cstancu commented Jun 22, 2021

@galderz please see my comments above. When those are solved I think this is ready to merge.

@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch 2 times, most recently from 95e5ae3 to 8896e6f Compare June 29, 2021 11:41
@galderz
Copy link
Contributor Author

galderz commented Jun 29, 2021

@cstancu I've rebased, the documentation should be in the right place.

I've also gone ahead and implemented the symbolic link idea to provide stable paths for latest csv files. This should make it easier to script graph db vendor specific import steps.

@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch from 8896e6f to 5f09a8b Compare July 19, 2021 14:01
@galderz
Copy link
Contributor Author

galderz commented Jul 19, 2021

There was a conflict with a recent commit. I've fixed that now. @cstancu can you have a look?

@galderz
Copy link
Contributor Author

galderz commented Jul 19, 2021

Hmmm, hold on. I can't see the symbolic link changes any more...

@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch from 5f09a8b to db1420d Compare July 19, 2021 14:24
@galderz
Copy link
Contributor Author

galderz commented Jul 19, 2021

Fixed. I had fixed it in the wrong place.

@galderz
Copy link
Contributor Author

galderz commented Aug 18, 2021

Any updates on this? @cstancu?

* Methods and virtual methods are represented with graph nodes.
* Direct, virtual and overriden-by relationships have been mapped.
* Bytecode indexes are part of the relationships.
* A method can interact with others multiple times,
with each bytecode index indicating the origin of the call
with the origin method.
@galderz galderz force-pushed the t_call_tree_cypher_v3_csv branch from db1420d to a641ae4 Compare August 25, 2021 06:53
@galderz
Copy link
Contributor Author

galderz commented Sep 20, 2021

Your monthly reminder: any updates on this?

@cstancu
Copy link
Member

cstancu commented Sep 20, 2021

Sorry for the tardiness in my reply, I will have it integrated shortly!

@graalvmbot graalvmbot merged commit 477fb02 into oracle:master Sep 21, 2021
@Sanne
Copy link
Contributor

Sanne commented Sep 21, 2021

Awesome, thanks all!

@galderz
Copy link
Contributor Author

galderz commented Sep 22, 2021

Thx @cstancu!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants