Create a way to export the query execution plan #278

newgene · 2021-09-08T17:00:43Z

Similar to explain option for many database server, can we create an optional explain parameter, which will only return the planned API request execution plan, without retrieving the API response.

Before we actually implement this, we need to check if the current logs output can already serve the purpose (but I personally find very hard to navigate the entire log msgs).

It may be a bit complicated with the new iterative query execution algorithm, but the execution plan can be made without going through that process, just output the possible API requests by each edges (without specifying the actual execution order).

The text was updated successfully, but these errors were encountered:

newgene · 2021-09-08T17:51:40Z

assigning this to @tokebe and @marcodarko. Can start with @tokebe to look into what's needed at call-apis package, then work with @marcodarko to see if anything is needed from the new query algorithm code at query_graph_handler package.

tokebe · 2021-09-21T14:46:32Z

@newgene @marcodarko On the call-apis side of things, all that's really needed is a new class function or two which returns the created axios query configs, which can then either be returned as-is or transformed for clarity in the response. I could make a PR with that for @marcodarko to look at, and then we can go from there?

colleenXu · 2021-10-06T06:58:15Z

My understanding is that the algorithm actually works (decides the order of execution) using the number of IDs in each step (as nodes get populated after one-hops).....

It would change the algorithm to do it based off of the number of operations possible between hops (w/o considering the number of IDs?).

colleenXu · 2022-01-12T17:59:26Z

definitely related to this biothings/bte_trapi_query_graph_handler#72

tokebe · 2022-01-12T19:16:34Z

Relevant code:

_edgeSupported runs through query edges (making assumptions that edges get results) to ensure that smartapi edges exist for each query edge. This general pattern could be extended to implement this issue?
query.js in call-apis would appear to have some implementation regarding dryrun_only which may be relevant/worth extending for this purpose.

ericz1803 · 2022-01-31T07:21:50Z

@tokebe was right, _edgeSupported already sort of does a dry run by running through the query edges and getting the smartapi edges. Using this, I logged info about each edge as well as the individual queries made.

@newgene @colleenXu @tokebe Is this the format we are looking for/is there any other info that should be getting logged?

...
{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "Edge e02: biolink:Gene > biolink:SmallMolecule",
    "code": null
},
{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "APIs called: Multiomics Wellness KP API,BioThings SEMMEDDB API,MyChem.info API,Text Mining Targeted Association API,BioThings DGIdb API,Automat Pharos (trapi v-1.2.0),Automat Cord19 (trapi v-1.2.0),Automat HMDB (trapi v-1.2.0),Automat Gtopdb (trapi v-1.2.0),Automat Hetio (trapi v-1.2.0),Automat CTD (trapi v-1.2.0) (181 queries total)",
    "code": null
},
{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "Multiomics Wellness KP API: Gene > correlated_with > SmallMolecule",
    "code": null
},
...

tokebe · 2022-01-31T17:41:03Z

@ericz1803 Just to clarify, the logs will list all APIs queried for a query edge, and then for each of those APIs list the specific sub-query edge? This sounds pretty good to me. @newgene and @colleenXu may have other specific intentions/requirements.

I definitely think using the existing TRAPI log format is the best way to do this as it keeps our response nice and TRAPI-compliant. I'd suggest changing the log levels to INFO to reflect that this is information specifically for/requested by the user, as it should make parsing out the execution plan logs from any potential debug logs simpler.

colleenXu · 2022-01-31T19:22:02Z

In the example above, it looks like all 181 sub-queries will have a log describing the semantics of the operation it's doing....which feels like too much.

Some thoughts:

I thought the main goal was to know the query-execution plan WITHOUT actually waiting the seconds-minutes for the whole query to execute.
BTE does walk thru the execution to make sure we have operations/metaKG edges for each edge. The goal is then to ONLY DO THIS STEP AND THEN STOP EXECUTION. And then probably give a TRAPI-compliant response with an empty object for message.knowledge_graph.nodes / message.knowledge_graph.edges and an empty list for message.results. The info we'd want would be in the logs.
the logs would at minimum give the edges in order of execution (this can be in 1 log total or 1 log per edge) and whether each edge is executed in the reverse direction or not.
I think it'd not necessary, but nice to include the logs for:
- which APIs aren't available during execution (one of the early current logs)
- which APIs are used for sub-querying for each edge's execution / number of sub-queries possible

tokebe · 2022-01-31T19:49:57Z

Just to respond to what I can here:

This doesn't appear to actually execute the queries, it just assembles them and logs the planned API and edge -- I should have been more clear in my previous comment.
I agree that it may be a bit much to list every separate API call and the association to query, however it could prove valuable in some debugging cases. Perhaps we might want to include an additional parameter to specify verbosity in that respect?
- We definitely, at minimum, want the edge execution order, if it's reversed, AND the list of APIs to be called, as that's the minimum original intent of this issue. I think that some way for the user to specify if they want the individual sub-query plans would be a significant nice-to-have.

ericz1803 · 2022-02-01T01:07:03Z

Addressing a few things:

I thought the main goal was to know the query-execution plan WITHOUT actually waiting the seconds-minutes for the whole query to execute.

@colleenXu Yup, just to clarify, it doesn't actually query it, it just gets the APIs for each edge.

which APIs are used for sub-querying for each edge's execution / number of sub-queries possible

I only included the log for 1 of them, but for each edge, it logs every sub-query it will make as well

which APIs aren't available during execution (one of the early current logs)

This message is still logged (I just didn't include it in the snippet)

I'd suggest changing the log levels to INFO

Will do for sure

Perhaps we might want to include an additional parameter to specify verbosity in that respect?

@tokebe Do you know if the logging system right now has levels of verbosity? I could put the execution summary at the INFO level and the subquery stuff at a level above INFO and the user could choose which level of verbosity they want.

We definitely, at minimum, want the edge execution order, if it's reversed, AND the list of APIs to be called, as that's the minimum original intent of this issue.

list of APIs
reversed: I'll add this to the logs (easy to do)
edge execution order: I'll implement @colleenXu's idea of doing entity count based on number of APIs in an edge (shouldn't be too difficult)

colleenXu · 2022-02-01T01:20:45Z

@ericz1803 I'm a bit confused. Comparing my comment to this #278 (comment), my perspective is:

Jackson and Eric have both confirmed that this work will not involve actually executing the whole query. Great, that was just a check
To confirm, we're discussing edge-management (ex: that e2 (reverse) is executed first, and then e1) and we aren't discussing sub-queries (the order that APIs are queried)....
- I thought this example log already showed "which APIs are used for sub-querying for each edge's execution / number of sub-queries possible". Perhaps the list of APIs and the "181 queries total" phrase are referring to the entire query and not a single QEdge?

{
    "timestamp": "2022-01-31T07:12:20.195Z",
    "level": "DEBUG",
    "message": "APIs called: Multiomics Wellness KP API,BioThings SEMMEDDB API,MyChem.info API,Text Mining Targeted Association API,BioThings DGIdb API,Automat Pharos (trapi v-1.2.0),Automat Cord19 (trapi v-1.2.0),Automat HMDB (trapi v-1.2.0),Automat Gtopdb (trapi v-1.2.0),Automat Hetio (trapi v-1.2.0),Automat CTD (trapi v-1.2.0) (181 queries total)",
    "code": null
},

colleenXu · 2022-02-01T01:30:06Z

maybe this will help? The pink arrows show what I'm thinking about regarding "edge execution order".

And this is the query I was using (it takes about 1 min 17 sec to run). It's a simpler version of D2 and might be good for testing this functionality. The C2/C3 queries are also good for testing but are much more complicated.


{
  "message": {
    "query_graph": {
      "edges": {
        "e00": {
          "subject": "n00",
          "object": "n01"
        },
        "e01": {
          "subject": "n01",
          "object": "n02"
        }
      },
      "nodes": {
        "n00": {
          "ids": ["DRUGBANK:DB00215"],
          "categories": ["biolink:SmallMolecule"]
        },
        "n01": {
          "categories": ["biolink:Gene"]
        },
        "n02": {
          "ids": [
              "DOID:5844", 
              "DOID:1936", 
              "DOID:3393"
            ],
          "categories": ["biolink:Disease"]
        }
      }
    }
  }
}

tokebe · 2022-02-01T16:16:16Z

Regarding edge execution order and reversed status: As an example, _edgesSupported() already does the correct order by using an instance of EdgeManager and leveraging manager.getNext(), which handles order (See these lines). BTE does already log the edge execution and whether it's reversed in the main edge execution here (as shown in @colleenXu's screenshot), so in theory all that would need to be done is for @ericz1803 to copy that log into his code that gets the execution plan as edges are being checked.

ericz1803 · 2022-02-07T07:31:07Z

Just made a pull request based on these comments.

Some things that might be worth discussing still:

I didn't address this with my PR, but with the current LogEntry class/system, there is no way to choose which log level to log. We should consider switching to an out of the box logging solution or beefing up our existing solution
With more query parameters, is there anywhere we are documenting this? Also, it can be kind of tedious and error prone to manually edit smartapi.yaml file every time we add a new query parameter

tokebe · 2022-02-07T14:55:03Z

@ericz1803 Regarding your first point, log levels in BTE match the TRAPI standard defined here. The log level is selected by the first argument when creating a new LogEntry(), which you did in you PR -- there's nothing else that needs to be done there. The user can select which log levels to retrieve when making the query.

Regarding your second point, the most important place to add changes is smartapi.yaml, however I'm not sure that there's any place it's documented otherwise -- I would have thought the SmartAPI documentation online might update with changes to that file, but I guess not? @newgene Is there anywhere else we can document new parameters/etc for our endpoints?

newgene · 2022-02-07T16:00:51Z

Generally, yes, smartapi metadata is the place to document end-user facing query endpoints and parameters. It's typically a manual edit, which is fine for us, since the changes to its metadata are not too often.

When the feature is stable and deployed, we can leave a comment here or create a new issue to update the metadata. @colleenXu handles many smartapi metadata updates, so she may chime in on the optimal process for our team as well.

colleenXu · 2022-02-08T06:14:55Z

My understanding is that we currently have to manually edit the smartapi.yaml file in BTE main repo (used to validate what we send to the endpoints) and in the registry repo (used by the Smartapi registry, less important, just for documentation purposes...).

We've previously talked about only keeping 1 file and removing the other one...

colleenXu · 2022-02-08T06:17:09Z

Also, I know @ericz1803 and @tokebe discussed changing the endpoint to "Explain", given the opening post here by @newgene ....

But could we instead use something different like "edgeplan" or "dryrun"? Since I feel like "explain" is used often to mean "explain-type" querying (IDs on QNodes, attached together by QGraph). Related to #379

ericz1803 · 2022-02-09T04:05:12Z

@tokebe I made the changes for "dryrun" -> "explain" and changing the logging level to "DEBUG".

I second @colleenXu's point that "explain" could lead to some confusion. For the smartapi.yaml file, I meant that it doesn't feel great to have to put the parameter in 6 different place but maybe it's unavoidable with how the query urls are set up.

tokebe · 2022-02-09T14:57:48Z

@ericz1803 Apologies for the redundant work, given @colleenXu's comment dryrun does make more sense and should be used instead.

As for smartapi.yaml, yes, unfortunately this is just an unavoidable consequence of how it's structured/how that structure is used.

ericz1803 · 2022-02-11T23:14:42Z

All good, I made the change back to dryrun. If there's nothing else, the two PRs should be ready to merge

colleenXu · 2022-02-23T01:18:37Z

noting (this was originally sent in Slack):

I've reviewed these PRs for dryrun....and I think the dryrun parameter isn't working correctly with the asyncquery endpoints - it looks to me that the query is executing completely rather than stopping early (only logging the query plan).

It looks like it's working as-expected for the /query endpoints. But...
The query-plan it gives still isn't what actually happens in execution....because it's not able to anticipate how many output IDs will happen after executing each edge.
In reality, the edge-manager uses the number of "input" IDs for each edge to decide the execution order (doing lower number earlier)...
Maybe this behavior is okay....since I'm not sure how to anticipate the output of each step beforehand...

I suggest: check with Chunlei on whether the current behavior is okay (only works on /query endpoints, only gives a "possible" query execution plan rather than the actual one) since he's the one who originally opened this issue

colleenXu · 2022-02-23T23:44:55Z

from discussion today:

the dryrun plan will accurately says the first edge to be executed, and it assumes that results/output IDs are found for each edge. This means that the order of the edges isn't necessarily correct after the first edge listed:

results/output IDs might not be found (sub-queries don't find any hits)
as mentioned above, edge-manager algorithm will look at the number of "input" IDs and pick the next edge based on the lower number of input IDs

Fix #278: Add dryrun option

newgene assigned marcodarko and tokebe Sep 8, 2021

newgene added the enhancement New feature or request label Sep 8, 2021

colleenXu mentioned this issue Oct 25, 2021

Canceling execution early: check if all edges can be traversed w/ metakg #335

Closed

This was referenced Feb 7, 2022

Fix #278: Add dryrun option #405

Merged

Fix #278: Add dryrun option biothings/bte_trapi_query_graph_handler#89

Merged

tokebe closed this as completed in #405 Mar 2, 2022

tokebe added a commit that referenced this issue Mar 2, 2022

Merge pull request #405 from biothings/dryrun

a60fe4a

Fix #278: Add dryrun option

colleenXu mentioned this issue Apr 10, 2024

TRAPI 1.5: update SmartAPI yamls for BTE + Service Provider #802

Closed

colleenXu mentioned this issue Sep 4, 2024

Pathfinder dryrun mode: showing what queries are created from the "template-generation" #862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a way to export the query execution plan #278

Create a way to export the query execution plan #278

newgene commented Sep 8, 2021

newgene commented Sep 8, 2021

tokebe commented Sep 21, 2021

colleenXu commented Oct 6, 2021

colleenXu commented Jan 12, 2022

tokebe commented Jan 12, 2022

ericz1803 commented Jan 31, 2022

tokebe commented Jan 31, 2022

colleenXu commented Jan 31, 2022

tokebe commented Jan 31, 2022

ericz1803 commented Feb 1, 2022 •

edited

Loading

colleenXu commented Feb 1, 2022 •

edited

Loading

colleenXu commented Feb 1, 2022 •

edited

Loading

tokebe commented Feb 1, 2022

ericz1803 commented Feb 7, 2022

tokebe commented Feb 7, 2022

newgene commented Feb 7, 2022

colleenXu commented Feb 8, 2022

colleenXu commented Feb 8, 2022 •

edited

Loading

ericz1803 commented Feb 9, 2022

tokebe commented Feb 9, 2022

ericz1803 commented Feb 11, 2022

colleenXu commented Feb 23, 2022 •

edited

Loading

colleenXu commented Feb 23, 2022

Create a way to export the query execution plan #278

Create a way to export the query execution plan #278

Comments

newgene commented Sep 8, 2021

newgene commented Sep 8, 2021

tokebe commented Sep 21, 2021

colleenXu commented Oct 6, 2021

colleenXu commented Jan 12, 2022

tokebe commented Jan 12, 2022

ericz1803 commented Jan 31, 2022

tokebe commented Jan 31, 2022

colleenXu commented Jan 31, 2022

tokebe commented Jan 31, 2022

ericz1803 commented Feb 1, 2022 • edited Loading

Addressing a few things:

colleenXu commented Feb 1, 2022 • edited Loading

colleenXu commented Feb 1, 2022 • edited Loading

tokebe commented Feb 1, 2022

ericz1803 commented Feb 7, 2022

tokebe commented Feb 7, 2022

newgene commented Feb 7, 2022

colleenXu commented Feb 8, 2022

colleenXu commented Feb 8, 2022 • edited Loading

ericz1803 commented Feb 9, 2022

tokebe commented Feb 9, 2022

ericz1803 commented Feb 11, 2022

colleenXu commented Feb 23, 2022 • edited Loading

colleenXu commented Feb 23, 2022

ericz1803 commented Feb 1, 2022 •

edited

Loading

colleenXu commented Feb 1, 2022 •

edited

Loading

colleenXu commented Feb 1, 2022 •

edited

Loading

colleenXu commented Feb 8, 2022 •

edited

Loading

colleenXu commented Feb 23, 2022 •

edited

Loading