Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear language / documentation of vocab + data structures #379

Closed
colleenXu opened this issue Dec 22, 2021 · 23 comments
Closed

clear language / documentation of vocab + data structures #379

colleenXu opened this issue Dec 22, 2021 · 23 comments

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Dec 22, 2021

I think there's a bit of confusion (maybe this is just on my part?) on vocab that makes it difficult to discuss what data structures BTE modules are actually handling.

There's:

  • Hit (having an output ID in the field specified by x-bte or TRAPI) vs Record (1 piece of data from a response) vs result (a compiled collection of nodes + edges that matches the query graph)
  • Restraint (the cut-off code?) vs constraint (the node property filtering?) vs capping vs other-filtering (promiscuous nodes?)
  • QEdge (from query graph) vs Edge (1 connection between nodes built from data we get from sub-queries) vs MetaEdge (how to do sub-queries)
  • 1 ID/curie vs an entity/Node (that can have multiple corresponding IDs + 1 primary ID) vs the data structures for ID resolution

Other things that are probably confusing to newcomers but are "normal" now:

  • QGraph and how it relates to knowledge graph and results in the response
  • node category vs node semantic type (same thing)
  • primary ID vs other IDs
  • edge ID vs subject-object-predicate triple
  • predicate vs relation
@ariutta
Copy link
Collaborator

ariutta commented Feb 2, 2022

Discussion regarding vocabulary in query_results.js for the query graph handler:
biothings/bte_trapi_query_graph_handler#88 (comment)

@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 2, 2022

This will be discussed further as we discuss "how to write tests for the results-assembly process"

and we want to discuss case...https://betterprogramming.pub/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841

@tokebe
Copy link
Member

tokebe commented Feb 2, 2022

Just adding a few notes here:

  • We want to stick to camelCase as it's pretty much the standard in javascript, with PascalCase for class names
  • While we do want descriptive names for each thing, we probably want to keep to shorter variable names for code readability. There's a balance we can probably find between clarity and length, for instance the class QNode is short for QueryNode, which is relatively easy to infer, while keeping the name distinct from others using the word node.
    • I bring this up because I suspect that something like unconsolidatedResultsByResultID will tend to blend together a little when scanning a file, especially when mixed with other relatively similar names like unconsolidatedResults and consolidatedResults

@tokebe
Copy link
Member

tokebe commented Feb 2, 2022

I also just want to define the scope for which we're going to try and do this -- along the lines of in-code language, I assume the main areas of focus will be the query_graph_handler and call-apis packages, while overall we're trying to develop a clear vocabulary to be used in just about any discussion. Does this sound acceptable, at least for a start?

It might be best to go file by file and identify specific variable names/concept names that we want to discuss, and build our final list through some basic discussion over these?

@ariutta
Copy link
Collaborator

ariutta commented Feb 3, 2022

Regarding unconsolidatedResultsByResultID, shorter variable names are better, as long as they remain understandable, because variables that are too long can be unwieldy. Here's one opinion:

Give as descriptive a name as possible, within reason. Do not worry about saving horizontal space as it is far more important to make your code immediately understandable by a new reader. Do not use abbreviations that are ambiguous or unfamiliar to readers outside your project, and do not abbreviate by deleting letters within a word.

https://google.github.io/styleguide/jsguide.html#naming-rules-common-to-all-identifiers

Regarding variable names for maps/objects, I've seen conventions like valueByKey, valueForKey, keyToValue and keyToValueMap. There are arguments pro/con for each one, e.g.,

  1. keyToValue matches the left to right order of {"key": value}, but a quick scan of a variable name like idToRecords for data like {'abc': [record0, record1, ...]} could suggest idToRecords is an array because of the final s.
  2. keyToValueMap makes it clear this isn't a function, but naming it Map could be a bit confusing if it's actually just a "normal" object, not an actual JS Map.

I don't have any strong preferences here, as long as it's all consistent and understandable.

@colleenXu
Copy link
Collaborator Author

note on CURIE (vs ID) and prefix stuffs: https://cthoyt.com/2021/09/14/curies.html. We use prefixes from biolink-model (either what biolink model has at the top or what they list under semantic types for nodes)

@tokebe
Copy link
Member

tokebe commented Mar 1, 2022

Going to begin posting items for discussion here as I move through the process of renaming.

First up: As far as I can tell, smartAPIEdge and metaEdge refer to the same concept -- should we note that the two can be used interchangeably, or pick only one to use? Or am I mistaken about them being the same?

@colleenXu
Copy link
Collaborator Author

@tokebe it might help to link to code lines where the terms are used, so we can see context.

Brainstorming on what I think this first concept is...

  • I think I've been calling the thing "metaKGEdges" or "operations" when communicating....
  • historically, I think we called these things "smartAPIEdge" or something like this, because the TRAPI "meta knowledge graph" concept wasn't implemented/didn't exist. So BTE only used the info from the smartAPI registry yaml's x-bte kgs operation annotation...
  • which brings up a point: do we need different terms to describe the stuff from x-bte annotation VS TRAPI meta_knowledge_graph?

@tokebe
Copy link
Member

tokebe commented Mar 1, 2022

@colleenXu The majority of uses of smartAPIEdge are in qedge2bteedge, while metaEdge is not actually used in-code, but has been used in discussion. I think the best idea will be to use metaEdge instead of smartAPIEdge in basically all cases.

In my exploratory refactoring I'm also using metaXEdge to differentiate when a metaEdge has been paired with a queryExecutionEdge (qXEdge for short).

As far as I can tell, we don't really need differentiation of x-bte annotation vs TRAPI outside of what currently exists, as once the metaKG is retrieved for execution in qedge2bteedge, this doesn't need to be distinguished. The smartapi-kg package differentiates these I think as far as is needed.

@colleenXu
Copy link
Collaborator Author

I'll leave it to your discretion then! I'm not clear on the "queryExecutionEdge / pairing with metaEdge". Like what's the diff between queryExecutionEdge and qEdge (from the QGraph / query request body)? And does the metaEdge change once it's paired, enough that we want to use a diff name?

I imagine that these are concepts/constructs in the code that I'm not fully clear on...

@tokebe
Copy link
Member

tokebe commented Mar 1, 2022

A qEdge exists on the query graph, while a qXEdge is an internal representation of a qEdge with additional properties and methods, which seemed sufficiently different to define separately just to keep track of the changing data types in the code.

@tokebe
Copy link
Member

tokebe commented Mar 3, 2022

Another question for vocabulary:

bteEdges are essentially metaXEdges that have been duplicated and had additional information attached so that queries may be generated using them. They have a 1:1 relationship with sub-queries (excepting paginated sub-queries, which are technically made multiple times, but are the same query related to one bteEdge).

However, the name sort of conflicts with the bteGraph, the edges of which by convention would be bteEdges, which could lead to some potential confusion. Should we rename bteEdges to something else? What would be an appropriate name?

Perhaps they could be named subQueryEdge or APIEdge to make their 1:1 relationship with subQueries more apparent?

@tokebe
Copy link
Member

tokebe commented Mar 3, 2022

@ariutta regarding uniqueNodeID vs simply nodeID, Would it be acceptable to stick with uniqueNodeID? I think that name better implies the purpose of the concept and makes it clearly separate from something like qNodeID.

@tokebe
Copy link
Member

tokebe commented Mar 3, 2022

Additionally, would it be acceptable to use ucResult and cResult as short forms for unconsolidatedResult and consolidatedResult, given that function names applicable first occurrences use the long form for clarity?

@tokebe
Copy link
Member

tokebe commented Mar 3, 2022

@colleenXu Another question: Since results and the kg in a response from BTE are supposed to be somewhat separate, what do we want to use for internal language for response kg nodes/edges? Simply kgNode and kgEdge, or is there something more specific that would make sense?

@ariutta
Copy link
Collaborator

ariutta commented Mar 3, 2022

Regarding uniqueNodeID vs nodeID, I'm fine either way. Technically, an ID should be unique, so it sounded a bit redundant, but, as you point out, just nodeID alone could be confusing.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Mar 3, 2022

  1. maybe "edge" isn't the best noun here, and something like an "operation" / "action" / "plan" is better? I agree on using something more descriptive (what it does or what it helps with) is better than "bteEdge" which is very generic (it's an edge thing and this tool is called BTE :P)
  2. I've been okay with using Node/Edge, similar to TRAPI....or picking a longer name that makes it clear that we're working with the "trapi response's KG" (kgNode / kgEdge, trapiNode / trapiEdge)

Maybe this is useful context? ⬇️ @tokebe @marcodarko @ariutta

I think Kevin originally designed the internal representation of BTE's info (after transforming raw sub-query responses) as independent of the TRAPI standard. I remember that there was an internal "table" of info (pandas dataframe) or "graph" representation (networkx graph) in the original Python-based tool. There was then a step to build a TRAPI response with its KG/results format for export. If this design carried on to the current tool, that may help explain some of the internal code...

@tokebe
Copy link
Member

tokebe commented Mar 3, 2022

For 2. In that case, perhaps we should try to keep it somewhat in-line with TRAPI and say that kgNodes/kgEdges, unless otherwise stated, are TRAPI response nodes/edges?

@tokebe
Copy link
Member

tokebe commented Mar 4, 2022

The above mentioned PRs contain all the changes I've currently made. We'll likely also need a pass on api-response-transform and biomedical_id_resolver, and possibly smartapi-kg, however I felt that these should be a second pass after we've adequately established the majority of our standardized vocab through these more 'central' packages.

@tokebe
Copy link
Member

tokebe commented Mar 10, 2022

@colleenXu One final question regarding this pass, some TRAPI logs reference SmartAPI edges when they are really talking about metaEdges -- how should we handle this? We could:

  1. Not change it (BTE is trying to find smartAPI edges...)
  2. Change to metaKG (not quite the convention, but more clear to user than metaEdge) (BTE is trying to find metaKG edges...)
  3. Use metaKG but explain what that means (BTE is trying to find metaKG (smartAPI + x-bte) edges...)
  4. Something else?

@colleenXu
Copy link
Collaborator Author

colleenXu commented Mar 14, 2022

@tokebe
I'm thinking of 3, but something like BTE is trying to find metaKG edges from smartAPI registry (x-bte annotation). Maybe that will make what is happening clearer to the developer?

sorry for the late reply.

@tokebe
Copy link
Member

tokebe commented Mar 15, 2022

Just for ease-of-access here is the finalized list of vocabulary replacements in the first pass:

View Table
New standard term, short-form Term meaning Old term (to replace) Where old term is used
queryEdge, qEdge Edge in TRAPI query graph Edge lots of places
queryEdgeID, qEdgeID ID of TRAPI query graph edge, e.g. "e01" edge_id query_graph_handler.query_graph
queryNodeID, qNodeID ID of TRAPI query graph node, e.g. "n01" node_id query_graph_handler.query_graph
queryEdgeHash, qEdgeHash A uniquely identifying (across BTE queries) hash of a qEdge - -
queryExecutionEdge, qXEdge Internal UpdatedExeEdge representation of a qEdge to be executed edge, queryEdge, qEdge lots of places
queryExecutionEdgeHash, qXEdgeHash A uniquely identifying (across BTE queries) hash of a qXEdge hashedEdgeID query_graph_handler.cache_handler
APIEdge Internal representation of a smartAPIEdge/qXEdge combo, per subquery bteEdge lots of places
metaEdge An edge in the metaKG smartapi_edge query_graph_handler.qedge2bteedge
metaExecutionEdge, metaXEdge A metaEdge paire with a qXEdge (such as the reasoner_edge property) smartapi_edge query_graph_handler.qedge2bteedge
Curie A "compact URI" -- consisting of a "prefix" and an "identifier" separated by a colon -- used to uniquely identify an entity ID lots of places
PrimaryCurie The 'main' Curie identifying an entity chosen from many valid equivalent CURIEs, usually chosen based on a ranked priority of prefixes PrimaryID lots of places
record A single unit of transformed data from a sub-query response hit, record, result lots of places
recordHash A uniquely identifying hash of a record edgeID query_graph_handler.graph
uniqueNodeID A unique identifier for a node while generating the results graph uniqueNodeID query_graph_handler.query_results
queryGraphSolution A collection of records representing a valid solution to the query graph preresult query_graph_handler.query_results
consolidatedSolution A consolidated set of all records representing a valid solution to the query graph (needs clarification) consolidatedPreresult query_graph_handler.query_results
trapiResultID A unique identifier for a result in the TRAPI response from BTE uniqueResultID query_graph_handler.query_results
result 1 item of the array in the TRAPI response (message.results) - -

@tokebe
Copy link
Member

tokebe commented Aug 24, 2022

Closing as this issue has been well-addressed and should be considered as a coding style directive moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants