Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring "records" to help test results-assembly #402

Closed
colleenXu opened this issue Feb 2, 2022 · 13 comments
Closed

refactoring "records" to help test results-assembly #402

colleenXu opened this issue Feb 2, 2022 · 13 comments

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Feb 2, 2022

  • Involves query-handler module. we're thinking of creating a class (Record) and creating methods to do what we want and/or building a way to get records as a "readable json" format that can be exported / imported for tests...
  • would make them easier to serialize / cache (w/ less memory usage)
  • Would make it easier to write tests for results-assembly code

We've tried working on this with biothings/bte_trapi_query_graph_handler#88, but put it aside in order to get its fixes deployed

related to #379 and a long-term effort to get "power users" like Colleen able to write tests for module-behavior and for overall-behavior.

@colleenXu
Copy link
Collaborator Author

Right now, there isn't a separate issue for the "testing" goal. This is because we're still figuring out what this effort involves: would we need separate processes for writing tests for each module and for overall-behavior?

Also, Andrew drew this diagram when trying to understand the "test writing process" for results assembly
Screen Shot 2022-02-02 at 9 46 03 AM

@colleenXu colleenXu changed the title refactoring "records" refactoring "records" to help test results-assembly Feb 2, 2022
@colleenXu
Copy link
Collaborator Author

colleenXu commented Feb 2, 2022

also regarding testing results-assembly: Jackson's code of running queries and checking: https://suwulab.slack.com/archives/CC218TEKC/p1643749685831709

  • check that every returned edge is used at least once in at least one result's edge_bindings
  • check that every returned node is used at least once in at least one result's node_bindings
  • check that every query node and query edge appears at least once in at least on result

@tokebe
Copy link
Member

tokebe commented Mar 7, 2022

Just want to note that once the vocab refactor is done we'll be in a pretty good spot to take a look at this. Records basically come into existence from api-response-transform and are lightly changed with annotation, etc. in the call-apis package.

There may be some work to do in the api-response-transform package or it may be better to create a new Record class in the query_graph_handler that handles the newer implementation and leaves the transformer largely unchanged. This will become more clear after the refactor is done and some discussion is had.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Mar 8, 2022

I agree that the vocab refactor will help us understand what's going on.

I believe @marcodarko and @ariutta have previously been thinking about this and trying things out. But their attempt didn't fully work...

@ariutta
Copy link
Collaborator

ariutta commented Mar 23, 2022

QueryGraphHelper is already almost a Record class. Every method except _generateHash takes the input record:
https://github.com/biothings/bte_trapi_query_graph_handler/blob/main/src/helper.js

If the methods with input record were pulled out and refactored, we'd have a Record class. Then we could add a toString method for serialization. We could additionally create a method to "re-hydrate" the serialized output into a new instance of Record.

@tokebe
Copy link
Member

tokebe commented Mar 23, 2022

This is definitely the general idea -- the main addition to this I would like to consider is possibly refactoring the internal data structure of a Record -- both to conform to new vocab standards and to avoid the current cyclic references in parts of the record object (or at least mitigate this as a problem during serialization/"rehydration").

@ariutta
Copy link
Collaborator

ariutta commented Mar 23, 2022

That sounds great

@ariutta
Copy link
Collaborator

ariutta commented Mar 23, 2022

Would toJSON be more JS-appropriate than toString?

@tokebe
Copy link
Member

tokebe commented Mar 23, 2022

(edit: clarification)

I think, given the cyclic references, we'll need a modified structure to output to that can circumvent these references (which can later be rebuilt). So I agree that toJSON or perhaps a more specific/accurate verb might be the better choice. Anything we output to can easily be converted to a string afterwards.

@ariutta
Copy link
Collaborator

ariutta commented Mar 23, 2022

Yeah, toJSON sounds good. One benefit: JSON.stringify automatically looks for a toJSON method.

The cyclic references are tricky. That's where @marcodarko and I ran into trouble earlier.

@tokebe
Copy link
Member

tokebe commented Apr 8, 2022

Below is the general interface for a FrozenRecord -- what you would expect from .toJSON().

interface FrozenRecord {
  subject: FrozenNode;
  object: FrozenNode;
  predicate: string;
  edgeAttributes: EdgeAttribute[];
  publications: string[];
  recordHash?: string; // always supplied by Record, not required from user
  api: string;
  apiInforesCurie: string;
  metaEdgeSource: string;
}

interface FrozenNode {
  original: string;
  obj?: NodeNormalizerResultObj; // always supplied by Record, not required from user
  qNodeID: string;
  isSet: boolean;
  curie: string;
  UMLS: string;
  semanticType: string;
  label: string;
  equivalentCuries?: string[]; // always supplied by Record, not necessarily required from user
  names: string[];
  attributes: any;
}

Most of these values are actually pulled from the apiEdge or the qXEdge. The idea here is that when making a test record, you no longer have to interact with these more "under the hood" parts and can instead use this interface to deal with the data you care about. When initializing a Record instance, the instance will generate a "fake" apiEdge and qXEdge to store the information so that the instance will behave normally and be compatible with the rest of BTE.

Meanwhile, in internal cases like caching, a much more minimal set can be frozen, with often-repeated parts like the apiEdge being pulled out to lower redundant information. When pulled from cache, this can be re-assembled properly into fully functional Record instances.

As it happens, the cyclic reference problem is not really an issue -- we don't need to reference one record through the qXEdge of another record ever, or other similar cases, so the cyclic reference is only there for convenience and can be broken without harming anything (this is also what allows "fake" qXEdges to be made for testing).

@tokebe
Copy link
Member

tokebe commented May 9, 2022

Closing as this has been addressed by biothings/api-respone-transform.js/pull/29 and related.

@colleenXu
Copy link
Collaborator Author

colleenXu commented May 11, 2022

Related to this issue?

An idea: Have a flag (like USE_THREADING = false?) that'll allow records from a query to a file as json (each hop/QEdge execution as a separate file)? Then can modify + use these records as "data" for a test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants