Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boolean search #5

Closed
hongsudt opened this issue Sep 8, 2018 · 2 comments
Closed

boolean search #5

hongsudt opened this issue Sep 8, 2018 · 2 comments
Assignees

Comments

@hongsudt
Copy link
Contributor

hongsudt commented Sep 8, 2018

Create boolean search UX for GUDMAP.

The app should allow users to keep supplying a group of the following parameters:

  • presence/not-detected
  • anatomical term
  • a range of stage
    There should be an add or delete button that allows users to add a new group or delete an old one.

Once user click submit, it should redirect the recordset page with proper facet parameters..

Take a look at http://legacy.gudmap.org/gudmap/pages/boolean_test.html for the current boolean search that we are trying to mimic.

Take a look at this issue in rbk-project for detail https://github.com/informatics-isi-edu/rbk-project/issues/463

Here is the link to the help page: http://legacy.gudmap.org/Help/Boolean_Syntax_Help.html

@RFSH
Copy link
Member

RFSH commented Oct 17, 2018

Overview

Based on my understanding this is the structure of tables that we should use for this feature (this is based on rbk-dev. rbk-www has different column and table names):

image

We want to generate a query based on the Gene_Expression:Specimen table. The query will be conjunction of multiple groups of filters. The filters are based on the following columns:

  • name in Anatomy (exact match)
  • Strength in Specimen_Expression (exact match)
  • Pattern in Specimen_Expression (exact match)
  • Pattern_Location in Specimen_Expression (exact match)
  • Order in Developmental_Stage table (range)

Populating Dropdown Values

The first thing that the app should do is sending requests to ermrest to get the existing values for each of these columns and show them in a picker:

  • For the values in the Specimen_Expression:

      /ermrest/catalog/2/attributegroup/Gene_Expression:Specimen_Expression/Strength
      /ermrest/catalog/2/attributegroup/Gene_Expression:Specimen_Expression/Pattern
      /ermrest/catalog/2/attributegroup/Gene_Expression:Specimen_Expression/Pattern_Location
    
  • For the Developmental_stage table: What users want to see is the name column in the table, but what we want to send to ermrest for getting the end result is Order column, so we should project both of these values:

      /ermrest/catalog/2/attribute/Common:Developmental_Stage/Name,Order
    
  • For the Anatomy table: Users want to pick based on the name column, but to make the ermrest query simpler we want the id value. Then to construct the filter we don't even need the join with Anatomy and we can generate the filter directly from the Specimen_Expression table. So we need both name and id. This is a huge table, so we might want to filter the set of terms that we're showing somehow.

      /ermrest/catalog/2/attribute/Common:Anatomy/id,name
    

Query

Now that we know how to get the existing values, I am going to explain the query. An example for one group of these filters would be the following query in ermrest (I wrote it in multiple lines to improve readability):

https://dev.rebuildingakidney.org/ermrest/catalog/2/entity/M:=Gene_Expression:Specimen

/(Developmental_Stage)=(Common:Developmental_Stage:ID)/Order::geq::min&Order::leq::max/$M

/(RID)=(Specimen_Expression:Specimen)

/Strength=str_val&Region=rigion_val&Pattern=pat_val&pattern_Location=pat_loc_val/$M

rbk-dev url

Based on this, the following is how the request will look like for two sets of filter groups:

https://dev.rebuildingakidney.org/ermrest/catalog/2/entity/M:=Gene_Expression:Specimen

/(Developmental_Stage)=(Common:Developmental_Stage:ID)/Order::geq::min1&Order::leq::max1/$M
/(RID)=(Specimen_Expression:Specimen)
/Strength=str_val1&Region=rigion_val1&Pattern=pat_val1&pattern_Location=pat_loc_val1/$M

/(Developmental_Stage)=(Common:Developmental_Stage:ID)/Order::geq::min2&Order::leq::max2/$M
/(RID)=(Specimen_Expression:Specimen)
/Strength=str_val2&Region=rigion_val2&Pattern=pat_val2&pattern_Location=pat_loc_val2/$M

Integration With Chaise

Now that we know how to construct the query, we need to pass this query to chaise to show the result. We recently have added custom facet feature to ermrestjs/chaise. You can pass a JSON blob in the /*::cfacets::BLOB in your url. The custom facet JSON looks like this:

{
  "displayname": "the query in simpler terms",
  "ermerst_path": "a path that ermrest understands"
}

This will allow you to pass a emrest_path to ermrestjs that will be appended to the url. This will be shown as a separate filter on top of the page and won't merge into other facets. This can be helpful here. So you can use this feature to pass the generated query. In ermrestjs we're always using alias $M to refer to the main table that will be projected. Therefore you should use this alias (as I did while explaining the queries). For example the JSON that you need to pass to chaise in the first example would be:

var customFacet = {
  "displayname": "p{in \"renal vesicle\" TS23..TS23}",
  "ermrest_path": "(Developmental_Stage)=(Common:Developmental_Stage:ID)/Order::geq::23&Order::leq::23/$M/(RID)=(Specimen_Expression:Specimen)/Strength=present&Region=EMAPA%3A27678&Pattern=regional&Pattern_Location=proximal/$M"
}

This brings the question of how we want to generate the displayname, which I'm not sure about. Regardless, assuming that you have created the object, then you can use the existing ERMrest.createPath() function to do the faceting encoding for you.

window.location = window.origin + "/chaise/recordset/" + ERMrest.createPath("2", "Gene_Expression", "Specimen", null, customFacet)

Concerns

URL Length

As you might have noticed this query is very lengthy and therefore is going to reach the url length limitation eventually. Currently we're setting a limit of 2048 characters for the path in chaise/ermrestjs (because it seems like IE is limiting URL to be 2048 characters. I didn't test this in IE but this is what I found out by searching online). I tested this query in Chrome in rbk-dev and it didn't have any problem with lengthier paths. It seems like the actual url length limit is around 4000 in rbk-dev (mostly set by apache). So if we want to increase that in chaise we can. But anyways this app should check for the url length and limit the number of queries based on that.

For each group of queries we're using 164 characters for constants (join and column names). The average length of variables that we're using is around 40 characters. So the total length of the query will be 204 times the number of groups. Which means we can have approximately 10 groups of queries.

If we create a view for the Specimen_Expression and name it se with s (for Specimen), st (for Strength), r (for Region), pl (for Pattern_Location), and p (for Pattern) columns; we can save 55 characters and therefore the total group will have a length of 149 which is 13 group of queries.

Performance

For each query we're adding two extra joins. So by adding more and more groups of queries we are making the query more complicated. Testing this was a bit complicated. I had to make sure that the first few groups are not limiting the set of results drastically. For example if the first two groups return empty result, then adding more join wouldn't change the performance at all. The longest query that I could find with different values for Region was this link. It's consistent of 10 group of queries which results in 511 rows. It doesn't seem to have any performance issues. Although we might want to do more testing with a bigger result sets.

@RFSH
Copy link
Member

RFSH commented Mar 5, 2019

As it is explained in the previous comment, we're doing the Anatomy queries based on ID but we're showing the name column to users.

But name is not unique and because of that we are validating its uniqueness in boolean search. This validation can be confusing to the users. Besides we are allowing them to still search based on an "invalid" Region. We're showing users that we're searching based on "name" while the search is still done using the first matching "ID".

We should fix this by changing our queries to add the extra join and use the "name" instead. So if the query was

https://dev.rebuildingakidney.org/ermrest/catalog/2/entity/M:=Gene_Expression:Specimen

/(Stage_ID)=(Common:Developmental_Stage:ID)/Ordinal::geq::min&Ordinal::leq::max/$M

/(RID)=(Specimen_Expression:Specimen)

/Strength=str_val&Region=rigion_id_val&Pattern=pat_val&pattern_Location=pat_loc_val/$M

rbk-dev url

It should be changed to

https://dev.rebuildingakidney.org/ermrest/catalog/2/entity/M:=Gene_Expression:Specimen

/(Stage_ID)=(Common:Developmental_Stage:ID)/Ordinal::geq::min&Ordinal::leq::max/$M

/(RID)=(Specimen_Expression:Specimen)

/Strength=str_val&Pattern=pat_val&pattern_Location=pat_loc_val

/(Region)=(Vocabulary:Anatomy:ID)/Name=region_name_val/$M

rbk-dev url

One obvious downside of doing this is adding the extra join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants