Add NA for genomic-data-bin-counts #11006

alisman · 2024-09-23T14:08:46Z

This is intended to serve as a documentation for the special approach to get NA counts for study view endpoints during the development of ClickHouse RFC80, where the first time implementation for endpoint genomic-data-counts is at #10807. These are the endpoints that also take this approach: generic-assay-bin-counts (#11039), clicinical-data(-bin)-counts (PR tbd).

Some thoughts that might be helpful before we get to the recipe:

The way I see cbioportal's backend structure (for study view), in steps:
1. The study view filter filters from database to select samples
2. Each endpoint examine these samples to extract specific properties and return to the frontend
Then, for those endpoints that their duty is to count on certain properties, we will have to find the count for NA, which is a special value that is complicated to handle because:
- It may stands for {not available, not altered}
- Its representative (different forms of it, e.g., '', 'NA', 'N/A') is not stored for all samples based on data availability
Which transform the question into, based off of the backend structure, 2 independent parts:
- Filtering: How do we handle the case when user selects 'NA' in the frontend filters?
- Counting: How do we get the count of NA for a specific property?
The ideas for these 2 questions are pretty strightforward:
1. For filtering: we find a way to select samples with NA values, and add that to all the samples with non-NA values
2. For counting: we find a way to get the NA count = total selected samples count - non-NA count

Now, to the recipe:

Ingredients:

Tables with data on that properties, without NA values OR there is a way to filter out NA values
- Without NA values: WHERE alteration_value != 'NA'
- "A way to filter out NA values":
```
<include refid="normalizeAttributeValue">
    <property name="attribute_value" value="alteration_value"/>
</include> != 'NA'
```
Endpoints with filters that contains information of what samples to select, and these filters can be determined by checking the request that sends to the endpoints, for example:
- For endpoint genomic-data-counts, the filter for counting and filtering are both GenomicDataFilter, which contains {hugo gene symbol, profile type}
- For endpoint genomic-data-bin-counts, the filter for counting is GenomicDataBinCountFilter(also contains {hugo gene symbol, profile type}) and for filtering is GenomicDataFilter. In this case we only need to pass GenomicDataBinCountFilter from controller to mapper and counting. The study view filter will have GenomicDataFilter for filtering.
- Similarly, for endpoint generic-assay-data-bin-counts, the filter for counting is GenericAssayDataBinCountFilter(contains {stable id, profile type}) and for filtering is GenericAssayDataFilter. We only need to pass GenericAssayDataBinCountFilter from controller to mapper and counting. The study view filter will have GenericAssayDataFilter for filtering.

Cook:

Create the code path with aboved mentioned filter, controller -> service -> repository -> mapper -> counting SQL

For counting: recall we need NA count = total selected samples count - non-NA count. First is to get the "non-NA count":

Get all non-NA value samples: use information {hugo gene symbol, profile type} from GenomicDataFilter.
- Bind can be used to create variable: <bind name="profileType" value="genomicDataBinFilters[0].profileType" />
- Contruct WHERE clause with caution to pursue O(n)
- Be noticed when user select 'NA' only, this query will return 'empty', meaning there will be no attributeId nor value, but in this case we still need the count, which is supposed to be 0. We will handle it later.

WITH genomic_numerical_query AS (
    SELECT
        concat(hugo_gene_symbol, profile_type) AS attributeId,
        <include refid="normalizeAttributeValue">
            <property name="attribute_value" value="alteration_value"/>
        </include> AS value,
        cast(count(value) as INTEGER) AS count
    FROM genetic_alteration_derived
    <where>
        <include refid="normalizeAttributeValue">
            <property name="attribute_value" value="alteration_value"/>
        </include> != 'NA' AND
        profile_type = #{profileType} AND
        <include refid="applyStudyViewFilter">
            <property name="filter_type" value="'SAMPLE_ID_ONLY'"/>
        </include>
        <foreach item="genomicDataBinFilter" collection="genomicDataBinFilters" open=" AND (" separator=" OR " close=")">
            hugo_gene_symbol = #{genomicDataBinFilter.hugoGeneSymbol}
        </foreach>
    </where>
    GROUP BY hugo_gene_symbol, profile_type, value
)

Sum to get the count of all non-NA value samples. Nice! We have "non-NA count" now.

genomic_numerical_sum AS (
    SELECT
        attributeId,
        sum(count) as genomic_numerical_count
    FROM genomic_numerical_query
    GROUP BY attributeId
)

Use total selected sample count minus non-NA count to get NA count. We just get total selected sample count from the study view filter. We still has the above mentioned 'empty' problem to handle, and the way is to provide default value that directly comes from the filter by using the coalesce() function. Then whenever the non-NA query returns empty results, we still have all properties we need to construct the 'NA' only object with its count = total selected sample count - 0. In the end we UNION the first non-NA query with this NA query together.

SELECT
  coalesce((SELECT attributeId FROM genomic_numerical_sum LIMIT 1), concat(#{genomicDataBinFilters[0].hugoGeneSymbol}, #{profileType})) as attributeId,
  'NA' as value,
  cast(((SELECT * FROM (<include refid="getTotalSampleCount"/>)) - coalesce((SELECT genomic_numerical_count FROM genomic_numerical_sum LIMIT 1), 0)) as INTEGER) as count

For filtering: recall we need to find a way to select samples with NA values, and add that to all the samples with non-NA values, since all non-NA values are certainly available in the database. This requires to consider all user selection cases: 1) user select 'NA' only 2) user select non-NA only 3) user select both 'NA' and non-NA. And we can combine results directly in WHERE clause or using UNION, depending on different scenarios:
1. Determine whether to use WHERE cluase for combining results or UNION. The main difference here is whether the endpoint handles categorical or numerical values. For categorical values, we can simply add value IS NULL to the WHERE clause, whereas for numerical values, since we need to perform binning on these values, it has to be non-NULL, and so we need to combine them with NULL values by UNION.
2. If UNION, check whether user selects NA or numerical value
```
<bind name="userSelectsNA" value="false" />
<bind name="userSelectsNumericalValue" value="false" />
<foreach item="dataFilterValue" collection="genomicDataFilter.values">
    <choose>
        <when test="dataFilterValue.value == 'NA'">
            <bind name="userSelectsNA" value="true" />
        </when>
        <otherwise>
            <bind name="userSelectsNumericalValue" value="true" />
        </otherwise>
    </choose>
</foreach>
```
1. If UNION, prepare NA values. The idea is to use sample table LEFT JOIN with all non-NA values, this way we can find all samples that doesn't have a value, then they have to be NULL. So we need to 1) filter out incomplete NA values in the table, depending on above mentioned whether table has no NA value or we can use "the way to filter out NA values". 2) narrow down the right part of JOIN beforehand to make the JOIN as fast as possible. These 2 steps are done in the <include refid="selectAllNumericalGeneticAlterations"/>. Then we can get clean NA values by specifying WHERE alteration_value IS null (or using "the way to filter out NA values") with this LEFT JOIN.
```
<if test="userSelectsNA">
  SELECT DISTINCT sd.sample_unique_id
  FROM sample_derived sd
      LEFT JOIN (<include refid="selectAllNumericalGeneticAlterations"/>) AS genomic_numerical_query ON sd.sample_unique_id = genomic_numerical_query.sample_unique_id
  WHERE alteration_value IS null
</if>
```
1. If UNION, prepare non-NA values and UNION them in the end. For this part just to carefully filter out incomplete NA values like we've done above.
2. If WHERE, prepare non-NA values and NA values together. We can reuse the non-NA query from endpoint counting, just add in the WHERE clause to looking for NULL <when test="dataFilterValue.value == 'NA'">alteration_value IS null</when>
```
WITH cna_query AS (
  SELECT sample_unique_id, alteration_value
  FROM genetic_alteration_derived
  WHERE profile_type = #{genomicDataFilter.profileType}
      AND hugo_gene_symbol = #{genomicDataFilter.hugoGeneSymbol}
      AND cancer_study_identifier IN
  <foreach item="studyId" collection="studyViewFilterHelper.studyViewFilter.studyIds" open="(" separator="," close=")">
      #{studyId}
  </foreach>
  <foreach item="dataFilterValue" collection="genomicDataFilter.values" open="AND (" separator=" OR " close=")">
      <choose>
          
          <when test="dataFilterValue.value == 'NA'">alteration_value IS null</when>
          
          <otherwise>alteration_value == #{dataFilterValue.value}</otherwise>
      </choose>
  </foreach>
```

Thank you for reading this far. Hope this is helpful. ;-)

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml

haynescd

lgtm

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml

alisman · 2024-09-26T14:26:55Z

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml

+                    <if test="dataFilterValue.start != null or dataFilterValue.end != null">
+                        <choose>
+                            <when test="dataFilterValue.start == dataFilterValue.end">
+                                AND abs(


lets think about whether we can cast to decimal and replace this.

* Fix intersection with parens

* add genomic data filter tests testing for missing NAs * add NA genomic data filter tests --------- Co-authored-by: Bryan Lai <laib1@mskcc.org>

sonarcloud · 2024-10-02T16:32:36Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

alisman commented Sep 23, 2024

View reviewed changes

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml Outdated Show resolved Hide resolved

fuzhaoyuan force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch 2 times, most recently from 5d2874c to d744f1d Compare September 23, 2024 20:35

haynescd previously approved these changes Sep 25, 2024

View reviewed changes

alisman commented Sep 26, 2024

View reviewed changes

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml Outdated Show resolved Hide resolved

alisman commented Sep 26, 2024

View reviewed changes

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml Outdated Show resolved Hide resolved

alisman commented Sep 26, 2024

View reviewed changes

src/main/resources/org/cbioportal/persistence/mybatisclickhouse/StudyViewFilterMapper.xml Outdated Show resolved Hide resolved

alisman commented Sep 26, 2024

View reviewed changes

fuzhaoyuan self-assigned this Sep 26, 2024

fuzhaoyuan dismissed haynescd’s stale review via adc4179 September 27, 2024 15:26

fuzhaoyuan force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch from adc4179 to dea9647 Compare September 27, 2024 16:38

alisman force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch from dea9647 to 437cd1f Compare September 27, 2024 17:53

fuzhaoyuan and others added 7 commits October 2, 2024 12:24

Add NA for genomic-data-bin-counts

5823caa

Filtering with NA for genomic-data-bin-counts

ea6af5a

Fix when subqueries return empty for NA counts

eb226a9

Refinement and comments

2fcc6d1

Address comments

b7c6725

Fix intersection with parens (#11032)

c725df9

* Fix intersection with parens

Demo rfc80 poc genomic data filter tests (#11036)

3a8536d

* add genomic data filter tests testing for missing NAs * add NA genomic data filter tests --------- Co-authored-by: Bryan Lai <laib1@mskcc.org>

fuzhaoyuan force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch from e114db0 to 3a8536d Compare October 2, 2024 16:24

alisman merged commit ceed01f into demo-rfc80-poc Oct 2, 2024
14 of 19 checks passed

alisman deleted the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch October 2, 2024 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NA for genomic-data-bin-counts #11006

Add NA for genomic-data-bin-counts #11006

alisman commented Sep 23, 2024 •

edited by fuzhaoyuan

Loading

haynescd left a comment

alisman Sep 26, 2024

sonarcloud bot commented Oct 2, 2024

Add NA for genomic-data-bin-counts #11006

Add NA for genomic-data-bin-counts #11006

Conversation

alisman commented Sep 23, 2024 • edited by fuzhaoyuan Loading

Some thoughts that might be helpful before we get to the recipe:

Now, to the recipe:

Ingredients:

Cook:

Thank you for reading this far. Hope this is helpful. ;-)

haynescd left a comment

Choose a reason for hiding this comment

alisman Sep 26, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Oct 2, 2024

Quality Gate passed

alisman commented Sep 23, 2024 •

edited by fuzhaoyuan

Loading