Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NA for genomic-data-bin-counts #11006

Merged

Conversation

alisman
Copy link
Contributor

@alisman alisman commented Sep 23, 2024

This is intended to serve as a documentation for the special approach to get NA counts for study view endpoints during the development of ClickHouse RFC80, where the first time implementation for endpoint genomic-data-counts is at #10807. These are the endpoints that also take this approach: generic-assay-bin-counts (#11039), clicinical-data(-bin)-counts (PR tbd).

Some thoughts that might be helpful before we get to the recipe:

  • The way I see cbioportal's backend structure (for study view), in steps:
    1. The study view filter filters from database to select samples
    2. Each endpoint examine these samples to extract specific properties and return to the frontend
  • Then, for those endpoints that their duty is to count on certain properties, we will have to find the count for NA, which is a special value that is complicated to handle because:
    • It may stands for {not available, not altered}
    • Its representative (different forms of it, e.g., '', 'NA', 'N/A') is not stored for all samples based on data availability
  • Which transform the question into, based off of the backend structure, 2 independent parts:
    • Filtering: How do we handle the case when user selects 'NA' in the frontend filters?
    • Counting: How do we get the count of NA for a specific property?
  • The ideas for these 2 questions are pretty strightforward:
    1. For filtering: we find a way to select samples with NA values, and add that to all the samples with non-NA values
    2. For counting: we find a way to get the NA count = total selected samples count - non-NA count

Now, to the recipe:

Ingredients:

  • Tables with data on that properties, without NA values OR there is a way to filter out NA values
    <include refid="normalizeAttributeValue">
        <property name="attribute_value" value="alteration_value"/>
    </include> != 'NA'
  • Endpoints with filters that contains information of what samples to select, and these filters can be determined by checking the request that sends to the endpoints, for example:
    • For endpoint genomic-data-counts, the filter for counting and filtering are both GenomicDataFilter, which contains {hugo gene symbol, profile type}
    • For endpoint genomic-data-bin-counts, the filter for counting is GenomicDataBinCountFilter(also contains {hugo gene symbol, profile type}) and for filtering is GenomicDataFilter. In this case we only need to pass GenomicDataBinCountFilter from controller to mapper and counting. The study view filter will have GenomicDataFilter for filtering.
    • Similarly, for endpoint generic-assay-data-bin-counts, the filter for counting is GenericAssayDataBinCountFilter(contains {stable id, profile type}) and for filtering is GenericAssayDataFilter. We only need to pass GenericAssayDataBinCountFilter from controller to mapper and counting. The study view filter will have GenericAssayDataFilter for filtering.

Cook:

  • Create the code path with aboved mentioned filter, controller -> service -> repository -> mapper -> counting SQL

  • For counting: recall we need NA count = total selected samples count - non-NA count. First is to get the "non-NA count":

    1. Get all non-NA value samples: use information {hugo gene symbol, profile type} from GenomicDataFilter.
      • Bind can be used to create variable: <bind name="profileType" value="genomicDataBinFilters[0].profileType" />
      • Contruct WHERE clause with caution to pursue O(n)
      • Be noticed when user select 'NA' only, this query will return 'empty', meaning there will be no attributeId nor value, but in this case we still need the count, which is supposed to be 0. We will handle it later.
    WITH genomic_numerical_query AS (
        SELECT
            concat(hugo_gene_symbol, profile_type) AS attributeId,
            <include refid="normalizeAttributeValue">
                <property name="attribute_value" value="alteration_value"/>
            </include> AS value,
            cast(count(value) as INTEGER) AS count
        FROM genetic_alteration_derived
        <where>
            <include refid="normalizeAttributeValue">
                <property name="attribute_value" value="alteration_value"/>
            </include> != 'NA' AND
            profile_type = #{profileType} AND
            <include refid="applyStudyViewFilter">
                <property name="filter_type" value="'SAMPLE_ID_ONLY'"/>
            </include>
            <foreach item="genomicDataBinFilter" collection="genomicDataBinFilters" open=" AND (" separator=" OR " close=")">
                hugo_gene_symbol = #{genomicDataBinFilter.hugoGeneSymbol}
            </foreach>
        </where>
        GROUP BY hugo_gene_symbol, profile_type, value
    )
    1. Sum to get the count of all non-NA value samples. Nice! We have "non-NA count" now.
    genomic_numerical_sum AS (
        SELECT
            attributeId,
            sum(count) as genomic_numerical_count
        FROM genomic_numerical_query
        GROUP BY attributeId
    )
    1. Use total selected sample count minus non-NA count to get NA count. We just get total selected sample count from the study view filter. We still has the above mentioned 'empty' problem to handle, and the way is to provide default value that directly comes from the filter by using the coalesce() function. Then whenever the non-NA query returns empty results, we still have all properties we need to construct the 'NA' only object with its count = total selected sample count - 0. In the end we UNION the first non-NA query with this NA query together.
    SELECT
      coalesce((SELECT attributeId FROM genomic_numerical_sum LIMIT 1), concat(#{genomicDataBinFilters[0].hugoGeneSymbol}, #{profileType})) as attributeId,
      'NA' as value,
      cast(((SELECT * FROM (<include refid="getTotalSampleCount"/>)) - coalesce((SELECT genomic_numerical_count FROM genomic_numerical_sum LIMIT 1), 0)) as INTEGER) as count
  • For filtering: recall we need to find a way to select samples with NA values, and add that to all the samples with non-NA values, since all non-NA values are certainly available in the database. This requires to consider all user selection cases: 1) user select 'NA' only 2) user select non-NA only 3) user select both 'NA' and non-NA. And we can combine results directly in WHERE clause or using UNION, depending on different scenarios:

    1. Determine whether to use WHERE cluase for combining results or UNION. The main difference here is whether the endpoint handles categorical or numerical values. For categorical values, we can simply add value IS NULL to the WHERE clause, whereas for numerical values, since we need to perform binning on these values, it has to be non-NULL, and so we need to combine them with NULL values by UNION.
    2. If UNION, check whether user selects NA or numerical value
    <bind name="userSelectsNA" value="false" />
    <bind name="userSelectsNumericalValue" value="false" />
    <foreach item="dataFilterValue" collection="genomicDataFilter.values">
        <choose>
            <when test="dataFilterValue.value == 'NA'">
                <bind name="userSelectsNA" value="true" />
            </when>
            <otherwise>
                <bind name="userSelectsNumericalValue" value="true" />
            </otherwise>
        </choose>
    </foreach>
    1. If UNION, prepare NA values. The idea is to use sample table LEFT JOIN with all non-NA values, this way we can find all samples that doesn't have a value, then they have to be NULL. So we need to 1) filter out incomplete NA values in the table, depending on above mentioned whether table has no NA value or we can use "the way to filter out NA values". 2) narrow down the right part of JOIN beforehand to make the JOIN as fast as possible. These 2 steps are done in the <include refid="selectAllNumericalGeneticAlterations"/>. Then we can get clean NA values by specifying WHERE alteration_value IS null (or using "the way to filter out NA values") with this LEFT JOIN.
    <if test="userSelectsNA">
      SELECT DISTINCT sd.sample_unique_id
      FROM sample_derived sd
          LEFT JOIN (<include refid="selectAllNumericalGeneticAlterations"/>) AS genomic_numerical_query ON sd.sample_unique_id = genomic_numerical_query.sample_unique_id
      WHERE alteration_value IS null
    </if>
    1. If UNION, prepare non-NA values and UNION them in the end. For this part just to carefully filter out incomplete NA values like we've done above.
    2. If WHERE, prepare non-NA values and NA values together. We can reuse the non-NA query from endpoint counting, just add in the WHERE clause to looking for NULL <when test="dataFilterValue.value == 'NA'">alteration_value IS null</when>
    WITH cna_query AS (
      SELECT sample_unique_id, alteration_value
      FROM genetic_alteration_derived
      WHERE profile_type = #{genomicDataFilter.profileType}
          AND hugo_gene_symbol = #{genomicDataFilter.hugoGeneSymbol}
          AND cancer_study_identifier IN
      <foreach item="studyId" collection="studyViewFilterHelper.studyViewFilter.studyIds" open="(" separator="," close=")">
          #{studyId}
      </foreach>
      <foreach item="dataFilterValue" collection="genomicDataFilter.values" open="AND (" separator=" OR " close=")">
          <choose>
              <!-- NA value samples -->
              <when test="dataFilterValue.value == 'NA'">alteration_value IS null</when>
              <!-- non-NA value samples -->
              <otherwise>alteration_value == #{dataFilterValue.value}</otherwise>
          </choose>
      </foreach>
    

Thank you for reading this far. Hope this is helpful. ;-)

@fuzhaoyuan fuzhaoyuan force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch 2 times, most recently from 5d2874c to d744f1d Compare September 23, 2024 20:35
haynescd
haynescd previously approved these changes Sep 25, 2024
Copy link
Collaborator

@haynescd haynescd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

<if test="dataFilterValue.start != null or dataFilterValue.end != null">
<choose>
<when test="dataFilterValue.start == dataFilterValue.end">
AND abs(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets think about whether we can cast to decimal and replace this.

@fuzhaoyuan fuzhaoyuan self-assigned this Sep 26, 2024
@fuzhaoyuan fuzhaoyuan force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch from adc4179 to dea9647 Compare September 27, 2024 16:38
@alisman alisman force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch from dea9647 to 437cd1f Compare September 27, 2024 17:53
@fuzhaoyuan fuzhaoyuan force-pushed the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch from e114db0 to 3a8536d Compare October 2, 2024 16:24
Copy link

sonarcloud bot commented Oct 2, 2024

@alisman alisman merged commit ceed01f into demo-rfc80-poc Oct 2, 2024
14 of 19 checks passed
@alisman alisman deleted the demo-rfc80-poc-na-count-for-genomic-and-generic-assay branch October 2, 2024 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants