Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all values for pos in vcfRecord are strings #97

Closed
cmungall opened this issue May 27, 2024 · 8 comments
Closed

all values for pos in vcfRecord are strings #97

cmungall opened this issue May 27, 2024 · 8 comments

Comments

@cmungall
Copy link
Member

pos should be an int:
https://phenopacket-schema.readthedocs.io/en/latest/variant.html#vcfrecord

But all the values here are strings

(found by checking using linkml-validate)

E.g

find . -name "*.json" -exec grep -h '"pos"' {} \;  | head
                  "pos": "124756209",
                  "pos": "124758482",
                  "pos": "124756540",
                  "pos": "124758482",
                  "pos": "124756540",
                  "pos": "124762120",
                  "pos": "124756704",
                  "pos": "124762120",
                  "pos": "124756540",
                  "pos": "124758482",
@ielis
Copy link
Member

ielis commented May 27, 2024

Hi @cmungall

I think there is no bug here. The Phenopacket Schema defines VcfRecord as following:

message VcfRecord {
  string genome_assembly = 1;
  string chrom = 2;
  uint64 pos = 3;
  string id = 4;
  string ref = 5;
  string alt = 6;
  string qual = 7;
  string filter = 8;
  string info = 9;
}

Note uint64 as the pos field. Due to some constraints related to Javascript, protobuf encodes uint64 into str when serializing to JSON.

This is also why the JSON-schema of phenopacket-tools requires pos to be a str when in JSON/YAML format.

So, it is a bit odd, but I think str is correct, if protobuf is the source of truth.

The only issue is... ... that RTD requires an int 😱

@cmungall
Copy link
Member Author

I thought we agreed the documentation was the SoT? It clearly states positions are integers.

IMO this is really odd. I'd strongly recommend merging this PR and before the current behavior becomes enshrined. For the protobuf users (surely a minority? Everyone uses json/jsonschema/pydantic) we can just have a minimal shim layer.

@cmungall
Copy link
Member Author

Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine. But I think freeing ourselves from limitations of protobuf is best

@pnrobinson
Copy link
Member

@cmungall
thanks for finding this. I am not sure what to do since the software is now pretty entrenched and it is working in many places.
In any case, the place to fix it, if we want to, is not here, it would be upstream in the library code! I think that given the software that we are now using for about 5 other projects, this is probably something we might need to live with, or is there some easy solution that will not cause other issues? @ielis

@cmungall
Copy link
Member Author

cmungall commented May 28, 2024 via email

@ielis
Copy link
Member

ielis commented May 28, 2024

I am not worried too much about our use cases - we can update our projects. I worry more about the public use and I think we should do what is allowed by the semantic versioning in order to fix errors and bugs.

It is important to decide what is the SoT, I get it wrong all the time. Then, regardless of choosing RTD or protobuf, we have plenty issues that should be fixed, some of them waiting for resolution for almost 2 years. We should find a way how to "touch" the schema, this is what I perceive as a very important issue. Nothing in the software world is static. Nothing.

In this particular case, I think that fixing the issue will be a "breaking" change. Either update the RTD to str or drop Protobuf. I think it would be better to change the RTD and to free ourselves from Protobuf from Phenopacket Schema v3.0.0, after discussing with GA4GH people.


Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine.

To the best of my knowledge, uint32 is serialized to JSON in the same way as uint32 - as a str. I ran a unit test in Java, checked the output, and observed a str despite the Protobuf documentation advertising that int32, fixed32, uint32 are mapped to a number.

In other words, running this:

package org.phenopackets.schema.v2;

import com.google.protobuf.util.JsonFormat;
import org.ga4gh.vrsatile.v1.VcfRecord;
import org.junit.jupiter.api.Test;

public class VcfRecordPosTest {

    @Test
    public void printPos() throws Exception {
        VcfRecord record = VcfRecord.newBuilder()
                .setChrom("chr1")
                .setPos(123_456)
                .setRef("C")
                .setAlt("G")
                .build();
        String json = JsonFormat.printer().print(record);
        System.err.println(json);
    }
}

produces this:

{
  "chrom": "chr1",
  "pos": "123456",
  "ref": "C",
  "alt": "G"
}

@cmungall
Copy link
Member Author

cmungall commented May 28, 2024 via email

@pnrobinson
Copy link
Member

We discussed this at length two years ago -- I think I saw the same thing when I was working on the JSON Schema. Our decision was basically that while it is not elegant, there is zero chance that anybody will actually treat the position as a string and so it is not a biggie.
I think that it would be more productive to figure out what we need to do for version 3 of the Schema -- for instance, it would be easier to use linkML with a "fresh" project. I think that what we need now is really implementations and algorithms and analyses, and I think it would create a lot of unnecessary churn if we change the file structure right now! I will close this, but let's meet to plan for the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants