Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill missing levels in variables for goat-cli #4

Open
Euphrasiologist opened this issue May 4, 2022 · 0 comments
Open

Fill missing levels in variables for goat-cli #4

Euphrasiologist opened this issue May 4, 2022 · 0 comments

Comments

@Euphrasiologist
Copy link

In pulling the public JSON:

curl -X 'GET' \
'https://goat.genomehubs.org/api/v0.0.1/resultFields?result=taxon&taxonomy=ncbi' \
-H 'accept: application/json' > vars.json 2> /dev/null

cat vars.json | jq

I can get all of the variables which is nice. Some variables, which I will list, have no constraint enums. This hinders some useful parsing in goat-cli. Full list:

  • bioproject
  • biosample
  • busco_lineage
  • in_progress
  • insdc_open
  • insdc_submitted
  • published
  • sample_acquired
  • sample_collected
  • sample_collected_by
  • sample_sex
  • sex_determination

To be explicit, biosample (rendered in md) has a length of 32 on constraint, but no actual fields:

group name constraint display_group organelle separator source source_url_stub type display_level display_name key summary traverse traverse_direction
taxon biosample {len: 32} assembly nucleus [;] NCBI Datasets https://www.ncbi.nlm.nih.gov/assembly/ keyword 2 Biosample biosample [list] list up

whereas family_representative does:

group name display_group display_level constraint summary traverse traverse_direction type
taxon family_representative target_lists 2 {enum: [asg, cbp, ebpn, cfgp, dtol, ebpn, endemixit, erga, eurofish, gaga, squalomix, metainvert, vgp, agi, arg, gap, gbr, omg, tsi, b10k]} list list up keyword

As you mentioned @rjchallis:

"But part of the problem here is that it doesn't make sense to use an enum to restrict the input values for fields like bioproject and biosample as the potential list is so long. I think a better solution may be to apply a regex constraint on these fields, or to export the list of unique values from the index (this can be cached so only needs to generated once per release) either as part of this endpoint or something similar to the sources report that includes counts per value."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant