Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Human readable names for queries #13

Open
chrisgorgo opened this issue Jan 11, 2016 · 20 comments
Open

Human readable names for queries #13

chrisgorgo opened this issue Jan 11, 2016 · 20 comments

Comments

@chrisgorgo
Copy link

I was wondering if we could switch to a more user friendly naming scheme for queries. Currently long uninformative hashes are used witch makes parsing code (that uses for example nidm-api) difficult for humans.

We can alternatively wait until singularity ;)

@vsoch
Copy link
Member

vsoch commented Jan 11, 2016

I also like this idea - for example, for experiment factory experiments we have our experiment "tag" as unique ID, and it is very intuitive and easy to find what you are looking for. It would be easy to have some kind of version in the name, in the case of multiple of the same query with different versions.

@nicholsn
Copy link
Contributor

I'm not sure it matters since it is just the filename, as soon as you open the file there all the human readable titles and descriptions. I kinda like the consistency of using uuids and pushing all the readable information into the file metadata. It seems redundant to have that info duplicated.

@vsoch
Copy link
Member

vsoch commented Jan 11, 2016

Let's say you are a developer and you want to update your query. You go to the repo and there are 100 of them...

@chrisgorgo
Copy link
Author

Exactly it makes reading code harder.

On Mon, Jan 11, 2016 at 10:36 AM, Vanessa Sochat notifications@github.com
wrote:

Let's say you are a developer and you want to update your query. You go to
the repo and there are 100 of them...


Reply to this email directly or view it on GitHub
#13 (comment)
.

@nicholsn
Copy link
Contributor

personally, after there are 100 of them I wouldn't remember the file name
anyways and would end up parsing the files and printing out titles and
descriptions to find what I was looking for.

nidm list queries --> prints table of filename, title, description

On Mon, Jan 11, 2016 at 10:37 AM, Chris Filo Gorgolewski <
notifications@github.com> wrote:

Exactly it makes reading code harder.

On Mon, Jan 11, 2016 at 10:36 AM, Vanessa Sochat <notifications@github.com

wrote:

Let's say you are a developer and you want to update your query. You go
to
the repo and there are 100 of them...


Reply to this email directly or view it on GitHub
<
#13 (comment)

.


Reply to this email directly or view it on GitHub
#13 (comment)
.

@chrisgorgo
Copy link
Author

What is easier to understand:

results = do_query("7950f524-90e8-4d54-ad6d-7b22af2e895d")

or

results = do_query("get_peak_coordinates")

@vsoch
Copy link
Member

vsoch commented Jan 11, 2016

I am in agreement with @chrisfilo. It is a detail that will make development much easier. By way of the files needing to exist in the same folder, that alone ensures uniqueness in naming.

@nicholsn
Copy link
Contributor

well I don't understand either without looking at the query, but I see what
you mean so much as the filenames in the query library are curated to be
informative. I still think including relevant metadata is important to
describe the query when not being accessed programatically.

Initially, I had a metadata file that indexes all the queries called,
pardon the ttl it could be json, meta.ttl (
https://github.com/nicholsn/niquery/blob/master/niquery/sparql/meta.ttl),
and was thinking that queries would be accessed more interactively so you
could hide the uuid.

On Mon, Jan 11, 2016 at 10:48 AM, Vanessa Sochat notifications@github.com
wrote:

I am in agreement with @chrisfilo https://github.com/chrisfilo. It is a
detail that will make development much easier. By way of the files needing
to exist in the same folder, that alone ensures uniqueness in naming.


Reply to this email directly or view it on GitHub
#13 (comment)
.

@chrisgorgo
Copy link
Author

Yes we should keep the metadata, but using human readable names will make developers life easier.

@nicholsn
Copy link
Contributor

sure, go for it. can you two decide on a recommended style?

for example all-lowercase-with-hyphens-and-version-1.0.0.json

On Mon, Jan 11, 2016 at 11:17 AM, Chris Filo Gorgolewski <
notifications@github.com> wrote:

Yes we should keep the metadata, but using human readable names will make
developers life easier.


Reply to this email directly or view it on GitHub
#13 (comment)
.

@vsoch
Copy link
Member

vsoch commented Jan 11, 2016

I would say use underscores, because you can't use hyphens in python function names.

  all_lowercase_with_version_1.0.0.json

@satra
Copy link

satra commented Jan 11, 2016

this sounds just like our terms discussions. readable names simply don't scale. i would really like to see us build tools that query the metadata quickly or provide interactive interfaces for editing queries. perhaps we are not at the point yet, but there is a reason why issues on github, questions on stack overflow and google docs all don't have readable names. (stack overflow uses a slug for readability, but the id is what makes things unique)

so instead of punting the interaction between scalable and readability, i think we should put in the effort during the upcoming sprint to have tools that allow us to address this (independent of what the query url looks like). for example, web service/api/command line tool for querying queries.

@nicholsn
Copy link
Contributor

+1

On Mon, Jan 11, 2016 at 11:29 AM, Satrajit Ghosh notifications@github.com
wrote:

this sounds just like our terms discussions. readable names simply don't
scale. i would really like to see us build tools that query the metadata
quickly or provide interactive interfaces for editing queries. perhaps we
are not at the point yet, but there is a reason why issues on github,
questions on stack overflow and google docs all don't have readable names.
(stack overflow uses a slug for readability, but the id is what makes
things unique)

so instead of punting the interaction between scalable and readability, i
think we should put in the effort during the upcoming sprint to have tools
that allow us to address this (independent of what the query url looks
like). for example, web service/api/command line tool for querying queries.


Reply to this email directly or view it on GitHub
#13 (comment)
.

@chrisgorgo
Copy link
Author

Issues on github, questions on stack overflow, and google docs are all examples of instances. Indeed that's where numeric identifies make sense. However we are talking about queries which are considered methods. Those should have human readable names and all of the examples you gave opt for such solution. For example a path for editing a comment on stackoverflow is:

http://stackoverflow.com/posts/34729781/edit

it is NOT

http://stackoverflow.com/posts/34729781/39834-343-683

where 39834-343-683 would correspond to the edit function. This is just not practical. Similar examples can be given for programming languages where functions and methods have human readable names.

@satra
Copy link

satra commented Jan 11, 2016

isn't a query-id simply an instance of a query? if so, all i'm suggesting is that we provide something like:

nidm.nidash.org/query/query-id/edit

as a web service, or something equivalent for other things

i don't think we are talking of queries as methods here ( i can see how it can be seen as such - but i don't think of it that way). anyone can create a query and we will have a collection of queries that an api/web service can call, but they are still instances (they have versions, they will apply to certain versions of the model, they will only work on certain versions of data, etc.,.).

@vsoch
Copy link
Member

vsoch commented Jan 11, 2016

The nidm-api by default serves a REST API, and the current format to view a query is:

  http://localhost:8088/api/7950f524-90e8-4d54-ad6d-7b22af2e895d

and this generates:

image

The issue still comes up about how the developer finds the query_id. To have to do that extra step every time, and to have to provide more methods to look up / search with the API does not make sense when we can just use strings with underscores that a human can remember.

There are two use cases right now for the API. Either someone uses the REST API and must make a call like the above to retrieve the query and do something with it, or the developer uses our python tool to do the query. The second looks like this:

First we retrieve all queries in a dictionary, with lookup key the unique id

  from nidm.query import Queries, do_query
  all_queries = Queries()
  results = Queries(components="results")

Then we would need to just know the qid. This adds an extra annoying step to figuring out the qid every single time.

  # Select a query from results that we like
  qid = "7950f524-90e8-4d54-ad6d-7b22af2e895d"

  # Here is a ttl file that I want to query, nidm-results
  ttl_file = "nidm.ttl"

  result = do_query(ttl_file=ttl_file,query=results.query_dict[qid]["sparql"])

The result is a pandas data frame. I would even suggest we simplify the above further to be more like what @chrisfilo suggested:

  results = do_query("get_peak_coordinates", ttl_file=ttl_file)

In the eyes of the developer, the query is a method. It is run to retrieve a particular result object. The purpose of the nidm-api, period, is to extend NIDM to developers. This means making using it as easily as possible for them. Insisting on a long string of letters and numbers only with the justification that it scales better is not logical, and in fact it makes life a lot harder for the exact audience we are intending for this tool. It also makes it harder for the people writing the query objects. If I go to the github repo now to find the "get_peak_coordinates" query - where is it? It's not intuitive. Scalability might be an issue if these things are made en-masse in an automated way, but they aren't. We are going to have a limited set because they are made by humans. This means they can give them a name that makes sense. I do not see any benefit in having such cryptic names when the entire purpose is to make this more user friendly.

@satra
Copy link

satra commented Jan 11, 2016

given the nidm-api, not just around nidm-results, the set of possible queries one can make is immense, especially as we allow people to fork/modify queries (by whatever interface - not necessarily a script).

The issue still comes up about how the developer finds the query_id.

in any scenario where the number of queries exceed a handful known ones, a developer will have to look into the metadata of a query to find the query-id or run a query to find a query-id using some matching criteria. get_peak_coordinates only works when there is one. as an example, what if i simply wanted to constrain this to FSL results or to FSL results processed with FSL > 5.0.5, or any other set of constraints. in any such event, the number of queries increase at a rapid rate (a la jsfiddle or gists). and this is just speaking about get_peak_coordinates. just the number of queries that i have run around nidm results for freesurfer coupled with other phenotypic data would go beyond a handful.

i completely agree that if the goal of nidm-api is to only expose a finite set of specific queries, those should simply be methods of the API, but if the goal is to to run a generic method as do_query and expect a set of differential datatypes (arrays, dataframes, graphs) depending on the query, then we really have to think further. in fact, in the former scenario do_query itself should be called something else like get_peak_coordinates. in the latter scenario, do_query needs to be able to return different datatypes. i personally think that nidm-api can only be as generic as the gdata api, anything more specific (such as get_peak_coordinates) becomes modules on top of the base api.

if a developer has to use a query the developer needs to understand the nuances of the query, and no amount of human readable name is going to help the developer. that is why i predicated my previous post saying, independent of how the query-id looks we really need to have tools to search through the set of queries and for forking/editing said queries.

i'm completely for the api being easy for developers. what i'm speaking against is the notion that naming a few queries to be human readable is the solution to the problem.

@vsoch
Copy link
Member

vsoch commented Jan 11, 2016

in any scenario where the number of queries exceed a handful known ones, a developer will have to look into the metadata of a query to find the query-id or run a query to find a query-id using some matching criteria. get_peak_coordinates only works when there is one. as an example, what if i simply wanted to constrain this to FSL results or to FSL results processed with FSL > 5.0.5, or any other set of constraints.

Isn't that what variables are for? The queries can have specific variables.

in the latter scenario, do_query needs to be able to return different datatypes.

The datatype returned is not integrated into the query, the user selects datatype to be returned as a variable of the do_query function in the nidm-api. The API always will retrieve the output of the query in some format, and parse to what the user wants.

if a developer has to use a query the developer needs to understand the nuances of the query,

I disagree. If I am a developer all I need is to know the data that I want to retrieve from the input file (such as turtle nidm-results) and the arguments that I can give.

we really need to have tools to search through the set of queries and for forking/editing said queries.

I think that is why we have them on github - to implement our own version of forking / editing seems like re-inventing the wheel. I agree a search function added to the nidm-api to search through the query data structures would be neat.

i'm completely for the api being easy for developers. what i'm speaking against is the notion that naming a few queries to be human readable is the solution to the problem.

I don't think I am suggesting it is a "solution," but it's making it just a little bit harder for people who just want to query some nidm-object to retrieve the data they need.

@nicholsn
Copy link
Contributor

Isn't that what variables are for? The queries can have specific variables.

​true, but you may want to lock in a query to fixed parameters rather than
making it a template.​

The datatype returned is not integrated into the query, the user selects
datatype to be returned as a variable of the do_query function in the
nidm-api. The API always will retrieve the output of the query in some
format, and parse to what the user wants.

​If you write a CONSTRUCT query a graph is returned, ASK returns a
boolean, and SELECT returns a table... I think that might be what @satra is
referring to. ​Also, the format of the output should/can be handled using
content negotiation on the server-side. The client shouldn't necessarily be
required to handle this (e.g., csv, ttl, json-ld), but in some cases it
makes total sense (e.g., dataframes)

I disagree. If I am a developer all I need is to know the data that I want
to retrieve from the input file (such as turtle nidm-results) and the
arguments that I can give.

This is true for an API endpoint, but queries feel a bit more malleable
than this​

​... Its kind interesting, should a query really a be thought of as a
method/function​ or something else? What I had in mind for nidm-api is
something much more flexible and dynamic, but it sounds like what you are
after is a traditional API that has a very limited scope for specific
functionality.

I think that is why we have them on github - to implement our own version
of forking / editing seems like re-inventing the wheel. I agree a search
function added to the nidm-api to search through the query data structures
would be neat.

​right, github could be the backend but what about a frontend for forking
and editing​ queires like:
http://xiphoid.biostr.washington.edu:8080/QueryManager/QueryManager.html#qid=71

I don't think I am suggesting it is a "solution," but it's making it just
a little bit harder for people who just want to query some nidm-object
to retrieve the data they need.

​i guess its a tradeoff, I would suspect anyone who​ 'just want to query
some nidm-object
' wouldn't be happy with either uuids or filenames and
would want a tool to help them sift through - but for now we have like 6
queries... so...

@cmaumet
Copy link
Member

cmaumet commented Jan 19, 2016

Very interesting discussion!

in any scenario where the number of queries exceed a handful known ones, a developer will have to look into the metadata of a query to find the query-id or run a query to find a query-id using some matching criteria. get_peak_coordinates only works when there is one. as an example, what if i simply wanted to constrain this to FSL results or to FSL results processed with FSL > 5.0.5, or any other set of constraints.

This relates to one point that is not entirely clear for me right now: how do we handle variants of the same query within nidm-query? For example, the get_peak_coordinates query has already existed in several "flavours", e.g. also returning optional peak fwer, also returning statistic type, also returning contrast name... To be extreme, we could even go all the way to the top of the tree and include the type of HRF that was used... The question is where do we stop and how do we decide which of those variants is the one we want in nidm-query? Or do we want all of them?

Isn't that what variables are for? The queries can have specific variables.

@vsoch: this could be part the solution but I am not clear how specific variables could be defined for a given query. Could you give me more details or, even better a small example, of what you had in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants