-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardizing what is returned by functions #218
Comments
Thanks @Aariq for separating this topic! Prioritizing data frames is a good idea! Regarding the |
In cases when the data is just one dataframe per query, why not just one tidy dataframe with a |
So you are suggesting that e.g. the |
I don't think we should force users to learn For I'd prefer single tidy dataframes for all query functions. I haven't delved into all of the query functions deep enough to know if that makes sense for all of them (e.g. the functions that return complex nested lists), but many of the structures listed above can be easily converted to tidy dataframes. I think if you need |
For the A related question is what we do with |
Thanks @Aariq for the nice overview. I totally agree to keep the output of the |
I agree with min one row for each query string, even if a query string returns no results. I'd also prefer single tidy |
Yes I agree, to treat the APIs with care and it's probably better to query |
I think we agree that query functions should return tidy tibbles whenever easily possible. So I think next steps for this issue are to identify functions that can easily return tidy tibbles that don't already, and create an issue or multiple issues that contributors can take on. For the functions that return more complex objects, we might want to put those on the back burner for now and hopefully gain some inspiration as we tidy up outputs from other functions. Also, a lot of these things will be breaking changes for existing code if people update webchem, so we should take into consideration how to handle that. Honestly I'm not sure the best way to deal with that, but changes to data structure output by functions should at the very least be well documented in NEWS.md. |
I think it is safe to start with |
@stitam This sounds good to me. So any changes to the output structure of functions should be accompanied by a warning for now (I think dplyr is a good package to look for inspiration. It has changed a lot and done a good job at reminding users to update their code). Users can get the old behavior with something like |
To continue our discussion to make the package more consistent, some thoughts here on the initial list at the top by @Aariq: Get ID functionsI think the query = '1071-83-6'
webchem::get_chebiid(query) # OK
webchem::get_cid(query) # OK
webchem::get_csid(query, apikey = apikey) # (almost) OK
webchem::get_etoxid(query, from = 'cas') # OK
webchem::get_wdid(query) # OK @stitam I have found that Query database functionsI think we haven't really agreed on something here. In my opinion it would be best to return a list of tibbles or lists (in case it's not a good idea to turn the data into a tibble. As described in #295, we should also discuss how to handle Object OrientationSome PS: Once we have agreed on something, I would like to implement #289 and #295 |
"I have found that verbose = TRUE doesn't return a message for get_csid(). On purpose?" No, when I joined webchem I started with the chemspider functions and I didn't see what verbose is for and I may have accidentally removed it. But I suggest to keep it that way because I have a major edit for chemspider I just didn't want to open a PR before v1.1. |
Regarding functions that actually return chemical data: I don't like the idea of returning a single tibble, but I do like the idea of returning a list of tibbles. The most atomised solution would be a tibble for each property. The value here could be that each property needs different columns, e.g. molecular weight may only have a |
I also meant that in my comment. Sorry for the confusion: A list of Not sure about a |
A tibble for each property doesn't make sense---that might as well be a named vector. Whenever data is rectangular, I think a tibble should be returned as that is easier to work with than nested lists. If all the data is rectangular, I'm ambivalent about whether multiple queries should return a list of tibbles or a single tibble with a |
I agree @Aariq , as long as one can do Generally we don't have to update the functions immediately and maybe it's good to talk again before we change anything that big. |
There is a huge amount of variation in what object type and structure is returned by
get_*
and database querying functions in webchem. Some of this variation might be necessary (e.g. because the data contained in a database is varied), but I think it would be ideal for functions to return data.frames whenever possible.Here's a quick and dirty overview of what formats are currently returned by functions in webchem:
Get ID functions
get_chebiid
: list of dataframes (one per query)get_cid
: list of vectorsget_csid
: dataframe (one row per query?)get_etoxid
: dataframe with column for query.get_wdid
: withmatch = "all"
, list of a single vector with attribute. Formatch = "first"
, a data frameQuery database functions
aw_query
: nested list (one list per query)chebi_comp_entity
: complex nested list (each compound has a nested list with some elements being dataframes and others being lists of vectors)chebi_lite_entity
: list of dataframes (even with a single query)ci_query
: complex nested listcs_compinfo
: a dataframe with one row per querycts_compinfo
: complex listetox_basic
: complex listetox_targets
,etox_tests
: list with a dataframe and a source URL(character vector) nested under each queryfn_percept
: named vector (one element per query)nist_ri
: dataframe with column for querypan_query
: nested list. For each query, the list seems like all length 1 vectors. If that's always true, there's no reason for this not to be a data frame.pc_prop
: data.framepc_synonyms
: list of character vectors (one vector per query) unless choices != NULL, then a dataframe with queries as row namessrs_query
: list of dataframes (one data frame per query) which includes list columnsIdeas
Something that might help us think through this is if someone came up with a "homework" problem that needs to be solved with multiple data sources and ID translations. Then we can each share our solutions (e.g. as vignettes in our own forks of webchem) and do some thinking on what would result in the least friction for the most users. The results of this could be a vignette for our package and/or supplemental for the webchem publication.
Whenever possible, I think we should:
data framestibbles whenever possible (except maybe functions that always return one element per query, which might be more usable as named vectors).The text was updated successfully, but these errors were encountered: