Standardizing what is returned by functions #218

Aariq · 2020-02-26T22:31:46Z

There is a huge amount of variation in what object type and structure is returned by get_* and database querying functions in webchem. Some of this variation might be necessary (e.g. because the data contained in a database is varied), but I think it would be ideal for functions to return data.frames whenever possible.

Here's a quick and dirty overview of what formats are currently returned by functions in webchem:

Get ID functions

get_chebiid: list of dataframes (one per query)
get_cid: list of vectors
get_csid: dataframe (one row per query?)
get_etoxid: dataframe with column for query.
get_wdid: with match = "all", list of a single vector with attribute. For match = "first", a data frame

Query database functions

aw_query: nested list (one list per query)
chebi_comp_entity: complex nested list (each compound has a nested list with some elements being dataframes and others being lists of vectors)
chebi_lite_entity: list of dataframes (even with a single query)
ci_query: complex nested list
cs_compinfo: a dataframe with one row per query
cts_compinfo: complex list
etox_basic: complex list
etox_targets, etox_tests: list with a dataframe and a source URL(character vector) nested under each query
fn_percept: named vector (one element per query)
nist_ri: dataframe with column for query
pan_query: nested list. For each query, the list seems like all length 1 vectors. If that's always true, there's no reason for this not to be a data frame.
pc_prop: data.frame
pc_synonyms: list of character vectors (one vector per query) unless choices != NULL, then a dataframe with queries as row names
srs_query: list of dataframes (one data frame per query) which includes list columns

Ideas

Something that might help us think through this is if someone came up with a "homework" problem that needs to be solved with multiple data sources and ID translations. Then we can each share our solutions (e.g. as vignettes in our own forks of webchem) and do some thinking on what would result in the least friction for the most users. The results of this could be a vignette for our package and/or supplemental for the webchem publication.

Whenever possible, I think we should:

keep output consistent regardless of arguments or number of queries
prioritize ~~data frames~~ tibbles whenever possible (except maybe functions that always return one element per query, which might be more usable as named vectors).

The text was updated successfully, but these errors were encountered:

stitam · 2020-02-27T21:02:12Z

Thanks @Aariq for separating this topic! Prioritizing data frames is a good idea! Regarding the get_ functions, you convinced me @andschar , I agree with the list of data frames instead of a single data frame. Also I think it would be great to force the same columns and give the get_ outputs a common class. As for the query functions, each database returns different data so I presume we won't be able to define a rigid structure there, but again a list of data frames sounds easy to handle I think. And in case of multiple ids, a nested list of data frames.

Aariq · 2020-02-27T21:04:42Z

In cases when the data is just one dataframe per query, why not just one tidy dataframe with a query column? Wouldn't that be easier to work with?

stitam · 2020-02-27T21:30:41Z

So you are suggesting that e.g. the get_() functions should output a single data frame instead of a list of data frames? I don't have a strong opinion towards either. If a get_() function returns a list of data frames, it is somewhat easier to visualise if there are multiple hits for a query as @andchar mentioned, also post processing can be performed easily by lapply() and the list of data frames can be easily combined into a single data frame when needed. However, after get_() I usually use a database specific query function, where I need a vector of ids to search for. With a list fo data frames I would have to add an aditional line of code to coerce the list of data frames into a single data frame before I use the ids for the query, which is not very convenient. If get_() returns a single data frame, I can just reference the id column and I'm done. On the other hand, if I want to do post processing with the singe data frame get_() output, I might have to split the data frame, which is again another line of code.

Aariq · 2020-02-27T22:08:38Z

I don't think we should force users to learn lapply() or the purrr package to use webchem.

For get_* functions, I'd prefer named vectors, but I realize that some of them return additional information, so tidy data frames would be consistent and easy to convert to vectors like you mentioned.

I'd prefer single tidy dataframes for all query functions. I haven't delved into all of the query functions deep enough to know if that makes sense for all of them (e.g. the functions that return complex nested lists), but many of the structures listed above can be easily converted to tidy dataframes.

I think if you need *apply() functions to make the object returned usable, then webchem probably needs to be doing more so the user can do less. In those cases maybe there needs to be multiple functions, or arguments that specify what type of data to get, or additional helper extractor functions to deal with the complex lists. I think possibly even data frames with list-columns like what srs_query() returns might be easier for a beginner/intermediate R user to figure out than complex nested lists.

stitam · 2020-02-28T09:15:06Z

For the get_* functions I'd prefer tidy data frames over named vectors because I think the extra columns help us identify query issues, e.g when we don't get an exact match or we get multiple results. Regarding single data frame output vs list of data frames, a list of dfs can be combined easily e.g. with dplyr::bind_rows(), while a single df can be split easily with split() so I think we just have to pick one and stick to it. Maybe a single data frame is more user friendly than a list of data frames.

A related question is what we do with chooser(). In a previous discussion I suggested that the get_* functions could always return all results to be easy on the servers, and then the user could plug those results into a separate function (maybe choose_id() would be a better name for it in this workflow) to reduce the number of results to one per query. True, the user could do this with their own data manipulation tools, but I would personally prefer to have a convenience function with some standard filtering options like first, best, ask, na.

andschar · 2020-02-29T07:46:06Z

Thanks @Aariq for the nice overview. I totally agree to keep the output of the get_*-functions consistent and I also prefer data.frame(s) over (named) vectors. Whether it should be one data.frame with all the results or a list of data.frame's I'm not completely sure anymore. As I mentioned in #193 a list of data.frame's could be easily bound, e.g. with data.table::rbindlist(l), dplyr::bind_rows(l) or base::do.call(rbind, l), but true, we could do this for the users as well. Hm, @stitam you are right, we should probably just pick one style. If we do so I'd like to highlight that it's important to also include query strings with no results in the data.frame.

stitam · 2020-03-01T08:45:22Z

I agree with min one row for each query string, even if a query string returns no results. I'd also prefer single tidy ~~dataframe~~ tibble instead of a list of data frames. Also, I think the get_* functions should return "all" results, and the results should be filtered by a separate function to return one result per query string. What do you think about this?

andschar · 2020-03-01T16:32:48Z

Yes I agree, to treat the APIs with care and it's probably better to query "all" results and refine them with a separate function locally. Moreover, such a function should be easy to write once all get_* functions return the same object? Hm, however, as you mentioned earlier @stitam we should think about whether the output should have at least some equal columns. E.g. there should be something like a 'score' column that looks the same for every API output to choose the "best" entry. Or, is it more clever to define not only a "get_*" S3 class, but also an API-specific S3 class according to the respective outputs. Something like class(out) <- c('get', 'chebi')?

Aariq · 2020-03-06T21:17:53Z

I think we agree that query functions should return tidy tibbles whenever easily possible.

So I think next steps for this issue are to identify functions that can easily return tidy tibbles that don't already, and create an issue or multiple issues that contributors can take on. For the functions that return more complex objects, we might want to put those on the back burner for now and hopefully gain some inspiration as we tidy up outputs from other functions.

Also, a lot of these things will be breaking changes for existing code if people update webchem, so we should take into consideration how to handle that. Honestly I'm not sure the best way to deal with that, but changes to data structure output by functions should at the very least be well documented in NEWS.md.

stitam · 2020-03-16T17:00:44Z

I think it is safe to start with get_ functions. Most of them already return data frames and tibbles are also data frames so there would be no CI issue. Once we gain inspiration for the rest, we can add an argument that controls whether to output the original structure or a tibble, then in the next release we can set that argument to default to tibble, and then in a subsequent release we can remove the old structure if we feel like it. What do you think?

Aariq · 2020-03-17T20:17:01Z

@stitam This sounds good to me. So any changes to the output structure of functions should be accompanied by a warning for now (I think dplyr is a good package to look for inspiration. It has changed a lot and done a good job at reminding users to update their code). Users can get the old behavior with something like output = c("tibble", "old").

andschar · 2020-09-29T12:27:57Z

To continue our discussion to make the package more consistent, some thoughts here on the initial list at the top by @Aariq:

Get ID functions

I think the get_*() functions are quite finished yet. They all return tibbles, no matter how many strings are queried and they all have a query column.

query = '1071-83-6'
webchem::get_chebiid(query) # OK
webchem::get_cid(query) # OK
webchem::get_csid(query, apikey = apikey) # (almost) OK
webchem::get_etoxid(query, from = 'cas') # OK
webchem::get_wdid(query) # OK

@stitam I have found that verbose = TRUE doesn't return a message for get_csid(). On purpose?

Query database functions

I think we haven't really agreed on something here. In my opinion it would be best to return a list of tibbles or lists (in case it's not a good idea to turn the data into a tibble. As described in #295, we should also discuss how to handle NA returns. Should we return an one-length NA vector or also a tibble (e.g. tibble(query = query)). I prefer the latter, because the data type would be consistent. So, we'd have only list of tibbles or (partly) lists and not lists of tibbles, NAs or (partly) lists.

Object Orientation

Some webchem functions, mostly older ones still have class() attributes assigned. Should we use this or discard OO completely? See also #295

PS: Once we have agreed on something, I would like to implement #289 and #295

stitam · 2020-09-29T21:13:45Z

"I have found that verbose = TRUE doesn't return a message for get_csid(). On purpose?" No, when I joined webchem I started with the chemspider functions and I didn't see what verbose is for and I may have accidentally removed it. But I suggest to keep it that way because I have a major edit for chemspider I just didn't want to open a PR before v1.1.

stitam · 2020-09-29T21:22:25Z

Regarding functions that actually return chemical data: I don't like the idea of returning a single tibble, but I do like the idea of returning a list of tibbles. The most atomised solution would be a tibble for each property. The value here could be that each property needs different columns, e.g. molecular weight may only have a value, dimension and reference, but e.g. nist_ri() returns a very different set of columns. If this is harmonised for each property, that would help us build powerful extractors later on as well.

andschar · 2020-09-30T14:42:31Z

Regarding functions that actually return chemical data: I don't like the idea of returning a single tibble

I also meant that in my comment. Sorry for the confusion: A list of tibbles() or sub-lists (in case data isn't made for tables (e.g. MOL-file structure). As I said for no match, a simple tibble(query = query) is fine.

Not sure about a tibble for each property. chebi_comp_entity(1) has some properties clustered together, and others as single tibble/data.frame s. Thinking about extractor-functions, we could also use S3 methods again.

Aariq · 2020-09-30T20:20:20Z

A tibble for each property doesn't make sense---that might as well be a named vector. Whenever data is rectangular, I think a tibble should be returned as that is easier to work with than nested lists. If all the data is rectangular, I'm ambivalent about whether multiple queries should return a list of tibbles or a single tibble with a query column. Personally, I'd probably use a single tibble most often. On the other hand, not all query functions return rectangular data, so a list of tibbles is probably more consistent across the package. Plus bind_rows(out_list) is easy to do

andschar · 2020-10-01T08:29:06Z

I agree @Aariq , as long as one can do bind_rows() or rbindlist() it's fine.

Generally we don't have to update the functions immediately and maybe it's good to talk again before we change anything that big.

Aariq mentioned this issue Feb 26, 2020

get_*() consistency discussion #193

Closed

Aariq added the consistent api Uniformity across functions and data sources label Feb 26, 2020

andschar mentioned this issue Mar 25, 2020

Functions handle NAs in input incorrectly #224

Closed

stitam mentioned this issue Mar 26, 2020

fixed etox_basic #229

Merged

4 tasks

Aariq mentioned this issue Sep 15, 2020

return tibble from cir_query() #289

Closed

andschar mentioned this issue Sep 29, 2020

improve bcpc_query() #295

Open

Aariq mentioned this issue Feb 15, 2022

Check for possible edge case in bcpc_query() #351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing what is returned by functions #218

Standardizing what is returned by functions #218

Aariq commented Feb 26, 2020 •

edited

Loading

stitam commented Feb 27, 2020 •

edited

Loading

Aariq commented Feb 27, 2020

stitam commented Feb 27, 2020

Aariq commented Feb 27, 2020

stitam commented Feb 28, 2020

andschar commented Feb 29, 2020

stitam commented Mar 1, 2020

andschar commented Mar 1, 2020 •

edited

Loading

Aariq commented Mar 6, 2020

stitam commented Mar 16, 2020

Aariq commented Mar 17, 2020

andschar commented Sep 29, 2020 •

edited

Loading

stitam commented Sep 29, 2020 •

edited

Loading

stitam commented Sep 29, 2020

andschar commented Sep 30, 2020

Aariq commented Sep 30, 2020 •

edited

Loading

andschar commented Oct 1, 2020

Standardizing what is returned by functions #218

Standardizing what is returned by functions #218

Comments

Aariq commented Feb 26, 2020 • edited Loading

Get ID functions

Query database functions

Ideas

stitam commented Feb 27, 2020 • edited Loading

Aariq commented Feb 27, 2020

stitam commented Feb 27, 2020

Aariq commented Feb 27, 2020

stitam commented Feb 28, 2020

andschar commented Feb 29, 2020

stitam commented Mar 1, 2020

andschar commented Mar 1, 2020 • edited Loading

Aariq commented Mar 6, 2020

stitam commented Mar 16, 2020

Aariq commented Mar 17, 2020

andschar commented Sep 29, 2020 • edited Loading

Get ID functions

Query database functions

Object Orientation

stitam commented Sep 29, 2020 • edited Loading

stitam commented Sep 29, 2020

andschar commented Sep 30, 2020

Aariq commented Sep 30, 2020 • edited Loading

andschar commented Oct 1, 2020

Aariq commented Feb 26, 2020 •

edited

Loading

stitam commented Feb 27, 2020 •

edited

Loading

andschar commented Mar 1, 2020 •

edited

Loading

andschar commented Sep 29, 2020 •

edited

Loading

stitam commented Sep 29, 2020 •

edited

Loading

Aariq commented Sep 30, 2020 •

edited

Loading