Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output behavior when only the module name is asked for #179

Closed
gaow opened this issue Apr 15, 2019 · 13 comments
Closed

Output behavior when only the module name is asked for #179

gaow opened this issue Apr 15, 2019 · 13 comments

Comments

@gaow
Copy link
Member

gaow commented Apr 15, 2019

The test data-set is on: /project/mstephens/SuSiE/susie_z/output. The query is:

dscout = dscquery('r_compare_data_signal', targets = 'sim_gaussian.pve score_susie')

The expected behavior is to see score_susie.output.file column in the output. But we see

> head(dscout)
  DSC sim_gaussian.pve
1   1             0.05
2   1             0.05
3   1             0.05
4   1             0.05
5   1             0.05
6   1             0.05

Notice running dsc-query the output seems fine:

> dsc-query r_compare_data_signal -o test.csv --target "sim_gaussian.pve score_susie" --force
> head test.csv

DSC,sim_gaussian.pve,score_susie.output.file
1,0.05,score_susie/small_data_151_sim_gaussian_25_get_sumstats_2_susie_bhat_3_score_susie_2
1,0.05,score_susie/small_data_152_sim_gaussian_25_get_sumstats_2_susie_bhat_3_score_susie_2
1,0.05,score_susie/small_data_153_sim_gaussian_25_get_sumstats_2_susie_bhat_3_score_susie_2
1,0.05,score_susie/small_data_154_sim_gaussian_25_get_sumstats_2_susie_bhat_3_score_susie_2
...

so we might have done something problematic filtering it downstream.

@gaow
Copy link
Member Author

gaow commented Apr 15, 2019

@pcarbo the above ticket was from Yuxin's slack DM. It seems we should set omit.filenames = FALSE by default, or at least give some warning here because filenames appear to be specifically asked for here? Not sure if there is a good motivation to set it to TRUE as of now -- one argument is with output in list format we should never need to access file names. But there are certainly cases when accessing file names is desirable.

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

@gaow What happens when you set omit.filenames = FALSE? The default is omit.filenames = TRUE.

@gaow
Copy link
Member Author

gaow commented Apr 15, 2019

@pcarbo it works with omit.filenames = FALSE. I'm arguing we should have it by default.

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

@gaow What do you mean by this:

Filenames appear to be specifically asked for here?

What makes you think Yuxin is requesting the file names?

@gaow
Copy link
Member Author

gaow commented Apr 15, 2019

What makes you think Yuxin is requesting the file names?

This is what I told her when she asked me in DM how one can access file names directly.

I said "filenames appear to be specifically asked for" because susie_score is not a module group. In this case filename is the only information we'll provide when no variable is specified.

However now I am wondering since we have the list-based output, should we just load everything into the output unless the output method is data.frame, in which case we'll return output file names instead?

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

However now I am wondering since we have the list-based output, should we just load everything into the output unless the output method is data.frame, in which case we'll return output file names instead?

@gaow This is the current behaviour.

@gaow
Copy link
Member Author

gaow commented Apr 15, 2019

This is the current behaviour.

Really? Then the example above would suggest a bug because the expected behavior would be to load everything from score_susie and return a list. So I do not think it is the current behavior. I'm talking about the case when no specific variable is asked for -- should we just load everything from that module into the output? We definitely have to rename omit.filenames to something else, eg something like omit.unspecified, and expect the following behavior:

  1. omit.unspecified = F, return.type = 'list' will load everything in susie_score and return a list with an element called susie_score.output.data (please ignore the naming convention for now we can discuss that later) containing everything in that module
  2. omit.unspecified = F, return.type = 'data.frame' will return a dataframe with a column called susie_score.output.file that contains file names.
  3. omit.unspecified = T will ignore susie_score column regardless of return.type

Does it make sense?

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

Then the example above would suggest a bug because the expected behavior would be to load
everything from score_susie and return a list.

This would be the behaviour for return.type = "list".

But the default is return.type = "auto".

@gaow gaow changed the title Output filename in DSC query when only the module name is asked for Output behavior when only the module name is asked for Apr 15, 2019
@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

I'm guess I'm not sure what you mean by "everything".

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

@gaow Can you explain what would be your proposed default behaviour for these two examples?

out <- dscquery("mydsc",targets = "score")
out <- dscquery("mydsc",targets = "score.delta")

(Assuming "score" is a module group.)

@gaow
Copy link
Member Author

gaow commented Apr 15, 2019

For out <- dscquery("mydsc",targets = "score.delta") the behavior is the same for all my 1-3 above:

score score.delta
score_name  <some_number>

For out <- dscquery("mydsc",targets = "score") it will become something like:

score  score.output.data
score_name <something>

where <something> is

  1. the result of readRDS("something.rds") when return.type=list or return.type=auto
  2. just the corresponding file name when return type is data.frame
  3. the 2nd column will not exist if omit.unspecified=T

1-3 in this post correspond to my proposal 1-3 above.

The point is to allow loading the entire data object when return.type is set to auto or list, just in case the users want to access something they did not specifically returned earlier as DSC module output but now want to extract them. Currently we do not provide a way to do it. And since this can potentially cause large objected loaded, we possibly want to set the default back to omit = T

@pcarbo
Copy link
Member

pcarbo commented Apr 15, 2019

@gaow I think that is potentially useful behaviour. On the other hand, I see some disadvantages (e.g., could use up a huge amount of memory if the outputs are large). I think we would need some more time to study usage before changing dscquery to accommodate this. In any case, this doesn't seem like high priority at this stage.

@gaow
Copy link
Member Author

gaow commented Apr 15, 2019

Sure let's keep the ticket open. I think an apparent way around the disadvantage is to set omit.unspecified to T again by default (I was editing my comment above added this point but was not quick enough to have you see it earlier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants