Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip empty first dimension for default windows. #313

Merged

Conversation

jeromekelleher
Copy link
Member

This implements a subset of #200. The idea is that, when we don't specify windows then it's a pain to have to write stat[0] to get what you actually want. This seems like a good usablilty feature to add, to me --- this is what you'd actually want the library to do, right?

More concretely, take the following example:

afs = ts.allele_frequency_spectrum(mode="branch")
print("afs: ", afs)
print("diversity", ts.diversity([ts.samples()], mode="branch"))

afs = ts.allele_frequency_spectrum(mode="branch", windows=[0, ts.sequence_length])
print("afs:", afs)
print("diversity",
    ts.diversity([ts.samples()], mode="branch", windows=[0, ts.sequence_length]))

gives

afs:  [0.  2.5 0.  0. ]
diversity [3.33333333]
afs: [[0.  2.5 0.  0. ]]
diversity [[3.33333333]]

In the first case, we don't specify any windows so we're only interested in a single window and we remove the empty first dimension. In the second case, we explicitly specify the windows and so we keep the dimension in place.

I agree this is going to be a bit confusing to explain, but it seems worth it. The mistakes that are made will be less annoying, in the long run, than having to add an extra [0] to the end of each call when you just want one window. If, as @petrelharp says people are rarely interested in a single window, then they won't be affected by this default behaviour.

The reason I'm bringing this up now is because it is a long-term decision. Because we've specified the default value for windows, we can't change the default behaviour after we ship 0.2.0. So we're making the decision here that we'll never want to do this, which to me seems a real shame as it's quite neat and elegant (IMO).

Because we don't have a default value for, e.g., diversity now we don't have to worry about the other half of #200 (stripping off the empty dimensions when we're only looking at one sample set in, e.g., diversity until later).

FWIW, the implementation is easy and it doesn't break many of the tests.

docs/stats.rst Outdated
@@ -70,6 +70,12 @@ e.g., the sites of the SNPs.
Windowing
*********

By default, statistics
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started writing this, and then decided to see how easy it would be to implement before getting into it.

@petrelharp
Copy link
Contributor

You are probably right. I do worry that it's strange that specifying a non-default argument changes the output dimensions, but at least None is a different argument. Ok, go for it.

Re: the other half; I don't think we need to drop dimensions for a default indexes (when there is one); since np.array([2.0]) acts a lot like 2.0. But, do we want to set up default indexes for other stats?

@jeromekelleher
Copy link
Member Author

You are probably right. I do worry that it's strange that specifying a non-default argument changes the output dimensions, but at least None is a different argument. Ok, go for it.

Hooray! OK, I'll flesh this out and ping you.

Re: the other half; I don't think we need to drop dimensions for a default indexes (when there is one); since np.array([2.0]) acts a lot like 2.0. But, do we want to set up default indexes for other stats?

Hmm, that's a good question. I'm not sure --- maybe this is an argument for dropping the default indexes value for Fst until after 0.2.0? The way I'm thinking about it at the moment is that post 0.2.0 we can possibly refine the semantics at bit such that the dimensions of your arguments determines the dimensions of your output. So, for example, we'd have

ts.diversity([ts.samples()])   ->   [x]
ts.diversity(ts.samples())     ->   x

Here, the user is giving clear information that they're only interested in one sample set in the second example, so we strip off the first dimension and just return the value. Possibly, the right behaviour then is for ts.diversity() to return x also. I'm not sure how this interacts with the indexes argument, but I'm not sure I want to tie us down to something without thinking it through properly either.

I know you have reservations about playing fast and loose with the dimensions like this, so I'm not saying this is something I'd definitely want to do. But, I don't definitely not want to do it either, so keeping the door open by not specifying default values for things we're not sure about might be good.

@petrelharp
Copy link
Contributor

Re: indexes - sounds good; also see #315 in the meantime.

@jeromekelleher
Copy link
Member Author

OK, turns out we couldn't really punt this one down the road and need to deal with it now. Here's my proposal for how we strip off empty dimensions @petrelharp --- the new tests should explain what it's doing. If we think this is a good idea I'll tidy up the rest of the tests and fix up the documentation.

I haven't looked at the derived stats like Fst, Tajimas D and the covariance stats etc. These may need to be treated differently.

python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

Ok, so in words, we drop a dimension if

  1. it is a one-way stat and sample_sets is actually just a list of samples
  2. it is a k>1-way stat and indexes is actually just a k-tuple of indexes
  3. it is a k>1-way stat and indexes=None and there are exactly k sample sets

This is nice and elegant. I like it. It also makes the interface a bit more confusing, but I say we go for it.

One wrinkle is that in the future we will probably have statistics that can take variaible numbers of indexes, so that e.g. indexes = [(0, 1, 2), (0, 1)] would be valid. (For instance, a multi-pop Fst.) I think we can just adjust things at that point, but we should make sure we don't make this impossible.

Do you have an opinion about dropping dimensions when windows=None?

@petrelharp
Copy link
Contributor

I haven't looked at the derived stats like Fst, Tajimas D and the covariance stats etc. These may need to be treated differently.

They should work the same, but we might need different code to make that happen...

@jeromekelleher
Copy link
Member Author

This is nice and elegant. I like it. It also makes the interface a bit more confusing, but I say we go for it.

Excellent!

Do you have an opinion about dropping dimensions when windows=None?

Yes, this is already implemented --- we drop the first dimension when windows is None. See the _default_windows tests for examples.

@petrelharp
Copy link
Contributor

Yes, this is already implemented --- we drop the first dimension when windows is None. See the _default_windows tests for examples.

Oh, duh. =) Yes, looks great!

@codecov
Copy link

codecov bot commented Aug 21, 2019

Codecov Report

Merging #313 into master will decrease coverage by 1.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
- Coverage   87.47%   86.44%   -1.03%     
==========================================
  Files          19       20       +1     
  Lines       10282    14122    +3840     
  Branches     1902     2777     +875     
==========================================
+ Hits         8994    12208    +3214     
- Misses        764      987     +223     
- Partials      524      927     +403
Flag Coverage Δ
#c_tests 87.5% <100%> (+0.02%) ⬆️
#python_c_tests 90.3% <100%> (?)
#python_tests 99.24% <100%> (ø) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.67% <100%> (+0.02%) ⬆️
python/_tskitmodule.c 83.6% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23fa1a8...e86a72b. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 21, 2019

Codecov Report

Merging #313 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   86.45%   86.47%   +0.01%     
==========================================
  Files          20       20              
  Lines       14015    14035      +20     
  Branches     2748     2751       +3     
==========================================
+ Hits        12117    12137      +20     
  Misses        979      979              
  Partials      919      919
Flag Coverage Δ
#c_tests 87.55% <100%> (+0.02%) ⬆️
#python_c_tests 90.31% <100%> (+0.02%) ⬆️
#python_tests 99.23% <100%> (ø) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.66% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28a56cb...74b1742. Read the comment docs.

@jeromekelleher jeromekelleher force-pushed the default-windows-strip-dims branch from e86a72b to 59ceea2 Compare August 22, 2019 10:37
@codecov
Copy link

codecov bot commented Aug 22, 2019

Codecov Report

Merging #313 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   86.45%   86.47%   +0.01%     
==========================================
  Files          20       20              
  Lines       14015    14035      +20     
  Branches     2748     2751       +3     
==========================================
+ Hits        12117    12137      +20     
  Misses        979      979              
  Partials      919      919
Flag Coverage Δ
#c_tests 87.55% <100%> (+0.02%) ⬆️
#python_c_tests 90.31% <100%> (+0.02%) ⬆️
#python_tests 99.23% <100%> (ø) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.66% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c156b7...59ceea2. Read the comment docs.

@jeromekelleher jeromekelleher force-pushed the default-windows-strip-dims branch from e86a72b to 59ceea2 Compare August 22, 2019 11:08
@jeromekelleher
Copy link
Member Author

I think we're nearly there with this @petrelharp. I've tidied up the loose ends, and written some high-level docs. I think I just need to make another pass through the methods, probably linking back to some general descipription of the behaviour or parameters one-way and k-way statistics to document the behaviour in various cases.

What do you think?

@jeromekelleher jeromekelleher force-pushed the default-windows-strip-dims branch from 0dbd172 to c8cb6d6 Compare August 22, 2019 15:13
@jeromekelleher
Copy link
Member Author

Note to self: need to update the AFS values in the tutorial, as these are now wrong (dimension stripping).

Also, does this close #229? They are examples --- if somewhat boring ones.

@petrelharp
Copy link
Contributor

This looks great, especially the tutorial. I think this closes #229, also.

@jeromekelleher
Copy link
Member Author

TODO:

  • Fix general stats interface like in Replace mean_descendants with a node stat #328 (also move around code like there, it's good to have the stats functions together).
  • Update documentation. Need to describe behaviour of, e.g., windows, concisely and precisely somewhere, and put in links to this in each function.

This was referenced Aug 23, 2019
@petrelharp
Copy link
Contributor

Most (all?) of those TODOs are now done over in #330; maybe you want to merge that one in here before continuing.

@jeromekelleher jeromekelleher force-pushed the default-windows-strip-dims branch from 855e827 to 74b1742 Compare August 23, 2019 08:05
@jeromekelleher jeromekelleher merged commit 47ec75f into tskit-dev:master Aug 23, 2019
@jeromekelleher jeromekelleher deleted the default-windows-strip-dims branch August 23, 2019 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants