Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added "seed" parameter to subsample-ids #317

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

VinzentRisch
Copy link
Contributor

@VinzentRisch VinzentRisch commented Oct 23, 2024

solves #320

  • Adds seed parameter to subsample-ids function.
  • Seed is per default set to 1. It can be set to "random" what creates a random 32 bit int as the seed.
  • Setting the seed to default 1 was done to ensure that the seed is logged in provenance. This improved reproducability.
  • At this point it is not possible to have a random seed that is created in the function that will be recorded in provenance.

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VinzentRisch, I thought I left a comment on this earlier today, but I see to have lost it so I'm thinking maybe it didn't save and I navigated away from the page. But, apologies if this feedback is showing up twice.

I completely agree that we need to be storing random seeds in provenance, but the approach implemented here will be problematic. The vast majority of users won't notice or override this parameter setting, in which case the result will be that the same seed is used all the time.

We have support for passing random seeds on the command line in q2-boots - see how we handle this here:
caporaso-lab/q2-boots@9195f9c

Could you update this PR to use this approach?

I created a framework issue to develop an approach for better supporting this:
qiime2/qiime2#823

@gregcaporaso
Copy link
Member

Oh, I see - my previous comment was on #321. This feedback is relevant to both of the PRs.

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VinzentRisch, could you update this PR to mirror the approaches we've taken to this functionality and the tests in #321?



def subsample_ids(table: biom.Table, subsampling_depth: int,
axis: str) -> biom.Table:
axis: str, seed: Union[int, str] = 1) -> biom.Table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
axis: str, seed: Union[int, str] = 1) -> biom.Table:
axis: str, seed: int = None) -> biom.Table:

Comment on lines +25 to +29
# Generate a random seed if seed is None
if seed == "random":
rng = np.random.default_rng()
seed = rng.integers(0, 2 ** 32 - 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all be able to go away since this happens in Table.subsample - right?

Suggested change
# Generate a random seed if seed is None
if seed == "random":
rng = np.random.default_rng()
seed = rng.integers(0, 2 ** 32 - 1)

@@ -61,7 +61,9 @@
function=q2_feature_table.subsample_ids,
inputs={'table': FeatureTable[Frequency]},
parameters={'subsampling_depth': Int % Range(1, None),
'axis': Str % Choices(['sample', 'feature'])},
'axis': Str % Choices(['sample', 'feature']),
'seed': Int % Range(0, 2**32) | Str % Choices(["random"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'seed': Int % Range(0, 2**32) | Str % Choices(["random"])
'seed': Int % Range(0, None)
Suggested change
'seed': Int % Range(0, 2**32) | Str % Choices(["random"])
'seed': Int % Range(0, 2**32) | Str % Choices(["random"])

Comment on lines +77 to +80
'seed': ('Set the seed for the subsampling. Using the same seed with '
'the same table will always lead to the same result. Using '
'"random", sets the seed to a random number. The random '
'seed will not be logged in provenance.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits to the text. I'd leave the bit about provenance out, since we'll ideally be fixing that very shortly and it's common to all actions that take seeds right now.

Suggested change
'seed': ('Set the seed for the subsampling. Using the same seed with '
'the same table will always lead to the same result. Using '
'"random", sets the seed to a random number. The random '
'seed will not be logged in provenance.')
'seed': ('Set the seed for the subsampling. Using the same seed with '
'the same table will always lead to the same result. By '
'default a random seed will be used.')

@@ -21,7 +21,7 @@ def test_subsample_samples(self):
t = Table(np.array([[0, 1, 3], [1, 1, 2]]),
['O1', 'O2'],
['S1', 'S2', 'S3'])
a = subsample_ids(t, 2, 'sample')
a = subsample_ids(t, 2, 'sample', 'random')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you adapt these tests to be similar to the ones we put together for #321?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Development
Development

Successfully merging this pull request may close these issues.

Add "seed" parameter to subsample-ids
2 participants