Approximate proportional sampling in BucketingSampler; remaining_duration, remaining_cuts, num_cuts properties for samplers. #372

pzelasko · 2021-08-14T03:37:14Z

I checked the efficiency of the approximate proportional sampling; for a sampler with 7700 batches, I'm checking what's the step when the first bucket gets depleted. For 'equal_len' buckets, the number goes up from 2500 to 5100. For 'equal_duration', it goes up from 6800-7200 to 7600. It seems to be working well. @danpovey

@csukuangfj I added num_cuts property to the sampler that you were asking for. Take note that it may be None when the CutSet is opened as a lazy manifest.

…tion, remaining_cuts, num_cuts properties for samplers.

danpovey · 2021-08-14T03:45:02Z

lhotse/dataset/sampling/data_source.py

+        return self._orig_items.is_lazy
+
+    @property
+    def remaining_duration(self) -> Optional[float]:


Is there a reason why this is a property and not a function? E.g. does it indicate that it's expected to be fast to compute?

Yeah, that's the reason.

danpovey

LGTM, except my only concern is what might happen if, due to floating point roundoff, we get inaccuracies near the last batch remaining. Can you convince yourself that at least it won't lead to a crash? Discarding a few cuts is OK.

pzelasko · 2021-08-14T03:59:09Z

I share the concern -- I ran it multiple times with different seeds on 100k+ items without issues, but I'm still not sure. I'll sleep on it and add some safeguards.

danpovey · 2021-08-14T04:02:05Z

Yeah, I don't think testing like that is sufficient, I think there needs to be logic where you handle the case where the duration is wrong. You could temporarily initialize the total duration to the real total duration plus a random number, to test whether that logic works.

pzelasko · 2021-08-15T01:16:37Z

I think I've convinced myself that this logic is OK. It ensures that the duration is non-negative (via the property in DataSource), and so even if it'd be incorrect, it'll only affect the sampling probabilities. I also followed your suggestion to add random numbers to the total duration (0 - 100s), and it successfully passes all tests and is able to iterate through a 3 mln item CutSet.

BTW using the same CutSet I checked that if we compute the total duration (3020h) and then subtract the duration one-by-one in a randomized order, the errors accumulate to only ~2e-7 (seconds).

For extra safety though, I added a flag to disable the proportional sampling in case it causes some issues.

danpovey

LGTM!

danpovey · 2021-08-15T06:19:20Z

lhotse/dataset/sampling/bucketing.py

+        counts = [s.remaining_cuts for _, s in self._nondepleted_samplers_with_idxs]
+        if any(c is None for c in counts):
+            return None
+        return sum(counts)


BTW you could make this more efficient with try-except.

…-sampling' into feature/approximate-proportional-sampling

Approximate proportional sampling in BucketingSampler; remaining_dura…

a0e4da7

…tion, remaining_cuts, num_cuts properties for samplers.

pzelasko added this to the v0.8 milestone Aug 14, 2021

pzelasko mentioned this pull request Aug 14, 2021

Samplers len() might need to be removed #373

Closed

danpovey reviewed Aug 14, 2021

View reviewed changes

pzelasko added 2 commits August 13, 2021 23:49

Add a test for the new properties

cd3fed6

Remove commented code

7918904

danpovey reviewed Aug 14, 2021

View reviewed changes

Add a flag to disable the proportional sampling of buckets

7c1aa93

Merge branch 'master' into feature/approximate-proportional-sampling

f0c08a3

danpovey previously approved these changes Aug 15, 2021

View reviewed changes

danpovey reviewed Aug 15, 2021

View reviewed changes

pzelasko added 2 commits August 16, 2021 10:27

easier to ask for forgiveness than permission

4d98d1d

Merge remote-tracking branch 'origin/feature/approximate-proportional…

317fdca

…-sampling' into feature/approximate-proportional-sampling

pzelasko dismissed danpovey’s stale review via 317fdca August 16, 2021 14:27

Merge branch 'master' into feature/approximate-proportional-sampling

d687706

pzelasko merged commit cc05716 into master Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximate proportional sampling in BucketingSampler; remaining_duration, remaining_cuts, num_cuts properties for samplers. #372

Approximate proportional sampling in BucketingSampler; remaining_duration, remaining_cuts, num_cuts properties for samplers. #372

pzelasko commented Aug 14, 2021 •

edited

Loading

danpovey Aug 14, 2021

pzelasko Aug 14, 2021

danpovey left a comment

pzelasko commented Aug 14, 2021

danpovey commented Aug 14, 2021

pzelasko commented Aug 15, 2021 •

edited

Loading

danpovey left a comment

danpovey Aug 15, 2021

Approximate proportional sampling in BucketingSampler; remaining_duration, remaining_cuts, num_cuts properties for samplers. #372

Approximate proportional sampling in BucketingSampler; remaining_duration, remaining_cuts, num_cuts properties for samplers. #372

Conversation

pzelasko commented Aug 14, 2021 • edited Loading

danpovey Aug 14, 2021

Choose a reason for hiding this comment

pzelasko Aug 14, 2021

Choose a reason for hiding this comment

danpovey left a comment

Choose a reason for hiding this comment

pzelasko commented Aug 14, 2021

danpovey commented Aug 14, 2021

pzelasko commented Aug 15, 2021 • edited Loading

danpovey left a comment

Choose a reason for hiding this comment

danpovey Aug 15, 2021

Choose a reason for hiding this comment

pzelasko commented Aug 14, 2021 •

edited

Loading

pzelasko commented Aug 15, 2021 •

edited

Loading