Group arrayset classes #162

rlizzo · 2019-11-11T05:21:34Z

Motivation and Context

Why is this change required? What problem does it solve?:

In a recent conversation with @elistevens, it became obvious that our current level of support for existing ML workflows is lacking. This PR is a very early mock-up of how we might support tf/torch data loader workflows as a first class citizens in hangar.

Essentially, we need to build an interface which allows: selection, batching, shuffling, & sharding of arraysets / samples for ML training pipelines. Both pre-compute and runtime rebalancing must be supported.

Description

Describe your changes in detail:

A significant amount of this has been adapted from the torch dataset sampler methods. The code at the time of writing is a total hack, and shouldn't be regarded as more then just a proof of concept for how we could approach sampling / shuffling / weighting / batching over samples.

By wrapping the aset object during runtime in an addon class (currently called GroupedArraysetDataReader), we can use just book-keeping records to create informational structures about what exactly each arrayset contains. (with almost no data IO required from disk). Filtering these values out from the arrayset worker decouples the ability to perform runtime balancing from a (potentially) blocking operation running in the arrayset backends...

I'm thinking the best solution for distributed computation is to just totally bypass the arrayset objects themselves - they are cheap to create, and can be done so individually on each node to read from the same checkout simultaneously).

A very rough example below shows usage of the wrapper class, weighted balancing of some "flavor" value common to many sample keys (which presumably would be used as a feature during some computation on larger data stored in a different arrayset), and finally a randomized subselection / batching of the the sample keys which would be sent off for computation.

from hangar.dataloaders import GroupedArraysetDataReader
from hangar.dataloaders.sampler import (
    BatchSampler, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler)

gaset = GroupedArraysetDataReader(aset)
gaset.group_samples

Mapping: Group Data Value -> Sample Name

 [8] :: ['samp_idx_0', 'samp_idx_36', 'samp_idx_54', 'samp_idx_62', 'samp_idx_67', 'samp_idx_83', 'samp_idx_93']
 [7] :: ['samp_idx_1', 'samp_idx_11', 'samp_idx_14', 'samp_idx_19', 'samp_idx_25', 'samp_idx_29', 'samp_idx_3', 'samp_idx_31', 'samp_idx_44', 'samp_idx_74', 'samp_idx_87', 'samp_idx_94']
 [6] :: ['samp_idx_10', 'samp_idx_105', 'samp_idx_12', 'samp_idx_2', 'samp_idx_27', 'samp_idx_30', 'samp_idx_32', 'samp_idx_33', 'samp_idx_43', 'samp_idx_45', 'samp_idx_51', 'samp_idx_53', 'samp_idx_56', 'samp_idx_61', 'samp_idx_70', 'samp_idx_75', 'samp_idx_84', 'samp_idx_89']
 [4] :: ['samp_idx_100', 'samp_idx_101', 'samp_idx_102', 'samp_idx_15', 'samp_idx_21', 'samp_idx_22', 'samp_idx_23', 'samp_idx_38', 'samp_idx_41', 'samp_idx_42', 'samp_idx_46', 'samp_idx_6', 'samp_idx_63', 'samp_idx_64', 'samp_idx_69', 'samp_idx_7', 'samp_idx_77', 'samp_idx_82', 'samp_idx_96', 'samp_idx_99']
 [2] :: ['samp_idx_103', 'samp_idx_16', 'samp_idx_26', 'samp_idx_68', 'samp_idx_78', 'samp_idx_85', 'samp_idx_86', 'samp_idx_9', 'samp_idx_90']
 [3] :: ['samp_idx_104', 'samp_idx_28', 'samp_idx_40', 'samp_idx_49', 'samp_idx_58', 'samp_idx_59', 'samp_idx_76', 'samp_idx_88']
 [1] :: ['samp_idx_13', 'samp_idx_24', 'samp_idx_47', 'samp_idx_48', 'samp_idx_57', 'samp_idx_91', 'samp_idx_95', 'samp_idx_97']
 [5] :: ['samp_idx_17', 'samp_idx_18', 'samp_idx_39', 'samp_idx_4', 'samp_idx_5', 'samp_idx_50', 'samp_idx_65', 'samp_idx_72', 'samp_idx_8', 'samp_idx_80', 'samp_idx_81', 'samp_idx_98']
 [0] :: ['samp_idx_20', 'samp_idx_34', 'samp_idx_35', 'samp_idx_37', 'samp_idx_52', 'samp_idx_55', 'samp_idx_60', 'samp_idx_66', 'samp_idx_71', 'samp_idx_73', 'samp_idx_79', 'samp_idx_92']

WS = WeightedRandomSampler(weights=[0.2, 0.1, 0.5, 0.4, 0.3, 0.6, 0.1, 0.5, 0.3],
                           num_samples=4,
                           replacement=True,
                           group_names=list(gaset.groups))
generated = []
for grp in WS:
    print('')
    print(f'GROUP: {grp}')
    samples = gaset.group_samples[grp]
    RS = SubsetRandomSampler(samples)
    BS = BatchSampler(RS, 3, drop_last=False)
    generated.append(list(BS))
    print(f'GENERATED: {generated[-1]}')

GROUP: [1]
GENERATED: [['samp_idx_97', 'samp_idx_24', 'samp_idx_48'], ['samp_idx_91', 'samp_idx_47', 'samp_idx_57'], ['samp_idx_95', 'samp_idx_13']]

GROUP: [3]
GENERATED: [['samp_idx_40', 'samp_idx_88', 'samp_idx_28'], ['samp_idx_49', 'samp_idx_76', 'samp_idx_104'], ['samp_idx_59', 'samp_idx_58']]

GROUP: [5]
GENERATED: [['samp_idx_17', 'samp_idx_65', 'samp_idx_72'], ['samp_idx_8', 'samp_idx_98', 'samp_idx_18'], ['samp_idx_39', 'samp_idx_81', 'samp_idx_4'], ['samp_idx_80', 'samp_idx_5', 'samp_idx_50']]

GROUP: [6]
GENERATED: [['samp_idx_70', 'samp_idx_45', 'samp_idx_61'], ['samp_idx_10', 'samp_idx_2', 'samp_idx_105'], ['samp_idx_89', 'samp_idx_56', 'samp_idx_53'], ['samp_idx_12', 'samp_idx_75', 'samp_idx_27'], ['samp_idx_51', 'samp_idx_43', 'samp_idx_84'], ['samp_idx_32', 'samp_idx_33', 'samp_idx_30']]

I'd like to get a good discussion going here, since this is an attempt to build a solution for a problem I don't have on a day-day basis. Any thoughts?

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

codecov · 2019-12-05T00:14:43Z

Codecov Report

Merging #162 into master will decrease coverage by 4.44%.
The diff coverage is 16.29%.

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage   95.22%   90.77%   -4.44%     
==========================================
  Files          66       69       +3     
  Lines       11881    12054     +173     
  Branches     1011     1042      +31     
==========================================
- Hits        11313    10942     -371     
- Misses        371      917     +546     
+ Partials      197      195       -2

Impacted Files	Coverage Δ
src/hangar/__init__.py	`100.00% <ø> (ø)`
src/hangar/dataloaders/common.py	`12.90% <ø> (-78.01%)`	⬇️
src/hangar/dataloaders/sampler.py	`0.00% <0.00%> (ø)`
src/hangar/dataloaders/grouper.py	`39.68% <39.68%> (ø)`
src/hangar/arrayset.py	`94.89% <100.00%> (-0.21%)`	⬇️
src/hangar/dataloaders/__init__.py	`100.00% <100.00%> (ø)`
tests/test_dataloaders.py	`15.85% <0.00%> (-84.15%)`	⬇️
src/hangar/dataloaders/torchloader.py	`20.00% <0.00%> (-80.00%)`	⬇️
src/hangar/dataloaders/tfloader.py	`60.00% <0.00%> (-40.00%)`	⬇️
... and 6 more

…t groups

lgtm-com · 2019-12-05T22:16:15Z

This pull request introduces 5 alerts when merging e63339f into d267c0a - view on LGTM.com

new alerts:

3 for Unused import
2 for Non-standard exception raised in special method

CLAassistant · 2020-04-10T22:19:50Z

All committers have signed the CLA.

rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels Nov 11, 2019

rlizzo mentioned this pull request Nov 11, 2019

[FEATURE REQUEST] Multi-tensor samples from complex data sources #79

Open

rlizzo force-pushed the group-arrayset-classes branch from 1f31ad5 to 53e07f4 Compare November 26, 2019 09:27

rlizzo force-pushed the group-arrayset-classes branch from 0678537 to 9587e93 Compare December 5, 2019 00:05

rlizzo added this to the v0.5.0 milestone Dec 5, 2019

rlizzo added 2 commits December 5, 2019 15:29

initial work on adding sampler

20e50f8

working - but extremly primitive - method to sample and batch arrayse…

426d3dd

…t groups

tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019

updates

e63339f

rlizzo force-pushed the group-arrayset-classes branch from 9587e93 to e63339f Compare December 5, 2019 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group arrayset classes #162

Group arrayset classes #162

rlizzo commented Nov 11, 2019 •

edited

Loading

codecov bot commented Dec 5, 2019 •

edited

Loading

lgtm-com bot commented Dec 5, 2019

CLAassistant commented Apr 10, 2020 •

edited

Loading

Group arrayset classes #162

Are you sure you want to change the base?

Group arrayset classes #162

Conversation

rlizzo commented Nov 11, 2019 • edited Loading

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

Types of changes

How Has This Been Tested?

Checklist:

codecov bot commented Dec 5, 2019 • edited Loading

Codecov Report

lgtm-com bot commented Dec 5, 2019

CLAassistant commented Apr 10, 2020 • edited Loading

rlizzo commented Nov 11, 2019 •

edited

Loading

codecov bot commented Dec 5, 2019 •

edited

Loading

CLAassistant commented Apr 10, 2020 •

edited

Loading