Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group arrayset classes #162

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented Nov 11, 2019

Motivation and Context

Why is this change required? What problem does it solve?:

In a recent conversation with @elistevens, it became obvious that our current level of support for existing ML workflows is lacking. This PR is a very early mock-up of how we might support tf/torch data loader workflows as a first class citizens in hangar.

Essentially, we need to build an interface which allows: selection, batching, shuffling, & sharding of arraysets / samples for ML training pipelines. Both pre-compute and runtime rebalancing must be supported.

Description

Describe your changes in detail:

A significant amount of this has been adapted from the torch dataset sampler methods. The code at the time of writing is a total hack, and shouldn't be regarded as more then just a proof of concept for how we could approach sampling / shuffling / weighting / batching over samples.

By wrapping the aset object during runtime in an addon class (currently called GroupedArraysetDataReader), we can use just book-keeping records to create informational structures about what exactly each arrayset contains. (with almost no data IO required from disk). Filtering these values out from the arrayset worker decouples the ability to perform runtime balancing from a (potentially) blocking operation running in the arrayset backends...

I'm thinking the best solution for distributed computation is to just totally bypass the arrayset objects themselves - they are cheap to create, and can be done so individually on each node to read from the same checkout simultaneously).

A very rough example below shows usage of the wrapper class, weighted balancing of some "flavor" value common to many sample keys (which presumably would be used as a feature during some computation on larger data stored in a different arrayset), and finally a randomized subselection / batching of the the sample keys which would be sent off for computation.

from hangar.dataloaders import GroupedArraysetDataReader
from hangar.dataloaders.sampler import (
    BatchSampler, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler)

gaset = GroupedArraysetDataReader(aset)
gaset.group_samples
Mapping: Group Data Value -> Sample Name

 [8] :: ['samp_idx_0', 'samp_idx_36', 'samp_idx_54', 'samp_idx_62', 'samp_idx_67', 'samp_idx_83', 'samp_idx_93']
 [7] :: ['samp_idx_1', 'samp_idx_11', 'samp_idx_14', 'samp_idx_19', 'samp_idx_25', 'samp_idx_29', 'samp_idx_3', 'samp_idx_31', 'samp_idx_44', 'samp_idx_74', 'samp_idx_87', 'samp_idx_94']
 [6] :: ['samp_idx_10', 'samp_idx_105', 'samp_idx_12', 'samp_idx_2', 'samp_idx_27', 'samp_idx_30', 'samp_idx_32', 'samp_idx_33', 'samp_idx_43', 'samp_idx_45', 'samp_idx_51', 'samp_idx_53', 'samp_idx_56', 'samp_idx_61', 'samp_idx_70', 'samp_idx_75', 'samp_idx_84', 'samp_idx_89']
 [4] :: ['samp_idx_100', 'samp_idx_101', 'samp_idx_102', 'samp_idx_15', 'samp_idx_21', 'samp_idx_22', 'samp_idx_23', 'samp_idx_38', 'samp_idx_41', 'samp_idx_42', 'samp_idx_46', 'samp_idx_6', 'samp_idx_63', 'samp_idx_64', 'samp_idx_69', 'samp_idx_7', 'samp_idx_77', 'samp_idx_82', 'samp_idx_96', 'samp_idx_99']
 [2] :: ['samp_idx_103', 'samp_idx_16', 'samp_idx_26', 'samp_idx_68', 'samp_idx_78', 'samp_idx_85', 'samp_idx_86', 'samp_idx_9', 'samp_idx_90']
 [3] :: ['samp_idx_104', 'samp_idx_28', 'samp_idx_40', 'samp_idx_49', 'samp_idx_58', 'samp_idx_59', 'samp_idx_76', 'samp_idx_88']
 [1] :: ['samp_idx_13', 'samp_idx_24', 'samp_idx_47', 'samp_idx_48', 'samp_idx_57', 'samp_idx_91', 'samp_idx_95', 'samp_idx_97']
 [5] :: ['samp_idx_17', 'samp_idx_18', 'samp_idx_39', 'samp_idx_4', 'samp_idx_5', 'samp_idx_50', 'samp_idx_65', 'samp_idx_72', 'samp_idx_8', 'samp_idx_80', 'samp_idx_81', 'samp_idx_98']
 [0] :: ['samp_idx_20', 'samp_idx_34', 'samp_idx_35', 'samp_idx_37', 'samp_idx_52', 'samp_idx_55', 'samp_idx_60', 'samp_idx_66', 'samp_idx_71', 'samp_idx_73', 'samp_idx_79', 'samp_idx_92']
WS = WeightedRandomSampler(weights=[0.2, 0.1, 0.5, 0.4, 0.3, 0.6, 0.1, 0.5, 0.3],
                           num_samples=4,
                           replacement=True,
                           group_names=list(gaset.groups))
generated = []
for grp in WS:
    print('')
    print(f'GROUP: {grp}')
    samples = gaset.group_samples[grp]
    RS = SubsetRandomSampler(samples)
    BS = BatchSampler(RS, 3, drop_last=False)
    generated.append(list(BS))
    print(f'GENERATED: {generated[-1]}')
GROUP: [1]
GENERATED: [['samp_idx_97', 'samp_idx_24', 'samp_idx_48'], ['samp_idx_91', 'samp_idx_47', 'samp_idx_57'], ['samp_idx_95', 'samp_idx_13']]

GROUP: [3]
GENERATED: [['samp_idx_40', 'samp_idx_88', 'samp_idx_28'], ['samp_idx_49', 'samp_idx_76', 'samp_idx_104'], ['samp_idx_59', 'samp_idx_58']]

GROUP: [5]
GENERATED: [['samp_idx_17', 'samp_idx_65', 'samp_idx_72'], ['samp_idx_8', 'samp_idx_98', 'samp_idx_18'], ['samp_idx_39', 'samp_idx_81', 'samp_idx_4'], ['samp_idx_80', 'samp_idx_5', 'samp_idx_50']]

GROUP: [6]
GENERATED: [['samp_idx_70', 'samp_idx_45', 'samp_idx_61'], ['samp_idx_10', 'samp_idx_2', 'samp_idx_105'], ['samp_idx_89', 'samp_idx_56', 'samp_idx_53'], ['samp_idx_12', 'samp_idx_75', 'samp_idx_27'], ['samp_idx_51', 'samp_idx_43', 'samp_idx_84'], ['samp_idx_32', 'samp_idx_33', 'samp_idx_30']]

I'd like to get a good discussion going here, since this is an attempt to build a solution for a problem I don't have on a day-day basis. Any thoughts?

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@rlizzo rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels Nov 11, 2019
@codecov
Copy link

codecov bot commented Dec 5, 2019

Codecov Report

Merging #162 into master will decrease coverage by 4.44%.
The diff coverage is 16.29%.

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage   95.22%   90.77%   -4.44%     
==========================================
  Files          66       69       +3     
  Lines       11881    12054     +173     
  Branches     1011     1042      +31     
==========================================
- Hits        11313    10942     -371     
- Misses        371      917     +546     
+ Partials      197      195       -2     
Impacted Files Coverage Δ
src/hangar/__init__.py 100.00% <ø> (ø)
src/hangar/dataloaders/common.py 12.90% <ø> (-78.01%) ⬇️
src/hangar/dataloaders/sampler.py 0.00% <0.00%> (ø)
src/hangar/dataloaders/grouper.py 39.68% <39.68%> (ø)
src/hangar/arrayset.py 94.89% <100.00%> (-0.21%) ⬇️
src/hangar/dataloaders/__init__.py 100.00% <100.00%> (ø)
tests/test_dataloaders.py 15.85% <0.00%> (-84.15%) ⬇️
src/hangar/dataloaders/torchloader.py 20.00% <0.00%> (-80.00%) ⬇️
src/hangar/dataloaders/tfloader.py 60.00% <0.00%> (-40.00%) ⬇️
... and 6 more

@rlizzo rlizzo added this to the v0.5.0 milestone Dec 5, 2019
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019
@lgtm-com
Copy link

lgtm-com bot commented Dec 5, 2019

This pull request introduces 5 alerts when merging e63339f into d267c0a - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 2 for Non-standard exception raised in special method

@CLAassistant
Copy link

CLAassistant commented Apr 10, 2020

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WIP Don't merge; Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants