-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group arrayset classes #162
Draft
rlizzo
wants to merge
3
commits into
tensorwerk:master
Choose a base branch
from
rlizzo:group-arrayset-classes
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rlizzo
added
enhancement
New feature or request
WIP
Don't merge; Work in Progress
labels
Nov 11, 2019
rlizzo
force-pushed
the
group-arrayset-classes
branch
from
November 26, 2019 09:27
1f31ad5
to
53e07f4
Compare
rlizzo
force-pushed
the
group-arrayset-classes
branch
from
December 5, 2019 00:05
0678537
to
9587e93
Compare
Codecov Report
@@ Coverage Diff @@
## master #162 +/- ##
==========================================
- Coverage 95.22% 90.77% -4.44%
==========================================
Files 66 69 +3
Lines 11881 12054 +173
Branches 1011 1042 +31
==========================================
- Hits 11313 10942 -371
- Misses 371 917 +546
+ Partials 197 195 -2
|
rlizzo
force-pushed
the
group-arrayset-classes
branch
from
December 5, 2019 22:04
9587e93
to
e63339f
Compare
This pull request introduces 5 alerts when merging e63339f into d267c0a - view on LGTM.com new alerts:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Why is this change required? What problem does it solve?:
In a recent conversation with @elistevens, it became obvious that our current level of support for existing ML workflows is lacking. This PR is a very early mock-up of how we might support tf/torch data loader workflows as a first class citizens in hangar.
Essentially, we need to build an interface which allows: selection, batching, shuffling, & sharding of
arraysets
/samples
for ML training pipelines. Both pre-compute and runtime rebalancing must be supported.Description
Describe your changes in detail:
A significant amount of this has been adapted from the
torch
dataset sampler methods. The code at the time of writing is a total hack, and shouldn't be regarded as more then just a proof of concept for how we could approach sampling / shuffling / weighting / batching over samples.By wrapping the
aset
object during runtime in an addon class (currently calledGroupedArraysetDataReader
), we can use just book-keeping records to create informational structures about what exactly eacharrayset
contains. (with almost no data IO required from disk). Filtering these values out from thearrayset
worker decouples the ability to perform runtime balancing from a (potentially) blocking operation running in thearrayset
backends...I'm thinking the best solution for distributed computation is to just totally bypass the arrayset objects themselves - they are cheap to create, and can be done so individually on each node to read from the same checkout simultaneously).
A very rough example below shows usage of the wrapper class, weighted balancing of some "flavor" value common to many sample keys (which presumably would be used as a feature during some computation on larger data stored in a different arrayset), and finally a randomized subselection / batching of the the sample keys which would be sent off for computation.
I'd like to get a good discussion going here, since this is an attempt to build a solution for a problem I don't have on a day-day basis. Any thoughts?
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Is this PR ready for review, or a work in progress?
How Has This Been Tested?
Put an
x
in the boxes that apply:Checklist: