-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Redesign #115
API Redesign #115
Conversation
Codecov Report
@@ Coverage Diff @@
## master #115 +/- ##
==========================================
+ Coverage 92.25% 92.47% +0.22%
==========================================
Files 50 51 +1
Lines 8622 9448 +826
Branches 843 930 +87
==========================================
+ Hits 7954 8737 +783
- Misses 492 517 +25
- Partials 176 194 +18
|
I actually think that's slightly confusing, since to specify multiple # case 1
samples = dset['foo', 'bar']['1', '2']
# case 2
samples = dset[:]['1']
# case 3
samples = dset[...]['1', '2']
# case 4
samples = dset['foo']['1'] While With Another IdeaWe actually may want to take some minor inspiration from the xarray project, as it is the de-facto standard for working with labelled ndim arrays, or as they say:
While we wouldn't want a "one-to-one" mapping of API and access conventions, their approach to positional indexing is essentially the same as the proposed # vectorized indexing
samples = dset[['foo', ['1', '2']], ['bar', ['3', '4']]`? We could also allow "dict style" indexing: I would specifically avoid their "Pandas style" named indexing (using One more Issue
# checkout has arraysets 'foo', 'bar', 'baz'.
# 'foo' has samples '1', '2'
# 'bar' has samples '1', '2'
# 'baz' has samples '1' <-- missing '2'
# Case 1 (expected output)
>>> res = dset[:, '1']
>>> print(res)
('foo'=np.array([1]), 'bar'=np.array([1, 1]), 'baz'=np.array([1, 1, 1]))
# Case 2A
>>> res = dset[:, '2']
KeyError('2' not in arrayset 'baz')
# Case 2B
>>> res = dset[:, '2']
>>> print(res)
('foo'=np.array([2]), 'bar'=np.array([2, 2]), 'baz'=None)
# Case 2C
>>> res = dset[:, '2']
>>> print(res)
('foo'=np.array([2]), 'bar'=np.array([2, 2])) I was leaning toward suggesting |
I think that the talk of how to layer syntactic sugar over the underlying operations is happening way too early in the design process. Trying to decide how many square brackets you need to get the data out shouldn't happen until the release after you've got verbose methods implemented and released. Also, keep in mind that 99% of your users are never going to use this API. Data will be written, committed, and then they'll call the API to convert it to a Dataset they can feed into their deep learning framework of choice. I'm not saying that because I think that it's okay to design it poorly, just that it's not a great use of time right now. There's an unfortunate trend in a lot of the more research/data science tooling to paper over data problems, which I think runs entirely counter to the hangar "data integrity is paramount" ethos. If I ask for something, but some of the keys are missing, I should get an exception. (FWIW, I feel the same way about missing dimensions, but that's a different PR) I do think that it's okay to have a keyword arg that enables silently discarding rows that don't meet the column constraints, but I should have to enable it explicitly (and I think that needing it points to deeper problems in the data pipeline). |
Hey, great job @rlizzo! Various coments: I agree with @rlizzo's comment on double brackets; I also started with @hhsecond's approach but switched to multi-indexing later on as it was simpler to grasp. When I say that, I'm referring to the fact that our users post 0.3 coming to Hangar will be primarily exposed this API to accessing / writing datasets, as it doesn't require further terminology or object hierarchy than just what you checkout, and it suffices for the majority of the use cases. @rlizzo it's great that we can get consistent with XArrray. I personally think @elistevens it's a good warning re: being too soon, but I don't agree with this. One thing we know is that we need to simplify Hangar's API to make users grow. I do think this is the API that 99% of users should use, because it won't require them to go deep into Hangar's design details to make it useful. The current API will be for advanced users in my head. |
@hhsecond, can you take a look at this and let me know what you think about testing for these new access methods? Not sure if we should be repeating many of the low level tests or just writing new ones for the high level API? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. I'll take a thorough look once you finish making changes?
arrays stored in each arrayset via a common sample key. Each sample | ||
key is returned values as an individual element in the | ||
List. The sample order is returned in the same order it wasw recieved. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also return a NamedTuple (outside of List) on this case dset[('foo', 'bar', 'baz'), '1']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to have that in some cases, but since there isn't anyway to set some modifiable flag
during dict style access it would have to be permenant. Two levels of indexing to pull elements which could be stored sequentially in a list would drive me nuts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I am saying is, It is already returning a NamedTuple in the case dset[('foo', 'bar', 'baz'), '1']
which is not documented in the Returns
field. Or Am I missing something here?
Thanks for starting. I'm gonna say go ahead and give it a look now! I'll come back to the rest of the tests once you give it a once over. |
@rlizzo We need something to replace |
Do we have a plan to enable users to loop over the dataset where |
Why? The standard API isn't going anywhere, and is really important to work with anything other than a pre-built repo/dataset. Is there some reason we shouldn't just keep |
It's not on my radar right now, and I'm not sure what it would even return?? I'm actually very wary of actually treating the from my point of view, these changes are mearly a conveience for a common use case users will face, they aren't changing the fundumental nature of what the Does that make sense? Maybe I'm seeing this wrong though. What's your point of view? |
Documentation updated and all tests pass with good coverage. Ready to be merged! |
It is just more intuitive to get all the relevant information from the high-level API itself, IMO |
Yep, make sense. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation and Context
Why is this change required? What problem does it solve?:
To simplify user interface with arraysets and provide some concept of a
dataset
as a view across arraysets.NOTE: This initial PR is a proof of concept only, and will require extensive discussion before the final design is agreed upon
If it fixes an open issue, please link to the issue here:
related to #79 and many conversations on the Hangar Users Slack Channel
Description
Describe your changes in detail:
Added
CheckoutIndexer
class which is inhereted inReaderCheckout
andWriterCheckout
to enable the following API. (originally proposed by @lantiga and @elistevens)Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Is this PR ready for review, or a work in progress?
How Has This Been Tested?
Put an
x
in the boxes that apply:Checklist: