-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to load only specific streams #24
Conversation
Good idea, but the interface has grown organically for some time so before we add more needed parameters we should think about what should be possible and what's the best way to offer that. Right now, I think a class to wrap an XDF file (optionally supporting the builder pattern) would be a good idea, so the regular use case (e.g. |
I think our main public interface should remain functional, i.e. the function That said, I changed the |
In the long run, i agree with @tstenner a better organisation of the repository, possibly into submodules, and an object oriented approach, could prevent such troubles. Actually, currently we would run into merging conflicts almost everytime for developments on different functionalities. But this is a different issue, isn't it? That said, i agree with @cbrnr insofar as the option to load only specific streams is worthwile the cost of addition to the function arguments, as it would be required to increase loading speed. I disagree with the notion that loading only specific streams is the most common use case. Yet, i believe this is a worthwhile option useful e.g. for debugging. But consider that changing it would affect how streams are synced, as depending on which streams you load, you can get different results for syncing them sample-wise. See #1 This would warrant a discussion, although i'd lean towards only syncing streams that are set to be loaded. |
pyxdf/pyxdf.py
Outdated
@@ -210,7 +214,11 @@ def load_xdf(filename, | |||
if not os.path.exists(filename): | |||
raise Exception('file %s does not exist.' % filename) | |||
|
|||
# dict of returned streams, in order of apparance, indexed by stream id | |||
# load_only contains the streams to load | |||
if isinstance(load_only, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check if string or not iterable. I don't think we have to be bound to StreamIDs being ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StreamID is the only value guaranteed to be unique for a stream, and a StreamID is always an integer. I don't understand why we should also handle a string - could you elaborate please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python doesn't enforce typing so users can pass anything into the load_only argument. Here you are doing a bit of type-checking (among types that are implicitly allowed). Why not go all the way and do full type-checking?
Your line of code is basically saying "if it's not an iterable make it an iterable". I was proposing to explicitly check if it was an iterable. Unfortunately, str
objects are also considered iterable, so usually you want to exclude those from the check.
After you make sure the argument is a proper iterable, you can then check that each item is an int.
This kind of type-checking is a bit of an anti-pattern in Python and it's considered better-practice to do duck-typing (try, except, else), but I've never been too concerned with that and it was a bigger change than what I was proposing here.
pyxdf/pyxdf.py
Outdated
@@ -62,6 +62,7 @@ def __init__(self, xml): | |||
|
|||
|
|||
def load_xdf(filename, | |||
load_only=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer an argument named 'extract_streams', and for the preferred argument value to be a list of dicts, where each dict has to have at least one of the keys "name", "type", "source_id".
This mirrors stream resolving in liblsl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can discuss the parameter name; load_only
is not really a good choice, but I don't like extract_streams
either. This doesn't work nicely with None
, because extract_streams=None
sounds like we don't want to extract any stream. With load_only=None
, at least to me it sounds like we don't want to load any specific stream, so we load all. What about load_streams
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for the same reason as before (namely that the only unique value associated with a stream is its stream_id
), I'd very much prefer not to have the option to provide other fields. This makes parsing and error-handling a lot more complicated, and is also a lot more complicated from a user's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A compromise could be a helper function to convert the liblsl-like keywords into a streamid for a given file.
I agree with @cboulay and would furthermore say that a liblsl-like approach feels to me actually less complicated from an end-users perspective. Similar to #23, streamids are arbitrary and can change between sessions. If the streams can not be uniquely resolved by keywords, i would expect to load all matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A compromise could be a helper function to convert the liblsl-like keywords into a streamid for a given file.
Sure, this would be an option.
I agree with @cboulay and would furthermore say that a liblsl-like approach feels to me actually less complicated from an end-users perspective. Similar to #23, streamids are arbitrary and can change between sessions. If the streams can not be uniquely resolved by keywords, i would expect to load all matching.
Only because you are using LSL. I only use XDF files without LSL, so it is much more confusing to me. Furthermore, stream IDs are arbitrary and can change between sessions is not a good argument here, because we only deal with a single XDF file here - stream IDs do not change within a single file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because you are using LSL.
I can see that point.
stream IDs do not change within a single file.
But between files, so if i analyze multiple files over the course of weeks (which we often do), i would have to catch any changing streamids, while giving a dictionary of keywords is native and not arbitrary. Anyways, this is a functionality our lab will use for sure, so i wrote a quick parser/matcher for that:
It does retrieve a) the stream info from a file and b) return streamids based on matching a list of dictionary with those. Tests are in /pyxdf/test/test_resolve_stream_id. Please be aware the the filepath to the minimal test.xdf was hardcoded - sorry for the hack.
https://github.com/xdf-modules/xdf-Python/tree/resolve_streamid
Looking forward to your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm not saying that this is not important, it just should be handled outside load_xdf
, which loads a single XDF file.
I already have something along these lines here: https://github.com/cbrnr/mnelab/blob/master/mnelab/utils/xdf.py
Basically, you would parse_chunks(parse_xdf(xdffile))
to get a list of important fields for all streams. From this list, you could resolve on whatever field you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about stream_ids
as a keyword? I think this will help get rid of some confusion among the vast majority of xdf users who are coming from LSL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for stream_ids
!
Thanks for the feedback. @agricolab, I don't think @tstenner was talking about restructuring the repository - this already happened and we have our own XDF-Python repository. Also, I shouldn't have said that loading only specific streams is the most common use case. What I meant is that the argument which streams to load should directly follow the file name because these belong together. The other parameters concern the post-processing of the signals, so they should be bunched together as well. With the current suggestion, this is the case: first we have Regarding the synchronization business, I don't understand how the absence of one or more streams could affect the synchronization of the remaining streams. The result of excluding a stream from loading should behave as if that stream was never recorded and not stored in the file. But in any case, I think we agree that having this option is a useful addition, right? |
I agree that having this option is a useful addition. Regarding restructuring, what i meant is as follows: Currently, almost everything happens in one file and in one function. This can cause conflicts when merging different functionalities. When rewriting (or however we want to call that) into an object-oriented approach, we could consider to restructure the python-repo into different submodules to make merging easier. It would also limit issues about the order of arguments. Anyways, in the end they are all keyword arguments anyways, so order is a small problem. I guess it is more about that the signature is already very heavy. But this discussion it outside of the scope of this PR. Regarding the synchronization, the pull request for synchronization of streams to be aligned sample-wise (see #1) currently selects the fastest streams and aligns all other to it. Omitting a stream therefore would affect this functionalities behavior. Imho not aligning to a not loaded stream is the behavior i would expect, i just wanted to make it transparent. |
OK, got it. I also think we should discuss this elsewhere, but IMO forcing everything into an object-oriented paradigm isn't the best solution. We can offer it optionally, but the basic
Ah, I see. So yes, this is what I think should happen. If you have a suggestion how to make the behavior more explicit, feel free to edit the docstring of the function. For example, you could mention this in the description of the new |
The majority of pyxdf users are going to be people coming from the LSL ecosystem. With the addition of this argument, I foresee the following conversation. User: Hi, I'm trying to load my xdf file but it's not working. Little things like this make tools seem like they aren't really intended for a wide community, but only for a few select users for their own pet projects. I think being able to select which streams are imported is a wonderful feature, but as is I don't see how the feature will be used by anyone except people who know their stream IDs a priori, which as far as I know is only Clemens. Yes you can figure out what the stream id is after parsing the header, but Edit: And if we're providing functionality to select a subset of streams in |
To hopefully tilt the scales and end the tangential conversation I'll add my vote: If we were to add an OO interface then I would prefer not to do it in a sub-repo (already quite a few in xdf). There is nothing stopping someone from making a separate package (i.e. as a wrapper around pyxdf), which could be added to the xdf-modules org, and after it matures a bit we can put it back into pyxdf while still advertising the functional interface but having the OO interface in there for power users who read the docs. |
I've already begun extracting smaller functions so there are less merge conflicts and more options for Cython conversions.
I'd like to extend this: even with the addition, there's no way to know which streams are in the file except to load all/some of them ( In any case, we need the functions to extract a list of stream headers and to map a filter ( |
Like I said, there will be a separate function for this purpose.
I prefer to have a separate function handle the listing of streams and not change |
I've added a
Here's the dialog as it should now play out: User: Hi, I'm trying to load my xdf file but it's not working. |
Great. In agreement with the wise words of Robert C Martin, a function should have a strong verb, e.g. Ideally, there should be no need for a manual selection, i.e. i prefer @cboulay request for handing over parameters. Ideally, there would now be a helper function returning all ids matching a dictionary. This boils down to whether matching should be done witihin load_xdf or outside in an helper function, and how granular we want PRs to be. Feel free to add
|
Thanks @agricolab, I've added this function. |
OK, this seems to be working pretty well - please let me know if you have further comments, otherwise I'd be happy if this was merged. |
Here's the new workflow for loading specific streams. Let's say we only want to load the audio stream and the stream named "MousePosition" from our file.
This workflow could be simplified if we allow the
However, Furthermore, we could think about changing the first parameter of Let me know what you think. |
A bit longer, but is applicable to all kinds of metadata fields. |
@tstenner true, but I'm not sure if this is user-friendly. If I know the stream IDs I'd prefer the simpler option. |
But those who know the stream ids probably also know list comprehensions, whereas other users will wonder why they can say |
I agree with @tstenner. I see no good reason to give stream-ids special favor as an argument. |
As I've said several times before, stream IDs are the only unique identifiers of a stream - everything else (name, type, uid, session_id, source_id, ...) is not guaranteed to be unique. Therefore, they are special by definition, and I would like to have a way to select streams based on their stream IDs without going through multiple hoops (list of dicts wrapped inside list comprehension). This was the original idea of this PR, which then evolved into a discussion what else would be nice to have, after which I've added the list of dicts based matching. I really don't understand the problem here - no one has to use the simpler way. It doesn't break any code. It makes things simpler if I want to make a selection based on stream IDs. Everything is clearly documented. If someone wonders why there is just If your problem is that we have only one argument and multiple ways to use it, we could split the current proposal into two parameters - I'll remove the special status if both of you think I should, but I'd be much happier if you could convince me why I should do that. Since so far only myself will be using the stream ID selection, I can live with the list comprehension, but I thought that other users might also do that. If not, I'll remove it. |
The Could you rename the parameter to |
Yes, they should, but they are not guaranteed to be unique. Re renaming, I also thought that |
Good to see we've resolved the important questions and can go back to bikeshedding :-) What about |
😄 true! I like |
@tstenner @agricolab if you are happy please go ahead and merge. |
threw |
Does this also happen on master? |
no, cbrnr/master loads fine |
I'll take a look. |
Should be fixed. |
@tstenner @agricolab @cboulay anyone up for a (hopefully) final round of review? |
Tested loading with defaults and for selecting subsets. Runs fine, raises as expected ValueError when no matching stream was found. Imho ready to merge. |
I'm sort of on vacation right now. I might take a look but don't wait for me. |
@tstenner do you want to take a look? Otherwise I think I'm going to merge because there do not seem to be any major objections any more. |
Thanks everyone for your input! |
With the
load_only
argument, users can specify a stream ID or a list of stream IDs if they only want to load these specific streams.