Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to load only specific streams #24

Merged
merged 14 commits into from
Jul 8, 2019

Conversation

cbrnr
Copy link
Contributor

@cbrnr cbrnr commented May 16, 2019

With the load_only argument, users can specify a stream ID or a list of stream IDs if they only want to load these specific streams.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 16, 2019

@cboulay @tstenner let me know what you think.

@tstenner
Copy link
Contributor

Good idea, but the interface has grown organically for some time so before we add more needed parameters we should think about what should be possible and what's the best way to offer that.

Right now, I think a class to wrap an XDF file (optionally supporting the builder pattern) would be a good idea, so the regular use case (e.g. XdfFile('bar.xdf').get_postprocessed_data()) would be almost the same as before (with load_xdf transparently calling that in the background) with more advanced use cases as your being possible, e.g. a = XdfFile('bar.xdf'); si = a.get_stream_info(); a.load(streams = [s.streamid for s in si if s.type=='EEG']); a.postprocess(dejitter=False); a.get_postprocessed_data();).

@cbrnr
Copy link
Contributor Author

cbrnr commented May 20, 2019

I think our main public interface should remain functional, i.e. the function load_xdf. Adding an object-oriented API can be done in parallel, but we should not aim to replace the current function.

That said, I changed the load_xdf signature on purpose, because I think the most common use case will be to load a given XDF file (filename) with only specific streams (load_only). The remaining parameters are pretty much optional for special cases, so they should go at the end of the parameter list.

@agricolab
Copy link
Member

agricolab commented May 21, 2019

In the long run, i agree with @tstenner a better organisation of the repository, possibly into submodules, and an object oriented approach, could prevent such troubles. Actually, currently we would run into merging conflicts almost everytime for developments on different functionalities. But this is a different issue, isn't it?

That said, i agree with @cbrnr insofar as the option to load only specific streams is worthwile the cost of addition to the function arguments, as it would be required to increase loading speed. I disagree with the notion that loading only specific streams is the most common use case. Yet, i believe this is a worthwhile option useful e.g. for debugging. But consider that changing it would affect how streams are synced, as depending on which streams you load, you can get different results for syncing them sample-wise. See #1 This would warrant a discussion, although i'd lean towards only syncing streams that are set to be loaded.

pyxdf/pyxdf.py Outdated
@@ -210,7 +214,11 @@ def load_xdf(filename,
if not os.path.exists(filename):
raise Exception('file %s does not exist.' % filename)

# dict of returned streams, in order of apparance, indexed by stream id
# load_only contains the streams to load
if isinstance(load_only, int):
Copy link
Contributor

@cboulay cboulay May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check if string or not iterable. I don't think we have to be bound to StreamIDs being ints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StreamID is the only value guaranteed to be unique for a stream, and a StreamID is always an integer. I don't understand why we should also handle a string - could you elaborate please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python doesn't enforce typing so users can pass anything into the load_only argument. Here you are doing a bit of type-checking (among types that are implicitly allowed). Why not go all the way and do full type-checking?

Your line of code is basically saying "if it's not an iterable make it an iterable". I was proposing to explicitly check if it was an iterable. Unfortunately, str objects are also considered iterable, so usually you want to exclude those from the check.

After you make sure the argument is a proper iterable, you can then check that each item is an int.

This kind of type-checking is a bit of an anti-pattern in Python and it's considered better-practice to do duck-typing (try, except, else), but I've never been too concerned with that and it was a bigger change than what I was proposing here.

pyxdf/pyxdf.py Outdated
@@ -62,6 +62,7 @@ def __init__(self, xml):


def load_xdf(filename,
load_only=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer an argument named 'extract_streams', and for the preferred argument value to be a list of dicts, where each dict has to have at least one of the keys "name", "type", "source_id".
This mirrors stream resolving in liblsl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can discuss the parameter name; load_only is not really a good choice, but I don't like extract_streams either. This doesn't work nicely with None, because extract_streams=None sounds like we don't want to extract any stream. With load_only=None, at least to me it sounds like we don't want to load any specific stream, so we load all. What about load_streams instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the same reason as before (namely that the only unique value associated with a stream is its stream_id), I'd very much prefer not to have the option to provide other fields. This makes parsing and error-handling a lot more complicated, and is also a lot more complicated from a user's perspective.

Copy link
Member

@agricolab agricolab May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A compromise could be a helper function to convert the liblsl-like keywords into a streamid for a given file.

I agree with @cboulay and would furthermore say that a liblsl-like approach feels to me actually less complicated from an end-users perspective. Similar to #23, streamids are arbitrary and can change between sessions. If the streams can not be uniquely resolved by keywords, i would expect to load all matching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A compromise could be a helper function to convert the liblsl-like keywords into a streamid for a given file.

Sure, this would be an option.

I agree with @cboulay and would furthermore say that a liblsl-like approach feels to me actually less complicated from an end-users perspective. Similar to #23, streamids are arbitrary and can change between sessions. If the streams can not be uniquely resolved by keywords, i would expect to load all matching.

Only because you are using LSL. I only use XDF files without LSL, so it is much more confusing to me. Furthermore, stream IDs are arbitrary and can change between sessions is not a good argument here, because we only deal with a single XDF file here - stream IDs do not change within a single file.

Copy link
Member

@agricolab agricolab May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because you are using LSL.

I can see that point.

stream IDs do not change within a single file.

But between files, so if i analyze multiple files over the course of weeks (which we often do), i would have to catch any changing streamids, while giving a dictionary of keywords is native and not arbitrary. Anyways, this is a functionality our lab will use for sure, so i wrote a quick parser/matcher for that:
It does retrieve a) the stream info from a file and b) return streamids based on matching a list of dictionary with those. Tests are in /pyxdf/test/test_resolve_stream_id. Please be aware the the filepath to the minimal test.xdf was hardcoded - sorry for the hack.
https://github.com/xdf-modules/xdf-Python/tree/resolve_streamid

Looking forward to your feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm not saying that this is not important, it just should be handled outside load_xdf, which loads a single XDF file.

I already have something along these lines here: https://github.com/cbrnr/mnelab/blob/master/mnelab/utils/xdf.py
Basically, you would parse_chunks(parse_xdf(xdffile)) to get a list of important fields for all streams. From this list, you could resolve on whatever field you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about stream_ids as a keyword? I think this will help get rid of some confusion among the vast majority of xdf users who are coming from LSL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for stream_ids!

pyxdf/pyxdf.py Outdated Show resolved Hide resolved
@cbrnr
Copy link
Contributor Author

cbrnr commented May 22, 2019

Thanks for the feedback. @agricolab, I don't think @tstenner was talking about restructuring the repository - this already happened and we have our own XDF-Python repository.

Also, I shouldn't have said that loading only specific streams is the most common use case. What I meant is that the argument which streams to load should directly follow the file name because these belong together. The other parameters concern the post-processing of the signals, so they should be bunched together as well. With the current suggestion, this is the case: first we have filename and load_only, and then we have all parameters related to post-processing.

Regarding the synchronization business, I don't understand how the absence of one or more streams could affect the synchronization of the remaining streams. The result of excluding a stream from loading should behave as if that stream was never recorded and not stored in the file. But in any case, I think we agree that having this option is a useful addition, right?

@agricolab
Copy link
Member

I agree that having this option is a useful addition.

Regarding restructuring, what i meant is as follows: Currently, almost everything happens in one file and in one function. This can cause conflicts when merging different functionalities. When rewriting (or however we want to call that) into an object-oriented approach, we could consider to restructure the python-repo into different submodules to make merging easier. It would also limit issues about the order of arguments. Anyways, in the end they are all keyword arguments anyways, so order is a small problem. I guess it is more about that the signature is already very heavy. But this discussion it outside of the scope of this PR.

Regarding the synchronization, the pull request for synchronization of streams to be aligned sample-wise (see #1) currently selects the fastest streams and aligns all other to it. Omitting a stream therefore would affect this functionalities behavior. Imho not aligning to a not loaded stream is the behavior i would expect, i just wanted to make it transparent.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 22, 2019

Regarding restructuring, what i meant is as follows: Currently, almost everything happens in one file and in one function. This can cause conflicts when merging different functionalities. When rewriting (or however we want to call that) into an object-oriented approach, we could consider to restructure the python-repo into different submodules to make merging easier. It would also limit issues about the order of arguments. Anyways, in the end they are all keyword arguments anyways, so order is a small problem. I guess it is more about that the signature is already very heavy. But this discussion it outside of the scope of this PR.

OK, got it. I also think we should discuss this elsewhere, but IMO forcing everything into an object-oriented paradigm isn't the best solution. We can offer it optionally, but the basic load_xdf function should remain our primary way to load XDF files.

Regarding the synchronization, the pull request for synchronization of streams to be aligned sample-wise (see #1) currently selects the fastest streams and aligns all other to it. Omitting a stream therefore would affect this functionalities behavior. Imho not aligning to a not loaded stream is the behavior i would expect, i just wanted to make it transparent.

Ah, I see. So yes, this is what I think should happen. If you have a suggestion how to make the behavior more explicit, feel free to edit the docstring of the function. For example, you could mention this in the description of the new load_only argument (or whatever it will be called).

@cboulay
Copy link
Contributor

cboulay commented May 22, 2019

The majority of pyxdf users are going to be people coming from the LSL ecosystem. With the addition of this argument, I foresee the following conversation.

User: Hi, I'm trying to load my xdf file but it's not working.
Me: OK, what's the problem?
(After a day of back-and-forth on Slack, probably longer...)
User: Here is my code: load_xdf(load_only='Biosemi').
Me: Oh, that argument isn't for stream names (or stream types), it's only for stream ids.
User: How do I know the stream id?
Me: You don't.
User: Oh, how do I use this argument, which is obviously important because it's the first argument?
Me: You can't. Only Clemens can.

Little things like this make tools seem like they aren't really intended for a wide community, but only for a few select users for their own pet projects.

I think being able to select which streams are imported is a wonderful feature, but as is I don't see how the feature will be used by anyone except people who know their stream IDs a priori, which as far as I know is only Clemens.

Yes you can figure out what the stream id is after parsing the header, but parse_chunks(parse_xdf(xdffile)) is not user-friendly.

Edit: And if we're providing functionality to select a subset of streams in load_xdf, then that functionality should be extended to things that can be found in the header.

@cboulay
Copy link
Contributor

cboulay commented May 22, 2019

To hopefully tilt the scales and end the tangential conversation I'll add my vote:
I prefer to keep things functional.

If we were to add an OO interface then I would prefer not to do it in a sub-repo (already quite a few in xdf). There is nothing stopping someone from making a separate package (i.e. as a wrapper around pyxdf), which could be added to the xdf-modules org, and after it matures a bit we can put it back into pyxdf while still advertising the functional interface but having the OO interface in there for power users who read the docs.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 23, 2019

OK, let me try to summarize:

  1. We keep our load_xdf function for now and extend it with the new argument to load only specific streams. We can add an OO API if anyone feels that it is needed in parallel/in addition. This could even be just a separate module in this package/repository.
  2. Stream IDs are the only unique identifiers of recorded streams (at least within a single XDF file). Since load_xdf always loads a single file, we need not be concerned that stream IDs of the same sources change across different files. Therefore, an argument (let's call it stream_ids) is needed to selectively load specific streams.
  3. Currently, there is no apparent way to find out which stream IDs are contained in an XDF file. There are multiple ways to tackle this problem, one possible solution is to have a little function which takes an XDF file name and returns a list of all streams and associated meta information such as StreamID, channel_count, channel_format, and nominal_srate (these are the only mandatory fields that must be present). In addition, this function would list optional fields such as name, type, source_id, created_at, uid, session_id, hostname, and so on (if present).

Here's an example of what this function might return:
Screen Shot 2019-05-23 at 08 00 59
Given this information, I think it is not too difficult to find out the stream IDs a user wants to load.

In a next step, we could add another utility function that automates this process more. For example, this function could take a list of dicts, or a dict (or whatever type we find best suited), with stream properties that the user wants to load (e.g. {"type": "eeg"} would mean that I want to load all streams of type "eeg").

However, I don't want to do all of this in this PR. So if you agree on this plan, I would like to go ahead and first start with adding the new parameter in this PR and continue adding the helper functions in separate PRs.

@tstenner
Copy link
Contributor

Regarding restructuring, what i meant is as follows: Currently, almost everything happens in one file and in one function. This can cause conflicts when merging different functionalities.

I've already begun extracting smaller functions so there are less merge conflicts and more options for Cython conversions.

User: How do I know the stream id?
Me: You don't.
User: Oh, how do I use this argument, which is obviously important because it's the first argument?
Me: You can't. Only Clemens can.

I'd like to extend this: even with the addition, there's no way to know which streams are in the file except to load all/some of them (load_xdf('foo.xdf', load_only={'type': 'EEG'})), see which ones you need and then load the entire file again.

In any case, we need the functions to extract a list of stream headers and to map a filter ({'type': 'EEG'}) to stream_ids. It doesn't matter much if those functions are member functions in a class or freestanding functions, but since we need to open the file to get a list of streams and might as well keep it open the object oriented approach makes sense for general file handling. The load_xdf function can create the object and call the appropriate functions transparently so the existing code will continue working and 95% of all users can use this convenience function.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 24, 2019

I'd like to extend this: even with the addition, there's no way to know which streams are in the file except to load all/some of them (load_xdf('foo.xdf', load_only={'type': 'EEG'})), see which ones you need and then load the entire file again.

Like I said, there will be a separate function for this purpose.

In any case, we need the functions to extract a list of stream headers and to map a filter ({'type': 'EEG'}) to stream_ids. It doesn't matter much if those functions are member functions in a class or freestanding functions, but since we need to open the file to get a list of streams and might as well keep it open the object oriented approach makes sense for general file handling. The load_xdf function can create the object and call the appropriate functions transparently so the existing code will continue working and 95% of all users can use this convenience function.

I prefer to have a separate function handle the listing of streams and not change load_xdf. Regarding its stream_ids argument, I would like to accept only integers or a list of integers as a first step. As I've explained before, I'd like to add the additional functionality later.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 24, 2019

I've added a stream_info function (feel free to suggest a better name) which collects information on all streams contained in an XDF file in a list of dicts, e.g.:

[{'channel_count': 1,
  'channel_format': 'string',
  'created_at': '14719.059184428999',
  'hostname': 'BrainflightNB',
  'name': 'BrainVision RDA Markers',
  'nominal_srate': 0,
  'session_id': 'default',
  'source_id': 'RDA 127.0.0.1:51244 Marker',
  'stream_id': 2,
  'type': 'Markers',
  'uid': 'c9f2b436-1c1f-4917-8629-078be1b9bcaa'},
 {'channel_count': 64,
  'channel_format': 'float32',
  'created_at': '14719.056604071',
  'hostname': 'BrainflightNB',
  'name': 'BrainVision RDA',
  'nominal_srate': 5000,
  'session_id': 'default',
  'source_id': 'RDA 127.0.0.1:51244',
  'stream_id': 3,
  'type': 'EEG',
  'uid': '927e4d1b-8c73-425f-8365-50c122c6a127'},
 {'channel_count': 1,
  'channel_format': 'string',
  'created_at': '14564.287683467999',
  'hostname': 'NB_7_64_02',
  'name': 'Keyboard',
  'nominal_srate': 0,
  'session_id': 'default',
  'source_id': 'KeyboardCapture_NB_7_64_02',
  'stream_id': 1,
  'type': 'Markers',
  'uid': '62d00708-2832-4d95-a3d0-7aea1238ce24'}]

Here's the dialog as it should now play out:

User: Hi, I'm trying to load my xdf file but it's not working.
Me: OK, what's the problem?
(After a day of back-and-forth on Slack, probably longer...)
User: Here is my code: load_xdf('myfile.xdf', stream_ids='Biosemi').
Me: Oh, that argument isn't for stream names (or stream types), it's only for stream ids. It says so in the docstring.
User: How do I know the stream id?
Me: You can find out with stream_info('myfile.xdf').
User: Oh, great, this works really nice! That way, I found out that my Biosemi stream has ID 2, so calling load_xdf('myfile.xdf', stream_ids=2 did the job.

@agricolab
Copy link
Member

Great. In agreement with the wise words of Robert C Martin, a function should have a strong verb, e.g. resolve_streams, which would also have the strongest similarity to pylsl. An alternative could a class instantiation, e.g. StreamInfo(filename). The latter is probably for a future OO implementation of pyxdf, so i'd lean towards the first functional with a verb.

Ideally, there should be no need for a manual selection, i.e. i prefer @cboulay request for handing over parameters. Ideally, there would now be a helper function returning all ids matching a dictionary. This boils down to whether matching should be done witihin load_xdf or outside in an helper function, and how granular we want PRs to be. Feel free to add

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed May 22 10:53:49 2019

@author: Robert Guggenberger
"""
from typing import List
#%%
def match_streaminfos(stream_infos:List[dict,], parameters:List[dict,]=None):    
    matches = []    
    for request in parameters:          
        for info  in stream_infos:
            for key in request.keys():                
                ismatch = info[key] == request[key]
                if not ismatch:
                    break
            if ismatch:
                matches.append(info['stream_id'])          
            
    return list(set(matches)) #return unique values
# %%

stream_infos = [{'channel_count': 1,
              'channel_format': 'string',
              'created_at': '14719.059184428999',
              'hostname': 'BrainflightNB',
              'name': 'BrainVision RDA Markers',
              'nominal_srate': 0,
              'session_id': 'default',
              'source_id': 'RDA 127.0.0.1:51244 Marker',
              'stream_id': 2,
              'type': 'Markers',
              'uid': 'c9f2b436-1c1f-4917-8629-078be1b9bcaa'},
             {'channel_count': 64,
              'channel_format': 'float32',
              'created_at': '14719.056604071',
              'hostname': 'BrainflightNB',
              'name': 'BrainVision RDA',
              'nominal_srate': 5000,
              'session_id': 'default',
              'source_id': 'RDA 127.0.0.1:51244',
              'stream_id': 3,
              'type': 'EEG',
              'uid': '927e4d1b-8c73-425f-8365-50c122c6a127'},
             {'channel_count': 64,
              'channel_format': 'float32',
              'created_at': '33546.38583',
              'hostname': 'AnotherPC',
              'name': 'eego mysports',
              'nominal_srate': 1000,
              'session_id': 'default',
              'source_id': 'AntNeuro',
              'stream_id': 4,
              'type': 'EEG',
              'uid': '635eecae-8382-bcec-993c-1c2738ee2aa4'},
             {'channel_count': 1,
              'channel_format': 'string',
              'created_at': '14564.287683467999',
              'hostname': 'NB_7_64_02',
              'name': 'Keyboard',
              'nominal_srate': 0,
              'session_id': 'default',
              'source_id': 'KeyboardCapture_NB_7_64_02',
              'stream_id': 1,
              'type': 'Markers',
              'uid': '62d00708-2832-4d95-a3d0-7aea1238ce24'}]


    
def test_selection_single():          
    parameters = [{'name': 'BrainVision RDA Markers'}]
    sid = match_streaminfos(stream_infos, parameters)
    assert sid ==[2]
    
    parameters = [{'name': 'BrainVision RDA'}]
    sid = match_streaminfos(stream_infos, parameters)
    assert sid == [3]
    
    parameters = [{'name': 'DoesnotWork'}]
    sid =  match_streaminfos(stream_infos, parameters)
    assert sid == []
    
    
def test_selection_multiple_returns():

    parameters = [{'type': 'EEG'}]
    sid =  match_streaminfos(stream_infos, parameters)
    assert sid ==[3,4]
    
    parameters = [{'name': 'BrainVision RDA'}, {'name': 'BrainVision RDA Markers'}]
    sid =  match_streaminfos(stream_infos, parameters)
    assert sid ==[2, 3]
    
    parameters = [{'name': 'BrainVision RDA'}, {'name': 'DoesnotWork'}, {'name': 'BrainVision RDA Markers'}]
    sid =  match_streaminfos(stream_infos, parameters)
    assert sid ==[2, 3]
    
def test_selection_multiple_parms():
    
    
    parameters = [{'name': 'BrainVision RDA', 'type': 'EEG'}]
    sid =  match_streaminfos(stream_infos, parameters)
    assert sid ==[0]
    
    parameters = [{'name': 'BrainVision RDA', 'type': 'Doesnotmatch'}]
    sid =  match_streaminfos(stream_infos, parameters)
    assert sid ==[]

@cbrnr
Copy link
Contributor Author

cbrnr commented May 24, 2019

Thanks @agricolab, I've added this function.

pyxdf/pyxdf.py Outdated Show resolved Hide resolved
pyxdf/pyxdf.py Show resolved Hide resolved
@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 18, 2019

OK, this seems to be working pretty well - please let me know if you have further comments, otherwise I'd be happy if this was merged.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 19, 2019

Here's the new workflow for loading specific streams. Let's say we only want to load the audio stream and the stream named "MousePosition" from our file.

  1. Initially, we don't know anything about the streams contained in our file xdf_sample.xdf. Therefore, we will resolve the streams with
    stream_infos = resolve_streams("xdf_sample.xdf")
    
    This creates a list of dicts containing information on each stream. It looks as follows:
    [{'channel_count': 2,
      'channel_format': 'float32',
      'created_at': '29519.444814558999',
      'hostname': 'Jordan',
      'name': 'MousePosition',
      'nominal_srate': 0,
      'session_id': 'default',
      'source_id': 'MousePositionCapture_Jordan',
      'stream_id': 3,
      'type': 'Position',
      'uid': '9f781aae-4b73-4ce6-bff2-7e9852ecc963'},
     {'channel_count': 1,
      'channel_format': 'string',
      'created_at': '29519.438182786002',
      'hostname': 'Jordan',
      'name': 'MouseButtons',
      'nominal_srate': 0,
      'session_id': 'default',
      'source_id': 'MouseButtonCapture_Jordan',
      'stream_id': 4,
      'type': 'Markers',
      'uid': 'd2c48fea-c04f-4777-b1b0-dc80eb51b7e2'},
     {'channel_count': 1,
      'channel_format': 'string',
      'created_at': '29536.104270822001',
      'hostname': 'Jordan',
      'name': 'Keyboard',
      'nominal_srate': 0,
      'session_id': 'default',
      'source_id': 'KeyboardCapture_Jordan',
      'stream_id': 1,
      'type': 'Markers',
      'uid': '4451d0a6-fba8-4a97-b008-b7178a34c22b'},
     {'channel_count': 2,
      'channel_format': 'float32',
      'created_at': '29501.938893857001',
      'hostname': 'Jordan',
      'name': 'AudioCaptureWin',
      'nominal_srate': 44100,
      'session_id': 'default',
      'source_id': 'Jordan{0.0.1.00000000}.{15571e14-3f37-459b-ab48-c92d4bca6d92}',
      'stream_id': 2,
      'type': 'Audio',
      'uid': '57bae396-734c-4613-9835-4490cc3aed61'}]
    
  2. Nice, we could manually search through these fields and collect the required stream IDs. This is not too difficult because this is a small XDF file, and we could readily come up with stream IDs 2 and 3. We could feed these IDs directly as an argument to load_xdf, but in general such a manual procedure will be cumbersome.
  3. Instead, we now search for streams with desired attributes using the stream_infos object we just created:
    stream_ids = match_streaminfos(stream_infos, [{"name": "MousePosition"}, {"type": "Audio"}])
    
    This will return the stream IDs of all matches, which in our case is [2, 3] as expected.
  4. We can now load the specific streams with streams, header = load_xdf("xdf_sample.xdf", stream_ids=stream_ids).

This workflow could be simplified if we allow the stream_ids argument to be a query that we just fed into match_streaminfos. Then the process of loading only specific streams would be:

streams, header = load_xdf("xdf_sample.xdf", stream_ids=[{"name": "MousePosition"}, {"type": "Audio"}])

However, stream_ids is not an optimal parameter name if we allow this (suggestions?).

Furthermore, we could think about changing the first parameter of match_streaminfos to be the file name instead of the object returned by resolve_streams.

Let me know what you think.

@tstenner
Copy link
Contributor

stream_ids = [{'stream_id': stream_id} for stream_id in [4,5,8,9]]

A bit longer, but is applicable to all kinds of metadata fields.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 25, 2019

@tstenner true, but I'm not sure if this is user-friendly. If I know the stream IDs I'd prefer the simpler option.

@tstenner
Copy link
Contributor

But those who know the stream ids probably also know list comprehensions, whereas other users will wonder why they can say stream_ids=[1,2,3] but not stream_names=['MyEEG', 'PresentationMarkers']

@agricolab
Copy link
Member

I agree with @tstenner. I see no good reason to give stream-ids special favor as an argument.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 26, 2019

As I've said several times before, stream IDs are the only unique identifiers of a stream - everything else (name, type, uid, session_id, source_id, ...) is not guaranteed to be unique. Therefore, they are special by definition, and I would like to have a way to select streams based on their stream IDs without going through multiple hoops (list of dicts wrapped inside list comprehension). This was the original idea of this PR, which then evolved into a discussion what else would be nice to have, after which I've added the list of dicts based matching.

I really don't understand the problem here - no one has to use the simpler way. It doesn't break any code. It makes things simpler if I want to make a selection based on stream IDs. Everything is clearly documented. If someone wonders why there is just stream_ids and not stream_names they are welcome to open an issue here.

If your problem is that we have only one argument and multiple ways to use it, we could split the current proposal into two parameters - stream_ids just for stream IDs (as the name implies) and something else (e.g. match_streams) for the list of dict approach. To keep things simple, we could enforce that only one of these two parameters can be set (and if both are set we raise an error).

I'll remove the special status if both of you think I should, but I'd be much happier if you could convince me why I should do that. Since so far only myself will be using the stream ID selection, I can live with the list comprehension, but I thought that other users might also do that. If not, I'll remove it.

@tstenner
Copy link
Contributor

The uids and stream names should be unique, too. I'm not entirely happy with putting an implementation detail in the public API, but it's the best solution for your use case.

Could you rename the parameter to stream_filter? The current documentation is fine and explains the allowed parameter types very well, but in the most common case (dicts) I wouldn't look at the stream_id parameter. Also, this would allow us to add support for filter functions (e.g. stream_filter=lambda streaminfo: return streaminfo['name'].endswith('Foobar')) in the future with a name that still fits.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 27, 2019

The uids and stream names should be unique, too. I'm not entirely happy with putting an implementation detail in the public API, but it's the best solution for your use case.

Yes, they should, but they are not guaranteed to be unique.

Re renaming, I also thought that stream_ids is not a good parameter name anymore. How do you like match_streams? Or if you want to have "filter" in the name, what about filter_streams (I think a verb is better than a noun here)?

@tstenner
Copy link
Contributor

Good to see we've resolved the important questions and can go back to bikeshedding :-)

What about select_streams or which_streams? I have opinions, but I've worked too much on pylsl to know what first-time users look for so I don't want to cast any vote.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 27, 2019

😄 true! I like select_streams.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 28, 2019

@tstenner @agricolab if you are happy please go ahead and merge.

@agricolab
Copy link
Member

from pyxdf import load_xdf
streams, info = load_xdf("example-files/minimal.xdf")

threw TypeError: 'NoneType' object is not iterable in line 209

@cbrnr
Copy link
Contributor Author

cbrnr commented Jul 2, 2019

Does this also happen on master?

@agricolab
Copy link
Member

agricolab commented Jul 2, 2019

no, cbrnr/master loads fine

@cbrnr
Copy link
Contributor Author

cbrnr commented Jul 2, 2019

I'll take a look.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jul 2, 2019

Should be fixed.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jul 2, 2019

@tstenner @agricolab @cboulay anyone up for a (hopefully) final round of review?

@agricolab
Copy link
Member

Tested loading with defaults and for selecting subsets. Runs fine, raises as expected ValueError when no matching stream was found. Imho ready to merge.

@cboulay
Copy link
Contributor

cboulay commented Jul 2, 2019

I'm sort of on vacation right now. I might take a look but don't wait for me.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jul 8, 2019

@tstenner do you want to take a look? Otherwise I think I'm going to merge because there do not seem to be any major objections any more.

pyxdf/pyxdf.py Outdated Show resolved Hide resolved
@cbrnr cbrnr merged commit a0b8501 into xdf-modules:master Jul 8, 2019
@cbrnr cbrnr deleted the load_subset branch July 8, 2019 11:46
@cbrnr
Copy link
Contributor Author

cbrnr commented Jul 8, 2019

Thanks everyone for your input!

@agricolab agricolab mentioned this pull request Sep 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants