Streams in load_xdf should not be ordered by default #35

dojeda · 2019-02-26T10:30:51Z

I recently found that there was a difference between the order of the streams in the Matlab and Python loader functions. I read the xdf specification, and there is no mention on the order of the streams. However, _load_xdf does this:

    streams = [s for s in streams.values()]
    sort_data = [s['info']['name'][0] for s in streams]
    streams = [x for _, x in sorted(zip(sort_data, streams))]

Sorting the streams by their name may be useful, but it may break for a user that expects stream 0 to be of a particular type, stream 1 to be of another particular type, and so on, because this might be the way the data is saved from their acquisition procedure. Moreover, this sorting is not documented, so it comes as a surprise.

What I propose is to either drop this sort (a breaking change) or, more conservatively, add a keyword parameter (which defaults to the current behavior) to perform this sorting:

def load_xdf(filename, ..., sort_streams=True):
    ...
    # at the end of the function...
    streams = [s for s in streams.values()]
    if sort_streams:
        streams.sort(key=lambda s: s['info']['name'][0])
    ...

(I simplified the 3-line code to sort the indices)

If you guys agree with this, I can file a PR with this change.

The text was updated successfully, but these errors were encountered:

cboulay · 2019-02-26T12:21:18Z

I think the issue here is that in an earlier step the streams are collected in a dictionary which has no order, meaning that without sorting they would come back in random order each time the same file is loaded. I’m on mobile now so I can’t easily confirm. If this is the case then maybe the streams should be first loaded into an OrderedDict.
That being said, the way LabRecorder stores streams doesn’t impose any order. Two files from the same session can have different stream ordering on disk so xdf users should NEVER rely on stream order.

dojeda · 2019-02-26T12:52:45Z

Thanks for the input @cboulay.

You are correct on fact that streams are saved in a dictionary, but since Python 3.7, dictionaries preserve insertion order. It is still strongly suggested to use an OrderedDict to keep a common behavior with <3.7 versions and for expressiveness. The dictionary where streams are saved is indexed by their StreamId, so this could alternatively be used as the order when converting the dictionary to a list. As I finish writing this paragraph, I realize that it, IMO it would make more sense to return streams as a dictionary without converting it or sorting it.

That being said, the way LabRecorder stores streams doesn’t impose any order. Two files from the same session can have different stream ordering on disk so xdf users should NEVER rely on stream order.

That's very interesting. Do you have any documentation that claims this so I could read a bit further on this and have a solid proof to argue against relying on order?

Addresses sccn#35 but note: apparently, one should not count on a specific order of streams (it can be arbitrary). I am still not sure if this warning concerns the StreamId as well.

cboulay · 2019-02-26T17:12:26Z

Each stream, including its header, is written by a stream-specific thread. See here. Relying on thread ordering is a bad idea and the discussion could end here.

Maybe because of the threading architecture, maybe for other reasons, no attention is paid to stream ordering elsewhere in LabRecorder. Streams are ordered according to the order that lsl::resolve_streams() finds them. But, if you add streams later and then tell LabRecorder to refresh the streams, new streams are added at the end of the list.

dojeda · 2019-02-26T18:02:56Z

Ok, I am convinced to not rely on stream ordering and the purpose of this issue is now weak to me.
However, I think it would be a good idea to at least inform that the returned list of streams pf load_xdf does provide a specific order.

cboulay · 2019-02-26T18:18:21Z

I think we should probably just delete the lines that do the sorting. I'll leave this issue open until I find time to delete the sorting (and test the result). If you want to make a PR for this small change then please go ahead.

You're right that returning a dictionary would be better than returning a list with dubious ordering, but this would be a breaking change and thus come at a high cost for a relatively small benefit.

dojeda · 2019-02-27T09:29:34Z

Ok I can do the PR later today.
One question though: I git-blamed the file and it seems that this particular part comes from a merge from Intheon's version. Do you know anything or perhaps someone over there that may want to chime in?

agricolab · 2019-02-27T15:11:47Z

I also stumbled upon this recasting part at the end of load_xdf. It makes development tricky. One option to ensure backwards compatibility is to make sorting an option with default true; or making load_xdf a load_xdf_as_dict() which does everything but sort.

raymundfueloep · 2019-02-28T10:09:55Z

Hello,
I found this issue here due to a crash that I encountered: if the XDF to be loaded contains two streams with identical names, the old sorting (sorted(zip(sort_data, streams))) tries to compare the actual stream objects, which fails. Using streams.sort(key=lambda s: s['info']['name'][0]) avoids that crash.

dojeda · 2019-02-28T11:02:31Z

All right, I actually did not do a PR because I am not sure what the consensus is.

I have also encountered the same problem described by Raymund; I have some XDF files with streams that have the same name and the current sorting falls back to the second tuple element comparison (the streams) and fails.

So far what I gather is that:

Chadwick proposes removing the sorting.
Robert proposes a new functionload_xdf_as_dict that calls load_xdf and then returns a dict, retaining backward compatibility.
At the very minimum, the current comparison needs to be fixed to account for streams for repeated names.

I lean towards the dict option and I would only need to know:

What key should be used? I propose the StreamId but I am not sure if this makes sense.
Would you agree to add a deprecation warning on the load_xdf or should we leave the two functions? (cf There should be one-- and preferably only one --obvious way to do it.)

agricolab · 2019-03-05T07:59:42Z

Robert proposes a new function load_xdf_as_dict that calls load_xdf and then returns a dict

Sorry, my original description was a little bit off. My original suggestion should have been(for backwards compatibility) to keep load_xdf returning a list, but it passes all arguments to load_xdf_as_dict(*args) -> dictand then sort_streams(streams:dict, sort_key:str) -> OrderedDict. The sort_key argument could be e.g. StreamId or type. This sorting is followed by recast_streams(streams:OrderedDict) -> list. In that case, everyone can call the low-level function for the desired output.

While sorting makes a lot of sense imho for a a list, it is pretty unnecessary for a dictionary, I find the recasting from dict to list at the end of load_xdf unnecessary. I personally could live with a break in backwards compatibility, end even if the output would be a regular dict instead of an OrderedDict. Consider also that recasting a dict as a list is pretty simple, especially if there is a backwards-compatibility function that does that. So, i personally would prefer load_xdf returning by default an unsorted dict of streams, not a list. That means, add a deprecation warning for the time being. But i'd tend to agree with what @cboulay think's is better.

Point against my view: We might also consider that if xdf is to become BIDS-compatible, and if there is a use case where we want to sync all streams sample-wise and store everything as a single stream, we might need to find a way to ensure that all signals are sorted identical across all calls to load_xdf.

cbrnr · 2019-03-21T11:59:55Z

What about storing the StreamId of each stream under the info key? Then we could stick with the current list of streams, and sort by StreamId. This would solve the current issue of duplicate stream names, because each stream must have a unique StreamId AFAIK.

cbrnr · 2019-03-21T12:27:23Z

The best solution, however, would be to directly return the ordered dict, because it already contains the stream IDs as keys. How terrible would it be to make this breaking change? If this is properly documented, people using the old list return type would need to change only very little to accommodate for the new dict return type.

cbrnr · 2019-03-21T12:30:10Z

So basically I'm +1 on @agricolab's suggestion to make the breaking change (and even use a plain dict instead of an OrderedDict because we don't care about insertion order, all we need are the keys and values of the dict).

cbrnr · 2019-04-16T10:29:49Z

Fixed in xdf-modules/pyxdf#3.

cbrnr · 2019-04-18T06:53:25Z

This can be closed. If you would like to continue discussing the type of the returned object (e.g. if you'd rather have a dict instead of a list), please open a new issue in the new repository https://github.com/xdf-modules/xdf-Python.

dojeda mentioned this issue Feb 26, 2019

Feature request: read headers only #37

Open

cbrnr mentioned this issue Mar 26, 2019

Do not sort streams #42

Closed

cboulay closed this as completed Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams in load_xdf should not be ordered by default #35

Streams in load_xdf should not be ordered by default #35

dojeda commented Feb 26, 2019 •

edited

Loading

cboulay commented Feb 26, 2019

dojeda commented Feb 26, 2019

cboulay commented Feb 26, 2019

dojeda commented Feb 26, 2019

cboulay commented Feb 26, 2019

dojeda commented Feb 27, 2019

agricolab commented Feb 27, 2019

raymundfueloep commented Feb 28, 2019

dojeda commented Feb 28, 2019

agricolab commented Mar 5, 2019

cbrnr commented Mar 21, 2019

cbrnr commented Mar 21, 2019

cbrnr commented Mar 21, 2019

cbrnr commented Apr 16, 2019

cbrnr commented Apr 18, 2019

Streams in load_xdf should not be ordered by default #35

Streams in load_xdf should not be ordered by default #35

Comments

dojeda commented Feb 26, 2019 • edited Loading

cboulay commented Feb 26, 2019

dojeda commented Feb 26, 2019

cboulay commented Feb 26, 2019

dojeda commented Feb 26, 2019

cboulay commented Feb 26, 2019

dojeda commented Feb 27, 2019

agricolab commented Feb 27, 2019

raymundfueloep commented Feb 28, 2019

dojeda commented Feb 28, 2019

agricolab commented Mar 5, 2019

cbrnr commented Mar 21, 2019

cbrnr commented Mar 21, 2019

cbrnr commented Mar 21, 2019

cbrnr commented Apr 16, 2019

cbrnr commented Apr 18, 2019

dojeda commented Feb 26, 2019 •

edited

Loading