Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams in load_xdf should not be ordered by default #35

Closed
dojeda opened this issue Feb 26, 2019 · 15 comments
Closed

Streams in load_xdf should not be ordered by default #35

dojeda opened this issue Feb 26, 2019 · 15 comments

Comments

@dojeda
Copy link

dojeda commented Feb 26, 2019

I recently found that there was a difference between the order of the streams in the Matlab and Python loader functions. I read the xdf specification, and there is no mention on the order of the streams. However, _load_xdf does this:

    streams = [s for s in streams.values()]
    sort_data = [s['info']['name'][0] for s in streams]
    streams = [x for _, x in sorted(zip(sort_data, streams))]

Sorting the streams by their name may be useful, but it may break for a user that expects stream 0 to be of a particular type, stream 1 to be of another particular type, and so on, because this might be the way the data is saved from their acquisition procedure. Moreover, this sorting is not documented, so it comes as a surprise.

What I propose is to either drop this sort (a breaking change) or, more conservatively, add a keyword parameter (which defaults to the current behavior) to perform this sorting:

def load_xdf(filename, ..., sort_streams=True):
    ...
    # at the end of the function...
    streams = [s for s in streams.values()]
    if sort_streams:
        streams.sort(key=lambda s: s['info']['name'][0])
    ...

(I simplified the 3-line code to sort the indices)

If you guys agree with this, I can file a PR with this change.

@cboulay
Copy link
Collaborator

cboulay commented Feb 26, 2019

I think the issue here is that in an earlier step the streams are collected in a dictionary which has no order, meaning that without sorting they would come back in random order each time the same file is loaded. I’m on mobile now so I can’t easily confirm. If this is the case then maybe the streams should be first loaded into an OrderedDict.
That being said, the way LabRecorder stores streams doesn’t impose any order. Two files from the same session can have different stream ordering on disk so xdf users should NEVER rely on stream order.

@dojeda
Copy link
Author

dojeda commented Feb 26, 2019

Thanks for the input @cboulay.

You are correct on fact that streams are saved in a dictionary, but since Python 3.7, dictionaries preserve insertion order. It is still strongly suggested to use an OrderedDict to keep a common behavior with <3.7 versions and for expressiveness. The dictionary where streams are saved is indexed by their StreamId, so this could alternatively be used as the order when converting the dictionary to a list. As I finish writing this paragraph, I realize that it, IMO it would make more sense to return streams as a dictionary without converting it or sorting it.

That being said, the way LabRecorder stores streams doesn’t impose any order. Two files from the same session can have different stream ordering on disk so xdf users should NEVER rely on stream order.

That's very interesting. Do you have any documentation that claims this so I could read a bit further on this and have a solid proof to argue against relying on order?

dojeda added a commit to dojeda/xdf that referenced this issue Feb 26, 2019
Addresses sccn#35 but note:
apparently, one should not count on a specific order of streams (it can be
arbitrary). I am still not sure if this warning concerns the StreamId as well.
@cboulay
Copy link
Collaborator

cboulay commented Feb 26, 2019

Each stream, including its header, is written by a stream-specific thread. See here. Relying on thread ordering is a bad idea and the discussion could end here.

Maybe because of the threading architecture, maybe for other reasons, no attention is paid to stream ordering elsewhere in LabRecorder. Streams are ordered according to the order that lsl::resolve_streams() finds them. But, if you add streams later and then tell LabRecorder to refresh the streams, new streams are added at the end of the list.

@dojeda
Copy link
Author

dojeda commented Feb 26, 2019

Ok, I am convinced to not rely on stream ordering and the purpose of this issue is now weak to me.
However, I think it would be a good idea to at least inform that the returned list of streams pf load_xdf does provide a specific order.

@cboulay
Copy link
Collaborator

cboulay commented Feb 26, 2019

I think we should probably just delete the lines that do the sorting. I'll leave this issue open until I find time to delete the sorting (and test the result). If you want to make a PR for this small change then please go ahead.

You're right that returning a dictionary would be better than returning a list with dubious ordering, but this would be a breaking change and thus come at a high cost for a relatively small benefit.

@dojeda
Copy link
Author

dojeda commented Feb 27, 2019

Ok I can do the PR later today.
One question though: I git-blamed the file and it seems that this particular part comes from a merge from Intheon's version. Do you know anything or perhaps someone over there that may want to chime in?

@agricolab
Copy link

I also stumbled upon this recasting part at the end of load_xdf. It makes development tricky. One option to ensure backwards compatibility is to make sorting an option with default true; or making load_xdf a load_xdf_as_dict() which does everything but sort.

@raymundfueloep
Copy link

Hello,
I found this issue here due to a crash that I encountered: if the XDF to be loaded contains two streams with identical names, the old sorting (sorted(zip(sort_data, streams))) tries to compare the actual stream objects, which fails. Using streams.sort(key=lambda s: s['info']['name'][0]) avoids that crash.

@dojeda
Copy link
Author

dojeda commented Feb 28, 2019

All right, I actually did not do a PR because I am not sure what the consensus is.

I have also encountered the same problem described by Raymund; I have some XDF files with streams that have the same name and the current sorting falls back to the second tuple element comparison (the streams) and fails.

So far what I gather is that:

  • Chadwick proposes removing the sorting.
  • Robert proposes a new functionload_xdf_as_dict that calls load_xdf and then returns a dict, retaining backward compatibility.
  • At the very minimum, the current comparison needs to be fixed to account for streams for repeated names.

I lean towards the dict option and I would only need to know:

  • What key should be used? I propose the StreamId but I am not sure if this makes sense.
  • Would you agree to add a deprecation warning on the load_xdf or should we leave the two functions? (cf There should be one-- and preferably only one --obvious way to do it.)

@agricolab
Copy link

Robert proposes a new function load_xdf_as_dict that calls load_xdf and then returns a dict

Sorry, my original description was a little bit off. My original suggestion should have been(for backwards compatibility) to keep load_xdf returning a list, but it passes all arguments to load_xdf_as_dict(*args) -> dictand then sort_streams(streams:dict, sort_key:str) -> OrderedDict. The sort_key argument could be e.g. StreamId or type. This sorting is followed by recast_streams(streams:OrderedDict) -> list. In that case, everyone can call the low-level function for the desired output.

While sorting makes a lot of sense imho for a a list, it is pretty unnecessary for a dictionary, I find the recasting from dict to list at the end of load_xdf unnecessary. I personally could live with a break in backwards compatibility, end even if the output would be a regular dict instead of an OrderedDict. Consider also that recasting a dict as a list is pretty simple, especially if there is a backwards-compatibility function that does that. So, i personally would prefer load_xdf returning by default an unsorted dict of streams, not a list. That means, add a deprecation warning for the time being. But i'd tend to agree with what @cboulay think's is better.

Point against my view: We might also consider that if xdf is to become BIDS-compatible, and if there is a use case where we want to sync all streams sample-wise and store everything as a single stream, we might need to find a way to ensure that all signals are sorted identical across all calls to load_xdf.

@cbrnr
Copy link
Contributor

cbrnr commented Mar 21, 2019

What about storing the StreamId of each stream under the info key? Then we could stick with the current list of streams, and sort by StreamId. This would solve the current issue of duplicate stream names, because each stream must have a unique StreamId AFAIK.

@cbrnr
Copy link
Contributor

cbrnr commented Mar 21, 2019

The best solution, however, would be to directly return the ordered dict, because it already contains the stream IDs as keys. How terrible would it be to make this breaking change? If this is properly documented, people using the old list return type would need to change only very little to accommodate for the new dict return type.

@cbrnr
Copy link
Contributor

cbrnr commented Mar 21, 2019

So basically I'm +1 on @agricolab's suggestion to make the breaking change (and even use a plain dict instead of an OrderedDict because we don't care about insertion order, all we need are the keys and values of the dict).

@cbrnr
Copy link
Contributor

cbrnr commented Apr 16, 2019

Fixed in xdf-modules/pyxdf#3.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 18, 2019

This can be closed. If you would like to continue discussing the type of the returned object (e.g. if you'd rather have a dict instead of a list), please open a new issue in the new repository https://github.com/xdf-modules/xdf-Python.

@cboulay cboulay closed this as completed Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants