-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: sample chunk type optimized for vectorized reading #19
Comments
CC @chkothe who wanted to see an example implementation. |
Also CC @agricolab, @cbrnr since this might be useful for a pure python reader. |
I'm not sure a new chunk type is the best solution. Have you tried Cythonizing the existing type 3 reader? |
If experimented with both Cython and pybind11 (C++ to parse the data chunks), and while the speedups were impressive they were still slower (not to mention more complex) than the pure python implementation sketched above. I'll upload the test code and data files some time next week. With liblsl we've had our fair share of binary distribution problems (both for Python and Matlab), so a chunk type that enables both fast loading with the base language as well as the option to |
There’s a lot of concern about feature parity with the matlab importer. Since it is still early stages, libxdf can be created in a way that keeps it manylinux2010 compatible. |
Neither of my comments above are reasons in themselves to not change the chunk format. As long as things stay backwards compatible then I am for it. |
From my point of view, it's absolutely necessary that all three readers (C++ libxdf, xdf-Python and xdf-Matlab) have a reference implementation before anything is set in stone. For Matlab, the |
Testing with 10 chunks, 64 channels and 100000 samples each. You can find the code at https://github.com/tstenner/xdf/tree/chunk7 (readers) and https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7 (writer; run
So even though compiled extensions speed everything up considerable, the builtin compiled binary readers are even faster.
Most readers either print an error message about the new chunk format or skip the chunks, but a conversion utility could help in the immediate transition. The (as far as I can see) feature complete matlab reader for the chunk 7 is only 37 lines long, so it's quite easy to implement it. |
Added data for the Matlab string chunk reader |
Can you summarize again why chunk3 reading is slow? I didn't really understand it from you initial post. Is it the nested nature of the samples structure (i.e. that you can't read the whole chunk in on go because you don't know how many bytes it consists of, mainly because of the presence/absence of timestamps)? Can LSL outlets stream chunk7 data? Is there more buffering going on for chunk7 vs. chunk3? What are the pros and cons of both chunk types? Having said that, XDF is extensible as the name implies, so if there are good reasons why we need something in addition to chunk3 then we should go for it. Personally, I could live with 7 seconds loading time of the plain Python chunk3 reader, but I guess there will be much longer files. The compiled extensions (Cython/MEX) are already much faster, so it is worth discussing if these are already fast enough. It is a bit odd that MEX is twice as fast as Cython, so I'm assuming that the Cython implementation could be further tweaked. Furthermore, why is the MATLAB chunk7 reader more than twice as fast as the Python implementation? |
Just a comment on the speed, Matlab has very well optimized matrix handling and is always working to improve performance there, so it's probably always going to be better than python's. As suggested in this post: https://stackoverflow.com/questions/2133031/is-matlab-faster-than-python Numpy can be compiled against ATLAS which will help. Don't know why it isn't distributed this way by default. Still, 2x faster seems like a lot. I still think that having a(n optimized) C backend is the best option here. For my money, what's lost in ease of distribution is gained in maintenance and performance. But, I'm not doing any work on this, so my vote doesn't count for much :P |
NumPy is also heavily optimized and they even use the same LAPACK/BLAS libraries in the background. Anaconda links NumPy against Intel MKL, which speeds up things even more (in most use cases). So like you're saying, I'm a bit skeptical that when MATLAB is 2x faster than Python/NumPy I tend to think that the Python code could probably be improved (and maybe it is the code outside of NumPy that could be optimized).
Distributing C bindings is a lot of work for Python, because you need to create binary wheels for all platforms if you want pip install to just work. That's why I normally prefer pure Python or if necessary Cython. But of course then you need to write separate readers for all languages (like we currently do). So yes, having only one C reader with bindings to other languages would be a good option (LSL does that BTW) - but currently we don't have anyone with the time to work on that. |
Chad managed to get pylsl working quite well on pip, but I gather it was no small effort and there have been a few hiccups along the way (e.g. certain versions not working on certain operating systems). Now it is running quite nicely. The way to do it properly would be to integrate CI the same way liblsl does to automate the process. This is a huge amount of work. The (rather obvious) question is whether or not it is less work than maintaining two different languages in parallel. Again, I'm just talking here. I think I've contributed exactly 1 line of code to load_xdf.m over the years. |
Yes, that's the question, and like you I'm also just talking here 😄. I can't answer this question, because I don't know the pyxdf roadmap nor who is currently actively involved. I'm planning to add XDF support to MNELAB, that's why I'm interested in the current state of development. And for my purposes, I'd very much prefer pure Python code (unless the pip package is working flawlessly like pylsl, then I don't care). |
The use of varlength ints (
The chunk format is independent of the stream data, i.e. the same data could be saved as either chunk3 or chunk7 in the xdf file. The only difference between the chunk types is the metadata at the chunk beginning and the layout of the data.
I hadn't talked about it here before, but in theory there could be an xdf optimization tool that reorders and merges chunks so the data could be
Speed is only half the problem – while playing with the format I've caused several crashes without even trying and a Matlab/Python handle corrupt files better (i.e. an exception instead of a segfault). Compiled extensions also need to be compiled and distributed and I've had more fun with that than I'd have liked.
Could be several things. It's a minimal prototype in C++ I didn't aim for speed with and I think it copies the data more often than needed. But even with the optimized mex file, the chunk3 reader is slower than a pure matlab implementation for chunk7.
Good question, but I suppose it's because Matlab has an advantage with the JIT compiler for everything except the read operation.
For Matlab, the chunk7 reader has ~37 lines of code, all of which can be debugged/profiled in Matlab and if something goes wrong the user gets an error message with an explanation and the line number.
For Windows an OS X, but for Linux we'd need to build against an older CentOS version that's hard to get nowadays.
Based on the previous community contributions, most users know either Matlab or Python and a few know C / C++. I would trust most of them to tweak, test and distribute Matlab or Python code, but with C++ (especially with constraints w.r.t. distribution and cross-platform compatibility) it's going to be a lot harder. |
@tstenner There's some sort of ANSI standard associated with the XDF specification. I don't think the research community has to be locked into that specification forever, but I'm pretty sure the only chance a change to the file format will be accepted upstream is if all the major tools that use XDF are compatible with both formats in a completely invisible way, and that precludes using a converter tool. So the Python and Matlab importers will both have to be compatible with both formats, and LabRecorder should have an option to save in the old format (I think it's OK to default to the new format). Side note: Using a converter is not a good option for some people (like me) who have clinical data. The raw data is sacred and can't be modified. Using a converter simply doubles the amount of disk space required and breaks the load-preprocess-save_to_h5 sequence into a slow step and a fast step, instead of one slow step. However, I'm sure a converter tool would be useful to many people though I think that could be put off until all of the importers are updated and accepted upstream. @cbrnr I have large multi-GB files and the current import times are quite painful. |
@cboulay a new chunk format won't speed up loading of existing chunk3 files, so an optimized chunk3 reader would probably be a good idea regardless of how we handle the proposed chunk7 type. I don't know about any official standardization of XDF - do you have more details on that? The way I imagined the process of extending the XDF specification is that the community decides that e.g. chunk7 is necessary and this will be integrated into XDF 1.1 (currently we have XDF 1.0). Therefore, readers compatible with XDF 1.0 don't have to be compatible with the new extension, but due to the XDF format, they can just skip unknown chunk types. @tstenner thanks for the summary. What is the disadvantage of chunk7? Do outlets need to buffer more or do more processing before pushing out samples in this format? Is the chunk7 type less flexible than chunk3? Regarding compiled extensions, I think this is relatively painless if you use Cython. As I've mentioned before, I think we should try to optimize chunk3 reading regardless of our chunk7 discussion. |
I'm already on it :-)
Agreed, but users with clinical data are (hopefully) more technically inclined and have more resources so an additional compilation step is doable for them. Also, the chunk7 format is slightly more robust to errors (wrong offsets) so there's less risk of unreadable files.
So far nobody (apart from me) uses it :-)
I took a shot at it and found it difficult to keep the normal implementation and the Cython annotations separate so it would run without Cython, whereas pybind11 was relatively easy to keep separate (and optional). For widespread usage, I agree that Cython is a better idea. |
One major difference though is that in chunk7, you need to supply either all timestamps (for each sample), or none. In chunk3, each sample can have a timestamp or not. Can you think of a recording setup where this might be a problem? Or does a chunk7 writer automatically timestamp all samples according to the nominal sampling rate? How many samples would you typically write in a chunk7? AFAIK the main idea of chunk3 was to provide timestamps when the data are actually sent out, and this is not possible with every sample because most devices push out an array of samples at once.
I've never used pybind11, but yes, when you cythonize a module you can't easily keep a pure Python implementation simultaneously. I don't think that's even necessary (maybe just as a fallback), it's sufficient if we have a nice Cython implementation. |
As in chunk7, timestamps can be left out (i.e. all 8 bytes are 0). So in chunk7, the chunk3 timestamp sequence For high sampling rates, I could see the point of writing only a small subset of timestamps, but in my practical experience almost all timestamps get written anyway. For those, maybe a sparse representation ( I was hoping on some experts (@cboulay ?) chiming in, if their experimental data benefits from having only a few timestamps saved.
The C++ implementation only has one function for each data type that reads in the string and populates the timestamp and data numpy arrays. One module load, the reader tries to import these functions. We could factor out the chunk reader functions and on module load try to load sped up replacements (either pybind11 or Cython) and monkeypatch the included functions. One other thing regarding disadvantages: Most readers (and writers) support 64bit lengths in theory, but fail in one way or another when they are actually in a file, so I capped the sample and channel counts to 32bit, i.e. 2^32-1. This will probably be quoted in the near future, but I think 4 billion samples per chunk are enough for everybody. |
Ah, now I get it. Left-out timestamps are represented as 0. I think this is good enough, I wouldn't be concerned about storage space. Practically speaking, if a device pushes out 1000 samples every second in a package, you could timestamp the first sample and have 999 zeros for the other samples. Correct?
I don't know what the best practice is here, but as long as it works and is maintainable I'm fine with this approach.
For the near future this will probably enough, but as you say this will probably catch up faster than you think. So if somehow possible, I'd go for 64bits - you are talking about two header fields |
Correct. A transparent compression algorithm (gzip? zstd?) would help with the file size issue, but so far nothing's been standardized in this matter.
Right. This is per-chunk, so even in the best-case (4^32 samples, one uint8 channel), the chunk alone would be at least 32GB of timestamps and 4GB of sample data big. Or to put it another way: with a sample rate of 1 million samples/second this would still be over an hour of data. End users, especially scientists are always full of surprises but it should be a long while before they reach those amounts of data. |
Agreed, but can you still make these 2 fields 64bit? It surely doesn't hurt. |
The channel count for lsl streams is already limited to 31 bits, so more wouldn't make sense for it to be larger in the xdf file. |
Alright, then leave them as 32bits. |
Current state of chunk 3
The current sample chunk type (tag 3) requires at least one branch per sample (2 for string samples) so both the Python and Matlab implementation require compiled extensions.
Also, the current code to "compress" timestamps isn't helping much due to the usual floating point equality problems, even under perfect conditions:
Proposal: new chunk type
My proposal for a new chunk type therefore looks as follows:
Each chunk would require at most 3 reads:
The chunk includes sample count, channel count and data type for increased robustness (errors can be detected earlier) and so data can be read without an XML parser.
For the timestamp array, a value of 0 (0x0000000000000000 in IEEE 754 representation) indicates a left out timestamp that would be recomputed as usual. Since this chunk format is more compression friendly due to the reduced number of reads / writes, a gzip compressed array would also require less space.
Implementation
The writer is implemented in https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7. The main serialization code looks as follows:
The reader is implemented in https://github.com/tstenner/xdf/commit/412998f3ed66cc7e85a93d77a1d7c96f81073ff0
Benchmark (more below)
A single XDF file with one 64-channel EEG (double) stream was created with 30 chunks, each containing 10K samples.
In both cases, the new chunks are smaller (156.3 vs. 156.0 MiB, 36.0 vs. 34.9 MiB compressed with gzip -6).
Open questions
[ ] layout for string chunks
[ ] optional padding so data starts at page boundaries (e.g. for
mmap
)?[ ] should the redundant type index be included?
The text was updated successfully, but these errors were encountered: