-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data compression #8
Comments
This is a very important discussion to have indeed and I agree with everything that you said. At the SCIL we compress all tractogram using linearization only by default and I personally worked on it. That's a easy 10x speed up / size reduction at minimal cost (if following algorithms are taking it into account). One additional point, all data per vertex/streamlines in trk are in float32 and so the division can be by a factor 2 (float or int 16, or 4 for int8 or 32 if .bit). This can add up to a lot. A linearized, float16 plus metadata with well though dtype can be 20-30 smaller than a typical 0.2mm step size TRK. No matter what end up being the final specification of the file format I believe the most important speed up will be from reducing streamlines data via lossy compression rather than the others format design decision. |
@Lestropie also, I thought a bit about QFIB . I don't know if you are familiar with QFib (lossy) compression. Its basically just starting from a single point you then step forward using a parcellated sphere. Instead of writing the vector at each step, you can write a single integer mapping the sphere's vector. This effectively divides by 3 the size of the file AND it can be combined with linearisation. However, this obviously comes with a slow down on access because you have to compute the real positions on-the-fly. I think that could be a nice future support if people REALLY want smaller files, it could be compatible with the file format. I would personally recommend this only for data storage or data transfer (And then uncompress back to standard positions.3.float16 for manipulation). Basically, my hypothetical recommendations:
|
@frheault What exactly do you mean by "linearisation"? Are you referring to simply dropping vertices that can be dropped without sacrificing too much precision? |
@soichih Pretty much, it consists in removing 'mostly' colinear point. The idea is that having 100 points on a straight line is not worth it if you can drop to 5-10 points with a margin of error of 0.1mm. |
Thanks. @frheault That makes sense. While on this subject though.. is there an algorithm that compresses tractography by turning coordinates into smaller set of bezier/spline coordinates? I think a curve will fit better to the original data points - assuming that brain tractograms are normally a smooth curves. Also, do you know of an algorithm that not only compresses each streamlines, but it will compress across multiple streamlines? When the number of fibers increases, there will be more and more similar streamlines. An algorithm could then apply similarity based huffman encoding to only store full vertex information.. If there are no such algorithm, I'd like to write one over the holiday break for fun. |
An approach based on bezier/spline, I remember having a review on my paper telling me to mention it (I don't think I did). The QFib approach does slow down, but at least conversion back to full 3D points representation is likely faster. And I can imagine a way to make it fit inside of the TRX structure with small modification. For your last idea, I actually did implement something like that for bundles, does not work for tractogram and scale very badly I think an approach with Quickbundles a multiple clustering distance (e.g ranging from None to 2,4,8,16,32mm) would create a simple 'multi-resolution' that could then be hidden 'easily' in the data_per_streamline. Making it easy to load the appropriate/desired resolution by the user. I don't know something that is actually dedicated to this idea. I could easily imagine an algorithm that stores reference streamlines and the others are represented as a vector of delta/difference. There is a lot of room for great compression algorithm, however the hard part (if you want it to catch on) is to make one that is compatible with TCK/TRK or maybe a future TRX. If you are planing on working on a streamlines compression algorithm, I would 'challenge' or 'recommend' that you try to make so its compatible with a representation that would 'fit' within the specification of TRK (with metadata) or TRX (which could be a bit more flexible). That way if it works at least it could be used right away by the ones that wish to further compress data. |
I think maybe this concept needs to be broken into two separate components: Your response here suggests that you are interpreting the proposal as one of transformation of streamlines from an ordered set of vertices in 3D to some fundamentally different basis, e.g. polynomials in each axis or Fourier coefficients or the like. As you say, this has been done in certain contexts, and I agree that this would be too severe a change in representation to be considered as a core component of something like TRX. If I instead interpret the original question through my own personal lens:
In MRtrix3 we have for many years implicitly interpreted streamlines as points on a Hermite spline (publication; see Appendix 3):
I think what @soichih is actually reaching at here is, instead of doing a basic linearisation where consecutive vertices are skipped if they are sufficiently collinear with the previous sample & tangent, re-pose the problem as one of minimising the number of vertices stored for a streamline whilst keeping the distance between the continuous spline inferred from those samples and the continuous spline inferred from the original densely-sampled streamline below some threshold. This would be a really cool project. It's similar to the concept of linearisation, but it would involve optimising the complete stored representation of each streamline rather than a single pass-through the data retaining or rejecting vertices, and it would be based on an inferred smooth trajectory rather than piecewise linear. Since the stored output would still be an ordered set of 3D vertices, existing file formats could still be used; the only condition for correct interpretation of the stored data would be that the receiving software impose the appropriate spline model. |
Given recent comments in dipy/dipy#2229, and further back in nipy/nibabel#942, I think it is worth dedicating a discussion thread to the issue of data compression.
I think it is fair to say that this particular format proposal has prioritised extensibility and simplicity over compression. But for some users / developers the latter may be the higher priority. So here we can discuss whether the gains in this department from the current TRX proposal are adequate, whether it is possible to bootstrap any kind of explicit data compression into this format proposal, or whether a drastically different proposal would be necessary in order to satisfy such demands.
Please amend or correct as necessary, and comment as you see fit.
Where compression may be possible in TRX:
Point 5 may be interesting to explore: there's been a number of manuscripts published explicitly on storage reduction of streamline vertex information, so simply having the option to utilise within TRX one such format for vertex position information would theoretically be possible. It would make the technical specification considerably more complex, but technically one could defer to external specifications for datasets doing such. I'm not a fan of this personally, I'm simply stating that it's a prospect.
In the converse case, i.e. a format from the ground up that would prioritise compression: A couple of these already exist, but don't have the extensibility of TRX. Trying to extend the capabilities of compressed streamline data storage could look something like bootstrapping compression into the
.trk
format, or it could be something entirely novel. It probably falls upon someone who prioritises such to propose something that still has a chance of satisfying the desires of those prioritising extensibility. But this is likely one of multiple dimensions of difference that caused the original discussion in nipy/nibabel#942 to go slightly astray, so warrants greater precision here.The text was updated successfully, but these errors were encountered: