-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy index building for XTC and TRR tajectories #3793
Comments
Do XTC and TRR contain reliable information about the total number of frames (and total time) in the last (readable) frame? |
I do not know how @Tsjerk's magic trick really works and how reliable it is. In my experience, it was fairly reliable, but I was only using it on very typical trajectories that had the same properties as the ones for which Tjerk wrote the script. I might still have the script somewhere in my archives somewhere? |
In an XTC file, the header contains the size of the coordinate section in bytes, and the header itself has a fixed size and starts with the magic number 1993. For indexing (and reversing) an XTC file, I use this to jump through the file with relative seek calls. This works well. I thought MDAnalysis had already adopted that trick. I have been playing around with guesstimating the positions of all frames from the first and last frame sizes. Whether this is reliable depends on the trajectory. If, e.g., PBC jumps are removed, then the drift in size of coordinates causes a drift in the size of the frames, making it hard to get proper guesstimates. For TRR the case is different. These have no compression, so sizes are fixed. However, a frame may contain coordinates, velocities, forces, in any possible combination. That means that frame sizes may be different throughout te file, and it is really hard to determine the frame positions in the file through a simple calculation. Again, this means that stepping through the TRR with fseeks is the most straightforward; the header contains all information required to determine the next frame position. For both XTC and TRR this thus means: find frame start, read header, determine position of next frame, fseek, repeat. |
Oh, reading through the original issue: 1. Finding the last frame is done by reading the first and saving the header tag (8 bytes: magic number and number of atoms), determine the size of a frame, then going to a position estimated 3/2 frames from the end, reading in everything up to the end and rfind the tag. 2. For parallel processing, the different nodes may be given filepointers spaced throughout the file, find the first frame/header and from there traverse up to the position where another worker started. |
Is your feature request related to a problem?
Reading an XTC or TRR trajectory for the first time can be very slow, especially for large files. This is due to the building of the index that allows us to access random frames.
We experienced issues due to the indexing with very large files (#3789) or with read-only file systems where the index had to be rebuilt each time.
Describe the solution you'd like
In most cases, however, we do not randomly access random frames. Instead, we iterate over the trajectory in order. There we do not need the index, so the overhead due to the first index building is unnecessary. We could build the index as we go by saving the offset when we encounter a new frame as we read the file.
This could cause issues with the parallel reading of trajectories. It could also make the reading slower in some cases, which is already an issue (#3731).
It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: