Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 4039 large dcd #4048

Merged
merged 7 commits into from
Mar 29, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions package/CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,14 @@ Changes
Deprecations


??/??/?? richardjgowers

* 2.4.3

Fixes
* Fixed DCD reading for large (>2Gb) files (Issue #4039). This was broken
for versions 2.4.0, 2.4.1 and 2.4.2

01/04/23 IAlibay

* 2.4.2
Expand Down
6 changes: 3 additions & 3 deletions package/MDAnalysis/lib/formats/libdcd.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -115,11 +115,11 @@ cdef class DCDFile:
# The current DCD frame
cdef int current_frame
# size of the first DCD frame
cdef readonly int _firstframesize
cdef readonly fio_size_t _firstframesize
# Size of a DCD frame
cdef readonly int _framesize
cdef readonly fio_size_t _framesize
# Size of the DCD header
cdef readonly int _header_size
cdef readonly fio_size_t _header_size
# Is the file open?
cdef int is_open
# Have we reached the end of the file
Expand Down
4 changes: 3 additions & 1 deletion package/MDAnalysis/lib/formats/libdcd.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,9 @@ cdef class DCDFile:
if frame == 0:
offset = self._header_size
else:
offset = self._header_size + self._firstframesize + self._framesize * (frame - 1)
offset = self._header_size
offset += self._firstframesize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ensure that the overflow cannot happen?

Btw, frames was declared as int in the methods signature. Should that be changed, too, or is that a Python int with infinite size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've repro'd the exact bug (with the contentious test) and this fixes it. I've not looked at the raw c and followed all the types.. but by eye promoting some variables to the correct datatype seemed to jiggle it into place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst the def(blah: int ) syntax in cython allows it to switch between int or PyInt depending on how much it knows about types. I think @richardjgowers approach of changing the size of the directly declared C types is the correct one.

offset += self._framesize * (frame - 1)

cdef int ok = fio_fseek(self.fp, offset, _whence_vals['FIO_SEEK_SET'])
if ok != 0:
Expand Down
36 changes: 36 additions & 0 deletions testsuite/MDAnalysisTests/coordinates/test_dcd.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
#
import os
from pathlib import Path
import numpy as np

Expand Down Expand Up @@ -436,3 +437,38 @@ def test_pathlib():
# we really only care that pathlib
# object handling worked
assert u.atoms.n_atoms == 3341


@pytest.fixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a module level fixture so that it really only runs once? Unfortunately will need to use the tmpdir factory

def large_dcdfile(tmpdir):
# creates a >2Gb DCD file
fsize = 3.8 # mb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be super-flexible, get the size from DCD itself. Totally optional

nreps_reqs = int(2100 // fsize) # times to duplicate traj to hit 2.1Gb

newf = str(tmpdir / "jabba.dcd")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name approved!


u = mda.Universe(PSF, DCD)

with mda.Writer(newf, n_atoms=len(u.atoms)) as w:
for _ in range(nreps_reqs):
for ts in u.trajectory:
w.write(u.atoms)

yield newf, nreps_reqs


@pytest.mark.skipif(
not os.environ.get("LARGEDCD", False), reason="Skipping large file test"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the env bar is supposed to skip the test then better call it SKIPLATGEFILETESTS or something like that. In any case, update CI so that it runs somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env bar = env var… sorry typing from mobile

)
def test_large_dcdfile(large_dcdfile):
DCD_large, nreps = large_dcdfile

u_small = mda.Universe(PSF, DCD)
u = mda.Universe(PSF, DCD_large)

assert len(u.trajectory) == len(u_small.trajectory) * nreps

u_small.trajectory[-1]
u.trajectory[-1]

assert_array_almost_equal(u.atoms.positions, u_small.atoms.positions)