Issue 4039 large dcd #4048

richardjgowers · 2023-03-01T14:07:55Z

Fixes #4039

Changes made in this Pull Request:

use fio_size_t for variables related to filesize

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

skipped by pytest by default unless LARGEDCD env var set

use fio_size_t for all variables related to filesizes fixes for #4039

IAlibay · 2023-03-01T14:13:32Z

yup this works re: bugfix, let's just make sure we squash merge

IAlibay

Do we actually care about that skipif so much?

IAlibay · 2023-03-01T14:16:58Z

testsuite/MDAnalysisTests/coordinates/test_dcd.py

+    yield newf, nreps_reqs
+
+
+@pytest.mark.skipif(not os.environ.get('LARGEDCD', False),


That's kinda confusing logic, and looks undocumented, are we really expecting to use it?

I can just remove the test if you like, it was handy while I was fixing the bug

test would be good to keep, I just don't really know why you'd need a skipif that isn't really documented. Did we not already have a high memory flag from the EDR tests? Can we just use that instead?

We should run the test in at least one runner every time.

And as I said in the original issue, eventually every reader should be tested with a large trajectory so that we have a better chance catching these kind of issues.

IAlibay · 2023-03-01T14:21:58Z

FYI lint failure is optional, I might make it print comments instead when I get time to play with the GH API again

codecov · 2023-03-01T14:34:58Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (5794c82) 93.57% compared to head (9cbe8e8) 93.57%.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4048   +/-   ##
========================================
  Coverage    93.57%   93.57%           
========================================
  Files          192      192           
  Lines        25133    25135    +2     
  Branches      4056     4056           
========================================
+ Hits         23517    23521    +4     
+ Misses        1095     1094    -1     
+ Partials       521      520    -1

Impacted Files	Coverage Δ
package/MDAnalysis/lib/formats/libdcd.pyx	`90.85% <100.00%> (-0.26%)`	⬇️

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

orbeckst

Minor comments

orbeckst · 2023-03-01T15:06:52Z

testsuite/MDAnalysisTests/coordinates/test_dcd.py

@@ -436,3 +437,38 @@ def test_pathlib():
    # we really only care that pathlib
    # object handling worked
    assert u.atoms.n_atoms == 3341
+
+
+@pytest.fixture


Make a module level fixture so that it really only runs once? Unfortunately will need to use the tmpdir factory

orbeckst · 2023-03-01T15:08:20Z

package/MDAnalysis/lib/formats/libdcd.pyx

@@ -391,7 +391,9 @@ cdef class DCDFile:
        if frame == 0:
            offset = self._header_size
        else:
-            offset = self._header_size + self._firstframesize + self._framesize * (frame - 1)
+            offset = self._header_size
+            offset += self._firstframesize


Does this ensure that the overflow cannot happen?

Btw, frames was declared as int in the methods signature. Should that be changed, too, or is that a Python int with infinite size?

I've repro'd the exact bug (with the contentious test) and this fixes it. I've not looked at the raw c and followed all the types.. but by eye promoting some variables to the correct datatype seemed to jiggle it into place

@orbeckst the def(blah: int ) syntax in cython allows it to switch between int or PyInt depending on how much it knows about types. I think @richardjgowers approach of changing the size of the directly declared C types is the correct one.

orbeckst · 2023-03-01T15:08:42Z

testsuite/MDAnalysisTests/coordinates/test_dcd.py

+    fsize = 3.8  # mb
+    nreps_reqs = int(2100 // fsize)  # times to duplicate traj to hit 2.1Gb
+
+    newf = str(tmpdir / "jabba.dcd")


Name approved!

orbeckst · 2023-03-01T15:09:31Z

testsuite/MDAnalysisTests/coordinates/test_dcd.py

+@pytest.fixture
+def large_dcdfile(tmpdir):
+    # creates a >2Gb DCD file
+    fsize = 3.8  # mb


To be super-flexible, get the size from DCD itself. Totally optional

orbeckst · 2023-03-01T15:14:25Z

testsuite/MDAnalysisTests/coordinates/test_dcd.py

+
+
+@pytest.mark.skipif(
+    not os.environ.get("LARGEDCD", False), reason="Skipping large file test"


If the env bar is supposed to skip the test then better call it SKIPLATGEFILETESTS or something like that. In any case, update CI so that it runs somewhere.

Env bar = env var… sorry typing from mobile

orbeckst · 2023-03-01T16:53:20Z

@IAlibay if you don't want to be in charge please assign someone else, but given that this is related to releases etc I thought you'd be the best person.

IAlibay · 2023-03-01T18:15:29Z

@IAlibay if you don't want to be in charge please assign someone else, but given that this is related to releases etc I thought you'd be the best person.

no worries, I'm happy to be in charge of merging, will make sure I don't forget to release

IAlibay · 2023-03-01T18:20:47Z

So one thing to be aware of here - there are only 14 GB disk space available on GitHub runners. We'll need to make sure we're 100% sure with clearing up space, especially when dealing with pytest-xdist.

richardjgowers · 2023-03-01T19:30:03Z

I'm not sure it's a good idea to run a large file test for every format for every run. They're slow to create for one (2Gb of I/O) and it's probably not necessary.

I'm (we're) not going to have time to solve the entire issue of testing large files I/O in this PR, but I might suggest that we take this patch and fix a popular format in a bugfix release

IAlibay · 2023-03-01T19:40:26Z

I'm not sure it's a good idea to run a large file test for every format for every run. They're slow to create for one (2Gb of I/O) and it's probably not necessary.

I'm (we're) not going to have time to solve the entire issue of testing large files I/O in this PR, but I might suggest that we take this patch and fix a popular format in a bugfix release

fair, do you want to just raise an issue with the current state of things?

orbeckst · 2023-03-01T19:53:47Z

I don’t suggest that this PR should solve the testing big files for every format.

But I think the PR should make some changes to the CI files that ensure that this test is run — either in at least one runner or at an absolute minimum in the cron. @IAlibay might have a better idea of when we should run it. But I’d want to avoid seeing such a bad regression again.

IAlibay · 2023-03-01T20:06:56Z

May I counter @orbeckst and ask that we don't deal with CI here? This is a good cherry pickable PR. Add CI and it's going to be a pain (changes a decent chunk between releases).

IAlibay · 2023-03-01T20:07:35Z

I'm happy to take on the responsibility of fixing up a CI entry for this if @richardjgowers would prefer not opening a second PR.

IAlibay · 2023-03-01T20:09:33Z

Also note that I approve but leave red so that I can just fixup stuff here directly so we don't need a second pre-2.4.3 PR.

hmacdope · 2023-03-01T20:59:39Z

I'm not sure it's a good idea to run a large file test for every format for every run. They're slow to create for one (2Gb of I/O) and it's probably not necessary.

I'm (we're) not going to have time to solve the entire issue of testing large files I/O in this PR, but I might suggest that we take this patch and fix a popular format in a bugfix release

I agree, I more was thinking that this is a potential class of bug we should investigate, especially with the XDR reader that I cythonised in #3892. We (I) can raise an issue and we can work from there?

hmacdope

Thanks @richardjgowers!

hmacdope · 2023-03-01T21:02:33Z

package/MDAnalysis/lib/formats/libdcd.pyx

@@ -391,7 +391,9 @@ cdef class DCDFile:
        if frame == 0:
            offset = self._header_size
        else:
-            offset = self._header_size + self._firstframesize + self._framesize * (frame - 1)
+            offset = self._header_size
+            offset += self._firstframesize


@orbeckst the def(blah: int ) syntax in cython allows it to switch between int or PyInt depending on how much it knows about types. I think @richardjgowers approach of changing the size of the directly declared C types is the correct one.

orbeckst

Given that @IAlibay prefers the PR in this form for easier handling and given that he also committed to getting the test to run on CI somehow, I have no further objections.

orbeckst · 2023-03-09T20:44:26Z

@IAlibay can we merge this into develop and you cherry-pick once you are ready for a hotfix 2.4.2 #4061 ?

I'd be keen to close #4048 and #4039 .

IAlibay · 2023-03-09T20:45:56Z

@IAlibay can we merge this into develop and you cherry-pick once you are ready for a hotfix 2.4.2 #4061 ?

I'd be keen to close #4048 and #4039 .

I still need to make a couple of changes, I'll try to deal with this tomorrow sorry about the delay.

orbeckst · 2023-03-23T03:42:58Z

Progress?

IAlibay · 2023-03-23T09:11:50Z

Progress?

Sorry I've been swamped lately and I mostly need a good empty half day to do the release, depending on how metrics generation & fixing darker lint goes I'll try to do so today or tomorrow.

orbeckst · 2023-03-23T16:43:38Z

I understand that the release requires a solid chunk of time. My (poorly worded) question was more along the lines of what needs to be done to be able to merge the PR into develop — when you said

I still need to make a couple of changes

Once it's merged then we can offer at least a working development version and you can cherry-pick from develop when you can fit it in. At least that's how I understood your comment for the process.

IAlibay · 2023-03-23T17:52:32Z

Once it's merged then we can offer at least a working development version and you can cherry-pick from develop when you can fit it in. At least that's how I understood your comment for the process.

Yeah sorry, there's a significant element of "I don't remember fully what I need fixed to cherry pick easily" (I have the medium term memory of a goldfish lately...), so I was trying to get a bit of time to review what I needed before having to make a mess out of this.

I've booked off the evening for this, so let's try to get this done now.

completed

Fixes #4039 * Fixes DCD seeking for large (2Gb+) files.

richardjgowers added 3 commits March 1, 2023 14:00

add test for large DCD file reading

0d7c047

skipped by pytest by default unless LARGEDCD env var set

fixes DCD seeking for large (2Gb+) files

0292170

use fio_size_t for all variables related to filesizes fixes for #4039

CHANGELOG for #4039

7ab0ae2

richardjgowers added Component-Readers Format-DCD CZI-performance performance track of CZIEOSS4 grant labels Mar 1, 2023

richardjgowers requested a review from IAlibay March 1, 2023 14:07

github-actions bot added Component-lib and removed Component-Readers labels Mar 1, 2023

IAlibay previously requested changes Mar 1, 2023

View reviewed changes

richardjgowers added 2 commits March 1, 2023 14:19

stupid black changes

2b87d3d

remove a stray print statement

ddc5b5e

orbeckst requested changes Mar 1, 2023

View reviewed changes

orbeckst assigned IAlibay Mar 1, 2023

hmacdope mentioned this pull request Mar 1, 2023

Error when loading multiple large DCD trajectories #4039

Closed

hmacdope approved these changes Mar 1, 2023

View reviewed changes

orbeckst approved these changes Mar 1, 2023

View reviewed changes

orbeckst mentioned this pull request Mar 9, 2023

bugfix release 2.4.2 #4061

Closed

Update CHANGELOG

529ba02

Merge branch 'develop' into issue-4039_large_DCD

9cbe8e8

IAlibay merged commit 628e0f7 into develop Mar 29, 2023

IAlibay deleted the issue-4039_large_DCD branch March 29, 2023 16:24

IAlibay pushed a commit that referenced this pull request Mar 29, 2023

Issue 4039 large dcd (#4048)

c772589

Fixes #4039 * Fixes DCD seeking for large (2Gb+) files.

orbeckst mentioned this pull request Mar 29, 2023

Add necessary commits for 2.4.3 #4102

Merged

IAlibay added the defect label Sep 21, 2023

richardjgowers mentioned this pull request Jul 18, 2024

Cannot access frames after a given point in large DCD files #3075

Closed

		yield newf, nreps_reqs


		@pytest.mark.skipif(not os.environ.get('LARGEDCD', False),



		@pytest.mark.skipif(
		not os.environ.get("LARGEDCD", False), reason="Skipping large file test"

Issue 4039 large dcd #4048

Issue 4039 large dcd #4048

Conversation

richardjgowers commented Mar 1, 2023

PR Checklist

IAlibay commented Mar 1, 2023

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IAlibay Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IAlibay commented Mar 1, 2023

codecov bot commented Mar 1, 2023 • edited Loading

Codecov Report

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Mar 1, 2023

IAlibay commented Mar 1, 2023

IAlibay commented Mar 1, 2023

richardjgowers commented Mar 1, 2023

IAlibay commented Mar 1, 2023

orbeckst commented Mar 1, 2023

IAlibay commented Mar 1, 2023 • edited Loading

IAlibay commented Mar 1, 2023

IAlibay commented Mar 1, 2023

hmacdope commented Mar 1, 2023 • edited Loading

hmacdope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

orbeckst commented Mar 9, 2023

IAlibay commented Mar 9, 2023

orbeckst commented Mar 23, 2023

IAlibay commented Mar 23, 2023

orbeckst commented Mar 23, 2023

IAlibay commented Mar 23, 2023

IAlibay Mar 1, 2023 •

edited

Loading

codecov bot commented Mar 1, 2023 •

edited

Loading

IAlibay commented Mar 1, 2023 •

edited

Loading

hmacdope commented Mar 1, 2023 •

edited

Loading