Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up and reducing the memory footprint of the trainer tests #344

Merged
merged 2 commits into from
Jan 1, 2022

Conversation

calebrob6
Copy link
Member

Our "tests/trainers/" unit tests currently use the following resources (based off of /usr/bin/time --verbose pytest tests/trainers/ on my machine):

User time (seconds): 682.60
System time (seconds): 269.87
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.86
Maximum resident set size (kbytes): 6274388

This PR monkeypatches / replaces the models (ResNet18s and UNets with ResNet18 backbones) in the generic tests with smaller models. This reduces the overall resource usage to:

User time (seconds): 503.18
System time (seconds): 212.22
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.41
Maximum resident set size (kbytes): 5501672

For "test_classification.py" specifically the before/after is:

User time (seconds): 53.64
System time (seconds): 35.74
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.28
Maximum resident set size (kbytes): 1941160

to

User time (seconds): 24.94
System time (seconds): 16.05
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.39
Maximum resident set size (kbytes): 1254892

Similarly, for "test_segmentation.py":

User time (seconds): 251.97
System time (seconds): 51.93
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.06
Maximum resident set size (kbytes): 2774676

to

User time (seconds): 228.96
System time (seconds): 38.08
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.87
Maximum resident set size (kbytes): 1403244

@github-actions github-actions bot added the testing Continuous integration testing label Dec 31, 2021
@adamjstewart
Copy link
Collaborator

This isn't as much of a speed/memory difference as I was expecting. Any thoughts on why things are still quite slow and use a lot of memory?

@isaaccorley
Copy link
Collaborator

For classification, we could probably make the stride something large so we do less multiplies during convolution. Wonder if that would make a difference.

@calebrob6
Copy link
Member Author

For "test_classification.py", if I monkeypatch timm in every method we use ClassificationTask then it drops resource usage a little bit more:

User time (seconds): 17.32
System time (seconds): 10.92
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.26
Maximum resident set size (kbytes): 1079080

However, even if I replace the models with noops (this is kind of hacky so I haven't committed it), the resource usage doesn't drop lower. This means that the datasets/datamodules are responsible for the additional time/memory.

This is a decent improvement though, it drops test time from (sampling from recent Ubuntu3.6 runs) from ~2m20s to 1m35s (33% decrease).

@calebrob6
Copy link
Member Author

if I monkeypatch timm in every method we use ClassificationTask

how can I do this once for the whole file?

@calebrob6
Copy link
Member Author

With the shrinking of the OSCD test data "test_segmentation.py" is now:

User time (seconds): 74.53
System time (seconds): 25.18
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.36
Maximum resident set size (kbytes): 1139396

(a huge improvement)

isaaccorley
isaaccorley previously approved these changes Jan 1, 2022
Copy link
Collaborator

@isaaccorley isaaccorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Not sure if we need docstrings for the classes in test_utils

@@ -30,6 +32,8 @@ def test_trainer(self, name: str, classname: Type[LightningDataModule]) -> None:
model_kwargs = conf_dict["module"]
model = RegressionTask(**model_kwargs)

model.model = RegressionTestModel()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could monkeypatch torchvision.models.resnet18 here to save having the ResNet18 object actually being created in the constructor, but it seems like that won't save much time/RAM and will change when we upgrade RegressionTask

@adamjstewart
Copy link
Collaborator

Tested again with the latest commit and latest main:

main

$ /usr/bin/time -v pytest
User time (seconds): 735.98
System time (seconds): 289.61
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.87
Maximum resident set size (kbytes): 8144368

tests/fix_memory

$ /usr/bin/time -v pytest
User time (seconds): 620.70
System time (seconds): 210.79
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09.44
Maximum resident set size (kbytes): 7420892

So that's a 15% (10 s) reduction in wall clock time and a 10% (700 MB) reduction in total RAM. This is definitely an improvement, but it's unclear if this is enough to avoid all OOM errors on Windows. I've still seen at least one OOM Windows error on main even after the reduction in OSCD pad size.

@calebrob6
Copy link
Member Author

Collection of comments:

  • From here -- windows and linux machines have 7GB of RAM
  • @adamjstewart do you know how I can monkeypatch timm for the entirety of "test_classification.py"? (or at a test class level?)
  • The RAM usage should improve once we get rid of the other trainer tests
  • I'll poke around some more today to see if there are: any huge test files laying around, the RAM usage of each part of the tests

@adamjstewart
Copy link
Collaborator

@adamjstewart do you know how I can monkeypatch timm for the entirety of "test_classification.py"? (or at a test class level?)

You would probably want to create a fixture in conftest.py. This file can be placed at any level in the repo (tests, tests/trainers, etc.) and defines fixtures available in all tests in the same directory and subdirectories.

@calebrob6
Copy link
Member Author

I don't think that I want to make a fixture. I want to make it such that when the code in ClassificationTask calls timm.create_model, my code gets called instead. Currently, I do that within a single test (

monkeypatch.setattr( # type: ignore[attr-defined]
), but I would need to repeat this line of code for all tests.

@adamjstewart
Copy link
Collaborator

Fixtures don't need to have a return value, they can monkeypatch a section of code and then yield, allowing test functions to use that monkeypatched library.

@calebrob6
Copy link
Member Author

Oh cool, so it would be something like:

def monkeypatch_timm() --> None:
  # do the patching
  yield

Then I can add that as a parameter to any test I want to use monkeypatched timm in?

@adamjstewart
Copy link
Collaborator

Yep, plus @pytest.fixture of course. Might need to play around with the scope.

@adamjstewart
Copy link
Collaborator

Can you rebase now that dataset-specific trainers (and tests) are gone? That should hopefully show further reduction in memory usage.

@calebrob6
Copy link
Member Author

calebrob6 commented Jan 1, 2022

Now

main

Command being timed: "pytest tests/trainers/"
User time (seconds): 360.17
System time (seconds): 184.81
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:31.12
Maximum resident set size (kbytes): 6773776

tests/fix_memory

Command being timed: "pytest tests/trainers/"
User time (seconds): 263.35
System time (seconds): 130.11
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.80
Maximum resident set size (kbytes): 4314044

@adamjstewart adamjstewart added this to the 0.2.1 milestone Jan 1, 2022
@adamjstewart adamjstewart changed the base branch from main to releases/v0.2 January 1, 2022 22:57
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to wait until after 0.2.0 but given that those tests are consistently failing I think we need this now. Changed the base to releases/v0.2 so that these can be merged into that PR branch.

@adamjstewart adamjstewart merged commit 4d9a729 into releases/v0.2 Jan 1, 2022
@adamjstewart adamjstewart deleted the tests/fix_memory branch January 1, 2022 22:59
@adamjstewart adamjstewart modified the milestones: 0.2.1, 0.2.0 Jan 1, 2022
isaaccorley added a commit that referenced this pull request Jan 2, 2022
* 0.2.0 release

* Fix notebooks

* Fix minimal dependency tests

* Fix integration tests

* Fix integration tests

* Try to avoid running GitHub Actions twice on release PRs

* Revert "Try to avoid running GitHub Actions twice on release PRs"

This reverts commit a1ac7ab.

* GeoDatasets use intersection, not addition

* Adding stack_samples to benchmarks

* Fix zero division error in SEN12MS tests

* Replaces test models with dummy models (#344)

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* lc values must be < num_classes

* updated indices tutorial with latest indices

Co-authored-by: Caleb Robinson <calebrob6@gmail.com>
Co-authored-by: isaaccorley <22203655+isaaccorley@users.noreply.github.com>
@adamjstewart adamjstewart added utilities Utilities for working with geospatial data and removed utilities Utilities for working with geospatial data labels Jan 2, 2022
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* 0.2.0 release

* Fix notebooks

* Fix minimal dependency tests

* Fix integration tests

* Fix integration tests

* Try to avoid running GitHub Actions twice on release PRs

* Revert "Try to avoid running GitHub Actions twice on release PRs"

This reverts commit a1ac7ab.

* GeoDatasets use intersection, not addition

* Adding stack_samples to benchmarks

* Fix zero division error in SEN12MS tests

* Replaces test models with dummy models (microsoft#344)

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* lc values must be < num_classes

* updated indices tutorial with latest indices

Co-authored-by: Caleb Robinson <calebrob6@gmail.com>
Co-authored-by: isaaccorley <22203655+isaaccorley@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants