Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by avoiding loading the GMT library repeatedly #2930

Merged
merged 12 commits into from
Jan 2, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Dec 28, 2023

Description of proposed changes

The load_libgmt function can search for the GMT library and load it so that we can call GMT API functions in PyGMT. The function is called in the Session class's get_libgmt_func function.

Although the get_libgmt_func function is called multiple times in Session's different methods, the GMT library is only loaded once for each session, as shown below:

pygmt/pygmt/clib/session.py

Lines 310 to 311 in bb82345

if not hasattr(self, "_libgmt"):
self._libgmt = load_libgmt()

However, when wrapping GMT modules, we always create a new session and then call the GMT module like this

with Session() as lib:
    lib.call_module(...)

It means the GMT library is searched and loaded multiple times, as reported in #867.

To verify if the GMT library is loaded multiple times, we just need to add a simple print statement like print("Loading GMT library") at the start of the load_libgmt function, and then call PyGMT. Here is what you'll see. It's clear that the GMT library are loaded multiple times:

In [1]: import pygmt
Loading GMT library
Loading GMT library

In [2]: fig = pygmt.Figure()
Loading GMT library

In [3]: fig.basemap(region=[0, 10, 0, 10], frame=True, projection="X10c")
Loading GMT library
Loading GMT library

In [4]: fig.savefig("map.png")
Loading GMT library
Loading GMT library

It's unnecessary to search for and load the GMT library multiple times, which is actually time-consuming (120 ms for me on macOS).

In [1]: from pygmt.clib.loading import load_libgmt

In [2]: %timeit load_libgmt()
118 ms ± 5.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This PR fixes the issue by moving the load_libgmt call outside the Session class, so that the GMT library is only searched for and loaded once when importing pygmt.

This PR addresses the following comment in PR #867 so it fixes #867:

So it might be time to rethink loading libgmt in the class instead of as a global instead to avoid searching for it every time.

#867 also proposed another alternative solution:

Or we figure out a way to not call begin at import time.

This solution may help address issue #217, but it's unclear what exactly we should do and it definitely needs big refactors, so we can explore the solution later.

It's worth noting that the small changes in this PR speed up our Tests significantly! For example, the "Run tests" step now only takes 85 seconds (on Linux) compared to 250 seconds in the main branch.

@seisman seisman changed the title clib: Search and load the GMT library only one time POC + WIP: clib: Search and load the GMT library only one time Dec 28, 2023
@seisman seisman marked this pull request as ready for review December 28, 2023 14:07
Copy link

codspeed-hq bot commented Dec 28, 2023

CodSpeed Performance Report

Merging #2930 will not alter performance

Comparing clib/load-libgmt (11cb76b) with main (88ab1ca)

Summary

✅ 64 untouched benchmarks

@seisman
Copy link
Member Author

seisman commented Dec 28, 2023

I'm a liitle confused why we don't see some improvements in the benchmark report above.

@weiji14
Copy link
Member

weiji14 commented Dec 29, 2023

If you click into the link - https://codspeed.io/GenericMappingTools/pygmt/branches/clib/load-libgmt, it shows that most tests improve by 1-2%. Since most unit tests only call a few GMT commands, they shouldn't show too much of a gain, but still, I think it's worth it.

Maybe wait for #2924? There are a few clib/session related tests in that PR which might show a clearer improvement.

@seisman
Copy link
Member Author

seisman commented Dec 29, 2023

If you click into the link - https://codspeed.io/GenericMappingTools/pygmt/branches/clib/load-libgmt, it shows that most tests improve by 1-2%. Since most unit tests only call a few GMT commands, they shouldn't show too much of a gain, but still, I think it's worth it.

Maybe wait for #2924? There are a few clib/session related tests in that PR which might show a clearer improvement.

Just tried to benchmark locally.

In the main branch:

In [1]: import pygmt

In [2]: fig = pygmt.Figure()

In [3]: %timeit fig.basemap(region=[0, 10, 0, 10], frame=True, projection="X10c")
237 ms ± 9.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And in this branch:

In [1]: import pygmt

In [2]: fig = pygmt.Figure()

In [3]: %timeit fig.basemap(region=[0, 10, 0, 10], frame=True, projection="X10c")
6.12 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, the Figure.basemap call is 6.12 ms vs 237 ms. The reduced time (~230 ms) is just the time used to load the GMT library twice. So I expect to see that even the simplest test should have a time reduction of at least 200 ms. But it doesn't match the CodSpeed reports.

@seisman seisman added the maintenance Boring but important stuff for the core devs label Dec 29, 2023
@seisman seisman added this to the 0.11.0 milestone Dec 29, 2023
@seisman seisman changed the title POC + WIP: clib: Search and load the GMT library only one time clib: Search and load the GMT library only one time Dec 29, 2023
@seisman
Copy link
Member Author

seisman commented Dec 29, 2023

I've added a unit test in commit f589ff8. The unit test passes in this branch and fails if in the main branch:

_______________________________ TestLibgmtCount.test_libgmt_load_counter _________________________________

self = <pygmt.tests.test_clib_loading.TestLibgmtCount object at 0x7f212fc958b0>

    @pytest.mark.usefixtures("_mock_ctypes")
    def test_libgmt_load_counter(self):
        """
        Make sure that the GMT library is not loaded in every session.
        """
        with Session() as lib:
            _ = lib
        with Session() as lib:
            _ = lib
>       assert self.counter == 0  # ctypes.CDLL is not called after two sessions.
E       assert 2 == 0
E        +  where 2 = <pygmt.tests.test_clib_loading.TestLibgmtCount object at 0x7f212fc958b0>.counter

pygmt/tests/test_clib_loading.py:242: AssertionError

@seisman seisman changed the title clib: Search and load the GMT library only one time Improve performance by avoiding loading the GMT library repeatly Dec 29, 2023
@seisman seisman added enhancement Improving an existing feature needs review This PR has higher priority and needs review. and removed maintenance Boring but important stuff for the core devs labels Dec 30, 2023
Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution may help address issue #217, but it's unclear what exactly we should do and it definitely needs big refactors, so we can explore the solution later.

I can see the value of not loading GMT repeatedly in terms of speed, but I'm still a little unsure about making this a global variable. There are also issues like #1242 and #1582 where there has been flakiness with using a single GMT session (though the changes here aren't at the session level, but on the libgmt loading level), and we'll need to come up with some really good unit tests to make sure that there isn't any side effects from this change.

Could we make a unit test based on #217 (comment), and test to see if the change in this PR allows for this multiprocessing/parallel code to work:

import pygmt
import multiprocessing as mp

def gmt_func(n):
    fig = pygmt.Figure()
    fig.coast(...) 
    fig.show()

# gmt_func(n=5) #calls the function without multiprocessing - uncomment as desired to see that the function works on its own

with mp.Pool(2) as p: #calls the process with multiprocessing, using 2 cores
    a = p.map(gmt_func, [x for x in range(0,2)])

pygmt/clib/session.py Outdated Show resolved Hide resolved
@@ -208,6 +209,45 @@ def test_brokenlib_brokenlib_workinglib(self):
assert check_libgmt(load_libgmt(lib_fullnames=lib_fullnames)) is None


class TestLibgmtCount:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to put this unit test in test_session_management.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to keep it in test_clib_loading.py because this unit test is actually not related to session management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I kinda wanted to move the test since test_clib_loading.py has 300+ lines of code, while test_session_management.py has <100 lines, and we're kinda checking that Session() doesn't reload libgmt here, but up to you 🙂

@seisman
Copy link
Member Author

seisman commented Dec 30, 2023

Could we make a unit test based on #217 (comment), and test to see if the change in this PR allows for this multiprocessing/parallel code to work:

No, it definitely won't work. To allow multiprocessing for PyGMT, we need to call gmt begin in each process. Currently, gmt begin is only called once when importing pygmt. That's why the trick in #217 (comment) works.

@weiji14
Copy link
Member

weiji14 commented Dec 30, 2023

Could we make a unit test based on #217 (comment), and test to see if the change in this PR allows for this multiprocessing/parallel code to work:

No, it definitely won't work. To allow multiprocessing for PyGMT, we need to call gmt begin in each process. Currently, gmt begin is only called once when importing pygmt. That's why the trick in #217 (comment) works.

Hmm you're right. Is it ok for each process to have its own 'global' libgmt though? I suppose the changes here are only affecting things on the initial 'load GMT library' library, and not the 'GMT session' level, so it's probably ok?

seisman and others added 2 commits December 31, 2023 23:11
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@seisman
Copy link
Member Author

seisman commented Jan 1, 2024

Could we make a unit test based on #217 (comment), and test to see if the change in this PR allows for this multiprocessing/parallel code to work:

import pygmt
import multiprocessing as mp

def gmt_func(n):
    fig = pygmt.Figure()
    fig.coast(...) 
    fig.show()

# gmt_func(n=5) #calls the function without multiprocessing - uncomment as desired to see that the function works on its own

with mp.Pool(2) as p: #calls the process with multiprocessing, using 2 cores
    a = p.map(gmt_func, [x for x in range(0,2)])

In 48c00ae, I've added a test to make sure that the workaround in #217 still works after this change. But, as mentioned in #217 (comment), the workaround doesn't work for Windows, so it's marked as xfail on Windows.

@seisman seisman changed the title Improve performance by avoiding loading the GMT library repeatly Improve performance by avoiding loading the GMT library repeatedly Jan 1, 2024
@seisman
Copy link
Member Author

seisman commented Jan 1, 2024

Could we make a unit test based on #217 (comment), and test to see if the change in this PR allows for this multiprocessing/parallel code to work:

import pygmt
import multiprocessing as mp

def gmt_func(n):
    fig = pygmt.Figure()
    fig.coast(...) 
    fig.show()

# gmt_func(n=5) #calls the function without multiprocessing - uncomment as desired to see that the function works on its own

with mp.Pool(2) as p: #calls the process with multiprocessing, using 2 cores
    a = p.map(gmt_func, [x for x in range(0,2)])

In 48c00ae, I've added a test to make sure that the workaround in #217 still works after this change. But, as mentioned in #217 (comment), the workaround doesn't work for Windows, so it's marked as xfail on Windows.

I've decided to remove this multiprocessing test from this PR, and will add it in PR #2938 and also fix the Windows issue.

@@ -208,6 +209,45 @@ def test_brokenlib_brokenlib_workinglib(self):
assert check_libgmt(load_libgmt(lib_fullnames=lib_fullnames)) is None


class TestLibgmtCount:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I kinda wanted to move the test since test_clib_loading.py has 300+ lines of code, while test_session_management.py has <100 lines, and we're kinda checking that Session() doesn't reload libgmt here, but up to you 🙂

@seisman seisman merged commit b561e9d into main Jan 2, 2024
19 checks passed
@seisman seisman deleted the clib/load-libgmt branch January 2, 2024 08:12
@seisman seisman removed the needs review This PR has higher priority and needs review. label Jan 2, 2024
@seisman seisman mentioned this pull request Jan 30, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve how PyGMT loads the GMT library
2 participants