Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of temporary files from pygmt functions and plotting methods #2730

Closed
27 of 31 tasks
seisman opened this issue Oct 9, 2023 · 7 comments
Closed
27 of 31 tasks
Labels
enhancement Improving an existing feature
Milestone

Comments

@seisman
Copy link
Member

seisman commented Oct 9, 2023

One of the project goals is to interface with GMT C API directly using ctypes without system calls. It lets us work on data in memory, which is more efficient than working on files on disk.

However, currently PyGMT still writes tables/grids/CPTs into temporary files. With PRs #2729 and #2398 (not merged yet), we can write data into memory and then working on these data-in-memory directly. Thus, it's possible to get rid of temporary files completely.

This issue report is the central place to track all the functions/methods that need to be refactored.

Modules writting tables

Modules writting grids

Most modules are refactored in #2398, except grdcut.

@seisman
Copy link
Member Author

seisman commented Dec 19, 2023

Just tried to see if getting rid of temporary files can make PyGMT faster. The result are quit promising. For grd2xyz, PR #2729 is ~35% faster (130 ms in PR #2729 vs 200 ms in the main branch):

For the main branch:

In [1]: import pygmt

In [2]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
200 ms ± 3.61 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
213 ms ± 17.9 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
217 ms ± 6.09 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

For PR #2729:

In [2]: import pygmt

In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
146 ms ± 15.6 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
132 ms ± 1.81 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [5]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
131 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

@weiji14
Copy link
Member

weiji14 commented Dec 19, 2023

Nice, good to see those benchmarks! I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:

  1. We'll need to manually mark the tests to benchmark with @pytest.mark.benchmark
  2. The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values. Edit: Actually no, codspeed only runs each benchmark once, see https://codspeed.io/blog/pinpoint-performance-regressions-with-ci-integrated-differential-profiling.

Maybe we can selectively marking some of the tests we want to track from the list above at #2730 (comment)? I can try to set up the CI for this in the coming weekend.

@seisman
Copy link
Member Author

seisman commented Dec 20, 2023

I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:

  1. We'll need to manually mark the tests to benchmark with @pytest.mark.benchmark
  2. The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values.

Sounds interesting. Manually marking tests sounds OK to me. For your point 2, since performance is not a high-priority issue, maybe we only do the benchmarks in the main branch. Technically, it's possible to override the @pytest.mark.benchmark marker, so that it does nothing if not using CI in the main branch.

@seisman
Copy link
Member Author

seisman commented Mar 11, 2024

Here are some more completed benchmarks:

import numpy as np
import pandas as pd
from pygmt.clib import Session
from pygmt.helpers import GMTTempFile


# Create a bunch of test files with different number of nrows
for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    data = np.random.random((nrows, 3))
    np.savetxt(fname=f"test-{nrows}.txt", X=data)

def tmpfile(fname):
    """
    The old way: write to a temporary file and then read it using pd.read_csv.
    """
    with Session() as lib:
        with GMTTempFile(suffix=".txt") as tmpfile:
            lib.call_module("read", f"{fname} {tmpfile.name} -Td")
            df = pd.read_csv(tmpfile.name, sep=" ", comment=">")

def vfile(fname):
    """
    The new way: write to a virtual file and then convert it to pandas.
    """
    with Session() as lib:
        with lib.virtualfile_out(kind="dataset") as vouttbl:
            lib.call_module("read", f"{fname} {vouttbl} -Td")
            df = lib.virtualfile_to_dataset(output_type="pandas", vfname=vouttbl)

Calling the above functions using IPython's magic command %timeit:

for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    print(f"ncols={nrows}")
    fname = f"test-{nrows}.txt"
    %timeit tmpfile(fname)
    %timeit vfile(fname)
    print()

Here are the outputs (on macOS):

nrows=1
18.9 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.5 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=10
18.4 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.3 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=100
17.7 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=1000
22.3 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.3 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=10000
41.1 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=100000
304 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
116 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=1000000
2.94 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
862 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, we can conclude that the vfile method is faster than the tmpfile method, specially for large data files.

@weiji14
Copy link
Member

weiji14 commented Mar 16, 2024

  • x2sys_cross

I had a look at refactoring x2sys_cross to use virtualfiles instead of temporary files, but it's a little tricky because:

  • Input: Cannot pass in virtualfiles as input as mentioned at Wrap x2sys_init and x2sys_cross #546 (comment) and Passing in virtual files into the supplementary x2sys modules gmt#3717, since GMT doesn't support virtualfiles to X2SYS modules

  • Output: The virtualfile_to_dataset method from clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083 was able to produce a pandas.DataFrame output, but the column names were missing. The logic for handling x2sys_cross's output is actually complicated:

    # Read temporary csv output to a pandas table
    if outfile == tmpfile.name: # if outfile isn't set, return pd.DataFrame
    # Read the tab-separated ASCII table
    date_format_kwarg = (
    {"date_format": "ISO8601"}
    if Version(pd.__version__) >= Version("2.0.0")
    else {}
    )
    table = pd.read_csv(
    tmpfile.name,
    sep="\t",
    header=2, # Column names are on 2nd row
    comment=">", # Skip the 3rd row with a ">"
    parse_dates=[2, 3], # Datetimes on 3rd and 4th column
    **date_format_kwarg, # Parse dates in ISO8601 format on pandas>=2
    )
    # Remove the "# " from "# x" in the first column
    table = table.rename(columns={table.columns[0]: table.columns[0][2:]})
    elif outfile != tmpfile.name: # if outfile is set, output in outfile only
    table = None

Important things to handle are:

  1. Datetime columns need to be parsed correctly as datetime64 dtype
  2. x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).
  3. Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

@seisman
Copy link
Member Author

seisman commented Mar 18, 2024

  1. Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

The column names are available as table header in GMT_DATASET. So we can parse the table header and get the column names automatically. See https://github.com/GenericMappingTools/pygmt/pull/3117/files for a proof of concept. We just need to borrow some ideas from the pd.read_csv method and carefully design the API.

@seisman
Copy link
Member Author

seisman commented Apr 17, 2024

I'm closing the issue since we have refactored most wrappers using virtual files for output. The only exceptions are info/grdinfo/x2sys_cross/grdcut and will be tracked in their own issue/PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants