Get rid of temporary files from pygmt functions and plotting methods #2730

seisman · 2023-10-09T15:22:59Z

One of the project goals is to interface with GMT C API directly using ctypes without system calls. It lets us work on data in memory, which is more efficient than working on files on disk.

However, currently PyGMT still writes tables/grids/CPTs into temporary files. With PRs #2729 and #2398 (not merged yet), we can write data into memory and then working on these data-in-memory directly. Thus, it's possible to get rid of temporary files completely.

This issue report is the central place to track all the functions/methods that need to be refactored.

Modules writting tables

Modules writting grids

Most modules are refactored in #2398, except grdcut.

The text was updated successfully, but these errors were encountered:

seisman · 2023-12-19T17:12:39Z

Just tried to see if getting rid of temporary files can make PyGMT faster. The result are quit promising. For grd2xyz, PR #2729 is ~35% faster (130 ms in PR #2729 vs 200 ms in the main branch):

For the main branch:

In [1]: import pygmt

In [2]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
200 ms ± 3.61 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
213 ms ± 17.9 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
217 ms ± 6.09 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

For PR #2729:

In [2]: import pygmt

In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
146 ms ± 15.6 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
132 ms ± 1.81 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

In [5]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
131 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)

weiji14 · 2023-12-19T19:31:17Z

Nice, good to see those benchmarks! I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:

We'll need to manually mark the tests to benchmark with @pytest.mark.benchmark
~~The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values.~~ Edit: Actually no, codspeed only runs each benchmark once, see https://codspeed.io/blog/pinpoint-performance-regressions-with-ci-integrated-differential-profiling.

Maybe we can selectively marking some of the tests we want to track from the list above at #2730 (comment)? I can try to set up the CI for this in the coming weekend.

seisman · 2023-12-20T03:59:04Z

I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this pytest-codspeed/CodSpeedHQ/action tool that would allow us to track these benchmarks over time, but downside is:

We'll need to manually mark the tests to benchmark with @pytest.mark.benchmark

The tests might run slower, because the benchmarks runs a single test multiple times to get the mean/standard deviation values.

Sounds interesting. Manually marking tests sounds OK to me. For your point 2, since performance is not a high-priority issue, maybe we only do the benchmarks in the main branch. Technically, it's possible to override the @pytest.mark.benchmark marker, so that it does nothing if not using CI in the main branch.

seisman · 2024-03-11T13:29:42Z

Here are some more completed benchmarks:

import numpy as np
import pandas as pd
from pygmt.clib import Session
from pygmt.helpers import GMTTempFile


# Create a bunch of test files with different number of nrows
for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    data = np.random.random((nrows, 3))
    np.savetxt(fname=f"test-{nrows}.txt", X=data)

def tmpfile(fname):
    """
    The old way: write to a temporary file and then read it using pd.read_csv.
    """
    with Session() as lib:
        with GMTTempFile(suffix=".txt") as tmpfile:
            lib.call_module("read", f"{fname} {tmpfile.name} -Td")
            df = pd.read_csv(tmpfile.name, sep=" ", comment=">")

def vfile(fname):
    """
    The new way: write to a virtual file and then convert it to pandas.
    """
    with Session() as lib:
        with lib.virtualfile_out(kind="dataset") as vouttbl:
            lib.call_module("read", f"{fname} {vouttbl} -Td")
            df = lib.virtualfile_to_dataset(output_type="pandas", vfname=vouttbl)

Calling the above functions using IPython's magic command %timeit:

for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
    print(f"ncols={nrows}")
    fname = f"test-{nrows}.txt"
    %timeit tmpfile(fname)
    %timeit vfile(fname)
    print()

Here are the outputs (on macOS):

nrows=1
18.9 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.5 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=10
18.4 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.3 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=100
17.7 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=1000
22.3 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
16.3 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nrows=10000
41.1 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=100000
304 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
116 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

nrows=1000000
2.94 s ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
862 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, we can conclude that the vfile method is faster than the tmpfile method, specially for large data files.

weiji14 · 2024-03-16T07:58:52Z

x2sys_cross

I had a look at refactoring x2sys_cross to use virtualfiles instead of temporary files, but it's a little tricky because:

Input: Cannot pass in virtualfiles as input as mentioned at Wrap x2sys_init and x2sys_cross #546 (comment) and Passing in virtual files into the supplementary x2sys modules gmt#3717, since GMT doesn't support virtualfiles to X2SYS modules

Output: The virtualfile_to_dataset method from clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083 was able to produce a pandas.DataFrame output, but the column names were missing. The logic for handling x2sys_cross's output is actually complicated:

pygmt/pygmt/src/x2sys_cross.py

Lines 231 to 250 in bcbbcad

    
           # Read temporary csv output to a pandas table 
        
           if outfile == tmpfile.name:  # if outfile isn't set, return pd.DataFrame 
        
               # Read the tab-separated ASCII table 
        
               date_format_kwarg = ( 
        
                   {"date_format": "ISO8601"} 
        
                   if Version(pd.__version__) >= Version("2.0.0") 
        
                   else {} 
        
               ) 
        
               table = pd.read_csv( 
        
                   tmpfile.name, 
        
                   sep="\t", 
        
                   header=2,  # Column names are on 2nd row 
        
                   comment=">",  # Skip the 3rd row with a ">" 
        
                   parse_dates=[2, 3],  # Datetimes on 3rd and 4th column 
        
                   **date_format_kwarg,  # Parse dates in ISO8601 format on pandas>=2 
        
               ) 
        
               # Remove the "# " from "# x" in the first column 
        
               table = table.rename(columns={table.columns[0]: table.columns[0][2:]}) 
        
           elif outfile != tmpfile.name:  # if outfile is set, output in outfile only 
        
               table = None

Important things to handle are:

Datetime columns need to be parsed correctly as datetime64 dtype
x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).
Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

seisman · 2024-03-18T03:12:51Z

Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

The column names are available as table header in GMT_DATASET. So we can parse the table header and get the column names automatically. See https://github.com/GenericMappingTools/pygmt/pull/3117/files for a proof of concept. We just need to borrow some ideas from the pd.read_csv method and carefully design the API.

seisman · 2024-04-17T01:03:32Z

I'm closing the issue since we have refactored most wrappers using virtual files for output. The only exceptions are info/grdinfo/x2sys_cross/grdcut and will be tracked in their own issue/PRs.

seisman added the enhancement Improving an existing feature label Oct 9, 2023

This was referenced Oct 10, 2023

pygmt.grdtrack: Support consistent table-like outputs #2733

Merged

Wrap GMT's standard data type GMT_DATASET for table inputs #2729

Merged

seisman mentioned this issue Oct 18, 2023

Standarize names for virtual files #2755

Closed

This was referenced Dec 23, 2023

Setup Continuous Benchmarking workflow with pytest-codspeed #2908

Merged

Benchmark performance of PyGMT functions #2910

Closed

seisman mentioned this issue Feb 20, 2024

clib: Add the virtualfile_out method for creating output virtualfile #3057

Merged

7 tasks

seisman pinned this issue Mar 7, 2024

This was referenced Mar 11, 2024

pygmt.grd2xyz: Improve performance by storing output in virtual files #3097

Merged

pygmt.grdvolume: Refactor to store output in virtual files instead of temporary files #3102

Merged

seisman mentioned this issue Mar 18, 2024

Finalize the pygmt.which wrapper #3003

Open

This was referenced Mar 29, 2024

pygmt.which: Refactor to get rid of temporary files #3148

Merged

Session.virtualfile_to_dataset: Add 'header' parameter to parse column names from table header #3117

Merged

seisman mentioned this issue Apr 8, 2024

x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

Closed

seisman unpinned this issue Apr 16, 2024

seisman added this to the 0.12.0 milestone Apr 17, 2024

seisman closed this as completed Apr 17, 2024

weiji14 mentioned this issue Apr 26, 2024

Changelog entry for v0.12.0 #3201

Merged

9 tasks

seisman mentioned this issue Jul 7, 2024

**BREAKING** pygmt.grdcut: Refactor to store output in virtualfiles for grids #3115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of temporary files from pygmt functions and plotting methods #2730

Get rid of temporary files from pygmt functions and plotting methods #2730

seisman commented Oct 9, 2023 •

edited

Loading

seisman commented Dec 19, 2023

weiji14 commented Dec 19, 2023 •

edited

Loading

seisman commented Dec 20, 2023

seisman commented Mar 11, 2024 •

edited

Loading

weiji14 commented Mar 16, 2024

seisman commented Mar 18, 2024

seisman commented Apr 17, 2024

Get rid of temporary files from pygmt functions and plotting methods #2730

Get rid of temporary files from pygmt functions and plotting methods #2730

Comments

seisman commented Oct 9, 2023 • edited Loading

seisman commented Dec 19, 2023

weiji14 commented Dec 19, 2023 • edited Loading

seisman commented Dec 20, 2023

seisman commented Mar 11, 2024 • edited Loading

weiji14 commented Mar 16, 2024

seisman commented Mar 18, 2024

seisman commented Apr 17, 2024

seisman commented Oct 9, 2023 •

edited

Loading

weiji14 commented Dec 19, 2023 •

edited

Loading

seisman commented Mar 11, 2024 •

edited

Loading