-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of temporary files from pygmt functions and plotting methods #2730
Comments
Just tried to see if getting rid of temporary files can make PyGMT faster. The result are quit promising. For For the main branch: In [1]: import pygmt
In [2]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
200 ms ± 3.61 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
213 ms ± 17.9 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
217 ms ± 6.09 ms per loop (mean ± std. dev. of 5 runs, 10 loops each) For PR #2729: In [2]: import pygmt
In [3]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
146 ms ± 15.6 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [4]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
132 ms ± 1.81 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)
In [5]: %timeit -n 10 -r 5 xyz = pygmt.grd2xyz("@earth_relief_01d_g", output_type="pandas")
131 ms ± 1.42 ms per loop (mean ± std. dev. of 5 runs, 10 loops each) |
Nice, good to see those benchmarks! I've been wondering if we should use pytest-benchmark to get an idea of speedup over time. There's this
Maybe we can selectively marking some of the tests we want to track from the list above at #2730 (comment)? I can try to set up the CI for this in the coming weekend. |
Sounds interesting. Manually marking tests sounds OK to me. For your point 2, since performance is not a high-priority issue, maybe we only do the benchmarks in the main branch. Technically, it's possible to override the |
Here are some more completed benchmarks: import numpy as np
import pandas as pd
from pygmt.clib import Session
from pygmt.helpers import GMTTempFile
# Create a bunch of test files with different number of nrows
for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
data = np.random.random((nrows, 3))
np.savetxt(fname=f"test-{nrows}.txt", X=data)
def tmpfile(fname):
"""
The old way: write to a temporary file and then read it using pd.read_csv.
"""
with Session() as lib:
with GMTTempFile(suffix=".txt") as tmpfile:
lib.call_module("read", f"{fname} {tmpfile.name} -Td")
df = pd.read_csv(tmpfile.name, sep=" ", comment=">")
def vfile(fname):
"""
The new way: write to a virtual file and then convert it to pandas.
"""
with Session() as lib:
with lib.virtualfile_out(kind="dataset") as vouttbl:
lib.call_module("read", f"{fname} {vouttbl} -Td")
df = lib.virtualfile_to_dataset(output_type="pandas", vfname=vouttbl) Calling the above functions using IPython's magic command for nrows in [1, 10, 100, 1000, 10000, 100000, 1000000]:
print(f"ncols={nrows}")
fname = f"test-{nrows}.txt"
%timeit tmpfile(fname)
%timeit vfile(fname)
print() Here are the outputs (on macOS):
So, we can conclude that the |
I had a look at refactoring
Important things to handle are:
It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in |
The column names are available as table header in GMT_DATASET. So we can parse the table header and get the column names automatically. See https://github.com/GenericMappingTools/pygmt/pull/3117/files for a proof of concept. We just need to borrow some ideas from the |
I'm closing the issue since we have refactored most wrappers using virtual files for output. The only exceptions are |
One of the project goals is to interface with GMT C API directly using ctypes without system calls. It lets us work on data in memory, which is more efficient than working on files on disk.
However, currently PyGMT still writes tables/grids/CPTs into temporary files. With PRs #2729 and #2398 (not merged yet), we can write data into memory and then working on these data-in-memory directly. Thus, it's possible to get rid of temporary files completely.
This issue report is the central place to track all the functions/methods that need to be refactored.
Modules writting tables
blockm*
pygmt.blockm*: Add 'output_type' parameter for output in pandas/numpy/file formats #3103filter1d
pygmt.filter1d: Improve performance by storing output in virtual files #3085grdinfo
[Will be tracked in Better return values for grdinfo #593]grdtrack
pygmt.grdtrack: Add 'output_type' parameter for output in pandas/numpy/file formats #3106triangulate
pygmt.triangulate.delaunay_triples: Improve performance by storing output in virtual files #3107grdvolume
pygmt.grdvolume: Refactor to store output in virtual files instead of temporary files #3102select
pygmt.select: Improve performance by storing output in virtual files #3108which
pygmt.which: Refactor to get rid of temporary files #3148grdhisteq
pygmt.grdhisteq.compute_bins: Refactor to store output in virtual files instead of temporary files #3109x2sys_cross
[Will be tracked in x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160]info
[Will be tracked in Better return value for pygmt.info #3159]project
pygmt.project: Add 'output_type' parameter for output in pandas/numpy/file formats #3110grd2xyz
pygmt.grd2xyz: Improve performance by storing output in virtual files #3097Modules writting grids
Most modules are refactored in #2398, except
grdcut
.dimfilter
grdclip
sphinterpolate
grdsample
grdfilter
grdproject
grdgradient
grdfill
triangulate
sphdistance
xyz2grd
grdcut
[**BREAKING** pygmt.grdcut: Refactor to store output in virtualfiles for grids #3115]grdhisteq
nearneighbor
sph2grd
grdlandmask
binstats
surface
The text was updated successfully, but these errors were encountered: