Functions that import sample GMT remote files #1436

willschlitzer · 2021-08-11T09:09:51Z

I've recently opened some pull requests (#1386 and #1420) that import the sample remote files, and I've realized there are a lot of sample files on the GMT server that are used in the GMT examples. Should functions be added that import remote files (e.g. "ternary.txt" and "EGM96_to_36.txt" for ternary and sph2grd) whenever they are used in a GMT module example? I think it's more Pythonic to have the datasets available as a DataFrame/array than just the text file, and I think those sample files are a great resource when writing a test or making an example, but I could also see it as bad practice to be adding all of these functions that import little-used sample files.

The text was updated successfully, but these errors were encountered:

maxrjones · 2021-08-11T13:55:03Z

Functions to load the file are nice, though I am not sure that we should export them or include in the API docs since there's no stability guarantee for those files (see https://docs.generic-mapping-tools.org/dev/datasets/remote-data.html#cache-file-updates).

maxrjones · 2021-08-11T17:26:31Z

xref GenericMappingTools/gmtserver-admin#100, which has some discussions about alternate options for data files used in examples.

willschlitzer · 2021-08-31T19:04:06Z

It looks like the consensus is that cache files should be used only in examples. Should the PyGMT examples just call the remote files directly or should there be a Python function to load the remote files first before they're used in the examples?

maxrjones · 2021-09-03T01:32:21Z

It looks like the consensus is that cache files should be used only in examples. Should the PyGMT examples just call the remote files directly or should there be a Python function to load the remote files first before they're used in the examples?

This is a good question. I am not sure the best solution here, but my current thinking is that we should either have a separate namespace for the functions related to GMT’s cache data relative to the earth relief, age, etc grids or a single public function like load_cache_data with a filename parameter that can be used to specify the cache file. This way, there would be more information that the data accessed are not 'stable' (which is currently only documented on the GMT side) and the smaller functions for transforming the data in each file for the examples would not overwhelm the API docs.

willschlitzer · 2021-09-09T10:13:53Z

It looks like the consensus is that cache files should be used only in examples. Should the PyGMT examples just call the remote files directly or should there be a Python function to load the remote files first before they're used in the examples?

This is a good question. I am not sure the best solution here, but my current thinking is that we should either have a separate namespace for the functions related to GMT’s cache data relative to the earth relief, age, etc grids or a single public function like load_cache_data with a filename parameter that can be used to specify the cache file. This way, there would be more information that the data accessed are not 'stable' (which is currently only documented on the GMT side) and the smaller functions for transforming the data in each file for the examples would not overwhelm the API docs.

I think functions that take a region argument, such as load_earth_relief (and the functions in #1457 and #1471) should still merit their own public functions to call the gridded data, but there's no need to have separate functions that all just read a remote file from the GMT server and return a DataFrame.

I'm not familiar with what it means to move the functions to a separate namespace. Does this mean there would be a separate class that would have sub-functions to call different datasets?

maxrjones · 2021-09-11T14:38:00Z

I think functions that take a region argument, such as load_earth_relief (and the functions in #1457 and #1471) should still merit their own public functions to call the gridded data, but there's no need to have separate functions that all just read a remote file from the GMT server and return a DataFrame.

100% agree

I'm not familiar with what it means to move the functions to a separate namespace. Does this mean there would be a separate class that would have sub-functions to call different datasets?

Not a new class, just a different module structure. So there would still be pygmt.datasets.load_earth_relief but also pygmt.datasets.cache.load_hotspots, etc. But I expect that using a single public function (similar to https://seaborn.pydata.org/generated/seaborn.load_dataset.html) and internal function(s) (as needed) for loading and transforming the data would be the better option. I could open a request for comment PR with a potential solution, or leave this to you if you prefer.

willschlitzer · 2021-09-11T15:48:28Z

Not a new class, just a different module structure. So there would still be pygmt.datasets.load_earth_relief but also pygmt.datasets.cache.load_hotspots, etc. But I expect that using a single public function (similar to https://seaborn.pydata.org/generated/seaborn.load_dataset.html) and internal function(s) (as needed) for loading and transforming the data would be the better option. I could open a request for comment PR with a potential solution, or leave this to you if you prefer.

As far as I can tell with the seaborn model, it would look something like this:

hotspot_data = pygmt.datasets.load_dataset(name="hotspot")

Then the function would look something like:

def load_dataset(name):
    if name == "hotspot":
         return _load_hotspot()

Does that look about right?

maxrjones · 2021-09-11T16:34:23Z

Not a new class, just a different module structure. So there would still be pygmt.datasets.load_earth_relief but also pygmt.datasets.cache.load_hotspots, etc. But I expect that using a single public function (similar to https://seaborn.pydata.org/generated/seaborn.load_dataset.html) and internal function(s) (as needed) for loading and transforming the data would be the better option. I could open a request for comment PR with a potential solution, or leave this to you if you prefer.

As far as I can tell with the seaborn model, it would look something like this:

hotspot_data = pygmt.datasets.load_dataset(name="hotspot")

Then the function would look something like:
def load_dataset(name):
    if name == "hotspot":
         return _load_hotspot()
Does that look about right?

That looks right. It would also be necessary to decide whether to deprecate datasets.load_japan_quakes, datasets.load_ocean_ridge_points, datasets.load_sample_bathymetry, datasets.load_usgs_quakes, datasets.load_fractures_compilation, and datasets.load_hotspots in favor of pygmt.datasets.load_dataset(name=dataset_name) (I would suggest yes since it's more pythonic to have just one way to do a task) and, if so, implement a deprecation warning.

weiji14 · 2021-09-18T11:36:31Z

in favor of pygmt.datasets.load_dataset(name=dataset_name) (I would suggest yes since it's more pythonic to have just one way to do a task) and, if so, implement a deprecation warning.

👍 for me. The seaborn.load_dataset code implementation is worth a look at https://github.com/mwaskom/seaborn/blob/703259f2dca711205f81d2d8f9e3fa45774eb0da/seaborn/utils.py#L546-L641, this is the general structure (I've simplified it a bit):

def load_dataset(name, **kwargs):
    path = f"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/{name}.csv"
    df = pd.read_csv(path, **kwargs)

    if name == "flights":
        months = df["month"].str[:3]
        df["month"] = pd.Categorical(months, months.unique())

    if name == "titanic":
        df["class"] = pd.Categorical(df["class"], ["First", "Second", "Third"])
        df["deck"] = pd.Categorical(df["deck"], list("ABCDEFG"))

    return df

So we could probably get away with one big function with lots of if-statements, rather than have too many _load_data1, _load_data2 functions being called. OR, we could construct one big dictionary where the key is the name of the dataset, and the values are the arguments to pass to pd.read_csv:

dataset_dict = {
    "japan_quakes": dict(url="@tut_quakes.ngdc", header=1, sep=r"\s+"),
    "sample_bathymetry": dict(
        url="@tut_ship.xyz",
        sep="\t",
        header=None,
        names=["longitude", "latitude", "bathymetry"],
    ),
}
fname = which(fname=dataset_dict[name]["url"], download="c")
dataframe = pd.read_csv(fname, **dataset_dict[name])

There's also xarray.tutorial.open_dataset which loads NetCDF datasets, but maybe we should separate table datasets (read using pd.read_csv) and grid datasets (read using xr.load_dataarray). Let's just handle the table datasets in this issue for now.

maxrjones · 2021-09-19T01:30:21Z

So we could probably get away with one big function with lots of if-statements, rather than have too many _load_data1, _load_data2 functions being called. OR, we could construct one big dictionary where the key is the name of the dataset, and the values are the arguments to pass to pd.read_csv:

I like the dictionary option, basically a switch...case for python.

There's also xarray.tutorial.open_dataset which loads NetCDF datasets, but maybe we should separate table datasets (read using pd.read_csv) and grid datasets (read using xr.load_dataarray). Let's just handle the table datasets in this issue for now.

I agree with just working on the table datasets for now.

seisman · 2021-09-23T13:38:01Z

dataset_dict = {
    "japan_quakes": dict(url="@tut_quakes.ngdc", header=1, sep=r"\s+"),
    "sample_bathymetry": dict(
        url="@tut_ship.xyz",
        sep="\t",
        header=None,
        names=["longitude", "latitude", "bathymetry"],
    ),
}
fname = which(fname=dataset_dict[name]["url"], download="c")
dataframe = pd.read_csv(fname, **dataset_dict[name])

Although the dictionary structure can make the code more compact, I feel that the if-statement option is easier to read, especially for new contributors who want to add examples.

BTW, the above code doesn't work, because pd.read_csv doesn't have the url parameter. If we use the dictionary option, we have to change the code, which may make it more difficult to read.

willschlitzer · 2021-09-25T06:27:22Z

I think the if statement version is more accessible to new contributors, but I do also think dataset_dict is much more scalable if the plan is to add the capability to import most of the GMT remote files. I think if plan is to just wrap the remote files that are used in testing/examples, we can get by without needing to create a dictionary.

seisman · 2021-10-02T09:36:24Z

There's also xarray.tutorial.open_dataset which loads NetCDF datasets, but maybe we should separate table datasets (read using pd.read_csv) and grid datasets (read using xr.load_dataarray). Let's just handle the table datasets in this issue for now.

Please note that we already have a pygmt.load_dataarray function which reads a grid as xr.DataArray. Thus, pygmt.datasets.load_dataset is a bad function name too me, because this function is only used to read sample dataset, and the function name is a little misleading. Maybe pygmt.datasets.load_sample_dataset is a better option.

seisman · 2022-01-10T09:04:22Z

Can we close the issue?

willschlitzer added the question Further information is requested label Aug 11, 2021

seisman added deprecation Deprecating a feature feature request New feature wanted and removed question Further information is requested labels Sep 18, 2021

seisman added this to the 0.5.0 milestone Sep 18, 2021

willschlitzer mentioned this issue Sep 19, 2021

Add Mars dataset #1420

Merged

5 tasks

weiji14 modified the milestones: 0.5.0, 0.6.0 Sep 30, 2021

weiji14 mentioned this issue Nov 6, 2021

Wrap filter1d #1512

Merged

5 tasks

weiji14 pinned this issue Nov 6, 2021

maxrjones mentioned this issue Dec 23, 2021

Use a single function for loading any sample dataset #1685

Merged

9 tasks

willschlitzer closed this as completed Jan 14, 2022

seisman unpinned this issue Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions that import sample GMT remote files #1436

Functions that import sample GMT remote files #1436

willschlitzer commented Aug 11, 2021

maxrjones commented Aug 11, 2021

maxrjones commented Aug 11, 2021

willschlitzer commented Aug 31, 2021 •

edited

Loading

maxrjones commented Sep 3, 2021 •

edited

Loading

willschlitzer commented Sep 9, 2021

maxrjones commented Sep 11, 2021

willschlitzer commented Sep 11, 2021

maxrjones commented Sep 11, 2021

weiji14 commented Sep 18, 2021 •

edited

Loading

maxrjones commented Sep 19, 2021

seisman commented Sep 23, 2021 •

edited

Loading

willschlitzer commented Sep 25, 2021

seisman commented Oct 2, 2021

seisman commented Jan 10, 2022

Functions that import sample GMT remote files #1436

Functions that import sample GMT remote files #1436

Comments

willschlitzer commented Aug 11, 2021

maxrjones commented Aug 11, 2021

maxrjones commented Aug 11, 2021

willschlitzer commented Aug 31, 2021 • edited Loading

maxrjones commented Sep 3, 2021 • edited Loading

willschlitzer commented Sep 9, 2021

maxrjones commented Sep 11, 2021

willschlitzer commented Sep 11, 2021

maxrjones commented Sep 11, 2021

weiji14 commented Sep 18, 2021 • edited Loading

maxrjones commented Sep 19, 2021

seisman commented Sep 23, 2021 • edited Loading

willschlitzer commented Sep 25, 2021

seisman commented Oct 2, 2021

seisman commented Jan 10, 2022

willschlitzer commented Aug 31, 2021 •

edited

Loading

maxrjones commented Sep 3, 2021 •

edited

Loading

weiji14 commented Sep 18, 2021 •

edited

Loading

seisman commented Sep 23, 2021 •

edited

Loading