Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

Merged
merged 12 commits into from
Mar 29, 2024
10 changes: 6 additions & 4 deletions pygmt/datatypes/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ class _GMT_DATASET(ctp.Structure): # noqa: N801
"""
GMT dataset structure for holding multiple tables (files).

This class is only meant for internal use by PyGMT and is not exposed to users.
See the GMT source code gmt_resources.h for the original C struct definitions.
This class is only meant for internal use and is not exposed to users. See the GMT
source code ``gmt_resources.h`` for the original C struct definitions.

Examples
--------
Expand Down Expand Up @@ -151,6 +151,8 @@ def to_dataframe(self) -> pd.DataFrame:
the same. The same column in all segments of all tables are concatenated. The
trailing text column is also concatenated as a single string column.

If the object has no data, an empty DataFrame will be returned.

Returns
-------
df
Expand Down Expand Up @@ -185,8 +187,8 @@ def to_dataframe(self) -> pd.DataFrame:
>>> df.dtypes.to_list()
[dtype('float64'), dtype('float64'), dtype('float64'), string[python]]
"""
# Deal with numeric columns
vectors = []
# Deal with numeric columns
for icol in range(self.n_columns):
colvector = []
for itbl in range(self.n_tables):
Expand All @@ -211,5 +213,5 @@ def to_dataframe(self) -> pd.DataFrame:
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
)

df = pd.concat(objs=vectors, axis=1)
df = pd.concat(objs=vectors, axis=1) if vectors else pd.DataFrame()
Copy link
Member

@weiji14 weiji14 Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty pd.DataFrame() won't have any columns. Should there still be columns returned (even if there are no rows)? How would this work with #3117 for example?

Copy link
Member Author

@seisman seisman Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DataFrame with columns but no rows is still empty. So I guess it's fine.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: df
Out[3]:
Empty DataFrame
Columns: []
Index: []

In [4]: df = pd.DataFrame(columns=None)

In [5]: df
Out[5]:
Empty DataFrame
Columns: []
Index: []

In [6]: df = pd.DataFrame(columns=["col1", "col2"])

In [7]: df
Out[7]:
Empty DataFrame
Columns: [col1, col2]
Index: []

In [8]: df.empty
Out[8]: True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names are set here like so:

pygmt/pygmt/clib/session.py

Lines 1853 to 1861 in 1eb6dec

# Read the virtual file as a GMT dataset and convert to pandas.DataFrame
result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe()
if output_type == "numpy": # numpy.ndarray output
return result.to_numpy()
# Assign column names
if column_names is not None:
result.columns = column_names
return result # pandas.DataFrame output

So we would do something like:

import pandas as pd

df = pd.DataFrame(columns=None)
assert df.empty
df.columns = ["col1", "col2"]

But setting column names to ["col1", "col2"] errors with:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.columns = ["col1", "col2"]

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:6310, in NDFrame.__setattr__(self, name, value)
   6308 try:
   6309     object.__getattribute__(self, name)
-> 6310     return object.__setattr__(self, name, value)
   6311 except AttributeError:
   6312     pass

File properties.pyx:69, in pandas._libs.properties.AxisProperty.__set__()

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:813, in NDFrame._set_axis(self, axis, labels)
    808 """
    809 This is called from the cython code when we set the `index` attribute
    810 directly, e.g. `series.index = [1, 2, 3]`.
    811 """
    812 labels = ensure_index(labels)
--> 813 self._mgr.set_axis(axis, labels)
    814 self._clear_item_cache()

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels)
    236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
    237     # Caller is responsible for ensuring we have an Index object.
--> 238     self._validate_set_axis(axis, new_labels)
    239     self.axes[axis] = new_labels

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/base.py:98, in DataManager._validate_set_axis(self, axis, new_labels)
     95     pass
     97 elif new_len != old_len:
---> 98     raise ValueError(
     99         f"Length mismatch: Expected axis has {old_len} elements, new "
    100         f"values have {new_len} elements"
    101     )

ValueError: Length mismatch: Expected axis has 0 elements, new values have 2 elements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to refactor GMT_DATASET.to_dataframe() to accept more panda-specific parameters (e.g., column_names, index_col). The the virtualfile_to_dataset will be called like:

 result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe(columns=column_names) 
 if output_type == "numpy":  # numpy.ndarray output 
     return result.to_numpy() 
  
 return result  # pandas.DataFrame output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so moving these lines that handle the column names from virtualfile_to_dataset to to_dataframe:

pygmt/pygmt/clib/session.py

Lines 1861 to 1863 in 62eb5d6

# Assign column names
if column_names is not None:
result.columns = column_names

Do you want to just do that in #3117? Or have a separate PR to handle this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to do it in a separate PR so that #3117 can focus on parsing the column names from header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so moving these lines that handle the column names from virtualfile_to_dataset to to_dataframe:

pygmt/pygmt/clib/session.py

Lines 1861 to 1863 in 62eb5d6

# Assign column names
if column_names is not None:
result.columns = column_names

Do you want to just do that in #3117? Or have a separate PR to handle this?

Done in #3140, so need to refactor this PR after #3140 is merged.

return df
83 changes: 83 additions & 0 deletions pygmt/tests/test_datatypes_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
"""
Tests for GMT_DATASET data type.
"""

from pathlib import Path

import pandas as pd
import pytest
from pygmt.clib import Session
from pygmt.helpers import GMTTempFile


def dataframe_from_pandas(filepath_or_buffer, sep=r"\s+", comment="#", header=None):
"""
Read a tabular data as pandas.DataFrame object using pandas.read_csv().
seisman marked this conversation as resolved.
Show resolved Hide resolved

The parameters have the same meaning as in ``pandas.read_csv()``.
"""
try:
df = pd.read_csv(filepath_or_buffer, sep=sep, comment=comment, header=header)
except pd.errors.EmptyDataError:
# Return an empty DataFrame if the file has no data
seisman marked this conversation as resolved.
Show resolved Hide resolved
return pd.DataFrame()

# By default, pandas reads text strings with whitespaces as multiple columns, but
# GMT contacatenates all trailing text as a single string column. Neet do find all
# string columns (with dtype="object") and combine them into a single string column.
seisman marked this conversation as resolved.
Show resolved Hide resolved
string_columns = df.select_dtypes(include=["object"]).columns
if len(string_columns) > 1:
df[string_columns[0]] = df[string_columns].apply(lambda x: " ".join(x), axis=1)
df = df.drop(string_columns[1:], axis=1)
# Convert 'object' to 'string' type
df = df.convert_dtypes(
convert_string=True,
convert_integer=False,
convert_boolean=False,
convert_floating=False,
)
return df


def dataframe_from_gmt(fname):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, GMT provides two special/undocumented modules read and write (their source codes are gmt/src/gmtread.c/gmt/src/gmtwrite.c) that can read a file into a GMT object (e.g, reading a tabular file as GMT_DATASET, or reading a grid as GMT_GRID). Currently, we're frequently using the special read module in the doctest of the pygmt.clib.session module (similar to lines 46-50 below). We may want to make it public in the future as already done in GMT.jl (https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtread-Tuple{String} and https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtwrite).

"""
Read a tabular data as pandas.DataFrame using GMT virtual file.
seisman marked this conversation as resolved.
Show resolved Hide resolved
"""
with Session() as lib:
with lib.virtualfile_out(kind="dataset") as vouttbl:
lib.call_module("read", f"{fname} {vouttbl} -Td")
df = lib.virtualfile_to_dataset(vfname=vouttbl)
return df


@pytest.mark.benchmark
def test_dataset():
"""
Test the basic functionality of GMT_DATASET.
"""
with GMTTempFile(suffix=".txt") as tmpfile:
with Path(tmpfile.name).open(mode="w") as fp:
print(">", file=fp)
print("1.0 2.0 3.0 TEXT1 TEXT23", file=fp)
print("4.0 5.0 6.0 TEXT4 TEXT567", file=fp)
print(">", file=fp)
print("7.0 8.0 9.0 TEXT8 TEXT90", file=fp)
print("10.0 11.0 12.0 TEXT123 TEXT456789", file=fp)

df = dataframe_from_gmt(tmpfile.name)
expected_df = dataframe_from_pandas(tmpfile.name, comment=">")
pd.testing.assert_frame_equal(df, expected_df)


def test_dataset_empty():
"""
Make sure that an empty DataFrame is returned if a file has no data.
seisman marked this conversation as resolved.
Show resolved Hide resolved
"""
with GMTTempFile(suffix=".txt") as tmpfile:
with Path(tmpfile.name).open(mode="w") as fp:
print("# This is a comment line.", file=fp)

df = dataframe_from_gmt(tmpfile.name)
assert df.empty # Empty DataFrame
expected_df = dataframe_from_pandas(tmpfile.name)
pd.testing.assert_frame_equal(df, expected_df)