-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131
Changes from 1 commit
175ba3c
2e6e277
7482b25
3246e5c
ec59f9c
a2c48d5
1281ec0
b817e91
71cc9b7
065ec12
dbfc2ae
06790e2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
""" | ||
Tests for GMT_DATASET data type. | ||
""" | ||
|
||
from pathlib import Path | ||
|
||
import pandas as pd | ||
import pytest | ||
from pygmt.clib import Session | ||
from pygmt.helpers import GMTTempFile | ||
|
||
|
||
def dataframe_from_pandas(filepath_or_buffer, sep=r"\s+", comment="#", header=None): | ||
""" | ||
Read a tabular data as pandas.DataFrame object using pandas.read_csv(). | ||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The parameters have the same meaning as in ``pandas.read_csv()``. | ||
""" | ||
try: | ||
df = pd.read_csv(filepath_or_buffer, sep=sep, comment=comment, header=header) | ||
except pd.errors.EmptyDataError: | ||
# Return an empty DataFrame if the file has no data | ||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return pd.DataFrame() | ||
|
||
# By default, pandas reads text strings with whitespaces as multiple columns, but | ||
# GMT contacatenates all trailing text as a single string column. Neet do find all | ||
# string columns (with dtype="object") and combine them into a single string column. | ||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
string_columns = df.select_dtypes(include=["object"]).columns | ||
if len(string_columns) > 1: | ||
df[string_columns[0]] = df[string_columns].apply(lambda x: " ".join(x), axis=1) | ||
df = df.drop(string_columns[1:], axis=1) | ||
# Convert 'object' to 'string' type | ||
df = df.convert_dtypes( | ||
convert_string=True, | ||
convert_integer=False, | ||
convert_boolean=False, | ||
convert_floating=False, | ||
) | ||
return df | ||
|
||
|
||
def dataframe_from_gmt(fname): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For reference, GMT provides two special/undocumented modules |
||
""" | ||
Read a tabular data as pandas.DataFrame using GMT virtual file. | ||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
with Session() as lib: | ||
with lib.virtualfile_out(kind="dataset") as vouttbl: | ||
lib.call_module("read", f"{fname} {vouttbl} -Td") | ||
df = lib.virtualfile_to_dataset(vfname=vouttbl) | ||
return df | ||
|
||
|
||
@pytest.mark.benchmark | ||
def test_dataset(): | ||
""" | ||
Test the basic functionality of GMT_DATASET. | ||
""" | ||
with GMTTempFile(suffix=".txt") as tmpfile: | ||
with Path(tmpfile.name).open(mode="w") as fp: | ||
print(">", file=fp) | ||
print("1.0 2.0 3.0 TEXT1 TEXT23", file=fp) | ||
print("4.0 5.0 6.0 TEXT4 TEXT567", file=fp) | ||
print(">", file=fp) | ||
print("7.0 8.0 9.0 TEXT8 TEXT90", file=fp) | ||
print("10.0 11.0 12.0 TEXT123 TEXT456789", file=fp) | ||
|
||
df = dataframe_from_gmt(tmpfile.name) | ||
expected_df = dataframe_from_pandas(tmpfile.name, comment=">") | ||
pd.testing.assert_frame_equal(df, expected_df) | ||
|
||
|
||
def test_dataset_empty(): | ||
""" | ||
Make sure that an empty DataFrame is returned if a file has no data. | ||
seisman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
with GMTTempFile(suffix=".txt") as tmpfile: | ||
with Path(tmpfile.name).open(mode="w") as fp: | ||
print("# This is a comment line.", file=fp) | ||
|
||
df = dataframe_from_gmt(tmpfile.name) | ||
assert df.empty # Empty DataFrame | ||
expected_df = dataframe_from_pandas(tmpfile.name) | ||
pd.testing.assert_frame_equal(df, expected_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty
pd.DataFrame()
won't have any columns. Should there still be columns returned (even if there are no rows)? How would this work with #3117 for example?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DataFrame with columns but no rows is still empty. So I guess it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names are set here like so:
pygmt/pygmt/clib/session.py
Lines 1853 to 1861 in 1eb6dec
So we would do something like:
But setting column names to
["col1", "col2"]
errors with:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to refactor
GMT_DATASET.to_dataframe()
to accept more panda-specific parameters (e.g.,column_names
,index_col
). The thevirtualfile_to_dataset
will be called like:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so moving these lines that handle the column names from
virtualfile_to_dataset
toto_dataframe
:pygmt/pygmt/clib/session.py
Lines 1861 to 1863 in 62eb5d6
Do you want to just do that in #3117? Or have a separate PR to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to do it in a separate PR so that #3117 can focus on parsing the column names from header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #3140, so need to refactor this PR after #3140 is merged.