Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent table-like output for PyGMT functions/methods #1318

Closed
9 of 13 tasks
maxrjones opened this issue Jun 4, 2021 · 24 comments
Closed
9 of 13 tasks

Consistent table-like output for PyGMT functions/methods #1318

maxrjones opened this issue Jun 4, 2021 · 24 comments
Assignees
Labels
enhancement Improving an existing feature
Milestone

Comments

@maxrjones
Copy link
Member

maxrjones commented Jun 4, 2021

Originally posted by @weiji14 in #1284 (comment):

Currently, the blockm*/grdtrack modules simply outputs pandas.DataFrame/filenames (see #1099) automatically depending on the input type (2 options). The implementation @willschlitzer made at bf58dc2 and 68fab7c returns either numpy/pandas/str depending on the user input (3 options). Best to be consistent across PyGMT on the types of outputs users can expect.

I am opening this issue to find out what the output format/options should be with the additional of x/y/z in blockmean, blockmedian, and grdtrack, as well as for new methods such as grd2xyz.


Edit by @seisman on Oct 9, 2023:

The PyGMT team has reach an agreement that PyGMT functions/methods should have consistent behavior for table-like outputs, i.e., the output depends on the output_type parameter.

Valid output_type values are:

  • file: output to a file directly (usually need to set outfile)
  • numpy: return a NumPy array (may not always possible because data must be in the same dtype in a 2-D numpy array)
  • pandas: return a pandas.DataFrame (sometimes need to set newcolname)

Here are a list of functions/methods that have table-like outputs:

We need to make sure they behave consistently.

@maxrjones maxrjones added the question Further information is requested label Jun 4, 2021
@maxrjones maxrjones changed the title What should the default table-like output be for PyGMT methods? What should the default table-like output be for PyGMT functions/methods? Jun 4, 2021
@weiji14
Copy link
Member

weiji14 commented Jun 6, 2021

Just wanted to say, this is a similar issue to that in the RAPIDS AI cuml library. See https://docs.rapids.ai/api/cuml/0.19/api.html#output-data-type-configuration. Their solution is to:

  1. By default: 'mirror' the input data format to be the same as the output format as much as possible, i.e. numpy in/numpy out, or pandas.DataFrame in/pandas.DataFrame out.
  2. Override the default behaviour above by having a using_output_type context manager. E.g. set using_output_type=numpy to return numpy regardless of input type.

As you can see, this involves more levels of I/O complexity going from phase 1 to phase 2. I would advocate to move to phase '1' for PyGMT v0.5.0 or v0.6.0, and keep PyGMT's default output type for table-like outputs as pandas.DataFrame (or None if outfile is set) for PyGMT v0.4.0 as per #1099 (i.e phase '0' if you will). Longer term, as PyGMT matures and integrates more with the PyData stack, we may want to consider mimicking cuml's configurable I/O behaviour.

That said, I'm open to other suggestions, if someone has a good idea on how to deal with I/O in PyGMT.

@maxrjones
Copy link
Member Author

Thanks for the helpful reference, @weiji14! I like the mirroring behavior and agree that it would be a good milestone for v0.5.0+.

In my opinion, a parameter to control the output type seems a bit simpler for users than a context manager (examples below). The options for pygmt.using_output_type may also depend on the specific function/method, which could become difficult to document. The context manager would be useful if users want to easily specify the output type for a block of code. Are there other benefits to implementing the configurable I/O behavior with a context manager rather than with function/method parameters?

Context manager example:

with pygmt.using_output_type("numpy"):
   output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30])

Parameter example:

output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30], output_format="numpy")

@weiji14
Copy link
Member

weiji14 commented Jun 8, 2021

You're right that controlling the output type via a parameter (similar to what @willschlitzer did in bf58dc2) would be simpler than a context manager. I don't have a preference for either style, but after doing a bit of digging, it seems like cuml had an output_type parameter at one point, but they deprecated it in favour of the context manager option in rapidsai/cuml#3040 for some reason (though it still seems to be mentioned in places like at https://github.com/rapidsai/cuml/blob/branch-21.08/wiki/python/ESTIMATOR_GUIDE.md#specifying-the-array-output-type).

Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2, but have it in a single standardized place.

@maxrjones
Copy link
Member Author

Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2, but have it in a single standardized place.

Yes, I agree that a helper function is a good idea for the implementation of an output format choice.

@willschlitzer
Copy link
Contributor

willschlitzer commented Jun 15, 2021

I think the helper function would be our best bet (although I'm not too strong with writing decorators). Borrowing/stealing entirely from my commits mentioned above, I think something like this would could be helpful:

def return_table(result, data_format, format_parameter):
    if data_format == "s":
        return result
    data_list = []
    for string_entry in result.strip().split("\n"):
        float_entry = []
        string_list = string_entry.strip().split()
        for i in string_list:
            float_entry.append(float(i))
        data_list.append(float_entry)
    data_array = np.array(data_list)
    if data_format == "a":
        result = data_array
    elif data_format == "d":
        result = pd.DataFrame(data_array)
    else:
        raise GMTInvalidInput(f"""Must specify {format_parameter} as either a, d, or s.""")
    return result

I know it's a little redundant accepting the result argument only to immediately return it if data_format is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.

@weiji14
Copy link
Member

weiji14 commented Jun 15, 2021

The return_table function looks to be on the right track. We can discuss this in the meeting tomorrow (https://forum.generic-mapping-tools.org/t/pygmt-community-zoom-meeting-tuesday-15-june-2021-utc-1700/1762) if there's time.

In terms of your PRs #1284 and #1299 though @willschlitzer, I'd suggest sticking to a str or pandas.DataFrame only output for PyGMT v0.4.0 if you want to get those PRs in on time (and not wait until v0.5.0). I.e. follow the code used by grdtrack and blockm*:

pygmt/pygmt/src/grdtrack.py

Lines 273 to 281 in a10af5d

# Read temporary csv output to a pandas table
if outfile == tmpfile.name: # if user did not set outfile, return pd.DataFrame
try:
column_names = points.columns.to_list() + [newcolname]
result = pd.read_csv(tmpfile.name, sep="\t", names=column_names)
except AttributeError: # 'str' object has no attribute 'columns'
result = pd.read_csv(tmpfile.name, sep="\t", header=None, comment=">")
elif outfile != tmpfile.name: # return None if outfile set, output in outfile
result = None

I know it's a little redundant accepting the result argument only to immediately return it if data_format is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.

You're welcome to start a proof of concept PR, but realistically speaking, I don't expect it to be ready for PyGMT v0.4.0. See this blog post by the RAPIDS AI team (https://medium.com/rapids-ai/making-rapids-cudf-io-readers-and-writers-more-reliable-with-fuzz-testing-5d595aa97389) which talks about the complexity of testing different I/O format combinations. Just want to manage expectations a bit 🙂

@willschlitzer
Copy link
Contributor

@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz and grdvolume in line with grdtrack for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume and grd2xyz are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.

@maxrjones
Copy link
Member Author

@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz and grdvolume in line with grdtrack for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume and grd2xyz are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.

Thanks for the detailed explanation and for submitting the proof-of-concept PR. I understand your points. I'll add some comments on #1336, but I agree with you that it's not likely we will reach consensus and update the other methods/functions with table output in the next few days. So, it seems #1318, #1336, #1299, and #1284 will be on track for v0.5.0.

@maxrjones maxrjones added enhancement Improving an existing feature and removed question Further information is requested labels Jun 25, 2021
@maxrjones maxrjones changed the title What should the default table-like output be for PyGMT functions/methods? Consistent table-like output for PyGMT functions/methods Jun 25, 2021
@maxrjones maxrjones mentioned this issue Jul 1, 2021
5 tasks
@maxrjones
Copy link
Member Author

I support Will wanting to merge in #1299 and #1284 before addressing this issue to keep things moving. At the same time, I would prefer that the argument format for the output_type parameter is final (i.e., will be used for all applicable functions) before merging these PRs to avoid deprecations. Here's the current section:

    output_type : str
        Determine the format the xyz data will be returned in:
            **a**: numpy array
            **d**: pandas DataFrame [Default option]
            **s**: string

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.

Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@willschlitzer
Copy link
Contributor

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.

Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!

@weiji14
Copy link
Member

weiji14 commented Aug 10, 2021

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.
Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!

Agree with having longer names. In this case, I would recommend re-using an already existing convention in the PyData ecosystem (as hinted at #1318 (comment)). Namely that from cuml at https://github.com/rapidsai/cuml/blob/e977f3e4194313a857da96fdd810ee5ef1d742b3/python/cuml/common/memory_utils.py#L346-L374:

Specifically. they have an output_type parameter (similar to the data_format used by @willschlitzer in #1336) that can be either ‘input’, ‘cudf’, ‘cupy’, ‘numpy’ (default = ‘input’). For PyGMT, I would recommend the following output_type names:

  • 'input' (default, which mirrors the input data type whenever possible)
  • 'numpy' (i.e. numpy.array)
  • 'pandas' (i.e. pandas.DataFrame)
  • 'file' (i.e. output to a text file, but we need to tie this with the 'outfile' parameter)

Happy to vote on whether short or long names is preferred. Let's go with 🎉 for short (a/d/s) and 🚀 for long (numpy/pandas/file).

@weiji14 weiji14 pinned this issue Aug 11, 2021
@willschlitzer
Copy link
Contributor

@weiji14 No fair, rockets are way cooler so people will vote that way by default!

But seriously, I think you and @meghanrjones are making better points than I am; I think long names are the way to go.

@willschlitzer
Copy link
Contributor

@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.

@maxrjones
Copy link
Member Author

@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.

Yes, I think we can settle it for long names.

@seisman seisman added this to the 0.5.0 milestone Sep 13, 2021
@seisman
Copy link
Member

seisman commented Sep 13, 2021

Yes, I also support the long names.

@seisman seisman added this to the 0.7.0 milestone Mar 18, 2022
@seisman seisman modified the milestones: 0.7.0, 0.8.0 Jun 23, 2022
@seisman seisman modified the milestones: 0.8.0, 0.9.0 Dec 11, 2022
@seisman seisman modified the milestones: 0.9.0, 0.10.0 Mar 16, 2023
@weiji14 weiji14 removed this from the 0.10.0 milestone Aug 24, 2023
@seisman
Copy link
Member

seisman commented Oct 9, 2023

I think we have reached an agreement (at least partially) about the expected behavior of table-like output. I've revised the top post to better track the progress.

@weiji14
Copy link
Member

weiji14 commented Oct 10, 2023

Just noting that Pandas 2.0 has not just a NumPy-backed arrays, but also PyArrow(C++)-backed arrays (see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes). Also according to PDEP10, Pandas 3.0 will use ArrowDtype instead of object dtype for string columns.

Haven't had time yet to work out what this would mean for our implementation of GMT_DATASET at #2729, but we might need some matrix tests to check that both NumPy and PyArrow-backed pandas.DataFrames work.

@seisman
Copy link
Member

seisman commented Apr 8, 2024

Now most modules already have consistent behaviors for table-like output. There are still four exceptions: x2sys_cross is complicated and need more work; for which/info/grdinfo, file/numpy/pandas output types may make no sense, so they need more discussions and will be tracked in separate issues.

Closing the issue. Great to have consistent table-like output behavior across the whole project!

@seisman seisman closed this as completed Apr 8, 2024
@seisman seisman unpinned this issue Apr 8, 2024
@yvonnefroehlich
Copy link
Member

Closing the issue. Great to have consistent table-like output behavior across the whole project!

Sounds great! Thanks @seisman for all your efforts on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants