Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs #191

Merged
merged 13 commits into from
Sep 5, 2023
Merged

docs #191

merged 13 commits into from
Sep 5, 2023

Conversation

andrewgsavage
Copy link
Collaborator

@andrewgsavage andrewgsavage commented Jun 27, 2023

  • Closes # (insert issue number)
  • Executed pre-commit run --all-files with no errors
  • The change is fully covered by automated unit tests
  • Documented in docs/ as appropriate
  • Added an entry to the CHANGES file

https://pint-pandas.readthedocs.io/en/docs/

@MichaelTiemannOSC
Copy link
Collaborator

Would it be useful to document the general implications and idioms of using ExtensionArrays? I'm gradually learning that when converting an ndarray into a PandasArray, np.nan becomes NA (and vice-versa in the other direction). Helping people understand how NA and np.nan play inside of Quantities, and the most efficient idioms for dealing with them correctly (pd.isna vs. np.isnan) could be very helpful. I could help write it if you tell me where you think it belongs.

@andrewgsavage
Copy link
Collaborator Author

I don't see many issues relating to nans so I'm wondering if you're encountering this because you're doing less typical workflows. It would be worth making an issue with your findings to understand where they're coming from.

I expect it to be sometihng to do with PintArrays either having PandasArrays or some form of np.array holding the values. I wonder if a better way for uncertainties would be to create an UncertaintyArray that the PintArray can use for values?

@MichaelTiemannOSC
Copy link
Collaborator

You are one step ahead of me. Last night I put my finger on what seems to be the last problem in my own test cases (the pint_pandas test cases don't trip it). When pd.merge needs to fill unmatched values with NaNs, it was creating invalid ndarrays due to the NaN value I've created. I'll write up findings when I have more to report, but I think I have a handle on a way forward. Thanks!

@MichaelTiemannOSC
Copy link
Collaborator

I've made a lot of progress working with pd.NA and reading through dtype("O") and validating values as UFLoat. I think it might be more elegant to create and use an UncertaintyArray, but I want to try to finish what I've almost got working, then discuss how to possibly make it more elegant with an UncertaintyArray type.

The test cases I'm looking at right now are the complex128 test cases, which, because they are actually EAs, and not ComplexArray types, are tickling what I've done in unexpected ways. Which is a good way to ensure the robustness of what I'm doing, rather than hiding behind a fresh type (I think).

@andrewgsavage
Copy link
Collaborator Author

has anyone had a chance to look at this?
you can view the docs here
https://pint-pandas.readthedocs.io/en/docs/

@MichaelTiemannOSC
Copy link
Collaborator

I just submitted some fresh changes to enable testing of complex128 for Pandas 2.1.0rc0+96. Over the past few weeks the pandas team have progressively improved underlying code so that as of today, essentially no special adaptations are required.

I still need to see if I can similarly simplify uncertainties, but I think that when 2.1 comes out things are going to be a lot simpler (both to document and implement).

@MichaelTiemannOSC
Copy link
Collaborator

Is there a convenient way I can leave comments inline? Like comments on a pull request?

@andrewgsavage
Copy link
Collaborator Author

Is there a convenient way I can leave comments inline? Like comments on a pull request?

in the .rst files that have been added

@MichaelTiemannOSC
Copy link
Collaborator

OK, so I cloned the repo and made a change to getting started, which in my version reads:

The Pandas package provides powerful DataFrame and Series abstractions for dealing with numerical, temporal, categorical, string-based, and even user-defined data (using its ExtensionArray feature).  The Pint package provides a ri\
ch and extensible vocabulary of units for constructing Quantities and an equally rich and extensible range of unit conversions to make it easy to perform unit-safe calculations using Quantities.  Pint-pandas provides PintArray, a \
Pandas ExtensionArray that efficiently implements Pandas DataFrame and Series functionality as unit-aware operations where appropriate.

Those who have used Pint know well that good units discipline often catches not only simple mistkaes, but sometimes more fundamental errors as well.  Pint-pandas can reveal similar errors when it comes to slicing and dicing Pandas\
 data.  A 1-dimensional Pandas Series can use a PintArray to hold its values.  Columns in 2-dimensional Pandas DataFrame can contain PintArrays--with all the efficiency the ExtensionArray APIs provide, but rows are a special case.\
  If all elements of the row have the same units, the row will be returned as Series backed by a PintArray with those units.  But if the units are heterogeneous, the row will be returned as a Series consisting of discrete Quantiti\
es (or raw data if the column values don't have units).  All Quantity data within such Series will follow Pint rules of unit conversions and will give error messages when units are not compatible, but some error messages may lose \
information as Pandas tries to align two incompatible Quantities to non-unitized magnitude values.  To get the greatest benefit from Pint-pandas (and Pandas in general), make your columns from homogeneous data and let your rows co\
ntain the heterogeneous data when necessary

The reason I'm telling you this in a comment and not with a PR is because I DON'T UNDERSTAND GITHUB!!! I really thought I did the right things in terms of cloning, forking, editing, etc., but GitHub insists on doing things most unintuitive to me. If I can get some help sorting out how to put my carefully placed andrewgsavage/pint-pandas repo into a properly described and defined git place that doesn't make it look like the twin of MichaelTiemannOSC/pint-pandas, I'd appreciate it. I do have hgrecco/pint and hgrecco/pint-pandas properly separated. I just somehow didn't say all the right magic when I tried to make a change relative to your repo as my upstream source.

@andrewgsavage
Copy link
Collaborator Author

you can add comments inline by going file changed, clicking a file, then clicking the blue + after hovering over a line
addnig comments like that is fine

@MichaelTiemannOSC
Copy link
Collaborator

That's a good solution...

@andrewgsavage
Copy link
Collaborator Author

if you want to make changes in a PR, you'll make a branch under MichaelTiemannOSC/pint-pandas that tracks andrewgsavage/pint-pandas:docs, then make a PR to andrewgsavage/pint-pandas:docs (ie go to https://github.com/andrewgsavage/pint-pandas/pulls )

@andrewgsavage
Copy link
Collaborator Author

I'll add this bit:

The Pandas package provides powerful DataFrame and Series abstractions for dealing with numerical, temporal, categorical, string-based, and even user-defined data (using its ExtensionArray feature). The Pint package provides a ri
ch and extensible vocabulary of units for constructing Quantities and an equally rich and extensible range of unit conversions to make it easy to perform unit-safe calculations using Quantities. Pint-pandas provides PintArray, a
Pandas ExtensionArray that efficiently implements Pandas DataFrame and Series functionality as unit-aware operations where appropriate.

Those who have used Pint know well that good units discipline often catches not only simple mistkaes, but sometimes more fundamental errors as well. Pint-pandas can reveal similar errors when it comes to slicing and dicing Pandas
data.

I think this bit is too in detail for the getting started section, but could fit elsewhere

A 1-dimensional Pandas Series can use a PintArray to hold its values. Columns in 2-dimensional Pandas DataFrame can contain PintArrays--with all the efficiency the ExtensionArray APIs provide, but rows are a special case.
If all elements of the row have the same units, the row will be returned as Series backed by a PintArray with those units. But if the units are heterogeneous, the row will be returned as a Series consisting of discrete Quantiti
es (or raw data if the column values don't have units). All Quantity data within such Series will follow Pint rules of unit conversions and will give error messages when units are not compatible, but some error messages may lose
information as Pandas tries to align two incompatible Quantities to non-unitized magnitude values. To get the greatest benefit from Pint-pandas (and Pandas in general), make your columns from homogeneous data and let your rows co
ntain the heterogeneous data when necessary

@MichaelTiemannOSC
Copy link
Collaborator

Please pass through a spell-check first. I notice I misspelled mistakes!

@andrewgsavage
Copy link
Collaborator Author

I think this bit is too in detail for the getting started section, but could fit elsewhere
An example would make this clearer and could go under common issues?


- `pint <https://github.com/hgrecco/pint>` Base package
- `pint-pandas <https://github.com/hgrecco/pint-pandas>`_ Pandas integration
- `pint-xarray <https://github.com/xarray-contrib/pint-xarray>`_ Xarray integration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • openscm-units <https://github.com/openscm/openscm-units> units related to simple climate modelling
  • iam_units <https://github.com/IAMconsortium/units> Pint-compatible definitions of energy, climate, and related units to supplement the SI and other units included in Pint's default_en.txt


The most common issue pint-pandas users encouter is that they have a DataFrame with column that aren't PintArrays.
An obvious indicator is unit strings showing in cells when viewing the DataFrame.
Several pandas operations return numpy arrays of ``Quantity`` objects, which can cause this.
Copy link
Collaborator

@MichaelTiemannOSC MichaelTiemannOSC Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantity objects within Pandas DataFrames (or Series) will behave like Quantities, meaning that they are subject to unit conversion rules and will raise errors when incompatible units are mixed. But these loose Quantities don't offer the elegance or performance optimizations that come from using PintArrays. And they may give strange error messages as Pandas tries to convert incompatible units to dimensionless magnitudes (which is often prohibited by Pint) rather than naming the incompatibility between the two Quantities in question.

Add:

Creating DataFrames from Series

The default operation of Pandas pd.concat function is to perform row-wise concatenation. When given a list of Series, each of which is backed by a PintArray, this will inefficiently convert all the PintArrays to arrays of object type, concatenate the several series into a DataFrame with that many rows, and then leave it up to you to convert that DataFrame back into column-wise PintArrays. A much more efficient approach is to concatenate Series in a column-wise fashion:

    df = pd.concat(list_of_series, axis=1)

This will preserve all the PintArrays in each of the Series.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantity objects within Pandas DataFrames (or Series) will behave like Quantities, meaning that they are subject to unit conversion rules and will raise errors when incompatible units are mixed. But these loose Quantities don't offer the elegance or performance optimizations that come from using PintArrays. And they may give strange error messages as Pandas tries to convert incompatible units to dimensionless magnitudes (which is often prohibited by Pint) rather than naming the incompatibility between the two Quantities in question.

It took 2-3 reads for me to follow this, referring to quantity objects and loose quantities is ambiguous (could argue PintArray contains quantity objects since getitem returns them). Some code showing your points would be clearer. Can do that in another PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the benchmark PR I made makes this point really clearly (up to 1000x performance difference). So if/when that lands, we can refer to coding (and performance examples) from that.

@MichaelTiemannOSC
Copy link
Collaborator

Plot twist: the next version of Pandas (2.1.1? 2.2.0?) will allow EAs to support 2d values, which means that the one-dimensional explanations I've given above will no longer be quite correct. Of course pint-pandas could make the decision that PintArrays are only ever one-dimensional, and we can clean up the text to say that, but we could also allow for the possibility that a whole 2-dimensional DataFrame has quantities, and that both rows and columns both allow not only showing quantified rows and columns, but both can have values set within them via .loc and .iloc while retaining their EA nature.

@andrewgsavage
Copy link
Collaborator Author

bors r+

@bors
Copy link
Contributor

bors bot commented Sep 5, 2023

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

@bors bors bot merged commit 068ded0 into hgrecco:master Sep 5, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants