Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAType values avoid using pandas features #227

Closed
meteoDaniel opened this issue Nov 18, 2020 · 13 comments
Closed

NAType values avoid using pandas features #227

meteoDaniel opened this issue Nov 18, 2020 · 13 comments

Comments

@meteoDaniel
Copy link
Contributor

meteoDaniel commented Nov 18, 2020

If you collect data from opendata store like this:

climate_data = DWDObservationData(
    station_ids=[station_id],
    parameters=[DWDObservationParameterSet.PRECIPITATION],
    resolution=DWDObservationResolution.MINUTE_10,
    periods=[DWDObservationPeriod.HISTORICAL],
    humanize_column_names=True
    ).collect_safe()

And you try to work with the values:

climate_data.VALUE.astype(float) 
>>> TypeError: float() argument must be a string or a number, not 'NAType'

or

climate_data.pivot_table(values='VALUE', columns='ELEMENT', index='DATE')
>>>DataError: No numeric types to aggregate

Yield to a dataframe that contains NAType. Why do you do that ? Never seen something like that before. It makes it really unattractive to use the DataFrame. If you are introducing such a parameter I am asking myself why you provide data as a DataFrame? Because several benefits are gone if you are not able to access the data due to suspicious nan type in a value column.

Btw.: You have made several breaking changes in the last weeks but only set minor version plus 1. Versions are normally structured With Major.Minor.Patch . Major changes including breaking changes. Means the main functionality is deprecated due to some changes. .And keep in mind that providing breaking changes every one or two months will annoy users.

@kmuehlbauer
Copy link
Collaborator

related upstream issue

@kmuehlbauer
Copy link
Collaborator

kmuehlbauer commented Nov 18, 2020

Btw.: You have made several breaking changes in the last weeks but only set minor version plus 1. Versions are normally structured With Major.Minor.Patch . Major changes including breaking changes. Means the main functionality is deprecated due to some changes. .And keep in mind that providing breaking changes every one or two months will annoy users.

As long as Major is 0 API isn't supposed to be stable (https://semver.org/#spec-item-4). Just my 2c.

@amotl
Copy link
Member

amotl commented Nov 18, 2020

Dear Daniel,

thanks for reporting this.

NAType issue

@kmuehlbauer: Thanks for quickly referencing the upstream issue pandas-dev/pandas#37626, you are a treasure of gold.
@meteoDaniel: I hear you well this is annoying.

So, shall we also apply to .astype("category") workaround somewhere like suggested by @mlondschien?

In [4]: pd.Series(["1", "2", pd.NA], dtype="string").astype("category").astype("float")  # magic workaround
Out[4]: 
0    1.0
1    2.0
2    NaN
dtype: float64

Versioning and stability

As long as Major is 0 API isn't supposed to be stable.

That's true, we are still in beta mode here. For that reason, even Pandas switched to version 1.0.0 only recently, after being on 0.x.x for a very long time.

Saying that, we do not intend to break things on purpose, but going forward on different levels is still required to fulfill our agenda somehow. So, please bear with us.

And keep in mind that providing breaking changes every one or two months will annoy users.

The most important thing is to provide releases. In this manner, people will be able to nail their dependencies to e.g. wetterdienst==0.10.1.

With kind regards,
Andreas.

P.S.: Still, we might want to adjust our examples accordingly. Ping, @gutzbenj ;].

@meteoDaniel
Copy link
Contributor Author

Btw.: You have made several breaking changes in the last weeks but only set minor version plus 1. Versions are normally structured With Major.Minor.Patch . Major changes including breaking changes. Means the main functionality is deprecated due to some changes. .And keep in mind that providing breaking changes every one or two months will annoy users.

As long as Major is 0 API isn't supposed to be stable (https://semver.org/#spec-item-4). Just my 2c.

Thanks for introducing this :)

@meteoDaniel
Copy link
Contributor Author

@amotl I checked out the example and there .dropna is applied and afterwards all is fine. So NAType is detected as NaN and is dropped. np.nan in pandas is treated as float and that makes it really handsome to work with. I am pretty sure that there was a reason for using NAType so I would like to understand why?!

@kmuehlbauer
Copy link
Collaborator

P.S.: Still, we might want to adjust our examples accordingly.

@amotl So, does that mean, you do not test the examples in your GitHub Actions?

@gutzbenj
Copy link
Member

gutzbenj commented Nov 18, 2020

@kmuehlbauer the notebook ("climate_observations") is executed fully once per test run. But you are right, the regular ".py" examples should be executed once as well to secure that the API is still usable.

@kmuehlbauer
Copy link
Collaborator

@gutzbenj Yes, that would be worth to setup.

@gutzbenj
Copy link
Member

gutzbenj commented Nov 18, 2020

@amotl I checked out the example and there .dropna is applied and afterwards all is fine. So NAType is detected as NaN and is dropped. np.nan in pandas is treated as float and that makes it really handsome to work with. I am pretty sure that there was a reason for using NAType so I would like to understand why?!

We have certain situations where the actual value is a integer (or even string). Currently pandas does

  • allow type conversion to float with arrays that contain NaNs
  • not allow type conversion to integer with arrays that contain NaNs

For this purpose we have used pandas internal DTypes and that would be my guess for the error you are facing here. However your example is missing a station id. I'd like to understand the error you are facing but for that I need the FULL example including the station id.

@meteoDaniel
Copy link
Contributor Author

@gutzbenj I am sorry, I thought I added one: the 44 e.g. . What I am thinking about is to parse NaType to floating nan before return the dataframe?

Okay pandas says they are using nump.nan as a default for nan's but providing own NaN types? Sometimes I am asking myself why we all use pandas :D There is so many bad stuff in it.

@gutzbenj
Copy link
Member

I again had a look at this problem within #260 but had figured atm that there's at least atm no solution but to be honest I also don't see a problem, that we could change within our scope. Firstly, changing the columns dtype is not recommended, as we have already set it to a meaningful dtype within the type coercion. If you still want to do that, pd.to_numeric() should be a better option. Secondly the problems with .pivot_table() are seen elsewhere and do not seem to be related to the NAType.

@gutzbenj gutzbenj closed this as completed Dec 3, 2020
@simonsays1980
Copy link

This is still an issue on pandas==2.0.0 and pandas=1.3.5:

from wetterdienst.provider.dwd.observation import (
    DwdObservationRequest, 
    DwdObservationDataset, 
    DwdObservationPeriod, 
    DwdObservationResolution
)
from wetterdienst import Settings


settings = Settings(ts_shape="long", ts_humanize=True, ts_si_units=False)

parameters = [
    DwdObservationDataset.TEMPERATURE_AIR, # Dry_Bulb
    DwdObservationDataset.DEW_POINT,
    DwdObservationDataset.PRESSURE,
    DwdObservationDataset.PRECIPITATION,
    DwdObservationDataset.SOLAR,
    DwdObservationDataset.WIND,
    DwdObservationDataset.SUN,
    DwdObservationDataset.CLOUDINESS,
    DwdObservationDataset.WEATHER_PHENOMENA
]
request = DwdObservationRequest(
    parameter=parameters,
    resolution=DwdObservationResolution.MINUTE_10,
    start_date="2019-01-01",
    end_date="2020-01-01",
    settings=settings,
)

stations = request.filter_by_station_id(station_id=(2968))

weather_df = stations.values.all()

I am pulling now the data via simple Python requests.

@gutzbenj
Copy link
Member

gutzbenj commented Apr 7, 2023

@simonsays1980 what exactly is the issue you are facing there? I'm currently working on migrating wetterdienst to polars #904 which would bring relief in this missing values situation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants