Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to polars #904

Merged
merged 1 commit into from
Jun 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ Changelog
Development
***********

- Backend: Migrate from pandas to polars

0.56.2 (11.05.2023)
*******************

Expand Down
67 changes: 38 additions & 29 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,15 +90,16 @@ Python feel like a warm summer breeze, similar to other projects like
rdwd_ for the R language, which originally drew our interest in this project.
Our long-term goal is to provide access to multiple weather services as well as other
related agencies such as river measurements. With ``wetterdienst`` we try to use modern
Python technologies all over the place. The library is based on pandas_ across the board,
uses Poetry_ for package administration and GitHub Actions for all things CI.
Python technologies all over the place. The library is based on polars_ (we <3 pandas_, it is still part of some
IO processes) across the board, uses Poetry_ for package administration and GitHub Actions for all things CI.
Our users are an important part of the development as we are not currently using the
data we are providing and only implement what we think would be the best. Therefore
contributions and feedback whether it be data related or library related are very
welcome! Just hand in a PR or Issue if you think we should include a new feature or data
source.

.. _rdwd: https://github.com/brry/rdwd
.. _polars: https://www.pola.rs/
.. _pandas: https://pandas.pydata.org/
.. _Poetry: https://python-poetry.org/

Expand Down Expand Up @@ -178,10 +179,11 @@ license those are published take a look at the data_ section.
Features
********

- API(s) for stations (metadata) and values
- Get station(s) nearby a selected location
- APIs for stations and values
- Get stations nearby a selected location
- Define your request by arguments such as `parameter`, `period`, `resolution`,
`start date`, `end date`
- Define general settings in Settings context
- Command line interface
- Web-API via FastAPI
- Run SQL queries on the results
Expand Down Expand Up @@ -318,8 +320,8 @@ Library

.. code-block:: python

>>> import pandas as pd
>>> pd.options.display.max_columns = 8
>>> import polars as pl
>>> _ = pl.Config.set_tbl_hide_dataframe_shape(True)
>>> from wetterdienst import Settings
>>> from wetterdienst.provider.dwd.observation import DwdObservationRequest
>>> settings = Settings( # default
Expand All @@ -334,29 +336,36 @@ Library
... end_date="2020-01-01", # if not given timezone defaulted to UTC
... settings=settings
... ).filter_by_station_id(station_id=(1048, 4411))
>>> request.df.head() # station list
station_id from_date to_date height \
... 01048 1934-01-01 00:00:00+00:00 ... 00:00:00+00:00 228.0
... 04411 1979-12-01 00:00:00+00:00 ... 00:00:00+00:00 155.0
<BLANKLINE>
latitude longitude name state
... 51.1278 13.7543 Dresden-Klotzsche Sachsen
... 49.9195 8.9671 Schaafheim-Schlierbach Hessen

>>> request.values.all().df.head() # values
station_id dataset parameter date value \
0 01048 climate_summary wind_gust_max 1990-01-01 00:00:00+00:00 NaN
1 01048 climate_summary wind_gust_max 1990-01-02 00:00:00+00:00 NaN
2 01048 climate_summary wind_gust_max 1990-01-03 00:00:00+00:00 5.0
3 01048 climate_summary wind_gust_max 1990-01-04 00:00:00+00:00 9.0
4 01048 climate_summary wind_gust_max 1990-01-05 00:00:00+00:00 7.0
<BLANKLINE>
quality
0 NaN
1 NaN
2 10.0
3 10.0
4 10.0
>>> stations = request.df
>>> stations.head()
┌────────────┬──────────────┬──────────────┬────────┬──────────┬───────────┬─────────────┬─────────┐
│ station_id ┆ from_date ┆ to_date ┆ height ┆ latitude ┆ longitude ┆ name ┆ state │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs, ┆ datetime[μs, ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str │
│ ┆ UTC] ┆ UTC] ┆ ┆ ┆ ┆ ┆ │
╞════════════╪══════════════╪══════════════╪════════╪══════════╪═══════════╪═════════════╪═════════╡
│ 01048 ┆ 1934-01-01 ┆ ... ┆ 228.0 ┆ 51.1278 ┆ 13.7543 ┆ Dresden-Klo ┆ Sachsen │
│ ┆ 00:00:00 UTC ┆ 00:00:00 UTC ┆ ┆ ┆ ┆ tzsche ┆ │
│ 04411 ┆ 1979-12-01 ┆ ... ┆ 155.0 ┆ 49.9195 ┆ 8.9671 ┆ Schaafheim- ┆ Hessen │
│ ┆ 00:00:00 UTC ┆ 00:00:00 UTC ┆ ┆ ┆ ┆ Schlierbach ┆ │
└────────────┴──────────────┴──────────────┴────────┴──────────┴───────────┴─────────────┴─────────┘
>>> values = request.values.all().df
>>> values.head()
┌────────────┬─────────────────┬───────────────┬─────────────────────────┬───────┬─────────┐
│ station_id ┆ dataset ┆ parameter ┆ date ┆ value ┆ quality │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ datetime[μs, UTC] ┆ f64 ┆ f64 │
╞════════════╪═════════════════╪═══════════════╪═════════════════════════╪═══════╪═════════╡
│ 01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-01 00:00:00 UTC ┆ null ┆ null │
│ 01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-02 00:00:00 UTC ┆ null ┆ null │
│ 01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-03 00:00:00 UTC ┆ 5.0 ┆ 10.0 │
│ 01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-04 00:00:00 UTC ┆ 9.0 ┆ 10.0 │
│ 01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-05 00:00:00 UTC ┆ 7.0 ┆ 10.0 │
└────────────┴─────────────────┴───────────────┴─────────────────────────┴───────┴─────────┘

.. code-block:: python

values.to_pandas() # to get a pandas DataFrame and e.g. create some matplotlib plots

Client
======
Expand Down
19 changes: 9 additions & 10 deletions benchmarks/interpolation.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from datetime import datetime, timedelta

import matplotlib.pyplot as plt
import pandas as pd
import polars as pl
import utm
from scipy import interpolate

Expand All @@ -13,8 +13,7 @@
DwdObservationResolution,
)

pd.set_option("display.width", 400)
pd.set_option("display.max_columns", None)
pl.Config.set_tbl_width_chars(400)

"""
example:
Expand Down Expand Up @@ -60,9 +59,9 @@ def request_weather_data(
# request the nearest weather stations
request = stations.filter_by_distance(latlon=(lat, lon), distance=distance)
print(request.df)
station_ids = request.df["station_id"].values.tolist()
latitudes = request.df["latitude"].values.tolist()
longitudes = request.df["longitude"].values.tolist()
station_ids = request.df.get_column("station_id")
latitudes = request.df.get_column("latitude")
longitudes = request.df.get_column("longitude")

utm_x = []
utm_y = []
Expand All @@ -72,16 +71,16 @@ def request_weather_data(
utm_y.append(y)

# request parameter from weather stations
df = request.values.all().df.dropna()
df = request.values.all().df.drop_nulls()

# filters by one exact time and saves the given parameter per station at this time
day_time = start_date + timedelta(days=1)
filtered_df = df[df["date"].astype(str).str[:] == day_time.strftime("%Y-%m-%d %H:%M:%S+00:00")]
filtered_df = df.filter(pl.col("date").eq(day_time))
print(filtered_df)
values = filtered_df["value"].values.tolist()
values = filtered_df.get_column("value").to_list()

return Data(
station_ids=station_ids,
station_ids=station_ids.to_list(),
utm_x=utm_x,
utm_y=utm_y,
values=values,
Expand Down
41 changes: 21 additions & 20 deletions benchmarks/interpolation_over_time.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import polars as pl

from wetterdienst import Parameter
from wetterdienst.provider.dwd.observation import (
Expand All @@ -13,7 +13,7 @@
plt.style.use("seaborn")


def get_interpolated_df(parameter: str, start_date: datetime, end_date: datetime) -> pd.DataFrame:
def get_interpolated_df(parameter: str, start_date: datetime, end_date: datetime) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=parameter,
resolution=DwdObservationResolution.HOURLY,
Expand All @@ -23,42 +23,43 @@ def get_interpolated_df(parameter: str, start_date: datetime, end_date: datetime
return stations.interpolate(latlon=(50.0, 8.9)).df


def get_regular_df(parameter: str, start_date: datetime, end_date: datetime, exclude_stations: list) -> pd.DataFrame:
def get_regular_df(parameter: str, start_date: datetime, end_date: datetime, exclude_stations: list) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=parameter,
resolution=DwdObservationResolution.HOURLY,
start_date=start_date,
end_date=end_date,
)
request = stations.filter_by_distance(latlon=(50.0, 8.9), distance=30)
df = request.values.all().df.dropna()
station_ids = df.station_id.tolist()
df = request.values.all().df.drop_nulls()
station_ids = df.get_column("station_id")
first_station_id = set(station_ids).difference(set(exclude_stations)).pop()
return df[df["station_id"] == first_station_id]
return df.filter(pl.col("station_id").eq(first_station_id))
gutzbenj marked this conversation as resolved.
Show resolved Hide resolved


def get_rmse(regular_values: pd.Series, interpolated_values: pd.Series):
diff = (regular_values.reset_index(drop=True) - interpolated_values.reset_index(drop=True)).dropna()
n = diff.size
return ((diff**2).sum() / n) ** 0.5
def get_rmse(regular_values: pl.Series, interpolated_values: pl.Series):
n = regular_values.len()
return (((regular_values - interpolated_values).drop_nulls() ** 2).sum() / n) ** 0.5


def get_corr(regular_values: pd.Series, interpolated_values: pd.Series):
def get_corr(regular_values: pl.Series, interpolated_values: pl.Series):
return np.corrcoef(regular_values.to_list(), interpolated_values.to_list())[0][1].item()


def visualize(parameter: str, unit: str, regular_df: pd.DataFrame, interpolated_df: pd.DataFrame):
rmse = get_rmse(regular_df["value"], interpolated_df["value"])
corr = get_corr(regular_df["value"], interpolated_df["value"])
def visualize(parameter: str, unit: str, regular_df: pl.DataFrame, interpolated_df: pl.DataFrame):
rmse = get_rmse(regular_df.get_column("value"), interpolated_df.get_column("value"))
corr = get_corr(regular_df.get_column("value"), interpolated_df.get_column("value"))
factor = 0.5
plt.figure(figsize=(factor * 19.2, factor * 10.8))
plt.plot(regular_df["date"], regular_df["value"], color="red", label="regular")
plt.plot(interpolated_df["date"], interpolated_df["value"], color="black", label="interpolated")
plt.plot(regular_df.get_column("date"), regular_df.get_column("value"), color="red", label="regular")
plt.plot(
interpolated_df.get_column("date"), interpolated_df.get_column("value"), color="black", label="interpolated"
)
ylabel = f"{parameter.lower()} [{unit}]"
plt.ylabel(ylabel)
title = (
f"rmse: {np.round(rmse, 2)}, corr: {np.round(corr, 2)}\n"
f"station_ids: {interpolated_df['station_ids'].to_list()[0]}"
f"station_ids: {interpolated_df.get_column('station_ids').to_list()[0]}"
)
plt.title(title)
plt.legend()
Expand All @@ -69,10 +70,10 @@ def visualize(parameter: str, unit: str, regular_df: pd.DataFrame, interpolated_
def main():
parameter = Parameter.TEMPERATURE_AIR_MEAN_200.name
unit = "K"
start_date = datetime(2022, 1, 1)
end_date = datetime(2022, 2, 24)
start_date = datetime(2022, 3, 1)
end_date = datetime(2022, 3, 31)
interpolated_df = get_interpolated_df(parameter, start_date, end_date)
exclude_stations = interpolated_df.station_ids[0]
exclude_stations = interpolated_df.get_column("station_ids")[0]
regular_df = get_regular_df(parameter, start_date, end_date, exclude_stations)
visualize(parameter, unit, regular_df, interpolated_df)

Expand Down
22 changes: 11 additions & 11 deletions benchmarks/interpolation_precipitation_difference.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from datetime import datetime

import pandas as pd
import polars as pl

from wetterdienst import Parameter
from wetterdienst.provider.dwd.observation import (
Expand All @@ -9,33 +9,33 @@
)


def get_interpolated_df(start_date: datetime, end_date: datetime) -> pd.DataFrame:
def get_interpolated_df(start_date: datetime, end_date: datetime) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.PRECIPITATION_HEIGHT.name,
parameter=Parameter.PRECIPITATION_HEIGHT,
resolution=DwdObservationResolution.DAILY,
start_date=start_date,
end_date=end_date,
)
return stations.interpolate(latlon=(50.0, 8.9)).df


def get_regular_df(start_date: datetime, end_date: datetime, exclude_stations: list) -> pd.DataFrame:
def get_regular_df(start_date: datetime, end_date: datetime, exclude_stations: list) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.PRECIPITATION_HEIGHT.name,
resolution=DwdObservationResolution.DAILY,
start_date=start_date,
end_date=end_date,
)
request = stations.filter_by_distance(latlon=(50.0, 8.9), distance=30)
df = request.values.all().df.dropna()
station_ids = df.station_id.tolist()
df = request.values.all().df.drop_nulls()
station_ids = df.get_column("station_id")
first_station_id = set(station_ids).difference(set(exclude_stations)).pop()
return df[df["station_id"] == first_station_id]
return df.filter(pl.col("station_id").eq(first_station_id))


def calculate_percentage_difference(df: pd.DataFrame, text: str = "") -> float:
total_amount = len(df["value"])
zero_amount = len(df[df["value"] == 0.0])
def calculate_percentage_difference(df: pl.DataFrame, text: str = "") -> float:
total_amount = df.get_column("value").len()
zero_amount = df.filter(pl.col("value").eq(0.0)).height
percentage = zero_amount / total_amount
print(f"{text}: {percentage*100:.2f}% = {zero_amount} of {total_amount} with zero value")
return percentage
Expand All @@ -46,7 +46,7 @@ def main():
end_date = datetime(2022, 1, 1)
interpolated_df = get_interpolated_df(start_date, end_date)
print(interpolated_df)
exclude_stations = interpolated_df.station_ids[0]
exclude_stations = interpolated_df.get_column("station_ids")[0]
regular_df = get_regular_df(start_date, end_date, exclude_stations)
calculate_percentage_difference(regular_df, "regular")
calculate_percentage_difference(interpolated_df, "interpolated")
Expand Down
8 changes: 4 additions & 4 deletions benchmarks/summary_over_time.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from datetime import datetime

import matplotlib.pyplot as plt
import pandas as pd
import polars as pl

from wetterdienst import Parameter
from wetterdienst.provider.dwd.observation import (
Expand All @@ -12,17 +12,17 @@
plt.style.use("seaborn")


def get_summarized_df(start_date: datetime, end_date: datetime, lat, lon) -> pd.DataFrame:
def get_summarized_df(start_date: datetime, end_date: datetime, lat, lon) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.TEMPERATURE_AIR_MEAN_200,
resolution=DwdObservationResolution.DAILY,
start_date=start_date,
end_date=end_date,
)
return stations.summarize((lat, lon)).df
return stations.summarize(latlon=(lat, lon)).df


def get_regular_df(start_date: datetime, end_date: datetime, station_id) -> pd.DataFrame:
def get_regular_df(start_date: datetime, end_date: datetime, station_id) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.TEMPERATURE_AIR_MEAN_200,
resolution=DwdObservationResolution.DAILY,
Expand Down
7 changes: 7 additions & 0 deletions docs/data/coverage/eaufrance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ Eaufrance
Overview
********

License
*******

Check out the `Terms and Conditions`_ of Hubeau for usage conditions.

.. _`Terms and Conditions`: https://hubeau.eaufrance.fr/page/conditions-generales

Products
********

Expand Down
6 changes: 0 additions & 6 deletions docs/data/license/eaufrance.rst

This file was deleted.

Loading