Skip to content

Commit

Permalink
Backend: Migrate from pandas to polars
Browse files Browse the repository at this point in the history
  • Loading branch information
gutzbenj committed Jun 4, 2023
1 parent d3bf4fe commit 0c9b5a2
Show file tree
Hide file tree
Showing 84 changed files with 3,562 additions and 4,612 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ Changelog
Development
***********

- Backend: Migrate from pandas to polars

0.56.2 (11.05.2023)
*******************

Expand Down
67 changes: 38 additions & 29 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,15 +90,16 @@ Python feel like a warm summer breeze, similar to other projects like
rdwd_ for the R language, which originally drew our interest in this project.
Our long-term goal is to provide access to multiple weather services as well as other
related agencies such as river measurements. With ``wetterdienst`` we try to use modern
Python technologies all over the place. The library is based on pandas_ across the board,
uses Poetry_ for package administration and GitHub Actions for all things CI.
Python technologies all over the place. The library is based on polars_ (we <3 pandas_, it is still part of some
IO processes) across the board, uses Poetry_ for package administration and GitHub Actions for all things CI.
Our users are an important part of the development as we are not currently using the
data we are providing and only implement what we think would be the best. Therefore
contributions and feedback whether it be data related or library related are very
welcome! Just hand in a PR or Issue if you think we should include a new feature or data
source.

.. _rdwd: https://github.com/brry/rdwd
.. _polars: https://www.pola.rs/
.. _pandas: https://pandas.pydata.org/
.. _Poetry: https://python-poetry.org/

Expand Down Expand Up @@ -178,10 +179,11 @@ license those are published take a look at the data_ section.
Features
********

- API(s) for stations (metadata) and values
- Get station(s) nearby a selected location
- APIs for stations and values
- Get stations nearby a selected location
- Define your request by arguments such as `parameter`, `period`, `resolution`,
`start date`, `end date`
- Define general settings in Settings context
- Command line interface
- Web-API via FastAPI
- Run SQL queries on the results
Expand Down Expand Up @@ -318,8 +320,8 @@ Library

.. code-block:: python
>>> import pandas as pd
>>> pd.options.display.max_columns = 8
>>> import polars as pl
>>> _ = pl.Config.set_tbl_hide_dataframe_shape(True)
>>> from wetterdienst import Settings
>>> from wetterdienst.provider.dwd.observation import DwdObservationRequest
>>> settings = Settings( # default
Expand All @@ -334,29 +336,36 @@ Library
... end_date="2020-01-01", # if not given timezone defaulted to UTC
... settings=settings
... ).filter_by_station_id(station_id=(1048, 4411))
>>> request.df.head() # station list
station_id from_date to_date height \
... 01048 1934-01-01 00:00:00+00:00 ... 00:00:00+00:00 228.0
... 04411 1979-12-01 00:00:00+00:00 ... 00:00:00+00:00 155.0
<BLANKLINE>
latitude longitude name state
... 51.1278 13.7543 Dresden-Klotzsche Sachsen
... 49.9195 8.9671 Schaafheim-Schlierbach Hessen
>>> request.values.all().df.head() # values
station_id dataset parameter date value \
0 01048 climate_summary wind_gust_max 1990-01-01 00:00:00+00:00 NaN
1 01048 climate_summary wind_gust_max 1990-01-02 00:00:00+00:00 NaN
2 01048 climate_summary wind_gust_max 1990-01-03 00:00:00+00:00 5.0
3 01048 climate_summary wind_gust_max 1990-01-04 00:00:00+00:00 9.0
4 01048 climate_summary wind_gust_max 1990-01-05 00:00:00+00:00 7.0
<BLANKLINE>
quality
0 NaN
1 NaN
2 10.0
3 10.0
4 10.0
>>> stations = request.df
>>> stations.head()
┌────────────┬──────────────┬──────────────┬────────┬──────────┬───────────┬─────────────┬─────────┐
│ station_id ┆ from_date ┆ to_date ┆ height ┆ latitude ┆ longitude ┆ name ┆ state │
------------------------
str ┆ datetime[μs, ┆ datetime[μs, ┆ f64 ┆ f64 ┆ f64 ┆ strstr
│ ┆ UTC] ┆ UTC] ┆ ┆ ┆ ┆ ┆ │
╞════════════╪══════════════╪══════════════╪════════╪══════════╪═══════════╪═════════════╪═════════╡
010481934-01-01...228.051.127813.7543 ┆ Dresden-Klo ┆ Sachsen │
│ ┆ 00:00:00 UTC00:00:00 UTC ┆ ┆ ┆ ┆ tzsche ┆ │
044111979-12-01...155.049.91958.9671 ┆ Schaafheim- ┆ Hessen │
│ ┆ 00:00:00 UTC00:00:00 UTC ┆ ┆ ┆ ┆ Schlierbach ┆ │
└────────────┴──────────────┴──────────────┴────────┴──────────┴───────────┴─────────────┴─────────┘
>>> values = request.values.all().df
>>> values.head()
┌────────────┬─────────────────┬───────────────┬─────────────────────────┬───────┬─────────┐
│ station_id ┆ dataset ┆ parameter ┆ date ┆ value ┆ quality │
------------------
strstrstr ┆ datetime[μs, UTC] ┆ f64 ┆ f64 │
╞════════════╪═════════════════╪═══════════════╪═════════════════════════╪═══════╪═════════╡
01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-01 00:00:00 UTC ┆ null ┆ null │
01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-02 00:00:00 UTC ┆ null ┆ null │
01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-03 00:00:00 UTC5.010.0
01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-04 00:00:00 UTC9.010.0
01048 ┆ climate_summary ┆ wind_gust_max ┆ 1990-01-05 00:00:00 UTC7.010.0
└────────────┴─────────────────┴───────────────┴─────────────────────────┴───────┴─────────┘
.. code-block:: python
values.to_pandas() # to get a pandas DataFrame and e.g. create some matplotlib plots
Client
======
Expand Down
19 changes: 9 additions & 10 deletions benchmarks/interpolation.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from datetime import datetime, timedelta

import matplotlib.pyplot as plt
import pandas as pd
import polars as pl
import utm
from scipy import interpolate

Expand All @@ -13,8 +13,7 @@
DwdObservationResolution,
)

pd.set_option("display.width", 400)
pd.set_option("display.max_columns", None)
pl.Config.set_tbl_width_chars(400)

"""
example:
Expand Down Expand Up @@ -60,9 +59,9 @@ def request_weather_data(
# request the nearest weather stations
request = stations.filter_by_distance(latlon=(lat, lon), distance=distance)
print(request.df)
station_ids = request.df["station_id"].values.tolist()
latitudes = request.df["latitude"].values.tolist()
longitudes = request.df["longitude"].values.tolist()
station_ids = request.df.get_column("station_id")
latitudes = request.df.get_column("latitude")
longitudes = request.df.get_column("longitude")

utm_x = []
utm_y = []
Expand All @@ -72,16 +71,16 @@ def request_weather_data(
utm_y.append(y)

# request parameter from weather stations
df = request.values.all().df.dropna()
df = request.values.all().df.drop_nulls()

# filters by one exact time and saves the given parameter per station at this time
day_time = start_date + timedelta(days=1)
filtered_df = df[df["date"].astype(str).str[:] == day_time.strftime("%Y-%m-%d %H:%M:%S+00:00")]
filtered_df = df.filter(pl.col("date").eq(day_time))
print(filtered_df)
values = filtered_df["value"].values.tolist()
values = filtered_df.get_column("value").to_list()

return Data(
station_ids=station_ids,
station_ids=station_ids.to_list(),
utm_x=utm_x,
utm_y=utm_y,
values=values,
Expand Down
41 changes: 21 additions & 20 deletions benchmarks/interpolation_over_time.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import polars as pl

from wetterdienst import Parameter
from wetterdienst.provider.dwd.observation import (
Expand All @@ -13,7 +13,7 @@
plt.style.use("seaborn")


def get_interpolated_df(parameter: str, start_date: datetime, end_date: datetime) -> pd.DataFrame:
def get_interpolated_df(parameter: str, start_date: datetime, end_date: datetime) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=parameter,
resolution=DwdObservationResolution.HOURLY,
Expand All @@ -23,42 +23,43 @@ def get_interpolated_df(parameter: str, start_date: datetime, end_date: datetime
return stations.interpolate(latlon=(50.0, 8.9)).df


def get_regular_df(parameter: str, start_date: datetime, end_date: datetime, exclude_stations: list) -> pd.DataFrame:
def get_regular_df(parameter: str, start_date: datetime, end_date: datetime, exclude_stations: list) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=parameter,
resolution=DwdObservationResolution.HOURLY,
start_date=start_date,
end_date=end_date,
)
request = stations.filter_by_distance(latlon=(50.0, 8.9), distance=30)
df = request.values.all().df.dropna()
station_ids = df.station_id.tolist()
df = request.values.all().df.drop_nulls()
station_ids = df.get_column("station_id")
first_station_id = set(station_ids).difference(set(exclude_stations)).pop()
return df[df["station_id"] == first_station_id]
return df.filter(pl.col("station_id").eq(first_station_id))


def get_rmse(regular_values: pd.Series, interpolated_values: pd.Series):
diff = (regular_values.reset_index(drop=True) - interpolated_values.reset_index(drop=True)).dropna()
n = diff.size
return ((diff**2).sum() / n) ** 0.5
def get_rmse(regular_values: pl.Series, interpolated_values: pl.Series):
n = regular_values.len()
return (((regular_values - interpolated_values).drop_nulls() ** 2).sum() / n) ** 0.5


def get_corr(regular_values: pd.Series, interpolated_values: pd.Series):
def get_corr(regular_values: pl.Series, interpolated_values: pl.Series):
return np.corrcoef(regular_values.to_list(), interpolated_values.to_list())[0][1].item()


def visualize(parameter: str, unit: str, regular_df: pd.DataFrame, interpolated_df: pd.DataFrame):
rmse = get_rmse(regular_df["value"], interpolated_df["value"])
corr = get_corr(regular_df["value"], interpolated_df["value"])
def visualize(parameter: str, unit: str, regular_df: pl.DataFrame, interpolated_df: pl.DataFrame):
rmse = get_rmse(regular_df.get_column("value"), interpolated_df.get_column("value"))
corr = get_corr(regular_df.get_column("value"), interpolated_df.get_column("value"))
factor = 0.5
plt.figure(figsize=(factor * 19.2, factor * 10.8))
plt.plot(regular_df["date"], regular_df["value"], color="red", label="regular")
plt.plot(interpolated_df["date"], interpolated_df["value"], color="black", label="interpolated")
plt.plot(regular_df.get_column("date"), regular_df.get_column("value"), color="red", label="regular")
plt.plot(
interpolated_df.get_column("date"), interpolated_df.get_column("value"), color="black", label="interpolated"
)
ylabel = f"{parameter.lower()} [{unit}]"
plt.ylabel(ylabel)
title = (
f"rmse: {np.round(rmse, 2)}, corr: {np.round(corr, 2)}\n"
f"station_ids: {interpolated_df['station_ids'].to_list()[0]}"
f"station_ids: {interpolated_df.get_column('station_ids').to_list()[0]}"
)
plt.title(title)
plt.legend()
Expand All @@ -69,10 +70,10 @@ def visualize(parameter: str, unit: str, regular_df: pd.DataFrame, interpolated_
def main():
parameter = Parameter.TEMPERATURE_AIR_MEAN_200.name
unit = "K"
start_date = datetime(2022, 1, 1)
end_date = datetime(2022, 2, 24)
start_date = datetime(2022, 3, 1)
end_date = datetime(2022, 3, 31)
interpolated_df = get_interpolated_df(parameter, start_date, end_date)
exclude_stations = interpolated_df.station_ids[0]
exclude_stations = interpolated_df.get_column("station_ids")[0]
regular_df = get_regular_df(parameter, start_date, end_date, exclude_stations)
visualize(parameter, unit, regular_df, interpolated_df)

Expand Down
22 changes: 11 additions & 11 deletions benchmarks/interpolation_precipitation_difference.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from datetime import datetime

import pandas as pd
import polars as pl

from wetterdienst import Parameter
from wetterdienst.provider.dwd.observation import (
Expand All @@ -9,33 +9,33 @@
)


def get_interpolated_df(start_date: datetime, end_date: datetime) -> pd.DataFrame:
def get_interpolated_df(start_date: datetime, end_date: datetime) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.PRECIPITATION_HEIGHT.name,
parameter=Parameter.PRECIPITATION_HEIGHT,
resolution=DwdObservationResolution.DAILY,
start_date=start_date,
end_date=end_date,
)
return stations.interpolate(latlon=(50.0, 8.9)).df


def get_regular_df(start_date: datetime, end_date: datetime, exclude_stations: list) -> pd.DataFrame:
def get_regular_df(start_date: datetime, end_date: datetime, exclude_stations: list) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.PRECIPITATION_HEIGHT.name,
resolution=DwdObservationResolution.DAILY,
start_date=start_date,
end_date=end_date,
)
request = stations.filter_by_distance(latlon=(50.0, 8.9), distance=30)
df = request.values.all().df.dropna()
station_ids = df.station_id.tolist()
df = request.values.all().df.drop_nulls()
station_ids = df.get_column("station_id")
first_station_id = set(station_ids).difference(set(exclude_stations)).pop()
return df[df["station_id"] == first_station_id]
return df.filter(pl.col("station_id").eq(first_station_id))


def calculate_percentage_difference(df: pd.DataFrame, text: str = "") -> float:
total_amount = len(df["value"])
zero_amount = len(df[df["value"] == 0.0])
def calculate_percentage_difference(df: pl.DataFrame, text: str = "") -> float:
total_amount = df.get_column("value").len()
zero_amount = df.filter(pl.col("value").eq(0.0)).height
percentage = zero_amount / total_amount
print(f"{text}: {percentage*100:.2f}% = {zero_amount} of {total_amount} with zero value")
return percentage
Expand All @@ -46,7 +46,7 @@ def main():
end_date = datetime(2022, 1, 1)
interpolated_df = get_interpolated_df(start_date, end_date)
print(interpolated_df)
exclude_stations = interpolated_df.station_ids[0]
exclude_stations = interpolated_df.get_column("station_ids")[0]
regular_df = get_regular_df(start_date, end_date, exclude_stations)
calculate_percentage_difference(regular_df, "regular")
calculate_percentage_difference(interpolated_df, "interpolated")
Expand Down
8 changes: 4 additions & 4 deletions benchmarks/summary_over_time.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from datetime import datetime

import matplotlib.pyplot as plt
import pandas as pd
import polars as pl

from wetterdienst import Parameter
from wetterdienst.provider.dwd.observation import (
Expand All @@ -12,17 +12,17 @@
plt.style.use("seaborn")


def get_summarized_df(start_date: datetime, end_date: datetime, lat, lon) -> pd.DataFrame:
def get_summarized_df(start_date: datetime, end_date: datetime, lat, lon) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.TEMPERATURE_AIR_MEAN_200,
resolution=DwdObservationResolution.DAILY,
start_date=start_date,
end_date=end_date,
)
return stations.summarize((lat, lon)).df
return stations.summarize(latlon=(lat, lon)).df


def get_regular_df(start_date: datetime, end_date: datetime, station_id) -> pd.DataFrame:
def get_regular_df(start_date: datetime, end_date: datetime, station_id) -> pl.DataFrame:
stations = DwdObservationRequest(
parameter=Parameter.TEMPERATURE_AIR_MEAN_200,
resolution=DwdObservationResolution.DAILY,
Expand Down
Loading

0 comments on commit 0c9b5a2

Please sign in to comment.