Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected large data gaps #16

Open
veenstrajelmer opened this issue Apr 3, 2024 · 2 comments
Open

unexpected large data gaps #16

veenstrajelmer opened this issue Apr 3, 2024 · 2 comments
Labels

Comments

@veenstrajelmer
Copy link

veenstrajelmer commented Apr 3, 2024

My expectation is that there are some stations for which there is data available from 1900 onwards, without large gaps. However, when looking at HOEKVHLD, we see large gaps in the dataset, sometimes 17/20 years without data.

import pandas as pd
import datetime as dt
import ddlpy

locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['HOEKVHLD'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

dtstart = dt.datetime.now()
amount = ddlpy.measurements_amount(selected.iloc[0], dt.datetime(1900,1,1), dt.datetime(2024,3,1), period="Jaar")
print(f'retrieving amount of measurements took: {(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
print(amount)

gap_inyear = pd.to_datetime(amount["Groeperingsperiode"]).dt.year.diff()
print(gap_inyear.drop_duplicates())

This prints:

retrieving amount of measurements took: 36.82 sec

   Groeperingsperiode  AantalMetingen
0                1900           17487
1                1906            2916
2                1907              36
3                1910            2920
4                1911            2920
..                ...             ...
74               2019           52560
75               2020           52704
76               2021           52560
77               2022           52560
78               2023           52560
[79 rows x 2 columns]

0      NaN
1      6.0
2      1.0
3      3.0
6     20.0
7      4.0
8      2.0
41    17.0
48     0.0
Name: Groeperingsperiode, dtype: float64

Also visible in timeseries (and only from 1986 for CADZD):
image

@KDoekes-RWS
Copy link

KDoekes-RWS commented Jun 4, 2024

This is all well known. The expectation that time series of waterlevels, or especially equidistant time series of locations in the tidal reach, are available without large gaps since 1900 is completely mistaken. Before 1971 the water level data were read off manually from the graphs, and as a rule only times and heights of high and low water (HW/LW-data) were processed. From the early 1930's for some tide gauges also water level data with a time step of 3 hours were processed.
I self have processed water level data with a time step of 30 minutes of Hoek van Holland of the year 1900 (just because it was a round number) with a digitizer, in 1983. Later on, in the '90's and '00's, the still available tables with water level data with time step 3 hours have been digitized.

@veenstrajelmer
Copy link
Author

veenstrajelmer commented Jun 4, 2024

Thanks for your response. I understand there can be (large) gaps in timeseries and some further investigation on my side indeed shows that a contiguous timeseries is not to be expected. However, we have a reference dataset that was exported from DONAR a few years ago for the KenmerkendeWaarden project. The code to reproduce the figure below:

import os
import pandas as pd
import hatyan
import matplotlib.pyplot as plt
plt.close("all")

dir_data = r"p:\archivedprojects\11208031-010-kenmerkende-waarden-k\work\data_vanRWS_20220805\wetransfer_waterstandsgegevens_2022-08-05_1306"
file_dia_cadz1 = os.path.join(dir_data, r"WATHTE_10min\CADZD_1.dia")
file_dia_cadz2 = os.path.join(dir_data, r"WATHTE_oud\CADZ_KW.dia")
file_dia_hoek1 = os.path.join(dir_data, r"WATHTE_10min\HOEKVHLD_1.dia")
file_dia_hoek2 = os.path.join(dir_data, r"WATHTE_oud\HOEK_KW.dia")

fig,ax = plt.subplots(figsize=(10,5))
ts_cadz = hatyan.read_dia([file_dia_cadz1,file_dia_cadz2], block_ids="allstation", station="CADZD", allow_duplicates=True)
ts_cadz["values"].plot(ax=ax, label="CADZD")
ts_hoek = hatyan.read_dia([file_dia_hoek1,file_dia_hoek2], block_ids="allstation", station="HOEKVHLD", allow_duplicates=True)
ts_hoek["values"].plot(ax=ax, label="HOEKVHLD")
ax.legend()
ax.grid()
ax.set_xlim(pd.Timestamp("1890-01-01"),pd.Timestamp("2024-01-01"))
fig.tight_layout()

This produces this figure:
image

This DONAR export does have a larger data coverage than in the DDL-based figure from this issue description, for instance for the period between 1970 and 1990. This period is not present in many of the datasets available on the DDL, as is also visible in the last figure in #39. Therefore I expect that the DDL is not in sync with DONAR when it comes to this data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants