unexpected large data gaps #16

veenstrajelmer · 2024-04-03T17:56:29Z

My expectation is that there are some stations for which there is data available from 1900 onwards, without large gaps. However, when looking at HOEKVHLD, we see large gaps in the dataset, sometimes 17/20 years without data.

import pandas as pd
import datetime as dt
import ddlpy

locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['HOEKVHLD'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

dtstart = dt.datetime.now()
amount = ddlpy.measurements_amount(selected.iloc[0], dt.datetime(1900,1,1), dt.datetime(2024,3,1), period="Jaar")
print(f'retrieving amount of measurements took: {(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
print(amount)

gap_inyear = pd.to_datetime(amount["Groeperingsperiode"]).dt.year.diff()
print(gap_inyear.drop_duplicates())

This prints:

retrieving amount of measurements took: 36.82 sec

   Groeperingsperiode  AantalMetingen
0                1900           17487
1                1906            2916
2                1907              36
3                1910            2920
4                1911            2920
..                ...             ...
74               2019           52560
75               2020           52704
76               2021           52560
77               2022           52560
78               2023           52560
[79 rows x 2 columns]

0      NaN
1      6.0
2      1.0
3      3.0
6     20.0
7      4.0
8      2.0
41    17.0
48     0.0
Name: Groeperingsperiode, dtype: float64

Also visible in timeseries (and only from 1986 for CADZD):

KDoekes-RWS · 2024-06-04T13:25:28Z

This is all well known. The expectation that time series of waterlevels, or especially equidistant time series of locations in the tidal reach, are available without large gaps since 1900 is completely mistaken. Before 1971 the water level data were read off manually from the graphs, and as a rule only times and heights of high and low water (HW/LW-data) were processed. From the early 1930's for some tide gauges also water level data with a time step of 3 hours were processed.
I self have processed water level data with a time step of 30 minutes of Hoek van Holland of the year 1900 (just because it was a round number) with a digitizer, in 1983. Later on, in the '90's and '00's, the still available tables with water level data with time step 3 hours have been digitized.

veenstrajelmer · 2024-06-04T14:27:26Z

Thanks for your response. I understand there can be (large) gaps in timeseries and some further investigation on my side indeed shows that a contiguous timeseries is not to be expected. However, we have a reference dataset that was exported from DONAR a few years ago for the KenmerkendeWaarden project. The code to reproduce the figure below:

import os
import pandas as pd
import hatyan
import matplotlib.pyplot as plt
plt.close("all")

dir_data = r"p:\archivedprojects\11208031-010-kenmerkende-waarden-k\work\data_vanRWS_20220805\wetransfer_waterstandsgegevens_2022-08-05_1306"
file_dia_cadz1 = os.path.join(dir_data, r"WATHTE_10min\CADZD_1.dia")
file_dia_cadz2 = os.path.join(dir_data, r"WATHTE_oud\CADZ_KW.dia")
file_dia_hoek1 = os.path.join(dir_data, r"WATHTE_10min\HOEKVHLD_1.dia")
file_dia_hoek2 = os.path.join(dir_data, r"WATHTE_oud\HOEK_KW.dia")

fig,ax = plt.subplots(figsize=(10,5))
ts_cadz = hatyan.read_dia([file_dia_cadz1,file_dia_cadz2], block_ids="allstation", station="CADZD", allow_duplicates=True)
ts_cadz["values"].plot(ax=ax, label="CADZD")
ts_hoek = hatyan.read_dia([file_dia_hoek1,file_dia_hoek2], block_ids="allstation", station="HOEKVHLD", allow_duplicates=True)
ts_hoek["values"].plot(ax=ax, label="HOEKVHLD")
ax.legend()
ax.grid()
ax.set_xlim(pd.Timestamp("1890-01-01"),pd.Timestamp("2024-01-01"))
fig.tight_layout()

This produces this figure:

This DONAR export does have a larger data coverage than in the DDL-based figure from this issue description, for instance for the period between 1970 and 1990. This period is not present in many of the datasets available on the DDL, as is also visible in the last figure in #39. Therefore I expect that the DDL is not in sync with DONAR when it comes to this data.

veenstrajelmer changed the title ~~unexpected data gaps~~ unexpected large data gaps Apr 3, 2024

This was referenced Apr 3, 2024

DDL improvements (data and waterwebservices) Deltares-research/kenmerkendewaarden#4

Open

Create wm-ws-dl issues from remaining comments Deltares-research/kenmerkendewaarden#10

Closed

TEXNZE has large gap in 2007 data #26

Open

TvLoon-RWS added the data label May 24, 2024

veenstrajelmer mentioned this issue Jul 1, 2024

Mean waterlevels cannot be reproduced for old periods Deltares-research/kenmerkendewaarden#105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected large data gaps #16

unexpected large data gaps #16

veenstrajelmer commented Apr 3, 2024 •

edited

Loading

KDoekes-RWS commented Jun 4, 2024 •

edited

Loading

veenstrajelmer commented Jun 4, 2024 •

edited

Loading

unexpected large data gaps #16

unexpected large data gaps #16

Comments

veenstrajelmer commented Apr 3, 2024 • edited Loading

KDoekes-RWS commented Jun 4, 2024 • edited Loading

veenstrajelmer commented Jun 4, 2024 • edited Loading

veenstrajelmer commented Apr 3, 2024 •

edited

Loading

KDoekes-RWS commented Jun 4, 2024 •

edited

Loading

veenstrajelmer commented Jun 4, 2024 •

edited

Loading