Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute_expected_counts gives incorrect values for HANSWT ext #88

Closed
6 tasks done
veenstrajelmer opened this issue Jun 20, 2024 · 0 comments · Fixed by #146
Closed
6 tasks done

compute_expected_counts gives incorrect values for HANSWT ext #88

veenstrajelmer opened this issue Jun 20, 2024 · 0 comments · Fixed by #146

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Jun 20, 2024

The expected counts are higher than the actual counts for HANSWT extremes, even though there are no nans or missing timesteps. This is because the frequency in the expected counts computation is based on the median instead of the mean. For HOEKVHLD this goes fine since the expected values are lower than the actual, but for HANSWT the median freq results in expected values higher than the actual number of values.

import os
import pandas as pd
import hatyan
import kenmerkendewaarden as kw
from kenmerkendewaarden.tidalindicators import compute_actual_counts, compute_expected_counts

# set logging level to INFO to get log messages
import logging
logging.getLogger("kenmerkendewaarden").setLevel(level="INFO")

tstop_dt = pd.Timestamp(2021,1,1, tz="UTC+01:00")

dir_base = r'p:\11210325-005-kenmerkende-waarden\work'
dir_meas = os.path.join(dir_base,'measurements_wl_18700101_20240101')
# dir_meas = r"c:\Users\veenstra\Downloads\measurements_wl_18700101_20240101"
current_station = 'HANSWT'

print("loading meas")
data_pd_HWLW_all = kw.read_measurements(dir_output=dir_meas, station=current_station, extremes=True)
data_pd_HWLW_all_12 = hatyan.calc_HWLW12345to12(data_pd_HWLW_all) #convert 12345 to 12 by taking minimum of 345 as 2 (laagste laagwater)

# computing counts
act_count_peryear = compute_actual_counts(data_pd_HWLW_all_12, freq="Y")
exp_count_peryear = compute_expected_counts(data_pd_HWLW_all_12, freq="Y")

# the max timediff is 9 hours and there are no nans, so there are no gaps
print("num nans:", data_pd_HWLW_all_12["values"].isnull().sum())
print("\nmax timediff:", (data_pd_HWLW_all_12.index[1:] - data_pd_HWLW_all_12.index[:-1]).max())

# however, the expected counts are higher because we compute the median frequency, not the mean
print("\nactual counts")
print(act_count_peryear)

print("\nexpected counts")
print(exp_count_peryear)

Gives:

num nans: 0

max timediff: 0 days 09:00:00

actual counts
time
1880     711
1881    1410
1882    1411
1883    1411
1884    1414

2019    1410
2020    1416
2021    1410
2022    1411
2023       3
Freq: A-DEC, Name: values, Length: 144, dtype: int64

expected counts
time
1880    1424.432432
1881    1420.540541
1882    1420.540541
1883    1420.540541
1884    1424.432432
    
2019    1420.540541
2020    1424.432432
2021    1420.540541
2022    1420.540541
2023    1401.600000
Freq: A-DEC, Length: 144, dtype: float64

Todo:

  • consider using scipy.stats.trim_mean, which exludes x percentiles of date before taking the mean. This probably gives the desired behaviour for frequency estimation. >> This still gives values that are slightly too high, so does not give the desired behaviour.
  • consider only counting HW's in case of extremes, the tidal periods might be more constant than duur stijging/daling. For extremes with aggers the current approach would not work also, but this alternative might. Unfortunately, there are still 12 years that result in higher expected counts than actual counts, even after using .floor() on the expected counts. Issue: we provide series of values to function, so excluding hwlw code.
  • apply compute_expected_counts() on hw only separately in calc_tidalindicators_HWLW(). Also do this in calc_havengetallen()
  • add testcase for this edgecase. Or add a testcase for hoekvhld extremes, since expected counts are too low there (because of same reason, but toolittle did not cause a problem so was not seen)
  • update docstrings: compute_expected_counts has an edgecase for months/years with only a value on the first and last timestep, derived frequency will be 15/183 days days and this will result in 2 expected counts. This causes the mean to be seen as valid, while it is not.
  • also prevent duplicate test_calc_wltidalindicators
@veenstrajelmer veenstrajelmer changed the title compute_expected_counts give incorrect values for HANSWT ext compute_expected_counts gives incorrect values for HANSWT ext Jun 20, 2024
@veenstrajelmer veenstrajelmer mentioned this issue Aug 22, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant