Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed forecast period generation function for multiseries #4320

Merged
merged 9 commits into from
Sep 29, 2023

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Sep 22, 2023

Resolves #4323

@christopherbunn christopherbunn changed the title Updated multiseries time series regression forecasting period function Fixed multiseries time series regression forecasting period generation Sep 22, 2023
@christopherbunn christopherbunn changed the title Fixed multiseries time series regression forecasting period generation Fixed forecast period generation function for multiseries Sep 22, 2023
@codecov
Copy link

codecov bot commented Sep 22, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (5c3e832) 99.7% compared to head (432d632) 99.7%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4320     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        357     357             
  Lines      39767   39867    +100     
=======================================
+ Hits       39647   39747    +100     
  Misses       120     120             
Files Coverage Δ
...valml/pipelines/multiseries_regression_pipeline.py 100.0% <100.0%> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <100.0%> (ø)
...line_tests/test_multiseries_regression_pipeline.py 100.0% <100.0%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for simplification, let me know if they actually work though!

evalml/pipelines/multiseries_regression_pipeline.py Outdated Show resolved Hide resolved
evalml/pipelines/multiseries_regression_pipeline.py Outdated Show resolved Hide resolved
pred_intervals = self.estimator.get_prediction_intervals(
X=estimator_input,
y=y,
coverage=coverage,
)
trans_pred_intervals = {}
intervals_labels = list(list(pred_intervals.values())[0].keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to go into the debugger and play with the code myself to figure out what this line was doing 😅 A much simpler way:

intervals_labels = pred_intervals[0].keys()

That may need to be cast to a list for later on, I didn't test fully, but either way it's more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this doesn't work since pred_intervals is a dict and this would pull the value at key 0 rather than the first value of the dictionary. Is there a better way to pull the first value?

Copy link
Contributor

@eccabay eccabay Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah oops, I see what I missed. I don't know of a better way to manipulate the dictionaries, but we could also just do intervals_labels = pd.DataFrame(pred_intervals).index 😂

evalml/pipelines/time_series_regression_pipeline.py Outdated Show resolved Hide resolved
Comment on lines 252 to 263
for key, orig_pi_values in intervals.items():
series_id_target_name = (
self.input_target_name + "_" + str(series_id)
)
interval_series_pred_intervals[key][
series_id_target_name
] = pd.Series(
(orig_pi_values.values - residuals[series_id].values)
+ trend_pred_intervals[series_id_target_name][key].values
+ y[series_id_target_name].values,
index=orig_pi_values.index,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of repeated code with the other logical branch, which is going to make life very hard for us if we ever need to update this code. Could you abstract it out into a local helper function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

def _get_series_intervals(intervals, residuals, trend_pred_intervals, y):
    return_intervals = {}
    for key, orig_pi_values in intervals.items():
        return_intervals[key] = pd.Series(
            (orig_pi_values.values - residuals.values)
            + trend_pred_intervals[key].values
            + y.values,
            index=origin_pi_values.index
        )
    return return_intervals

if is_multiseries(problem_type):
    for series_id, series_intervals in pred_intervals.items():
        series_id_target_name = self.input_target_name + "_" + str(series_id)
        interval_series_pred_intervals[series_id_target_name] = _get_series_intervals(
            series_intervals,
            residuals[series_id],
            trend_pred_intervals[series_id_target_name],
            y[series_id_target_name]
        )
else:
    trans_pred_intervals = _get_series_intervals(pred_intervals, residuals, trend_pred_intervals, y)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I suggested does make a change to the dictionary structure for the multiseries case, which you'll have to let me know if it works or not - I swapped the intervals with series ids, to give us {series_1: {0.75_lower: <>, 0.75_upper: <>, ...}, series_2: {...}...} instead of {0.75_lower: {series_1: <>, series_2: <>, ...}, ...}
Personally, I think this would make it easier to get per-series prediction intervals, but you'll have to let me know if it's too much effort to swap things around at this point. We could also completely overhaul the data structure for this to be something actually 2D like a dataframe instead of nested dictionaries, but that might just be tech debt for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using your implementation but I tweaked it slightly. I still kept the original dictionary structure since it makes stacking each prediction interval in the end slightly easier. Let me know what you think!

Comment on lines 254 to 255
trans_pred_intervals = {}
trend_pred_intervals = self.get_component(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got so confused here for a second, these names are so similar 😅 can one of them be renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trans_pred_intervals -> transformed_pred_intervals

Comment on lines 274 to 278
for interval, interval_data in series_id_interval_result.items():
interval_series_pred_intervals[interval][
series_id_target_name
] = interval_data
for interval in intervals_labels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the word interval has no meaning to me any more 😂

Comment on lines 274 to 286
for interval, interval_data in series_id_interval_result.items():
interval_series_pred_intervals[interval][
series_id_target_name
] = interval_data
for interval in intervals_labels:
series_id_df = pd.DataFrame(
interval_series_pred_intervals[interval],
)
stacked_pred_interval = stack_data(
data=series_id_df,
series_id_name=self.series_id,
)
trans_pred_intervals[interval] = stacked_pred_interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read over this like 5 times in a row and I can't figure out what the point of all of it is. It seems to be a lot of rearranging the same data in different ways? I'd love if we can make this clearer, even if it's just through comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave variable renaming + additional comments adding a shot. Lmk if you think there's a way to additionally clarify it!

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nothing to add on my end

@christopherbunn christopherbunn merged commit da17fae into main Sep 29, 2023
25 checks passed
@christopherbunn christopherbunn deleted the msts_get_forecast_period branch September 29, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiseries prediction intervals fails when using pipelines
3 participants