-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling Pearson correlation counterintuitive #248
Comments
@SmokinCaterpillar Thanks so much for your feedback, this is great! The correlations functionality was one of the first pieces I added to this tool and it was one of those things where it fit the type of data we were using (universes of securities over time => multiple values for each date) real well and we didn't stray from that formula so I was a little near-sighted in how I put it together. I'm currently finishing up implementing treemaps and I'll be working on these changes next. The only thing about them that might be a little complex is specifying time intervals. So what I will do is, in addition to the current functionality of # of values, add the ability to specify a time interval string supported by pandas. Hope this is fine for you 🙏 thanks again |
@SmokinCaterpillar quick question on making the "rolling" correlations available to datasets where there are multiple data points for each date. Just trying to figure out who one can show the output in a chart. For example I ran the following code:
But that produces the following output:
For a timeseries chart I need one data point per day. The only other option would be to have a timeseries chart with multiple lines. One for each point in the rolling window... Let me know if you have any thoughts, thanks. |
I just realized that I guess in this case just take some aggregation, e.g. mean for example. So I would change my second part of my proposal to: |
Awesome! I can definitely set that up. Glad I wasn’t going crazy with the groupby-rolling-corr issues 🙂 |
@SmokinCaterpillar Sorry its taken so long I've been moving. Here's a demo of what I've got. Let me know what you think and I'll put together a release soon. |
Wow, this is nice, great, thanks! |
Added in v1.15.2 |
Hey, first and foremost, dtale is a great and super useful application. Thanks a lot for this nice tool!
I have two suggestions for improvements:
To my mind the temporal correlation plot behaves quite unintuitive and too much magic happens in the background.
The documentation says:
"When the data being viewed in D-Tale has date or timestamp columns but for each date/timestamp vlaue there is only one row of data the behavior of the Correlations popup is a little different. Instead of a timeseries correlation chart the user is given a rolling correlation chart which can have the window (default: 10) altered."
To me having rolling windows is almost always the desired behavior. But you cannot have that in case by any chance two data points share the same timestamp (my data sets have this quite often). So you cannot have any rolling analysis unless you filter duplicates before using the correlation tab.
Proposal: Make the rolling view always the default behavior and add a toggle to switch between rolling windows and the grouping by date behavior. Now dtale's behavior no longer depends implicitly on the dataframe data, but on the user's selection in the dtale frontend.
Moreover, the rolling behavior is hard to grasp as well. In case you export the code, you get:
The date column is set as the index, however the rolling function is not taken over a time interval, but simply over the last
n
data points. This produces weird behavior if the data is not sampled at regular intervals. In this case correlation over time is really misleading. Moreover, since nomin_periods
is set for pandas'rolling
function, the whole correlation analysis breaks down once you have NaN values in your data and you increase the rolling window, as then the correlations become all NaN for increasing window sizes.Proposal: Always take rolling windows over time intervals and not over
n
data points. Let the user choose the lengtht
and the unit of the time interval (e.g. days, seconds, hours, etc.). Let the user specify themin_periods
of the pandas rolling function.Eager to hear your opinion about these two suggestions.
Best,
Robert
The text was updated successfully, but these errors were encountered: