Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix+perf: Benford correlation and entropy features #58

Merged
merged 4 commits into from
Sep 30, 2023
Merged

Conversation

abstractqqq
Copy link
Collaborator

@abstractqqq abstractqqq commented Sep 30, 2023

  1. Do not approve for now.
  2. Fixed a problem in benford_correlation. (x != "") is erroneous because x is numeric. In eager mode, np.corrcoef returns a matrix so we need to extract the coefficient, not return the whole matrix. Also, moving sort after value_counts which gives us another 3-5% perf improvment.
  3. Significantly improved perf for permutation entropy by utilizing parallelism provided by pl.concat_list to a greater degree.
  4. I have an idea to improve perf of binned_entropy. It is not implemented yet. I will likely work on this this weekend. It is tested on integers and works as fast as Tsfresh. I mentioned in a previous issue that we were about significantly slower and this could bring the performance to the same level. So this would be a big win. The only problem now I am facing is a precision issue when the input column is float, which results in error of scale 1e-5, which can be big sometimes.

The key idea in (4) is to create the segments by floordiv, instead of using cuts, which does way more computation than needed.

  (x.floordiv((x.max() - x.min())/pl.lit(max_bin) + x.min() )).drop_nulls().value_counts().sort()
  .struct.field("counts")
  .entropy()

@vercel
Copy link

vercel bot commented Sep 30, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
functime-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 30, 2023 8:12pm

@abstractqqq
Copy link
Collaborator Author

abstractqqq commented Sep 30, 2023

@topher-lo @MathieuCayssol

  1. After some fiddling, binned_entropy's numerical issue seems to have gone away and we get really nice time results. It is incredible that we can get this fast using just expressions

image

  1. For expressions, I replaced x.len() with x.count() because x.len() internally called self.count(), which is an indirection. Likely just nanosec difference. But it is a free improvement so why not.

I think it's ready to merge if no other issue come up during your review.

@topher-lo topher-lo added bug Something isn't working refactor Code change that neither fixes a bug nor adds a feature perf Performance or optimization improvements labels Sep 30, 2023
@topher-lo topher-lo changed the title Fix/bug fixes fix+perf: Benford correlation and entropy features Sep 30, 2023
@topher-lo topher-lo self-requested a review September 30, 2023 21:47
@topher-lo topher-lo merged commit 4672cc9 into main Sep 30, 2023
5 of 9 checks passed
@topher-lo topher-lo deleted the fix/bug-fixes branch September 30, 2023 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working perf Performance or optimization improvements refactor Code change that neither fixes a bug nor adds a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants