Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: Cythonize #33

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

draft: Cythonize #33

wants to merge 5 commits into from

Conversation

nucccc
Copy link
Contributor

@nucccc nucccc commented Feb 6, 2023

Good evening,
I provide in here a folder with an example of cython code and a benchmark, so that if you want to think about it you can find some material from which to start.

Inside the simylaritymeasures folder, there will be a cy_simylaritymeasures, in which i left only the cython code, scripts and quick documentation in readme.md to build it and run it.

Hoping it can turn out to be of any utility,
Nuc

@cjekel
Copy link
Owner

cjekel commented Feb 7, 2023

Very nice! It's going to be until the weekend until I can comment in detail.

I see this is a very minimal cython port. That's great!

I'm curious what happens if you drop

c = distance.cdist(exp_data, num_data, metric='minkowski', p=p)

and use cython to create this matrix? It might be slightly faster to calculate the distance when we need it, rather than creating the large matrix?

Also, we may want to show performance improvements with respect to both the number of data, and the number of dimensions in the data.

@nucccc
Copy link
Contributor Author

nucccc commented Feb 8, 2023

After your suggestion, I even modified the cython code for a hypothetized improvement by eliminating not the distances matrix
c = distance.cdist(exp_data, num_data, metric='minkowski', p=p)
but rather the ca matrix of the coefficients. That didn't look to me as a great speed improvement, but may remove from ram an additional matrix. All calculations are done inplace on the original distance matrix c, thus modifying it. That one can be seem as a procedural change, and I still to reflect on that, but may lead to a decrease in RAM occupation. I hope it to be possibly useful.

I forgot to mention that as a case I'm actually working just on the frechet distance, but of course the cythonization can be then extended on all the other functions.

Thank you for the attention and the work,
Nuc

@cjekel
Copy link
Owner

cjekel commented Feb 14, 2023

Just a bump. I'm still very interested in this comparison, I've just been swamped lately!

@cjekel
Copy link
Owner

cjekel commented Feb 22, 2023

Nice bump adding dtw! I need to test this out asap.

@nucccc
Copy link
Contributor Author

nucccc commented Feb 22, 2023

Ah yes, i just added a dtw comparison. Then i think the benchmark could be better organized, I would put some more work in it.

I'm swamped too and thank you for the attention!

Also if you ever need a quick voice discussion just let me know and we can keep in touch via google meet or something at a point, I'm based in Italy and right now I'm flexible with my schedule.

@cjekel
Copy link
Owner

cjekel commented Oct 7, 2023

I seem to be getting about a 40% improvement gain, that is not sensitive to the total amount of data. This is good! I think I should consider porting the entire library to cython. It would be nice to have both a cython and python version side by side, since the python version is slightly easier to read. Not sure what is the best way to have both...

1.0000510139013212
1.0000510139013212
True
[0.981130201, 1.0351228799999999, 1.0592268329999999, 1.0405540450000004, 1.066770408]
[0.594745455, 0.5694208449999998, 0.5682200880000003, 0.5850636710000003, 0.5942479000000009]
average execution time for non cythonized version: 1.036561
average execution time for cythonized version: 0.582340
improvement: 43 %
100.00171783203773
100.00171783203773
True
[0.9917487250000008, 1.0106956060000005, 0.984660250000001, 1.0517639240000012, 0.9957187320000003]
[0.6305159309999997, 0.6164782420000012, 0.6147694729999991, 0.6011433679999989, 0.6316231439999989]
average execution time for non cythonized version: 1.006917
average execution time for cythonized version: 0.618906
improvement: 38 %

@nucccc
Copy link
Contributor Author

nucccc commented Oct 16, 2023

Hi, in this period I could be a little bit busy, but I could find some time. Recently I red that Cython 3.0 was released, which uses heavily type hinting, and one could resort to type hints such as cython.int or cython.long to actually have types, I would give a look if one could could write pythonic code and then let Cython 3.0 transpile it with that performance gain.

@nucccc nucccc changed the title Cythonize draft: Cythonize Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants