-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add array_emd function #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Seems like there's pretty bad precision in certain cases, like array_emd([1], [2])
. This should be 1
, but it's 0.9666...
on my machine. This has to do with NumPy's heuristics for bin sizes, which result in many bins even in small cases like that. I'm not sure how important this is, but it may be worth thinking about.
pyemd/emd.pyx
Outdated
histogram. Defaults to the range of the union of `first_array` | ||
and `second_array`. Note: if the given range is not a superset | ||
of the default range, no warning will be given. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, trailing whitespace should be stripped. There's probably a plugin for your text editor that can do this automatically for you.
pyemd/emd.pyx
Outdated
@@ -45,6 +46,10 @@ def validate(first_histogram, second_histogram, distance_matrix): | |||
if (first_histogram.shape[0] != second_histogram.shape[0]): | |||
raise ValueError('Histogram lengths must be equal') | |||
|
|||
def euclidean_pairwise_distance(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be two blank lines before top-level function definitions. There are automated tools for checking these sorts of things, called “linters”, such as pylint
.
pyemd/emd.pyx
Outdated
return emd(first_histogram, | ||
second_histogram, | ||
distance_matrix, | ||
extra_mass_penalty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call should explicitly pass extra_mass_penalty
as a keyword argument:
emd(..., extra_mass_penalty=extra_mass_penalty)
README.rst
Outdated
>>> first_array = [1,2,3,4] | ||
>>> second_array = [2,3,4,5] | ||
>>> array_emd(first_array, second_array, bins=2) | ||
0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a few unit tests added to test/test_pyemd.py
, in addition to this.
pyemd/emd.pyx
Outdated
@@ -91,6 +96,76 @@ def emd(np.ndarray[np.float64_t, ndim=1, mode="c"] first_histogram, | |||
distance_matrix, | |||
extra_mass_penalty) | |||
|
|||
def array_emd(first_array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called emd_samples
(if you can think of something better, I'm all ears), since the other emd
functions also take arrays, and they all begin with emd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good—just a couple of style nitpicks.
Last thing is to write some tests, and then I think this is good to go.
pyemd/emd.pyx
Outdated
max(np.max(first_array), np.max(second_array))) | ||
|
||
if type(bins) == str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance(bins, str)
should be preferred over explicitly checking type
equality.
pyemd/emd.pyx
Outdated
if type(bins) == str: | ||
hist, _ = np.histogram(np.concatenate([first_array, | ||
second_array]), | ||
range=range, bins=bins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: arguments should be aligned.
pyemd/__init__.py
Outdated
@@ -33,6 +33,14 @@ | |||
>>> emd_with_flow(first_signature, second_signature, distance_matrix) | |||
(3.5, [[0.0, 0.0], [0.0, 1.0]]) | |||
|
|||
You can also calculate the EMD directly from two arrays: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could elaborate and say “arrays of observations” or “samples” instead of “arrays” and mention that the histogram is computed automatically.
pyemd/emd.pyx
Outdated
if distance == 'euclidean': | ||
distance = euclidean_pairwise_distance | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: comments should be capitalized, to be consistent with the others.
Apologies, I didn’t intend for this to be done – I just need to push to go between my windows and Linux environments. Thanks for your comments – I’ll add a comment to the PR when I think it’s done.
|
Gotcha, no worries. I'll wait to review until you comment. |
Okay, I'm pretty happy with that now. Thanks for your patience! |
Great, looks good. Thank you! |
Release v0.5.0 - Add the `emd_samples()` function (PR #26). - Clarify docstrings. - Update documentation in README. - Refactor tests. - Explicitly support Python 3.4 and 3.5.
Released in v0.5.0. |
As discussed in #25