Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array_emd function #26

Merged
merged 5 commits into from
Jan 22, 2018
Merged

Add array_emd function #26

merged 5 commits into from
Jan 22, 2018

Conversation

scottgigante
Copy link
Contributor

As discussed in #25

Copy link
Owner

@wmayner wmayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Seems like there's pretty bad precision in certain cases, like array_emd([1], [2]). This should be 1, but it's 0.9666... on my machine. This has to do with NumPy's heuristics for bin sizes, which result in many bins even in small cases like that. I'm not sure how important this is, but it may be worth thinking about.

pyemd/emd.pyx Outdated
histogram. Defaults to the range of the union of `first_array`
and `second_array`. Note: if the given range is not a superset
of the default range, no warning will be given.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, trailing whitespace should be stripped. There's probably a plugin for your text editor that can do this automatically for you.

pyemd/emd.pyx Outdated
@@ -45,6 +46,10 @@ def validate(first_histogram, second_histogram, distance_matrix):
if (first_histogram.shape[0] != second_histogram.shape[0]):
raise ValueError('Histogram lengths must be equal')

def euclidean_pairwise_distance(x):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be two blank lines before top-level function definitions. There are automated tools for checking these sorts of things, called “linters”, such as pylint.

pyemd/emd.pyx Outdated
return emd(first_histogram,
second_histogram,
distance_matrix,
extra_mass_penalty)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call should explicitly pass extra_mass_penalty as a keyword argument:

emd(..., extra_mass_penalty=extra_mass_penalty)

README.rst Outdated
>>> first_array = [1,2,3,4]
>>> second_array = [2,3,4,5]
>>> array_emd(first_array, second_array, bins=2)
0.5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a few unit tests added to test/test_pyemd.py, in addition to this.

pyemd/emd.pyx Outdated
@@ -91,6 +96,76 @@ def emd(np.ndarray[np.float64_t, ndim=1, mode="c"] first_histogram,
distance_matrix,
extra_mass_penalty)

def array_emd(first_array,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called emd_samples (if you can think of something better, I'm all ears), since the other emd functions also take arrays, and they all begin with emd.

Copy link
Owner

@wmayner wmayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good—just a couple of style nitpicks.

Last thing is to write some tests, and then I think this is good to go.

pyemd/emd.pyx Outdated
max(np.max(first_array), np.max(second_array)))

if type(bins) == str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(bins, str) should be preferred over explicitly checking type equality.

pyemd/emd.pyx Outdated
if type(bins) == str:
hist, _ = np.histogram(np.concatenate([first_array,
second_array]),
range=range, bins=bins)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: arguments should be aligned.

@@ -33,6 +33,14 @@
>>> emd_with_flow(first_signature, second_signature, distance_matrix)
(3.5, [[0.0, 0.0], [0.0, 1.0]])

You can also calculate the EMD directly from two arrays:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could elaborate and say “arrays of observations” or “samples” instead of “arrays” and mention that the histogram is computed automatically.

pyemd/emd.pyx Outdated
if distance == 'euclidean':
distance = euclidean_pairwise_distance

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: comments should be capitalized, to be consistent with the others.

@scottgigante
Copy link
Contributor Author

scottgigante commented Jan 19, 2018 via email

@wmayner
Copy link
Owner

wmayner commented Jan 19, 2018

Gotcha, no worries. I'll wait to review until you comment.

@scottgigante
Copy link
Contributor Author

Okay, I'm pretty happy with that now. Thanks for your patience!

@wmayner
Copy link
Owner

wmayner commented Jan 22, 2018

Great, looks good. Thank you!

@wmayner wmayner merged commit 56ea285 into wmayner:develop Jan 22, 2018
wmayner added a commit that referenced this pull request Jan 22, 2018
Release v0.5.0

- Add the `emd_samples()` function (PR #26).
- Clarify docstrings.
- Update documentation in README.
- Refactor tests.
- Explicitly support Python 3.4 and 3.5.
@wmayner
Copy link
Owner

wmayner commented Jan 22, 2018

Released in v0.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants