-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Feat/scatter by size #17582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/scatter by size #17582
Conversation
pandas/plotting/_core.py
Outdated
@@ -815,11 +816,22 @@ def _post_plot_logic(self, ax, data): | |||
class ScatterPlot(PlanePlot): | |||
_kind = 'scatter' | |||
|
|||
def __init__(self, data, x, y, s=None, c=None, **kwargs): | |||
def __init__(self, data, x, y, s=None, s_grow=1, c=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wary of adding this in the middle of the signature. If people pass in arguments by position, this will cause their code to break because now s_grow
will be assigned the value instead of c
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I make s_grow the last positionnal argument? Or make it a keyword argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keyword, just after c=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to ask if you wanted s_grow to be the last keyword argument, like so:
def __init__(self, data, x, y, s=None, c=None, s_grow=1, **kwargs):
# [...]
or if s_grow shoulb be a keyword-only argument, included in the **kwargs:
def __init__(self, data, x, y, s=None, c=None, **kwargs):
if 's_grow' in kwargs:
#[...]
I understand you're OK with the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I would have to look more closely, but **kwargs
is usually passed through to the underlying matplotlib function.
The function signatures in .plot
and .plot.scatter
are a little messy, so you'll need to be careful to handle **kwargs
appropriately. Having s_grow=1
in the function signature is a good way to ensure that it isn't accidentally passed through to matplotlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name seems awkward
does seaborn/mpl have a comment name for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very happy with the name s_grow either... Maybe size_factor is better?
I don't find anything similar in seaborn or matplotlib.
In ggplot2 there does not seem to be an option to pass a scaling factor to bubble plots, but you have the option to pass a size range instead. The argument name is scale_size (which I don't find to be a very good name either!).
Having the option to specify a size range is more flexible than only being able to specify a scaling factor, as it allows to visualize small variations of data which would be invisible in a bubble plot with bubble areas proportional to the data, but the down side is precisely that it breaks the proportionality between data and bubble areas, which can result in unintentionnaly misleading and untrustworthy visualizations where data points with small relative differences are represented by very different bubble sizes.
As a Pandas user I would prefer having the option to pass a scaling factor rather than a size range.
What are your thoughts?
- Scaling factor (like what I implemented) or size range (like in ggplot2) ?
- If we keep things as they are (scaling factor), should I replace s_grow by size_factor? Other name?
pandas/plotting/_core.py
Outdated
if s is None: | ||
# hide the matplotlib default for size, in case we want to change | ||
# the handling of this argument later | ||
# Set default size if no argument is given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add period at end.
pandas/plotting/_core.py
Outdated
c_is_column = is_hashable(c) and c in self.data.columns | ||
|
||
# plot a colorbar only if a colormap is provided or necessary | ||
cb = self.kwds.pop('colorbar', self.colormap or c_is_column) | ||
|
||
# Plot bubble size scale if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add period at end.
@@ -875,6 +887,60 @@ def _make_plot(self): | |||
ax.errorbar(data[x].values, data[y].values, | |||
linestyle='none', **err_kwds) | |||
|
|||
def _sci_notation(self, num): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brief doc-string here.
pandas/plotting/_core.py
Outdated
scientific_notation).groups()[0]) | ||
return coef, expnt | ||
|
||
def _legend_bubbles(self, s_data_max, s_grow, bubble_points): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brief doc-string here.
scatter_test.py
Outdated
s='popdensity', | ||
s_grow=0.2, | ||
title='Popuation vs area and density') | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only create new test file if it is absolutely necessary. Surely this test has a home under an existing module in the pandas/tests
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I imagine that this test will probably have to be rewritten slightly (correct me if I'm wrong @TomAugspurger ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this can go in plotting/test_frame.py
. Then you have all the plotting infrastructure in place.
If possible, avoid using a new dataset. We have some in pandas/tests/io/data
and some in doc/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the test to check that the sizes of the bubbles in the plot are related to the data in the size column as expected, and placed it in a new function in plotting/test_frame.py.
Let me know if it's OK now.
Hello @VincentAntoine! Thanks for updating the PR.
Comment last updated on March 24, 2018 at 22:50 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #17582 +/- ##
==========================================
- Coverage 91.22% 91.14% -0.08%
==========================================
Files 163 163
Lines 49625 49664 +39
==========================================
- Hits 45270 45267 -3
- Misses 4355 4397 +42
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17582 +/- ##
==========================================
- Coverage 91.22% 91.15% -0.08%
==========================================
Files 163 163
Lines 49625 49895 +270
==========================================
+ Hits 45270 45481 +211
- Misses 4355 4414 +59
Continue to review full report at Codecov.
|
s = 20 | ||
elif is_hashable(s) and s in data.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to check hasability, other ops will fail if this is the case
pandas/plotting/_core.py
Outdated
elif is_hashable(s) and s in data.columns: | ||
# Handle the case where s is a label of a column of the df. | ||
# The data is normalized to 200 * size_factor. | ||
size_data = data.loc[:, s].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data[s]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need .values
pandas/plotting/_core.py
Outdated
self.bubble_points = 200 | ||
s = self.bubble_points * size_factor * size_data / self.s_data_max | ||
else: | ||
raise TypeError('s must be of numeric dtype') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better error message here (s is the name of a column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Bubbles sizes must be of numeric dtype" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'size_factor' must be numeric or categorical dtype.
''' | ||
Returns mantissa and exponent of the number passed in agument. | ||
Example: | ||
_sci_notation(89278.8924) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can just be a module level function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does matplotlib have anything that does this? (cc @tacaswell)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, typically the docstring formatting is like
>>> _sci_notation(89278.8924)
(8.9, 5.0)
identical to what you get from the regular python REPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but buried in the formatter code. I would not try to re-use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we place this function as a module level function within plotting/_core as suggested by @jreback or is there another module that would be more appropriate ? Maybe plotting/_converter.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module level in plotting/_core.py
is fine I think.
Hi, can you tell me what's needed here? Shall I do something about the checks that fail? If yes what? |
@VincentAntoine could you add an empty commit to re-trigger the CI? Otherwise, I'm going through this again now, and will probably have some feedback, if you want to wait for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, things look pretty good here.
We'd like a release note (in doc/source/whatsnew/v0.21.0.txt`) and I think it'd be good, and not too much extra work, to support categoricals. Let me know.
self.size_factor = size_factor | ||
self.bubble_points = 200 | ||
s = self.bubble_points * size_factor * size_data / self.s_data_max | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to support Categorical data here as well. How much work would it be to do that? Perhaps
if is_categorical_dtype(size_data):
if size_data.ordered:
size_data = size_data.codes
# else raise with a nice error message
and then the if is_numeric_dtype(size_data)
? Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger I'm not sure I understand what it means to plot categorical data as sizes. Could you give me a use case example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentAntoine this basically, but with all the nice stuff from this PR
df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])
categories = list('abcd')
df['c'] = pd.Categorical(np.random.choice(categories, size=(100,)),
categories=categories, ordered=True)
df.head()
fig, ax = plt.subplots()
s = (10 + df.c.cat.codes * 10)
ax.scatter(x='a', y='b', data=df, s=s);
So the idea is to automatically know to use .categories.codes
for categorical dtype data, instead of just the categories (which may not be numeric). I think if you do the
- check for categorical dtype
- size_data =
df[s].cat.codes
before if is_numeric_dtype(size_data)
, then everything else should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works just as you said with no additionnal modification :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to do size_data = df[s].cat.codes + 1, as codes start at 0 and the resulting bubbles will have an area of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger I added this in the last commit, and added a test for scatter plot with categorical data as well. I'll write the release note now. Let me know if the code needs any more changes.
pandas/plotting/_core.py
Outdated
self.bubble_points = 200 | ||
s = self.bubble_points * size_factor * size_data / self.s_data_max | ||
else: | ||
raise TypeError('s must be of numeric dtype') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'size_factor' must be numeric or categorical dtype.
''' | ||
Returns mantissa and exponent of the number passed in agument. | ||
Example: | ||
_sci_notation(89278.8924) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does matplotlib have anything that does this? (cc @tacaswell)
''' | ||
Returns mantissa and exponent of the number passed in agument. | ||
Example: | ||
_sci_notation(89278.8924) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, typically the docstring formatting is like
>>> _sci_notation(89278.8924)
(8.9, 5.0)
identical to what you get from the regular python REPL.
pandas/plotting/_core.py
Outdated
import matplotlib.legend as legend | ||
sizes, labels = self._legend_bubbles(s_data_max, | ||
size_factor, | ||
bubble_points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use self.bubble_points
here instead of assigning it up above.
pandas/plotting/_core.py
Outdated
bubbles.append(ax.scatter([], | ||
[], | ||
s=size, | ||
color='white', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be color=rcParams['axes.facecolor']
and axes.edgecolor
for the next line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This color should not matter as there are no data points in it.
It is probably better to just create a Collection
here and not bother adding it to the Axes (to not clutter the draw tree with dummy artists).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tacaswell Thank you for your help, I had tried this but used Circle objects instead of CircleCollection objects and could not make it work. Now it works :) And it's cleaner indeed.
(0, 1.5): [1, 0.5, 0.25, 0.1] | ||
} | ||
for lower_bound, upper_bound in labels_catalog: | ||
if (coef >= lower_bound) & (coef < upper_bound): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower_bound <= coef < upper_bound
is a bit more readable here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know that worked :)
pandas/plotting/_core.py
Outdated
s = 20 | ||
elif s in data.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the if hashable
check here, else this could throw an exception.
''' | ||
Returns mantissa and exponent of the number passed in agument. | ||
Example: | ||
_sci_notation(89278.8924) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module level in plotting/_core.py
is fine I think.
pandas/tests/plotting/test_frame.py
Outdated
@@ -1006,6 +1006,40 @@ def test_scatter_colors(self): | |||
np.array([1, 1, 1, 1], dtype=np.float64)) | |||
|
|||
@pytest.mark.slow | |||
def test_plot_scatter_with_s(self): | |||
data = np.array([[3.1, 4.2, 1.9], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test with some more variety in the data? I'm thinking
- all negative
- some negative some positive
- constant values
- some missing values
- some very large values
You don't need to do the whole plot for these tests, just verify that the values you get for s
look correct. You may want to refactor you logic getting this to make it easier to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. For these cases (negatives, NaN etc) I guess we want to have the same behaviour as what we'd get by passing s=df[s] directly ? That is:
- make the bubble plots only with points that have positive values (discard NaN and negative data points)
- throw a warning in case of negative values
?
can you rebase / fixup |
closing as stale, but if you'd like to keep working, ping and we can reopen |
Hi! I could not get some time to work on this for some time, I could now start working on it again and finish it if that's OK for you. |
@VincentAntoine : Go for it! |
git diff upstream/master -u -- "*.py" | flake8 --diff
Let me know if any changes are needed.
Thanks!
Vincent