Skip to content

Feat/scatter by size #17582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

VincentAntoine
Copy link

Let me know if any changes are needed.
Thanks!
Vincent

@gfyoung gfyoung added the Visualization plotting label Sep 18, 2017
@@ -815,11 +816,22 @@ def _post_plot_logic(self, ax, data):
class ScatterPlot(PlanePlot):
_kind = 'scatter'

def __init__(self, data, x, y, s=None, c=None, **kwargs):
def __init__(self, data, x, y, s=None, s_grow=1, c=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary of adding this in the middle of the signature. If people pass in arguments by position, this will cause their code to break because now s_grow will be assigned the value instead of c.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I make s_grow the last positionnal argument? Or make it a keyword argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword, just after c=None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to ask if you wanted s_grow to be the last keyword argument, like so:

def __init__(self, data, x, y, s=None, c=None, s_grow=1, **kwargs):
    # [...]

or if s_grow shoulb be a keyword-only argument, included in the **kwargs:

def __init__(self, data, x, y, s=None, c=None, **kwargs):
    if 's_grow' in kwargs:
        #[...]

I understand you're OK with the first one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I would have to look more closely, but **kwargs is usually passed through to the underlying matplotlib function.

The function signatures in .plot and .plot.scatter are a little messy, so you'll need to be careful to handle **kwargs appropriately. Having s_grow=1 in the function signature is a good way to ensure that it isn't accidentally passed through to matplotlib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name seems awkward

does seaborn/mpl have a comment name for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very happy with the name s_grow either... Maybe size_factor is better?

I don't find anything similar in seaborn or matplotlib.

In ggplot2 there does not seem to be an option to pass a scaling factor to bubble plots, but you have the option to pass a size range instead. The argument name is scale_size (which I don't find to be a very good name either!).

http://t-redactyl.io/blog/2016/02/creating-plots-in-r-using-ggplot2-part-6-weighted-scatterplots.html

Having the option to specify a size range is more flexible than only being able to specify a scaling factor, as it allows to visualize small variations of data which would be invisible in a bubble plot with bubble areas proportional to the data, but the down side is precisely that it breaks the proportionality between data and bubble areas, which can result in unintentionnaly misleading and untrustworthy visualizations where data points with small relative differences are represented by very different bubble sizes.

As a Pandas user I would prefer having the option to pass a scaling factor rather than a size range.

What are your thoughts?

  • Scaling factor (like what I implemented) or size range (like in ggplot2) ?
  • If we keep things as they are (scaling factor), should I replace s_grow by size_factor? Other name?

if s is None:
# hide the matplotlib default for size, in case we want to change
# the handling of this argument later
# Set default size if no argument is given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add period at end.

c_is_column = is_hashable(c) and c in self.data.columns

# plot a colorbar only if a colormap is provided or necessary
cb = self.kwds.pop('colorbar', self.colormap or c_is_column)

# Plot bubble size scale if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add period at end.

@@ -875,6 +887,60 @@ def _make_plot(self):
ax.errorbar(data[x].values, data[y].values,
linestyle='none', **err_kwds)

def _sci_notation(self, num):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief doc-string here.

scientific_notation).groups()[0])
return coef, expnt

def _legend_bubbles(self, s_data_max, s_grow, bubble_points):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief doc-string here.

scatter_test.py Outdated
s='popdensity',
s_grow=0.2,
title='Popuation vs area and density')
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only create new test file if it is absolutely necessary. Surely this test has a home under an existing module in the pandas/tests directory.

Copy link
Member

@gfyoung gfyoung Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I imagine that this test will probably have to be rewritten slightly (correct me if I'm wrong @TomAugspurger ).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this can go in plotting/test_frame.py. Then you have all the plotting infrastructure in place.

If possible, avoid using a new dataset. We have some in pandas/tests/io/data and some in doc/data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the test to check that the sizes of the bubbles in the plot are related to the data in the size column as expected, and placed it in a new function in plotting/test_frame.py.
Let me know if it's OK now.

@pep8speaks
Copy link

pep8speaks commented Sep 26, 2017

Hello @VincentAntoine! Thanks for updating the PR.

Line 832:80: E501 line too long (87 > 79 characters)
Line 838:80: E501 line too long (82 > 79 characters)
Line 840:80: E501 line too long (83 > 79 characters)
Line 944:13: E122 continuation line missing indentation or outdented
Line 944:20: E251 unexpected spaces around keyword / parameter equals
Line 944:22: E251 unexpected spaces around keyword / parameter equals
Line 945:13: E122 continuation line missing indentation or outdented
Line 946:54: E225 missing whitespace around operator
Line 947:49: E113 unexpected indentation
Line 947:54: E225 missing whitespace around operator
Line 948:58: E225 missing whitespace around operator
Line 949:13: E128 continuation line under-indented for visual indent
Line 949:26: E251 unexpected spaces around keyword / parameter equals
Line 949:28: E251 unexpected spaces around keyword / parameter equals
Line 953:13: E122 continuation line missing indentation or outdented
Line 954:13: E122 continuation line missing indentation or outdented
Line 955:9: E122 continuation line missing indentation or outdented

Line 1017:31: E251 unexpected spaces around keyword / parameter equals
Line 1017:33: E251 unexpected spaces around keyword / parameter equals
Line 1033:37: E251 unexpected spaces around keyword / parameter equals
Line 1033:39: E251 unexpected spaces around keyword / parameter equals

Comment last updated on March 24, 2018 at 22:50 Hours UTC

@codecov
Copy link

codecov bot commented Sep 26, 2017

Codecov Report

Merging #17582 into master will decrease coverage by 0.07%.
The diff coverage is 36.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17582      +/-   ##
==========================================
- Coverage   91.22%   91.14%   -0.08%     
==========================================
  Files         163      163              
  Lines       49625    49664      +39     
==========================================
- Hits        45270    45267       -3     
- Misses       4355     4397      +42
Flag Coverage Δ
#multiple 88.93% <36.58%> (-0.06%) ⬇️
#single 40.17% <12.19%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 80.85% <36.58%> (-1.88%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...895afd8. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 26, 2017

Codecov Report

Merging #17582 into master will decrease coverage by 0.07%.
The diff coverage is 47.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17582      +/-   ##
==========================================
- Coverage   91.22%   91.15%   -0.08%     
==========================================
  Files         163      163              
  Lines       49625    49895     +270     
==========================================
+ Hits        45270    45481     +211     
- Misses       4355     4414      +59
Flag Coverage Δ
#multiple 88.95% <47.61%> (-0.04%) ⬇️
#single 40.22% <14.28%> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 80.59% <47.61%> (-2.14%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/util/_tester.py 29.41% <0%> (-9.48%) ⬇️
pandas/core/accessor.py 93.75% <0%> (-6.25%) ⬇️
pandas/plotting/_tools.py 72.92% <0%> (-6.08%) ⬇️
pandas/core/tools/datetimes.py 82.97% <0%> (-2.2%) ⬇️
pandas/core/indexes/category.py 97.46% <0%> (-1.09%) ⬇️
pandas/core/indexes/interval.py 92.85% <0%> (-0.72%) ⬇️
pandas/core/common.py 91.42% <0%> (-0.56%) ⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...84de8ab. Read the comment docs.

s = 20
elif is_hashable(s) and s in data.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to check hasability, other ops will fail if this is the case

elif is_hashable(s) and s in data.columns:
# Handle the case where s is a label of a column of the df.
# The data is normalized to 200 * size_factor.
size_data = data.loc[:, s].values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data[s]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need .values

self.bubble_points = 200
s = self.bubble_points * size_factor * size_data / self.s_data_max
else:
raise TypeError('s must be of numeric dtype')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better error message here (s is the name of a column)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Bubbles sizes must be of numeric dtype" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'size_factor' must be numeric or categorical dtype.

'''
Returns mantissa and exponent of the number passed in agument.
Example:
_sci_notation(89278.8924)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be a module level function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does matplotlib have anything that does this? (cc @tacaswell)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, typically the docstring formatting is like

>>> _sci_notation(89278.8924)
(8.9, 5.0)

identical to what you get from the regular python REPL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but buried in the formatter code. I would not try to re-use it.

Copy link
Author

@VincentAntoine VincentAntoine Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we place this function as a module level function within plotting/_core as suggested by @jreback or is there another module that would be more appropriate ? Maybe plotting/_converter.py ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module level in plotting/_core.py is fine I think.

@VincentAntoine
Copy link
Author

Hi, can you tell me what's needed here? Shall I do something about the checks that fail? If yes what?
Guidance will be useful for me.
Vincent

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 2, 2017

@VincentAntoine could you add an empty commit to re-trigger the CI? git commit --allow-empty -m 'trigger-ci' and then git push. That should hopefully resolve the CI failures.

Otherwise, I'm going through this again now, and will probably have some feedback, if you want to wait for that.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, things look pretty good here.

We'd like a release note (in doc/source/whatsnew/v0.21.0.txt`) and I think it'd be good, and not too much extra work, to support categoricals. Let me know.

self.size_factor = size_factor
self.bubble_points = 200
s = self.bubble_points * size_factor * size_data / self.s_data_max
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to support Categorical data here as well. How much work would it be to do that? Perhaps

if is_categorical_dtype(size_data):
    if size_data.ordered:
        size_data = size_data.codes
    # else raise with a nice error message

and then the if is_numeric_dtype(size_data)? Does that work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I'm not sure I understand what it means to plot categorical data as sizes. Could you give me a use case example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentAntoine this basically, but with all the nice stuff from this PR

df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])

categories = list('abcd')
df['c'] = pd.Categorical(np.random.choice(categories, size=(100,)),
                         categories=categories, ordered=True)
df.head()

fig, ax = plt.subplots()
s = (10 + df.c.cat.codes * 10)

ax.scatter(x='a', y='b', data=df, s=s);

gh

So the idea is to automatically know to use .categories.codes for categorical dtype data, instead of just the categories (which may not be numeric). I think if you do the

  • check for categorical dtype
  • size_data = df[s].cat.codes

before if is_numeric_dtype(size_data), then everything else should be the same

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works just as you said with no additionnal modification :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to do size_data = df[s].cat.codes + 1, as codes start at 0 and the resulting bubbles will have an area of 0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I added this in the last commit, and added a test for scatter plot with categorical data as well. I'll write the release note now. Let me know if the code needs any more changes.

self.bubble_points = 200
s = self.bubble_points * size_factor * size_data / self.s_data_max
else:
raise TypeError('s must be of numeric dtype')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'size_factor' must be numeric or categorical dtype.

'''
Returns mantissa and exponent of the number passed in agument.
Example:
_sci_notation(89278.8924)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does matplotlib have anything that does this? (cc @tacaswell)

'''
Returns mantissa and exponent of the number passed in agument.
Example:
_sci_notation(89278.8924)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, typically the docstring formatting is like

>>> _sci_notation(89278.8924)
(8.9, 5.0)

identical to what you get from the regular python REPL.

import matplotlib.legend as legend
sizes, labels = self._legend_bubbles(s_data_max,
size_factor,
bubble_points)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use self.bubble_points here instead of assigning it up above.

bubbles.append(ax.scatter([],
[],
s=size,
color='white',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be color=rcParams['axes.facecolor'] and axes.edgecolor for the next line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This color should not matter as there are no data points in it.

It is probably better to just create a Collection here and not bother adding it to the Axes (to not clutter the draw tree with dummy artists).

See https://matplotlib.org/users/legend_guide.html#creating-artists-specifically-for-adding-to-the-legend-aka-proxy-artists

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tacaswell Thank you for your help, I had tried this but used Circle objects instead of CircleCollection objects and could not make it work. Now it works :) And it's cleaner indeed.

(0, 1.5): [1, 0.5, 0.25, 0.1]
}
for lower_bound, upper_bound in labels_catalog:
if (coef >= lower_bound) & (coef < upper_bound):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower_bound <= coef < upper_bound is a bit more readable here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that worked :)

s = 20
elif s in data.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the if hashable check here, else this could throw an exception.

'''
Returns mantissa and exponent of the number passed in agument.
Example:
_sci_notation(89278.8924)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module level in plotting/_core.py is fine I think.

@@ -1006,6 +1006,40 @@ def test_scatter_colors(self):
np.array([1, 1, 1, 1], dtype=np.float64))

@pytest.mark.slow
def test_plot_scatter_with_s(self):
data = np.array([[3.1, 4.2, 1.9],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test with some more variety in the data? I'm thinking

  • all negative
  • some negative some positive
  • constant values
  • some missing values
  • some very large values

You don't need to do the whole plot for these tests, just verify that the values you get for s look correct. You may want to refactor you logic getting this to make it easier to test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. For these cases (negatives, NaN etc) I guess we want to have the same behaviour as what we'd get by passing s=df[s] directly ? That is:

  • make the bubble plots only with points that have positive values (discard NaN and negative data points)
  • throw a warning in case of negative values
    ?

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

can you rebase / fixup

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

closing as stale, but if you'd like to keep working, ping and we can reopen

@jreback jreback closed this Dec 10, 2017
@VincentAntoine
Copy link
Author

Hi! I could not get some time to work on this for some time, I could now start working on it again and finish it if that's OK for you.

@gfyoung gfyoung reopened this Mar 24, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 24, 2018

@VincentAntoine : Go for it!

@VincentAntoine VincentAntoine deleted the feat/scatter_by_size branch April 1, 2018 16:42
@VincentAntoine VincentAntoine mentioned this pull request Apr 4, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants