Feat/scatter by size #17582

VincentAntoine · 2017-09-18T22:56:23Z

[y ] partial solution to (only size_by_variable is handled, not color_by_variable) Scatter plot with colour_by and size_by variables #16827
[y] tests added: scatter_test.py shows an example of use
[y] passes git diff upstream/master -u -- "*.py" | flake8 --diff
[n] whatsnew entry

Let me know if any changes are needed.
Thanks!
Vincent

…umn_name' argument

gfyoung · 2017-09-18T23:11:31Z

pandas/plotting/_core.py

@@ -815,11 +816,22 @@ def _post_plot_logic(self, ax, data):
 class ScatterPlot(PlanePlot):
    _kind = 'scatter'

-    def __init__(self, data, x, y, s=None, c=None, **kwargs):
+    def __init__(self, data, x, y, s=None, s_grow=1, c=None, **kwargs):


I'm wary of adding this in the middle of the signature. If people pass in arguments by position, this will cause their code to break because now s_grow will be assigned the value instead of c.

Should I make s_grow the last positionnal argument? Or make it a keyword argument?

Keyword, just after c=None.

I meant to ask if you wanted s_grow to be the last keyword argument, like so:

def __init__(self, data, x, y, s=None, c=None, s_grow=1, **kwargs): # [...]

or if s_grow shoulb be a keyword-only argument, included in the **kwargs:

def __init__(self, data, x, y, s=None, c=None, **kwargs): if 's_grow' in kwargs: #[...]

I understand you're OK with the first one?

Correct, I would have to look more closely, but **kwargs is usually passed through to the underlying matplotlib function.

The function signatures in .plot and .plot.scatter are a little messy, so you'll need to be careful to handle **kwargs appropriately. Having s_grow=1 in the function signature is a good way to ensure that it isn't accidentally passed through to matplotlib.

the name seems awkward

does seaborn/mpl have a comment name for this?

I'm not very happy with the name s_grow either... Maybe size_factor is better?

I don't find anything similar in seaborn or matplotlib.

In ggplot2 there does not seem to be an option to pass a scaling factor to bubble plots, but you have the option to pass a size range instead. The argument name is scale_size (which I don't find to be a very good name either!).

http://t-redactyl.io/blog/2016/02/creating-plots-in-r-using-ggplot2-part-6-weighted-scatterplots.html

Having the option to specify a size range is more flexible than only being able to specify a scaling factor, as it allows to visualize small variations of data which would be invisible in a bubble plot with bubble areas proportional to the data, but the down side is precisely that it breaks the proportionality between data and bubble areas, which can result in unintentionnaly misleading and untrustworthy visualizations where data points with small relative differences are represented by very different bubble sizes.

As a Pandas user I would prefer having the option to pass a scaling factor rather than a size range.

What are your thoughts?

Scaling factor (like what I implemented) or size range (like in ggplot2) ?

If we keep things as they are (scaling factor), should I replace s_grow by size_factor? Other name?

gfyoung · 2017-09-18T23:11:39Z

pandas/plotting/_core.py

        if s is None:
-            # hide the matplotlib default for size, in case we want to change
-            # the handling of this argument later
+            # Set default size if no argument is given


Nit: add period at end.

gfyoung · 2017-09-18T23:11:48Z

pandas/plotting/_core.py

        c_is_column = is_hashable(c) and c in self.data.columns

        # plot a colorbar only if a colormap is provided or necessary
        cb = self.kwds.pop('colorbar', self.colormap or c_is_column)

+        # Plot bubble size scale if needed


Nit: add period at end.

gfyoung · 2017-09-18T23:11:55Z

pandas/plotting/_core.py

@@ -875,6 +887,60 @@ def _make_plot(self):
            ax.errorbar(data[x].values, data[y].values,
                        linestyle='none', **err_kwds)

+    def _sci_notation(self, num):


Brief doc-string here.

gfyoung · 2017-09-18T23:11:58Z

pandas/plotting/_core.py

+                               scientific_notation).groups()[0])
+        return coef, expnt
+
+    def _legend_bubbles(self, s_data_max, s_grow, bubble_points):


Brief doc-string here.

gfyoung · 2017-09-18T23:12:54Z

scatter_test.py

+                     s='popdensity',
+                     s_grow=0.2,
+                     title='Popuation vs area and density')
+plt.show()


We only create new test file if it is absolutely necessary. Surely this test has a home under an existing module in the pandas/tests directory.

Also, I imagine that this test will probably have to be rewritten slightly (correct me if I'm wrong @TomAugspurger ).

Yep, this can go in plotting/test_frame.py. Then you have all the plotting infrastructure in place.

If possible, avoid using a new dataset. We have some in pandas/tests/io/data and some in doc/data

I rewrote the test to check that the sizes of the bubbles in the plot are related to the data in the size column as expected, and placed it in a new function in plotting/test_frame.py.
Let me know if it's OK now.

pep8speaks · 2017-09-26T22:26:47Z

Hello @VincentAntoine! Thanks for updating the PR.

In the file pandas/plotting/_core.py, following are the PEP8 issues :

Line 832:80: E501 line too long (87 > 79 characters)
Line 838:80: E501 line too long (82 > 79 characters)
Line 840:80: E501 line too long (83 > 79 characters)
Line 944:13: E122 continuation line missing indentation or outdented
Line 944:20: E251 unexpected spaces around keyword / parameter equals
Line 944:22: E251 unexpected spaces around keyword / parameter equals
Line 945:13: E122 continuation line missing indentation or outdented
Line 946:54: E225 missing whitespace around operator
Line 947:49: E113 unexpected indentation
Line 947:54: E225 missing whitespace around operator
Line 948:58: E225 missing whitespace around operator
Line 949:13: E128 continuation line under-indented for visual indent
Line 949:26: E251 unexpected spaces around keyword / parameter equals
Line 949:28: E251 unexpected spaces around keyword / parameter equals
Line 953:13: E122 continuation line missing indentation or outdented
Line 954:13: E122 continuation line missing indentation or outdented
Line 955:9: E122 continuation line missing indentation or outdented

In the file pandas/tests/plotting/test_frame.py, following are the PEP8 issues :

Line 1017:31: E251 unexpected spaces around keyword / parameter equals
Line 1017:33: E251 unexpected spaces around keyword / parameter equals
Line 1033:37: E251 unexpected spaces around keyword / parameter equals
Line 1033:39: E251 unexpected spaces around keyword / parameter equals

Comment last updated on March 24, 2018 at 22:50 Hours UTC

codecov · 2017-09-26T22:27:07Z

Codecov Report

Merging #17582 into master will decrease coverage by 0.07%.
The diff coverage is 36.58%.

@@            Coverage Diff             @@
##           master   #17582      +/-   ##
==========================================
- Coverage   91.22%   91.14%   -0.08%     
==========================================
  Files         163      163              
  Lines       49625    49664      +39     
==========================================
- Hits        45270    45267       -3     
- Misses       4355     4397      +42

Flag	Coverage Δ
#multiple	`88.93% <36.58%> (-0.06%)`	⬇️
#single	`40.17% <12.19%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/plotting/_core.py	`80.85% <36.58%> (-1.88%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...895afd8. Read the comment docs.

codecov · 2017-09-26T22:27:08Z

Codecov Report

Merging #17582 into master will decrease coverage by 0.07%.
The diff coverage is 47.61%.

@@            Coverage Diff             @@
##           master   #17582      +/-   ##
==========================================
- Coverage   91.22%   91.15%   -0.08%     
==========================================
  Files         163      163              
  Lines       49625    49895     +270     
==========================================
+ Hits        45270    45481     +211     
- Misses       4355     4414      +59

Flag	Coverage Δ
#multiple	`88.95% <47.61%> (-0.04%)`	⬇️
#single	`40.22% <14.28%> (-0.04%)`	⬇️

Impacted Files	Coverage Δ
pandas/plotting/_core.py	`80.59% <47.61%> (-2.14%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/util/_tester.py	`29.41% <0%> (-9.48%)`	⬇️
pandas/core/accessor.py	`93.75% <0%> (-6.25%)`	⬇️
pandas/plotting/_tools.py	`72.92% <0%> (-6.08%)`	⬇️
pandas/core/tools/datetimes.py	`82.97% <0%> (-2.2%)`	⬇️
pandas/core/indexes/category.py	`97.46% <0%> (-1.09%)`	⬇️
pandas/core/indexes/interval.py	`92.85% <0%> (-0.72%)`	⬇️
pandas/core/common.py	`91.42% <0%> (-0.56%)`	⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...84de8ab. Read the comment docs.

jreback · 2017-09-28T10:38:37Z

pandas/plotting/_core.py

            s = 20
+        elif is_hashable(s) and s in data.columns:


you don't need to check hasability, other ops will fail if this is the case

jreback · 2017-09-28T10:39:33Z

pandas/plotting/_core.py

+        elif is_hashable(s) and s in data.columns:
+            # Handle the case where s is a label of a column of the df.
+            # The data is normalized to 200 * size_factor.
+            size_data = data.loc[:, s].values


you don't need .values

jreback · 2017-09-28T10:40:09Z

pandas/plotting/_core.py

+                self.bubble_points = 200
+                s = self.bubble_points * size_factor * size_data / self.s_data_max
+            else:
+                raise TypeError('s must be of numeric dtype')


better error message here (s is the name of a column)

How about "Bubbles sizes must be of numeric dtype" ?

'size_factor' must be numeric or categorical dtype.

jreback · 2017-09-28T10:40:49Z

pandas/plotting/_core.py

+        '''
+        Returns mantissa and exponent of the number passed in agument.
+        Example:
+        _sci_notation(89278.8924)


this can just be a module level function

Does matplotlib have anything that does this? (cc @tacaswell)

Also, typically the docstring formatting is like

>>> _sci_notation(89278.8924) (8.9, 5.0)

identical to what you get from the regular python REPL.

Yes, but buried in the formatter code. I would not try to re-use it.

Should we place this function as a module level function within plotting/_core as suggested by @jreback or is there another module that would be more appropriate ? Maybe plotting/_converter.py ?

module level in plotting/_core.py is fine I think.

VincentAntoine · 2017-10-02T21:15:42Z

Hi, can you tell me what's needed here? Shall I do something about the checks that fail? If yes what?
Guidance will be useful for me.
Vincent

TomAugspurger · 2017-10-02T21:37:31Z

@VincentAntoine could you add an empty commit to re-trigger the CI? git commit --allow-empty -m 'trigger-ci' and then git push. That should hopefully resolve the CI failures.

Otherwise, I'm going through this again now, and will probably have some feedback, if you want to wait for that.

TomAugspurger

Overall, things look pretty good here.

We'd like a release note (in doc/source/whatsnew/v0.21.0.txt`) and I think it'd be good, and not too much extra work, to support categoricals. Let me know.

TomAugspurger · 2017-10-02T21:40:26Z

pandas/plotting/_core.py

+                self.size_factor = size_factor
+                self.bubble_points = 200
+                s = self.bubble_points * size_factor * size_data / self.s_data_max
+            else:


I'd like to support Categorical data here as well. How much work would it be to do that? Perhaps

if is_categorical_dtype(size_data): if size_data.ordered: size_data = size_data.codes # else raise with a nice error message

and then the if is_numeric_dtype(size_data)? Does that work?

@TomAugspurger I'm not sure I understand what it means to plot categorical data as sizes. Could you give me a use case example?

@VincentAntoine this basically, but with all the nice stuff from this PR

df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b']) categories = list('abcd') df['c'] = pd.Categorical(np.random.choice(categories, size=(100,)), categories=categories, ordered=True) df.head() fig, ax = plt.subplots() s = (10 + df.c.cat.codes * 10) ax.scatter(x='a', y='b', data=df, s=s);

So the idea is to automatically know to use .categories.codes for categorical dtype data, instead of just the categories (which may not be numeric). I think if you do the

check for categorical dtype

size_data = df[s].cat.codes

before if is_numeric_dtype(size_data), then everything else should be the same

Yes, it works just as you said with no additionnal modification :)

We just need to do size_data = df[s].cat.codes + 1, as codes start at 0 and the resulting bubbles will have an area of 0.

@TomAugspurger I added this in the last commit, and added a test for scatter plot with categorical data as well. I'll write the release note now. Let me know if the code needs any more changes.

TomAugspurger · 2017-10-02T21:43:09Z

pandas/plotting/_core.py

+                self.bubble_points = 200
+                s = self.bubble_points * size_factor * size_data / self.s_data_max
+            else:
+                raise TypeError('s must be of numeric dtype')


'size_factor' must be numeric or categorical dtype.

TomAugspurger · 2017-10-02T21:45:30Z

pandas/plotting/_core.py

+        '''
+        Returns mantissa and exponent of the number passed in agument.
+        Example:
+        _sci_notation(89278.8924)


Does matplotlib have anything that does this? (cc @tacaswell)

TomAugspurger · 2017-10-02T21:46:15Z

pandas/plotting/_core.py

+        '''
+        Returns mantissa and exponent of the number passed in agument.
+        Example:
+        _sci_notation(89278.8924)


Also, typically the docstring formatting is like

>>> _sci_notation(89278.8924) (8.9, 5.0)

identical to what you get from the regular python REPL.

TomAugspurger · 2017-10-02T21:58:36Z

pandas/plotting/_core.py

+            import matplotlib.legend as legend
+            sizes, labels = self._legend_bubbles(s_data_max,
+                                                 size_factor,
+                                                 bubble_points)


You can use self.bubble_points here instead of assigning it up above.

TomAugspurger · 2017-10-02T21:59:37Z

pandas/plotting/_core.py

+                bubbles.append(ax.scatter([],
+                                          [],
+                                          s=size,
+                                          color='white',


Should this be color=rcParams['axes.facecolor'] and axes.edgecolor for the next line?

This color should not matter as there are no data points in it.

It is probably better to just create a Collection here and not bother adding it to the Axes (to not clutter the draw tree with dummy artists).

See https://matplotlib.org/users/legend_guide.html#creating-artists-specifically-for-adding-to-the-legend-aka-proxy-artists

@tacaswell Thank you for your help, I had tried this but used Circle objects instead of CircleCollection objects and could not make it work. Now it works :) And it's cleaner indeed.

tacaswell · 2017-10-03T02:55:55Z

pandas/plotting/_core.py

+            (0, 1.5): [1, 0.5, 0.25, 0.1]
+        }
+        for lower_bound, upper_bound in labels_catalog:
+            if (coef >= lower_bound) & (coef < upper_bound):


lower_bound <= coef < upper_bound is a bit more readable here

I did not know that worked :)

TomAugspurger · 2017-10-04T13:36:25Z

pandas/plotting/_core.py

            s = 20
+        elif s in data.columns:


I think we need the if hashable check here, else this could throw an exception.

TomAugspurger · 2017-10-04T20:43:24Z

pandas/plotting/_core.py

+        '''
+        Returns mantissa and exponent of the number passed in agument.
+        Example:
+        _sci_notation(89278.8924)


module level in plotting/_core.py is fine I think.

TomAugspurger · 2017-10-04T20:48:38Z

pandas/tests/plotting/test_frame.py

@@ -1006,6 +1006,40 @@ def test_scatter_colors(self):
                                    np.array([1, 1, 1, 1], dtype=np.float64))

    @pytest.mark.slow
+    def test_plot_scatter_with_s(self):
+        data = np.array([[3.1, 4.2, 1.9],


Could you add a test with some more variety in the data? I'm thinking

all negative

some negative some positive

constant values

some missing values

some very large values

You don't need to do the whole plot for these tests, just verify that the values you get for s look correct. You may want to refactor you logic getting this to make it easier to test.

OK. For these cases (negatives, NaN etc) I guess we want to have the same behaviour as what we'd get by passing s=df[s] directly ? That is:

make the bubble plots only with points that have positive values (discard NaN and negative data points)

throw a warning in case of negative values
?

jreback · 2017-11-12T19:10:22Z

can you rebase / fixup

jreback · 2017-12-10T23:32:44Z

closing as stale, but if you'd like to keep working, ping and we can reopen

VincentAntoine · 2018-03-24T22:48:52Z

Hi! I could not get some time to work on this for some time, I could now start working on it again and finish it if that's OK for you.

gfyoung · 2018-03-24T22:50:44Z

@VincentAntoine : Go for it!

VincentAntoine added 2 commits September 5, 2017 01:03

Grab and normalize bubble size data

55733a3

Add possibility to make scatter plot by size on DataFrame with s='col…

d2d42e5

…umn_name' argument

gfyoung requested a review from TomAugspurger September 18, 2017 23:10

gfyoung added the Visualization plotting label Sep 18, 2017

gfyoung reviewed Sep 18, 2017

View reviewed changes

Add test for scatter plot with s argument

895afd8

Change the order of arguments in scatter plot

9a86ce1

jreback reviewed Sep 28, 2017

View reviewed changes

Remove hashability check in argument parsing

bc5adb4

TomAugspurger reviewed Oct 2, 2017

View reviewed changes

tacaswell reviewed Oct 3, 2017

View reviewed changes

TomAugspurger reviewed Oct 4, 2017

View reviewed changes

Accept categorical data for s argument

84de8ab

TomAugspurger reviewed Oct 4, 2017

View reviewed changes

jreback closed this Dec 10, 2017

gfyoung reopened this Mar 24, 2018

VincentAntoine closed this Apr 1, 2018

VincentAntoine deleted the feat/scatter_by_size branch April 1, 2018 16:42

VincentAntoine mentioned this pull request Apr 4, 2018

Feat/scatter by size #20572

Closed

3 tasks

Feat/scatter by size #17582

Feat/scatter by size #17582

Conversation

VincentAntoine commented Sep 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Sep 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Sep 26, 2017 • edited Loading

Comment last updated on March 24, 2018 at 22:50 Hours UTC

codecov bot commented Sep 26, 2017

Codecov Report

codecov bot commented Sep 26, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VincentAntoine Oct 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VincentAntoine commented Oct 2, 2017

TomAugspurger commented Oct 2, 2017 • edited Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 12, 2017

jreback commented Dec 10, 2017

VincentAntoine commented Mar 24, 2018

gfyoung commented Mar 24, 2018

gfyoung Sep 18, 2017 •

edited

Loading

pep8speaks commented Sep 26, 2017 •

edited

Loading

codecov bot commented Sep 26, 2017 •

edited

Loading

VincentAntoine Oct 3, 2017 •

edited

Loading

TomAugspurger commented Oct 2, 2017 •

edited

Loading