From 9f1af70485be6a10330abe13a0d6c8e2f4bf045f Mon Sep 17 00:00:00 2001 From: Isaac Schwabacher Date: Wed, 2 Dec 2015 17:28:16 -0600 Subject: [PATCH] Improve(?) explanation of SettingWithCopy warning After playing with R a bunch, I started feeling like the explanation of `SettingWithCopy` wasn't getting to the core of the matter, which is actually an essential consequence of python slice assignment semantics. Here's how python handles chained assignment: ```python df['foo']['bar'] = quux df.__getitem__('foo').__setitem__('bar', quux) ``` whereas in R, it's this: ```R df["foo"]["bar"] <- quux df["foo"] <- `[<-`(df["foo"], "bar", quux) df <- `[<-`(df, "foo", `[<-`(`[`(df, "foo"), "bar", quux)) ``` That last is a lot of line noise, though the R method names `` `[` `` and `` `[<-` `` are more concise than `__getitem__` and `__setitem__`! But imagine that you could call `__setitem__` with a kwarg `inplace=False` that would cause it to return a modified copy instead of modifying the original object. Then the R version would translate to this in python: ```python df = df.__setitem__('foo', df.__getitem__('foo') .__setitem__('bar', quux, inplace=False), inplace=False) ``` This is incredibly awkward, but it has the advantage of making `SettingWithCopy` unnecessary— *everything* is a copy, and yet things get set nonetheless. So this commit is an attempt to explain this without requiring the reader to know R. --- doc/source/indexing.rst | 53 ++++++++++++++++++++++++++++++++++------- 1 file changed, 44 insertions(+), 9 deletions(-) diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst index 80dc1be8ee2ea..af5087689ca4d 100644 --- a/doc/source/indexing.rst +++ b/doc/source/indexing.rst @@ -1522,23 +1522,58 @@ Contrast this to ``df.loc[:,('one','second')]`` which passes a nested tuple of ` ``__getitem__``. This allows pandas to deal with this as a single entity. Furthermore this order of operations *can* be significantly faster, and allows one to index *both* axes if so desired. -Why does the assignment when using chained indexing fail! -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Why does assignment fail when using chained indexing? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -So, why does this show the ``SettingWithCopy`` warning / and possibly not work when you do chained indexing and assignment: +The problem in the previous section is just a performance issue. What's up with +the ``SettingWithCopy`` warning? We don't **usually** throw warnings around when +you do something that might cost a few extra milliseconds! + +But it turns out that assigning to the product of chained indexing has +inherently unpredictable results. To see this, think about how the Python +interpreter executes this code: .. code-block:: python - dfmi['one']['second'] = value + dfmi.loc[:,('one','second')] = value + # becomes + dfmi.loc.__setitem__((slice(None), ('one', 'second')), value) -Since the chained indexing is 2 calls, it is possible that either call may return a **copy** of the data because of the way it is sliced. -Thus when setting, you are actually setting a **copy**, and not the original frame data. It is impossible for pandas to figure this out because their are 2 separate python operations that are not connected. +But this code is handled differently: + +.. code-block:: python -The ``SettingWithCopy`` warning is a 'heuristic' to detect this (meaning it tends to catch most cases but is simply a lightweight check). Figuring this out for real is way complicated. + dfmi['one']['second'] = value + # becomes + dfmi.__getitem__('one').__setitem__('second', value) + +See that ``__getitem__`` in there? Outside of simple cases, it's very hard to +predict whether it will return a view or a copy (it depends on the memory layout +of the array, about which *pandas* makes no guarantees), and therefore whether +the ``__setitem__`` will modify ``dfmi`` or a temporary object that gets thrown +out immediately afterward. **That's** what ``SettingWithCopy`` is warning you +about! + +.. note:: You may be wondering whether we should be concerned about the ``loc`` + property in the first example. But ``dfmi.loc`` is guaranteed to be ``dfmi`` + itself with modified indexing behavior, so ``dfmi.loc.__getitem__`` / + ``dfmi.loc.__setitem__`` operate on ``dfmi`` directly. Of course, + ``dfmi.loc.__getitem__(idx)`` may be a view or a copy of ``dfmi``. + +Sometimes a ``SettingWithCopy`` warning will arise at times when there's no +obvious chained indexing going on. **These** are the bugs that +``SettingWithCopy`` is designed to catch! Pandas is probably trying to warn you +that you've done this: + +.. code-block:: python -The ``.loc`` operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified, thus setting the values as you would think. + def do_something(df): + foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows! + # ... many lines here ... + foo['quux'] = value # We don't know whether this will modify df or not! + return foo -The reason for having the ``SettingWithCopy`` warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say ``float`` and ``object`` data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array. +Yikes! Evaluation order matters ~~~~~~~~~~~~~~~~~~~~~~~~