Skip to content

Commit

Permalink
Improve(?) explanation of SettingWithCopy warning
Browse files Browse the repository at this point in the history
After playing with R a bunch, I started feeling like the explanation of
`SettingWithCopy` wasn't getting to the core of the matter, which is actually
an essential consequence of python slice assignment semantics. Here's how
python handles chained assignment:

```python
df['foo']['bar'] = quux
df.__getitem__('foo').__setitem__('bar', quux)
```

whereas in R, it's this:

```R
df["foo"]["bar"] <- quux
df["foo"] <- `[<-`(df["foo"], "bar", quux)
df <- `[<-`(df, "foo", `[<-`(`[`(df, "foo"), "bar", quux))
```

That last is a lot of line noise, though the R method names `` `[` `` and
`` `[<-` `` are more concise than `__getitem__` and `__setitem__`! But imagine
that you could call `__setitem__` with a kwarg `inplace=False` that would cause
it to return a modified copy instead of modifying the original object. Then the
R version would translate to this in python:

```python
df = df.__setitem__('foo',
                    df.__getitem__('foo')
                      .__setitem__('bar', quux, inplace=False),
                    inplace=False)
```
This is incredibly awkward, but it has the advantage of making
`SettingWithCopy` unnecessary&mdash; *everything* is a copy, and yet things get
set nonetheless.

So this commit is an attempt to explain this without requiring the reader to
know R.
  • Loading branch information
ischwabacher committed Dec 4, 2015
1 parent 8ec8487 commit 9f1af70
Showing 1 changed file with 44 additions and 9 deletions.
53 changes: 44 additions & 9 deletions doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1522,23 +1522,58 @@ Contrast this to ``df.loc[:,('one','second')]`` which passes a nested tuple of `
``__getitem__``. This allows pandas to deal with this as a single entity. Furthermore this order of operations *can* be significantly
faster, and allows one to index *both* axes if so desired.

Why does the assignment when using chained indexing fail!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Why does assignment fail when using chained indexing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So, why does this show the ``SettingWithCopy`` warning / and possibly not work when you do chained indexing and assignment:
The problem in the previous section is just a performance issue. What's up with
the ``SettingWithCopy`` warning? We don't **usually** throw warnings around when
you do something that might cost a few extra milliseconds!

But it turns out that assigning to the product of chained indexing has
inherently unpredictable results. To see this, think about how the Python
interpreter executes this code:

.. code-block:: python
dfmi['one']['second'] = value
dfmi.loc[:,('one','second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
Since the chained indexing is 2 calls, it is possible that either call may return a **copy** of the data because of the way it is sliced.
Thus when setting, you are actually setting a **copy**, and not the original frame data. It is impossible for pandas to figure this out because their are 2 separate python operations that are not connected.
But this code is handled differently:

.. code-block:: python
The ``SettingWithCopy`` warning is a 'heuristic' to detect this (meaning it tends to catch most cases but is simply a lightweight check). Figuring this out for real is way complicated.
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that ``__getitem__`` in there? Outside of simple cases, it's very hard to
predict whether it will return a view or a copy (it depends on the memory layout
of the array, about which *pandas* makes no guarantees), and therefore whether
the ``__setitem__`` will modify ``dfmi`` or a temporary object that gets thrown
out immediately afterward. **That's** what ``SettingWithCopy`` is warning you
about!

.. note:: You may be wondering whether we should be concerned about the ``loc``
property in the first example. But ``dfmi.loc`` is guaranteed to be ``dfmi``
itself with modified indexing behavior, so ``dfmi.loc.__getitem__`` /
``dfmi.loc.__setitem__`` operate on ``dfmi`` directly. Of course,
``dfmi.loc.__getitem__(idx)`` may be a view or a copy of ``dfmi``.

Sometimes a ``SettingWithCopy`` warning will arise at times when there's no
obvious chained indexing going on. **These** are the bugs that
``SettingWithCopy`` is designed to catch! Pandas is probably trying to warn you
that you've done this:

.. code-block:: python
The ``.loc`` operation is a single python operation, and thus can select a slice (which still may be a copy), but allows pandas to assign that slice back into the frame after it is modified, thus setting the values as you would think.
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
foo['quux'] = value # We don't know whether this will modify df or not!
return foo
The reason for having the ``SettingWithCopy`` warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say ``float`` and ``object`` data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array.
Yikes!

Evaluation order matters
~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down

0 comments on commit 9f1af70

Please sign in to comment.