Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segmentation fault when JSON serializing a PeriodIndex #47431

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Fixed regressions
- Fixed regression in :func:`assert_index_equal` when ``check_order=False`` and :class:`Index` has extension or object dtype (:issue:`47207`)
- Fixed regression in :func:`read_excel` returning ints as floats on certain input sheets (:issue:`46988`)
- Fixed regression in :meth:`DataFrame.shift` when ``axis`` is ``columns`` and ``fill_value`` is absent, ``freq`` is ignored (:issue:`47039`)
- Fixed regression in :meth:`DataFrame.to_json` causing a segmentation violation when :class:`DataFrame` is created with an ``index`` parameter of the type :class:`PeriodIndex` (:issue:`46683`)

.. ---------------------------------------------------------------------------

Expand Down
4 changes: 3 additions & 1 deletion pandas/_libs/src/ujson/python/objToJSON.c
Original file line number Diff line number Diff line change
Expand Up @@ -238,8 +238,10 @@ static PyObject *get_values(PyObject *obj) {
PyErr_Clear();
} else if (PyObject_HasAttrString(values, "__array__")) {
// We may have gotten a Categorical or Sparse array so call np.array
PyObject *array_values = PyObject_CallMethod(values, "__array__",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how the memory management work here? I'm not entirely sure that is safe to reassign to values after decrementing. It may be happen-stance that this improves the odds of delaying garbage collection. Maybe we can just return array_values here directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C there is no Python style reference counting. I must say that I assume that the Python object returned by PyObject_CallMethod has at least a reference count of 1, so that it does not get deallocated.
The next line decrements the refcount of the original values object, which in this case will reduce it to 0, causing it to be freed immediately. This was the problem causing the segfault.
This fix delays the destruction of the original values object and as the array_values object should have a refcount high enough not to be destroyed, this works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in case array_values would be equal to NULL, this if statement on line 252 (new situation) would not be executed if it would it would be returned directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say that I assume that the Python object returned by PyObject_CallMethod has at least a reference count of 1, so that it does not get deallocated.

Yep typicaly callsPyObject_* functions will give you ownership of the reference to an object, which is pretty similar to +1 on a refcount.

The next line decrements the refcount of the original values object, which in this case will reduce it to 0, causing it to be freed immediately.

Could be wrong but I think it gets freed on the next garbage collector run, not necessarily immediately when the count reaches zero. Would definitely be safer here to return rather than re-using the variable if we can

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but this part of the thread re-ups my desire to move this logic out of C

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python documentation on reference counting is definitely a good resource. Worth a read:

https://docs.python.org/3/c-api/intro.html#reference-count-details

I think this code would be very difficult to port to Cython maybe not even worth it, but of course anything possible with time and effort

Copy link
Contributor Author

@roberthdevries roberthdevries Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the reference count reaches zero, a python object gets freed (see Py_DECREF). Garbage collection is used to free up objects which are held in circular references that are no longer referenced from any other objects.
This was also the cause of the segmentation fault, as a freed object was being used (use after free type of bug).
There is nothing to be gained by returning immediately with respect to influencing the reference count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for clarifying

NULL);
Py_DECREF(values);
values = PyObject_CallMethod(values, "__array__", NULL);
values = array_values;
} else if (!PyArray_CheckExact(values)) {
// Didn't get a numpy array, so keep trying
Py_DECREF(values);
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/io/json/test_ujson.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
DatetimeIndex,
Index,
NaT,
PeriodIndex,
Series,
Timedelta,
Timestamp,
Expand Down Expand Up @@ -1240,3 +1241,9 @@ def test_encode_timedelta_iso(self, td):
expected = f'"{td.isoformat()}"'

assert result == expected

def test_encode_periodindex(self):
# GH 46683
p = PeriodIndex(["2022-04-06", "2022-04-07"], freq="D")
df = DataFrame(index=p)
assert df.to_json() == "{}"