Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segmentation fault when JSON serializing a PeriodIndex #47431

Conversation

roberthdevries
Copy link
Contributor

@roberthdevries roberthdevries commented Jun 20, 2022

Fixes #46683

@phofl phofl added this to the 1.4.3 milestone Jun 20, 2022
@phofl phofl added IO JSON read_json, to_json, json_normalize Segfault Non-Recoverable Error labels Jun 20, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

@@ -30,6 +30,7 @@ Fixed regressions
- Fixed regression in :func:`assert_index_equal` when ``check_order=False`` and :class:`Index` has extension or object dtype (:issue:`47207`)
- Fixed regression in :func:`read_excel` returning ints as floats on certain input sheets (:issue:`46988`)
- Fixed regression in :meth:`DataFrame.shift` when ``axis`` is ``columns`` and ``fill_value`` is absent, ``freq`` is ignored (:issue:`47039`)
- Fixed regression in :meth:`DataFrame.to_json` when ``index`` is of the type ``PeriodIndex`` causing a segmentation violation (:issue:`46683`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:Index and class:PeriodIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index refers to the parameter in the constructor of DataFrame and not the class. Is there some kind of markup to indicate parameters?
The PeriodIndex has been marked as class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for parameter types then the double ticks is fine. However, the note refers to to_json so it is not obvious that the index is a parameter name. perhaps reword to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion. Is the new version clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think marking this as class too is clearer, since both are interchangable in this context. Otherwise, please add something like when argument ìndex is ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -238,8 +238,10 @@ static PyObject *get_values(PyObject *obj) {
PyErr_Clear();
} else if (PyObject_HasAttrString(values, "__array__")) {
// We may have gotten a Categorical or Sparse array so call np.array
PyObject *array_values = PyObject_CallMethod(values, "__array__",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how the memory management work here? I'm not entirely sure that is safe to reassign to values after decrementing. It may be happen-stance that this improves the odds of delaying garbage collection. Maybe we can just return array_values here directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C there is no Python style reference counting. I must say that I assume that the Python object returned by PyObject_CallMethod has at least a reference count of 1, so that it does not get deallocated.
The next line decrements the refcount of the original values object, which in this case will reduce it to 0, causing it to be freed immediately. This was the problem causing the segfault.
This fix delays the destruction of the original values object and as the array_values object should have a refcount high enough not to be destroyed, this works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in case array_values would be equal to NULL, this if statement on line 252 (new situation) would not be executed if it would it would be returned directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say that I assume that the Python object returned by PyObject_CallMethod has at least a reference count of 1, so that it does not get deallocated.

Yep typicaly callsPyObject_* functions will give you ownership of the reference to an object, which is pretty similar to +1 on a refcount.

The next line decrements the refcount of the original values object, which in this case will reduce it to 0, causing it to be freed immediately.

Could be wrong but I think it gets freed on the next garbage collector run, not necessarily immediately when the count reaches zero. Would definitely be safer here to return rather than re-using the variable if we can

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but this part of the thread re-ups my desire to move this logic out of C

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python documentation on reference counting is definitely a good resource. Worth a read:

https://docs.python.org/3/c-api/intro.html#reference-count-details

I think this code would be very difficult to port to Cython maybe not even worth it, but of course anything possible with time and effort

Copy link
Contributor Author

@roberthdevries roberthdevries Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the reference count reaches zero, a python object gets freed (see Py_DECREF). Garbage collection is used to free up objects which are held in circular references that are no longer referenced from any other objects.
This was also the cause of the segmentation fault, as a freed object was being used (use after free type of bug).
There is nothing to be gained by returning immediately with respect to influencing the reference count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for clarifying

@simonjayhawkins
Copy link
Member

lgtm ex @WillAyd comment.

@roberthdevries roberthdevries force-pushed the 46683-fix-segfault-when-json-serializing-periodindex branch from 175d8e1 to 521dfeb Compare June 21, 2022 17:54
@simonjayhawkins simonjayhawkins merged commit faacb72 into pandas-dev:main Jun 22, 2022
@simonjayhawkins
Copy link
Member

Thanks @roberthdevries

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jun 22, 2022
@roberthdevries roberthdevries deleted the 46683-fix-segfault-when-json-serializing-periodindex branch June 22, 2022 10:16
simonjayhawkins pushed a commit that referenced this pull request Jun 22, 2022
…serializing a PeriodIndex) (#47457)

Backport PR #47431: Fix segmentation fault when JSON serializing a PeriodIndex

Co-authored-by: Robert de Vries <rhdv@xs4all.nl>
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Segmentation fault when JSON serializing a PeriodIndex
5 participants