[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #25321

HyukjinKwon · 2019-08-01T00:19:12Z

What changes were proposed in this pull request?

This PR backports #24958 to branch-2.4.

This PR proposes to use AtomicReference so that parent and child threads can access to the same file block holder.

Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets InputFileBlockHolder.set before the parent does which the parent thread is unable to read later.

In this separate child thread, if it happens to call InputFileBlockHolder.set first without initialization of the parent's thread local (which is done when the ThreadLocal.get() is first called), the child thread seems calling its own initialValue to initialize.
After that, the parent calls its own initialValue to initializes at the first call of ThreadLocal.get().
Both now have two different references. Updating at child isn't reflected to parent.

This PR fixes it via initializing parent's thread local with AtomicReference for file status so that they can be used in each task, and children thread's update is reflected.

I also tried to explain this a bit more at #24958 (comment).

How was this patch tested?

Manually tested and unittest was added.

… support input_file_name with Python UDF) This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at apache#24958 (comment). Manually tested and unittest was added. Closes apache#24958 from HyukjinKwon/SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

python/pyspark/sql/tests.py

dongjoon-hyun · 2019-08-01T00:55:59Z

Thank you, @HyukjinKwon . I checked and added 2.4.3/2.3.3 to the affected versions of JIRA.

dongjoon-hyun

+1, LGTM (Pending Jenkins).

Since this is branch-2.4, this is tested locally with the following libraries.

numpy             1.16.4
pandas            0.19.2
pyarrow           0.8.0
scipy             1.2.2

$ python/run-tests.py --python-executables python --modules pyspark-sql
Running PySpark tests. Output is in /Users/dhyun/PRS/PR-25321/python/unit-tests.log
Will test against the following Python executables: ['python']
Will test the following Python modules: ['pyspark-sql']
Starting test(python): pyspark.sql.tests
Starting test(python): pyspark.sql.catalog
Starting test(python): pyspark.sql.column
Starting test(python): pyspark.sql.conf
Finished test(python): pyspark.sql.catalog (8s)
Starting test(python): pyspark.sql.context
Finished test(python): pyspark.sql.column (12s)
Starting test(python): pyspark.sql.dataframe
Finished test(python): pyspark.sql.conf (13s)
Starting test(python): pyspark.sql.functions
Finished test(python): pyspark.sql.context (11s)
Starting test(python): pyspark.sql.group
Finished test(python): pyspark.sql.group (28s)
Starting test(python): pyspark.sql.readwriter
Finished test(python): pyspark.sql.dataframe (38s)
Starting test(python): pyspark.sql.session
Finished test(python): pyspark.sql.functions (41s)
Starting test(python): pyspark.sql.streaming
Finished test(python): pyspark.sql.session (17s)
Starting test(python): pyspark.sql.types
Finished test(python): pyspark.sql.readwriter (21s)
Starting test(python): pyspark.sql.udf
Finished test(python): pyspark.sql.streaming (18s)
Starting test(python): pyspark.sql.window
Finished test(python): pyspark.sql.types (6s)
Finished test(python): pyspark.sql.window (3s)
Finished test(python): pyspark.sql.udf (11s)
Finished test(python): pyspark.sql.tests (242s) ... 4 tests were skipped
Tests passed in 242 seconds

Skipped tests in pyspark.sql.tests with python:
    test_unbounded_frames (pyspark.sql.tests.HiveContextSQLTests) ... skipped "Unittest < 3.3 doesn't support mocking"
    test_create_dataframe_required_pandas_not_found (pyspark.sql.tests.SQLTests) ... skipped 'Required Pandas was found.'
    test_to_pandas_required_pandas_not_found (pyspark.sql.tests.SQLTests) ... skipped 'Required Pandas was found.'
    test_type_annotation (pyspark.sql.tests.ScalarPandasUDFTests) ... skipped 'Type hints are supported from Python 3.5.'

SparkQA · 2019-08-01T04:09:04Z

Test build #108497 has finished for PR 25321 at commit 4275f82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-08-01T04:17:00Z

retest this please

HyukjinKwon · 2019-08-01T04:48:13Z

I am testing mergine script Python 3 compatibility. please ignore the noise above.

SparkQA · 2019-08-01T09:04:40Z

Test build #108502 has finished for PR 25321 at commit 4275f82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-08-01T09:18:34Z

Merged to branch-2.4.

…ckHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR backports #24958 to branch-2.4. This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at #24958 (comment). ## How was this patch tested? Manually tested and unittest was added. Closes #25321 from HyukjinKwon/backport-SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ckHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR backports apache#24958 to branch-2.4. This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at apache#24958 (comment). ## How was this patch tested? Manually tested and unittest was added. Closes apache#25321 from HyukjinKwon/backport-SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon commented Aug 1, 2019

View reviewed changes

python/pyspark/sql/tests.py Show resolved Hide resolved

HyukjinKwon mentioned this pull request Aug 1, 2019

[SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #24958

Closed

dongjoon-hyun added the PYSPARK label Aug 1, 2019

dongjoon-hyun approved these changes Aug 1, 2019

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)~~ [SPARK-28153][Python][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) Aug 1, 2019

HyukjinKwon changed the title ~~[SPARK-28153][Python][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)~~ [SPARK28153][Python][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) Aug 1, 2019

HyukjinKwon changed the title ~~[SPARK28153][Python][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)~~ [SPARK28153][Python][BRANCH-2.4] []$[abc]$ Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) Aug 1, 2019

dongjoon-hyun changed the title ~~[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)~~ [SPARK-28153][PYTHON][2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) Aug 1, 2019

HyukjinKwon mentioned this pull request Aug 1, 2019

[SPARK-28586][INFRA] Make merge-spark-pr script compatible with Python 3 #25322

Closed

HyukjinKwon changed the title ~~[SPARK-28153][PYTHON][2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)~~ [SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) Aug 1, 2019

HyukjinKwon closed this Aug 1, 2019

HyukjinKwon deleted the backport-SPARK-28153 branch March 3, 2020 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #25321

[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #25321

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 1, 2019

Uh oh!

dongjoon-hyun left a comment

Uh oh!

SparkQA commented Aug 1, 2019

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

SparkQA commented Aug 1, 2019

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #25321

[SPARK-28153][PYTHON][BRANCH-2.4] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #25321

Uh oh!

Conversation

HyukjinKwon commented Aug 1, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 1, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2019

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

SparkQA commented Aug 1, 2019

Uh oh!

HyukjinKwon commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants