[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create dataframe from ndarray by zhengruifeng · Pull Request #53280 · apache/spark

zhengruifeng · 2025-12-02T11:58:00Z

What changes were proposed in this pull request?

Avoid unnecessary pandas conversion in create dataframe from ndarray

Why are the changes needed?

before:
ndarray -> pandas dataframe -> arrow data

after:
ndarray -> arrow data

and will be consistent with connect mode:

spark/python/pyspark/sql/connect/session.py

Lines 675 to 706 in 40ba971

    
           elif isinstance(data, np.ndarray): 
        
               if _cols is None: 
        
                   if data.ndim == 1 or data.shape[1] == 1: 
        
                       _cols = ["value"] 
        
                   else: 
        
                       _cols = ["_%s" % i for i in range(1, data.shape[1] + 1)] 
        
               if data.ndim == 1: 
        
                   if 1 != len(_cols): 
        
                       raise PySparkValueError( 
        
                           errorClass="AXIS_LENGTH_MISMATCH", 
        
                           messageParameters={ 
        
                               "expected_length": str(len(_cols)), 
        
                               "actual_length": "1", 
        
                           }, 
        
                       ) 
        
                   _table = pa.Table.from_arrays([pa.array(data)], _cols) 
        
               else: 
        
                   if data.shape[1] != len(_cols): 
        
                       raise PySparkValueError( 
        
                           errorClass="AXIS_LENGTH_MISMATCH", 
        
                           messageParameters={ 
        
                               "expected_length": str(len(_cols)), 
        
                               "actual_length": str(data.shape[1]), 
        
                           }, 
        
                       ) 
        
                   _table = pa.Table.from_arrays( 
        
                       [pa.array(data[::, i]) for i in range(0, data.shape[1])], _cols 
        
                   )

Does this PR introduce any user-facing change?

no

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

test

zhengruifeng · 2025-12-03T09:09:11Z

thanks, merged to master

….test_with_none_and_nan` ### What changes were proposed in this pull request? There was a bug in create dataframe from ndarray containing NaN values: NaN was incorrectly converted to Null when arrow-optimization is on, it happened to be resolved in #53280 ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53305 from zhengruifeng/reenable_test_with_none_and_nan. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added SQL PYTHON labels Dec 2, 2025

zhengruifeng requested a review from HyukjinKwon December 2, 2025 11:58

HyukjinKwon approved these changes Dec 2, 2025

View reviewed changes

zhengruifeng added 3 commits December 3, 2025 14:58

test

052fc95

test

simplify

40fe3f6

lint

d0812c0

zhengruifeng force-pushed the test_np_arrow branch from 01bff1d to d0812c0 Compare December 3, 2025 06:58

zhengruifeng closed this in 34d4716 Dec 3, 2025

zhengruifeng deleted the test_np_arrow branch December 3, 2025 09:09

zhengruifeng mentioned this pull request Dec 3, 2025

[SPARK-54575][PYTHON][TESTS] Reenable test SparkConnectCreationTests.test_with_none_and_nan #53305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create dataframe from ndarray#53280

[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create dataframe from ndarray#53280
zhengruifeng wants to merge 3 commits intoapache:masterfrom
zhengruifeng:test_np_arrow

zhengruifeng commented Dec 2, 2025 •

edited

Loading

Uh oh!

zhengruifeng commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


	elif isinstance(data, np.ndarray):
	if _cols is None:
	if data.ndim == 1 or data.shape[1] == 1:
	_cols = ["value"]
	else:
	_cols = ["_%s" % i for i in range(1, data.shape[1] + 1)]

	if data.ndim == 1:
	if 1 != len(_cols):
	raise PySparkValueError(
	errorClass="AXIS_LENGTH_MISMATCH",
	messageParameters={
	"expected_length": str(len(_cols)),
	"actual_length": "1",
	},
	)

	_table = pa.Table.from_arrays([pa.array(data)], _cols)
	else:
	if data.shape[1] != len(_cols):
	raise PySparkValueError(
	errorClass="AXIS_LENGTH_MISMATCH",
	messageParameters={
	"expected_length": str(len(_cols)),
	"actual_length": str(data.shape[1]),
	},
	)

	_table = pa.Table.from_arrays(
	[pa.array(data[::, i]) for i in range(0, data.shape[1])], _cols
	)

Conversation

zhengruifeng commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented Dec 2, 2025 •

edited

Loading