Skip to content

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

This pr refine docstring of from_csv/schema_of_csv/to_csv and add some new examples.

Why are the changes needed?

To improve PySpark documentation

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass Github Actions

Was this patch authored or co-authored using generative AI tooling?

No

@LuciferYang LuciferYang marked this pull request as draft January 9, 2024 11:49
@LuciferYang
Copy link
Contributor Author

Error: Internal server error occurred while resolving "actions/cache@v3". Internal server error occurred while resolving "actions/checkout@v4". Internal server error occurred while resolving "actions/setup-java@v4". Internal server error occurred while resolving "actions/upload-artifact@v3"

Seems there are some issues with GA, need to wait until it's resolved to continue testing.

@github-actions github-actions bot removed the INFRA label Jan 10, 2024
return _invoke_function("schema_of_csv", col, _options_to_str(options))


# TODO(SPARK-46654) Re-enable the `Example 2` test after fixing the display
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example 2: Converting a complex StructType to a CSV string displayed different results between Regular Spark and Spark Connect, skip test it in this pr and add TODO(SPARK-46654):

**********************************************************************
3953File "/__w/spark/spark/python/pyspark/sql/connect/functions/builtin.py", line 2232, in pyspark.sql.connect.functions.builtin.to_csv
3954Failed example:
3955    df.select(sf.to_csv(df.value)).show(truncate=False)
3956Expected:
3957    +-----------------------+
3958    |to_csv(value)          |
3959    +-----------------------+
3960    |2,Alice,"[100,200,300]"|
3961    +-----------------------+
3962Got:
3963    +--------------------------------------------------------------------------+
3964    |to_csv(value)                                                             |
3965    +--------------------------------------------------------------------------+
3966    |2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@99c5e30f|
3967    +--------------------------------------------------------------------------+
3968    <BLANKLINE>
3969**********************************************************************
3970   1 of  18 in pyspark.sql.connect.functions.builtin.to_csv
3971***Test Failed*** 1 failures. 

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/10 13:56:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/10 13:56:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/01/10 13:56:18 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022 15:24:06)
Spark context Web UI available at http://localhost:4042
Spark context available as 'sc' (master = local[*], app id = local-1704866178807).
SparkSession available as 'spark'.
>>> from pyspark.sql import Row, functions as sf
>>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(sf.to_csv(df.value)).show(truncate=False)
+-----------------------+                                                       
|to_csv(value)          |
+-----------------------+
|2,Alice,"[100,200,300]"|
+-----------------------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./bin/pyspark --remote "sc://localhost"

Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0.dev0
      /_/

Using Python version 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022 15:24:06)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> from pyspark.sql import Row, functions as sf
>>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(sf.to_csv(df.value)).show(truncate=False)
+--------------------------------------------------------------------------+
|to_csv(value)                                                             |
+--------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@99c5e30f|
+--------------------------------------------------------------------------+

@LuciferYang LuciferYang marked this pull request as ready for review January 10, 2024 07:05
@LuciferYang
Copy link
Contributor Author

Merged into master. Thanks @HyukjinKwon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants