Create abstraction to split into multiple columns easily #85

MrPowers · 2023-03-18T18:58:11Z

Suppose you have this DataFrame:

+------------+---------------+-------+
|student_name|graduation_year|  major|
+------------+---------------+-------+
| chrisXXborg|           2025|    bio|
|davidXXcross|           2026|physics|
|sophiaXXraul|           2022|    bio|
|    fredXXli|           2025|physics|
|someXXperson|           2023|   math|
|     liXXyao|           2025|physics|
+------------+---------------+-------+

Here is how to clean the DataFrame:

from pyspark.sql.functions import col, split

clean_df = (
    df.withColumn("student_first_name", split(col("student_name"), "XX").getItem(0))
    .withColumn("student_last_name", split(col("student_name"), "XX").getItem(1))
    .drop("student_name")
)

It'd be nice to have a function that would do this automatically:

quinn.split_col(df, col_name="student_name", delimiter="XX", new_col_names=["student_first_name", "student_last_name"])

The current syntax is tedious.

The text was updated successfully, but these errors were encountered:

SemyonSinchenko · 2023-03-18T20:58:09Z

I can do it. Should we add an option to have a default value or not? I mean, if there is a string that does not fit with the pattern, should we raise the "IndexOutOfBound" exception, or should we return default value?

MrPowers · 2023-03-19T19:34:56Z

@SemyonSinchenko - we should probably give the user both options. Perhaps we should have a mode parameter. When mode="strict", then they'll get an IndexOutOfBound error. When mode="permissive", then the missing values/extra values are just populated with null/ignored entirely. Thoughts?

puneetsharma04 · 2023-03-21T19:36:18Z

@SemyonSinchenko & @MrPowers : It seems to be interesting feature and mostly used transformation in the ETL projects.
I would also like share the piece of code here, if it look fine to both of you.
Below given is code which can handle the scenario:

from pyspark.sql.functions import col, split, when

def split_col(df, col_name, delimiter, new_col_names, mode="strict"):
    split_col_expr = split(col(col_name), delimiter)

    if mode == "strict":
        df = df.withColumn(new_col_names[0], split_col_expr.getItem(0))
        df = df.withColumn(new_col_names[1], split_col_expr.getItem(1))
    elif mode == "permissive":
        df = (
            df.withColumn(new_col_names[0], split_col_expr.getItem(0))
              .withColumn(new_col_names[1], when(split_col_expr.size() > 1, split_col_expr.getItem(1)))
              .filter(col(new_col_names[1]).isNotNull())
        )
    else:
        raise ValueError("Invalid mode: {}".format(mode))

    df = df.drop(col_name)
    return df
  

  # Create Spark DataFrame
data = [    ("chrisXXborg", 2025, "bio"),    ("davidXXcross", 2026, "physics"),    ("sophiaXXraul", 2022, "bio"),    ("fredXXli", 2025, "physics"),    ("someXXperson", 2023, "math"),    ("liXXyao", 2025, "physics")]
df = spark.createDataFrame(data,["student_name", "graduation_year", "major"])

# Call split_col() function to split "student_name" column
new_df = split_col(df, "student_name", "XX", ["student_first_name", "student_last_name"])

# Show the resulting DataFrame
new_df.show()
+---------------+-------+------------------+-----------------+
|graduation_year|  major|student_first_name|student_last_name|
+---------------+-------+------------------+-----------------+
|           2025|    bio|             chris|             borg|
|           2026|physics|             david|            cross|
|           2022|    bio|            sophia|             raul|
|           2025|physics|              fred|               li|
|           2023|   math|              some|           person|
|           2025|physics|                li|              yao|
+---------------+-------+------------------+-----------------+

* Added files for schema append functionality * Update test_append_if_schema_identical.py * Made the changes as per the review comments * Made the changes as per the review comments & added comments for better readability. * Made the changes as per the review comments & added comments for better readability. * Added function to handle the splitting of column. * Made changes to include split_col function. * Made changes to default mode as 'strict'. * Added test cases to test the functionality. * Additional functionality as per review comments. --------- Co-authored-by: Matthew Powers <matthewkevinpowers@gmail.com>

MrPowers assigned SemyonSinchenko Mar 19, 2023

MrPowers mentioned this issue Apr 11, 2023

Created the functionality to split the columns for issue #85 #92

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create abstraction to split into multiple columns easily #85

Create abstraction to split into multiple columns easily #85

MrPowers commented Mar 18, 2023

SemyonSinchenko commented Mar 18, 2023

MrPowers commented Mar 19, 2023

puneetsharma04 commented Mar 21, 2023

Create abstraction to split into multiple columns easily #85

Create abstraction to split into multiple columns easily #85

Comments

MrPowers commented Mar 18, 2023

SemyonSinchenko commented Mar 18, 2023

MrPowers commented Mar 19, 2023

puneetsharma04 commented Mar 21, 2023