Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create abstraction to split into multiple columns easily #85

Open
MrPowers opened this issue Mar 18, 2023 · 3 comments
Open

Create abstraction to split into multiple columns easily #85

MrPowers opened this issue Mar 18, 2023 · 3 comments
Assignees

Comments

@MrPowers
Copy link
Collaborator

Suppose you have this DataFrame:

+------------+---------------+-------+
|student_name|graduation_year|  major|
+------------+---------------+-------+
| chrisXXborg|           2025|    bio|
|davidXXcross|           2026|physics|
|sophiaXXraul|           2022|    bio|
|    fredXXli|           2025|physics|
|someXXperson|           2023|   math|
|     liXXyao|           2025|physics|
+------------+---------------+-------+

Here is how to clean the DataFrame:

from pyspark.sql.functions import col, split

clean_df = (
    df.withColumn("student_first_name", split(col("student_name"), "XX").getItem(0))
    .withColumn("student_last_name", split(col("student_name"), "XX").getItem(1))
    .drop("student_name")
)

It'd be nice to have a function that would do this automatically:

quinn.split_col(df, col_name="student_name", delimiter="XX", new_col_names=["student_first_name", "student_last_name"])

The current syntax is tedious.

@SemyonSinchenko
Copy link
Collaborator

I can do it. Should we add an option to have a default value or not? I mean, if there is a string that does not fit with the pattern, should we raise the "IndexOutOfBound" exception, or should we return default value?

@MrPowers
Copy link
Collaborator Author

@SemyonSinchenko - we should probably give the user both options. Perhaps we should have a mode parameter. When mode="strict", then they'll get an IndexOutOfBound error. When mode="permissive", then the missing values/extra values are just populated with null/ignored entirely. Thoughts?

@puneetsharma04
Copy link
Contributor

@SemyonSinchenko & @MrPowers : It seems to be interesting feature and mostly used transformation in the ETL projects.
I would also like share the piece of code here, if it look fine to both of you.
Below given is code which can handle the scenario:

from pyspark.sql.functions import col, split, when

def split_col(df, col_name, delimiter, new_col_names, mode="strict"):
    split_col_expr = split(col(col_name), delimiter)

    if mode == "strict":
        df = df.withColumn(new_col_names[0], split_col_expr.getItem(0))
        df = df.withColumn(new_col_names[1], split_col_expr.getItem(1))
    elif mode == "permissive":
        df = (
            df.withColumn(new_col_names[0], split_col_expr.getItem(0))
              .withColumn(new_col_names[1], when(split_col_expr.size() > 1, split_col_expr.getItem(1)))
              .filter(col(new_col_names[1]).isNotNull())
        )
    else:
        raise ValueError("Invalid mode: {}".format(mode))

    df = df.drop(col_name)
    return df
  

  # Create Spark DataFrame
data = [    ("chrisXXborg", 2025, "bio"),    ("davidXXcross", 2026, "physics"),    ("sophiaXXraul", 2022, "bio"),    ("fredXXli", 2025, "physics"),    ("someXXperson", 2023, "math"),    ("liXXyao", 2025, "physics")]
df = spark.createDataFrame(data,["student_name", "graduation_year", "major"])

# Call split_col() function to split "student_name" column
new_df = split_col(df, "student_name", "XX", ["student_first_name", "student_last_name"])

# Show the resulting DataFrame
new_df.show()
+---------------+-------+------------------+-----------------+
|graduation_year|  major|student_first_name|student_last_name|
+---------------+-------+------------------+-----------------+
|           2025|    bio|             chris|             borg|
|           2026|physics|             david|            cross|
|           2022|    bio|            sophia|             raul|
|           2025|physics|              fred|               li|
|           2023|   math|              some|           person|
|           2025|physics|                li|              yao|
+---------------+-------+------------------+-----------------+

MrPowers added a commit that referenced this issue Oct 7, 2023
* Added files for schema append functionality

* Update test_append_if_schema_identical.py

* Made the changes as per the review comments

* Made the changes as per the review comments & added comments for better readability.

* Made the changes as per the review comments & added comments for better readability.

* Added function to handle the splitting of column.

* Made changes to include split_col function.

* Made changes to default mode as 'strict'.

* Added test cases to test the functionality.

* Additional functionality as per review comments.

---------

Co-authored-by: Matthew Powers <matthewkevinpowers@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants