Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an Exercise that covers remaining Spark Functions Topics #39

Open
13 tasks
kelseymok opened this issue Apr 6, 2023 · 0 comments
Open
13 tasks
Assignees

Comments

@kelseymok
Copy link
Contributor

kelseymok commented Apr 6, 2023

With the new exercises, we're not covering some of the more interesting Spark functions.

We'll create a new exercise for "Additional Spark Functions" (in the small-exercises repo) to cover the following:

DataFrame Cleaning

  • na.drop
  • na.fill
  • replace
  • coalesce

DataFrame Queries

  • select + array_contains

Aggregations

  • stddev
  • variance
  • mean

String Operations

  • regexp_replace
  • regexp_extract

For everything else

  • UDF

CFRs

  • All functions should have a solution (in a separate Solutions notebook)
  • All functions should have a link to the documentation for PySpark

Notes

We might be able to reuse some of the examples we had in the Wrangling with Spark exercise, but do it better. If there's an opportunity to use our domain data, that would be best but we might need to dirty up some data and save it as a CSV or something in the repo in order to pull it in

Open Questions

Are these valuable?

  • cache
  • unpersist
  • createOrReplaceGlobalTempView
  • createOrReplaceTempView

Should all functions have a test? Perhaps we can do it later?

@syed-tw syed-tw self-assigned this Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants