The ETL(Extract-Transform-Load) process is a key component of many data management operations, including move data and to transform the data from one format to another. To effectively support these operations, spark-etl is providing a distributed solution.
spark-etl is a Scala-based project and it is developing with Spark. So it is scalable and distributed. spark-etl will process data from N source to N database. The project structure:
Extract
- FILES (json, csv)
- SQLs
- NoSQLs
- Key Value Stores
- APIs
- Streams
Transform
- Row to json
- Row to csv
- Json to Row
- Change records
- Merge records
Load
- FILES (json, csv)
- SQLs
- NoSQLs
- Key Value Stores
- APIs
- Streams
Pros
- parallel ETL on cluster level
- synchronisation of data
- open source
We want to get data from multiple sources like MySQL and CVS. When we extracting data, we also want to filter and merge some fields/tables. During the transform layer, we want to run an SQL. Then we want to write the transformed data to multiple targets like S3 and Redshift.
spark-etl is the easiest way to do this scenario!
- Scala - Functional Programming Language
- ScalaTest - ScalaTest is a testing tool in the Scala ecosystem.
- wartremover - WartRemover is a flexible Scala code linting tool.
- scalastyle - Scalastyle examines Scala code and indicates potential problems with it.
- [scoverage] - Scoverage is a code coverage tool for scala that offers statement and branch coverage.
- Apache Spark - Apache Spark is a fast and general engine for large-scale data processing.
- travis-ci - Travis CI is a hosted, distributed continuous integration service used to build and test software projects
- coveralls - Coveralls is a web service to help you track your code coverage over time, and ensure that all your new code is fully covered
Prerequisites for building spark-etl:
- sbt clean assembly
Want to contribute? Great! Let's say "Hello" on gitter.
- Scalafmt integration
- ETL Design