Data science is more than just modeling. The complete data science lifecycle also includes data engineering and model deployment. This project offers a simplified yet credible example of all three elements, as implemented using Apache Spark, the Cloudera Data Science Workbench, and JPMML / OpenScoring.
In this project, the ACME corporation is productionizing a connected-house platform. Part of this service requires predicting the occupancy of a room given sensor readings.
This example project includes simplified examples of:
- Data Engineering
- Ingest
- Cleaning
- Data Science
- Modeling
- Tuning and evaluation
- Model Serving
- Model management
- Testing
- REST API
- Cloudera Data Science Workbench 1.0
- CDH 5.10+ cluster
- Spark 2.1 CSD for CDH
- Apache Maven 3.2+
To continue, review documentation for each of the three modules, which contains more information about what it show and how to run it.