This repository demonstrates the CI/CD pipeline process for a team of data engineers collaboratively working on Python notebooks. The pipeline facilitates the seamless transition of notebooks from a development environment, through testing, and into the production Databricks workspace. It automates deployment, ensures code quality, and integrates testing, allowing multiple engineers to push their changes with confidence.
This project demonstrates a multi-stage CI/CD pipeline using Azure DevOps and PowerShell to automate the following stages:
- Development: Python notebooks are pushed by multiple engineers to the development workspace in Databricks.
- Testing: The notebooks are validated in the dev environment to ensure correctness and functionality.
- Production: Once testing passes, the notebooks are automatically deployed to the production Databricks workspace.
The demo includes:
- YAML pipelines for managing deployments.
- A PowerShell script for securely handling Databricks tokens and deploying notebooks.
- Error handling and logging to ensure transparency and debuggability.
This file defines the overall CI/CD pipeline with stages for:
- Deploying to the Development Environment: Deploys Python notebooks from the repository to the Databricks development workspace.
- Running Tests: Ensures the notebooks work as expected in the development environment.
- Deploying to the Production Environment: Deploys the tested notebooks to the production Databricks workspace.
Key Features:
- Conditional triggers to ensure stages are only run when needed.
- Artifact handling for better traceability of deployments.
- Error handling and retry logic to ensure resilience.
This file handles the deployment of notebooks to the Databricks workspace via Azure CLI and PowerShell scripts. It:
- Retrieves Databricks workspace information and tokens securely.
- Deploys notebooks from a specified directory to the target workspace.
- Supports both development and production environments.
Key Features:
- Secure token retrieval from Azure Key Vault.
- Deployment logs and error tracking for troubleshooting.
- Modular and reusable structure for handling different environments.
This PowerShell script securely retrieves the Databricks bearer token from Azure Key Vault and handles the deployment of notebooks into the Databricks environment.
Key Features:
- Secure token retrieval, eliminating the need for hardcoded credentials.
- Modularized code for easy maintenance.
- Error handling and logging to capture any issues during deployment.
The pipeline is triggered whenever changes are made to the main
branch.
-
Deploy to Development:
- The notebooks are deployed to the Databricks development workspace.
- The deployment is logged, and errors are caught and displayed.
-
Run Tests:
- After deployment, the notebooks are validated with tests to ensure they function correctly.
- Only if the tests succeed, the pipeline proceeds to the next stage.
-
Deploy to Production:
- The notebooks are deployed to the Databricks production workspace.
- This stage only runs if the previous stages succeed.
- Logs and artifacts (such as notebooks) are published at the end of the pipeline for debugging and verification.
Before running the pipeline, ensure the following prerequisites are met:
- Azure DevOps: Set up with access to the required environments (Dev and Prod).
- Azure Databricks: Both development and production workspaces configured.
- Azure Key Vault: Store Databricks token securely for authentication.
- PowerShell: PowerShell environment for executing scripts within the pipeline.
- Azure CLI: Installed and configured for managing Azure resources.
- Add the YAML files (
cicd-pipelines.yml
anddeploy-notebooks.yml
) to your Azure DevOps pipeline. - Ensure that variable groups for both the development and production environments are properly configured.
- Modify the resource group, environment names, and paths as needed.
- Store the Databricks bearer token in an Azure Key Vault.
- Update the
DatabricksToken.ps1
script to reference your Key Vault and token name.
- Push changes to the
main
branch to trigger the pipeline. - Monitor the pipeline stages to verify successful deployment and testing.
- Notebook Paths: Modify the
notebooksPath
in the YAML files to change the location of the notebooks. - Testing: Add custom testing scripts in the
Run_Tests
stage to validate the notebooks based on your specific use case. - Error Handling: Customize error handling logic in the YAML and PowerShell scripts as needed for your environment.
- Version Control: Ensure that all notebook changes are versioned using Git to avoid conflicts and manage multiple engineers’ contributions.
- Environment Separation: Keep development and production environments isolated to avoid unintended impacts.
- Secure Token Management: Use Azure Key Vault or another secure secret management solution for all sensitive credentials.
- Parallelization: Future versions could parallelize notebook deployment and testing to improve efficiency.
- Enhanced Logging: Integrate more detailed logging and monitoring with Azure Monitor or another logging solution.
- Notebook Linting: Add automated notebook linting as part of the testing phase for improved code quality.
This demo provides a scalable and secure CI/CD pipeline for Python notebooks in Azure Databricks, ensuring quality and consistency across multiple engineers’ contributions. The modular design and error-handling capabilities make it adaptable to a variety of team structures and project requirements.