ETL with Apache Airflow and Apache Kafka

This project includes building data pipelines using Apache Airflow and Kafka separately, the data is loaded into PostgreSQL and SQL Server databases respectively.

Steps to run Apache Airflow code:

Install Apache Airflow
Copy the .py file in the Airflow folder that contains all the required definitions of DAGs (Directed Acyclic Graphs) and tasks to the ~airflow/dags folder (which comes along when you install Airflow). For easy access add the path to a variable like AIRFLOW_DAGS using the command export $AIRFLOW_DAGS=~airflow/dags
That's it, since the python file is in the dags folder, airflow will recognize it and display it on the Airflow UI.
To start the Airflow UI, you should initialize the database from which airflow takes the DAGs using the command airflow db init and you may want to start the scheduler then using the command airflow scheduler.
Once the scheduler is started, the server for the UI can be started using the command airflow webserver --port 8080.
You can then access the UI at http://localhost/8080
You can start, stop, and manage the required DAGs in the UI directly :)

Steps to run Apache Kafka code:

Install Apache Kafka using the command wget https://archive.apache.org/dist/kafka/2.8.0/kafka_2.12-2.8.0.tgz and unzip the kafka folder using tar -xzf kafka_2.12-2.8.0.tgz.
Start the ZooKeeper (a distributed system that manages the brokers in Kafka architecture) and Kafka using the following commands respectively. a) cd kafka_2.12-2.8.0 bin/zookeeper-server-start.sh config/zookeeper.properties

b) cd kafka_2.12-2.8.0 bin/kafka-server-start.sh config/server.properties
Create a topic to which messaged will be sent since the brokers work using topic partitions. Topic can be created using the command cd kafka_2.12-2.8.0 bin/kafka-topics.sh --create --topic example_topic_name --bootstrap-server localhost:9092
Run the producer.py and consumer.py files in Kafka folder in separate tabs.
Once run, the streaming data will be loaded in SQL Server database with which analysis can be performed in SQL Server Management Studio.

SQL script file contains the basic analysis performed on streaming data loaded in database.

Happy Coding :)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Airflow		Airflow
Kafka		Kafka
README.md		README.md
SQLQuery_streaming_analysis.sql		SQLQuery_streaming_analysis.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL with Apache Airflow and Apache Kafka

About

Releases

Packages

Languages

NvkAnirudh/Toll-Data-Streaming

Folders and files

Latest commit

History

Repository files navigation

ETL with Apache Airflow and Apache Kafka

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages