This sub-project provides the boilerplate and tooling to help in developing AWS Glue ETL Jobs.
Make sure you have the following available from your shell (terminal):
poetry
python3.10
If they are missing, you can brew install
them.
Once the requirements above are ready, you create the poetry environment and install the project dependencies. To do that,
make install
We use the container to emulate Glue's runtime environment.
To start the container,
make start
This will start the container and run JupyterLab. If the container is not built yet, it will build first before starting.
Be patient. It will take around 15 minutes to build it for the first time. But succeeding builds will take a lot less.
To start a bash shell from inside this running container, you can do
make shell
Once the container has successfully started, point your browser to http://127.0.0.1:8888/lab.
To run a PySpark shell,
make pyspark
Using the JupyterLab, you can create a PySpark notebook which imports your Glue/PySpark code and run things there.
Please check Python modules already provided in AWS Glue to see if the library that you need is already included in the AWS Glue runtime.
If the desired version of the library is already included, we only need it to our local environment.
poetry add --group=runtime {{library name}}=={{exact version}}
If you need to add a new runtime library, you need to add it in our local environment,
poetry add --group=runtime {{library name}}
and make sure to use the --additional-python-modules
option when
defining your Glue script.
$ poetry add {{library name}}
# To update the docker/requirements.txt file
$ make docker/requirements.txt
# To rebuild the docker container for local testing purposes
$ make start
This project includes git hooks that will execute several checks when making commits. This gives us feedback during development on checks that would otherwise fail in CI.
To install them,
make githooks
After installation, it will do an initial run to check all files.
=== Glue Dev Tools ===
Available commands:
all Show help (default)
install Create virtualenv and install dependencies
outdated Check for outdated dependencies
start Rebuild and start the development container
format Format project source code
lint Check source code for common errors
typecheck Check type annotations
test Run automated tests
coverage Generate test coverage HTML report
githooks Install/update project git hooks
synth Synthesizes and prints the CloudFormation template
diff Compares the specified stack with the deployed stack
deploy Deploy the application including the necessary infrastructure
shell Start a bash shell session inside the container
clean-notebooks Removes output cells from Jupyter notebooks
pyspark Start a Spark shell session
clean Delete generated artifacts