Skip to content

dimzachar/capstone_mlzoomcamp

Repository files navigation

Capstone Project mlzoomcamp Image Classification

Table of Contents

Repo contains the following:

  • Extra folder with the project images
  • Extra_models folder with best models in .h5 and .tflite format (it is advised that you train the model from the start and not copy these)
  • Dockerfile for building the docker image
  • Documentation with code description
  • README.md with
    • Description of the problem
    • Instructions on how to run the project
  • create_directories.py that splits the data to train, val, test folders
  • Dependencies
  • Script lambda-function.py for predictions. The script is formatted for deployment on Amazon Web Services' Lambda.
  • notebook.ipynb a Jupyter Notebook with the data analysis and models
  • Script test.py for testing
  • test.json where you can copy any json event to test
  • Script train.py
    • Training the final model
  • Instructions for Production deployment
    • Video or image of how you interact with the deployed service

Dataset is from Kaggle and will be given instructions later how to download.

Description of the problem

Written with help of #ChatGPT

Have you ever been to the beach and found yourself wanting to collect either shells or pebbles, but not sure which was which? Or maybe you're in the oil and gas industry and need a quick and accurate way to classify different geological materials? Well, I have the solution for you!

Introducing the Shells or Pebbles dataset – a collection of images specifically designed for binary classification tasks. With this dataset, you'll be able to easily determine whether a certain image is a shell or a pebble.

But the usefulness of this dataset doesn't stop there. In the oil and gas industry, accurately identifying and classifying different materials, including rocks and shells, is crucial for exploration and production activities. By understanding the composition and structure of the earth's layers, geologists can make informed decisions about where to drill for oil and gas.

And for those concerned about the environment, this dataset can also be used to study the impacts of climate change on coastal ecosystems. By analyzing the distribution and abundance of shells and pebbles on beaches, scientists can gain valuable insights into the health of marine life and the effects of human activities.

So whether you're an artist looking to create a beach-themed project or a scientist studying the earth's geological makeup, the Shells or Pebbles dataset has something to offer. With its reliable and accurate classification capabilities, this dataset can help you make better informed decisions and better understand the world around you.

Project Objectives

Potential objectives for this project include:

  • Develop a model that performs well on a binary classification problem.
  • Tune the model's hyperparameters to get the best possible accuracy.
    • Used learning rate, droprate as main hyperparameters. Also added data augmentation but due to lack of time and computer resources didn't spend much time on tuning it further. Size of inner layers, img size and other parameters could also be changed by the user.
  • Use the callbacks to save the best model weights and and end training if the validation accuracy does not increase after a certain number of epochs.
  • Utilize TensorBoard to visualize the training process and find trends or patterns in the data (I didn't make use of this in the end).
  • Use the trained model to accurately categorize new photos as Shells or Pebbles.
  • Deploy the trained model in a production environment.
  • Create comprehensive Documentation for the project, including a detailed description of the model architecture, training procedure and deployment.
  • Display the project's outcomes in a more professional way.

I selected possible best parameters and architecture to achieve a good accuracy. It is possible the architecture is not much suitable or there are other parameters that better fit this problem. It would need more investigation on the dataset and on creation of the model.

Local deployment

All development was done on Windows with conda.

You can create an environment

conda env create -f env_project.yml
conda activate capstone

Download repo

https://github.com/dimzachar/capstone_mlzoomcamp.git

Notes:

  • You can git clone the repo in Saturn Cloud instead of running it in your own pc.
  • Just make sure you have set it up, see here. Create secrets for Kaggle in order to download the data.
  • You don't need pipenv if you use Saturn Cloud.
  • See instructions below for more.
  • You can access the environment here Run in Saturn Cloud

For the virtual environment, I utilized pipenv.

If you want to use the same venv as me, install pipenv and dependencies, navigate to the folder with the given files:

cd capstone_mlzoomcamp
pip install pipenv
pipenv shell
pipenv install numpy pandas seaborn jupyter plotly scipy tensorflow==2.9.1 scikit-learn==1.1.3 tensorflow-gpu

Before you begin you need to download the data. You can either download them manually from Kaggle or use the kaggle cli with your API keys (you need to download the kaggle.json from your profile amd paste it in PATH/.kaggle) and extract the files

kaggle config set -n api.username -v YOUR_USERNAME
kaggle config set -n api.key -v YOUR_API_KEY

kaggle datasets download -d vencerlanz09/shells-or-pebbles-an-image-classification-dataset -p Images

kaggle

If you run it on Saturn Cloud make sure you are inside /tensorflow/capstone_mlzoomcamp.

This will download the zip file inside folder named Images. Then, unzip it inside this folder manually or using git bash and delete the zip file. Since you are inside capstone_mlzoomcamp folder do

unzip -q Images/shells-or-pebbles-an-image-classification-dataset.zip -d Images
rm Images/shells-or-pebbles-an-image-classification-dataset.zip

Folder structure should now look like this

Images
├───Pebbles
└───Shells

Now run create_directories script, which will split the images into train, val and test folders (60%,20%,20%) with labels

pipenv run python create_directories.py

The final structure before you train the model should look like this

Images
├───test
│   ├───Pebbles
│   └───Shells
├───train
│   ├───Pebbles
│   └───Shells
└───val
    ├───Pebbles
    └───Shells

To open the notebook.ipynb and see what is inside (optional - running the whole thing would probably take min 2 hours), run jupyter

pipenv run jupyter notebook

For the evaluation you would need to run train.py. This, will run the train function and construct a ML model with best parameters which will be saved in checkpoints folder (it will be created automatically). The model with highest validation accuracy will be loaded, evaluated (it will return some metrics) and then converted to a Tensorflow Lite model in order to deploy it in the cloud later. Note: If you run it on a CPU it will take some time (minimum 20 minutes). It is a good idea to use a GPU to speed up the training process.

pipenv run python train.py

Note:

  • Ignore if you get any warnings and wait till you see the message Finished. In the end you will have a model.tflite file in the directory. You can also find the best model in .h5 format inside the checkpoints folder.
  • If you don't want to run train.py (even though you should) there are files in folder Extra_models in .h5 and .tflite format. I have no responsibility if they work (I guess they do).

Production deployment

Docker container

To deploy the model locally, follow these steps:

  • Install Docker on your system. Instructions can be found here.
  • Build the Docker image for the model and run the container using the following commands:
docker build -t model .
docker run -it --rm -p 8080:8080 model:latest

then run

pipenv run python test.py

to test it locally using an url.

The function returns a dictionary with a single key-value pair, where the key is the class label and the value is the prediction value. The class label is "Shells" if the prediction value is greater than or equal to 0.5, or "Pebbles" if the prediction value is less than 0.5. The prediction value is always greater than or equal to 0.5.

For example, if the value of pred is 0.7, the class label will be "Shells" and the prediction value will be 0.7. If the value of pred is 0.3, the class label will be "Pebbles" and the prediction value will be 0.7.

local

Cloud deployment

In order to deploy it to AWS we push the docker image. Make sure you have an account and install AWS CLI. Instructions can be found here

First, create a repository on Amazon Elastic Container Registry (ECR) with an appropriate name registry2

registry

You will find the push commands there to tag and push the latest docker image ECR

which you find on your system with

pipenv run docker images

Next, we publish to AWS Lambda.

Go to AWS Lambda, create function, select container image and add a name. Then, browse your image and finally hit create function function

Go to configuration, change timeout to 30 seconds and increase memory RAM (e.g. 1024) settings

Test the function by changing the event json eventjson

Expose the lambda function using API Gateway. Go to API Gateway, select REST API and build a new API apigate

Create a new API, give a name apigatecreate

Create new resource, name it predict apiresource

Create new method, select POST and hit click. Choose Lambda function as integration type and on Lambda function give the name of the function you created and hit save post

Hit Test, add a JSON document on request body

 {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Pebbleswithquarzite.jpg/1280px-Pebbleswithquarzite.jpg" }

or other image

posttest testjson

Hit Deploy on Actions, select New Stage and give a name

deployapi

Copy the invoke URL, put it in your /test.py file and run it testinvoke

Make sure you remove/delete everything after testing if necessary.

Video of cloud deployment

shells.mp4

That's a wrap!

What else can I do?

  • Send a pull request.
  • If you liked this project, give a ⭐.

Connect with me:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages