Repo contains the following:
- Extra folder with the project images
- Extra_models folder with best models in
.h5
and.tflite
format (it is advised that you train the model from the start and not copy these) Dockerfile
for building the docker image- Documentation with code description
README.md
with- Description of the problem
- Instructions on how to run the project
create_directories.py
that splits the data to train, val, test folders- Dependencies
- Script
lambda-function.py
for predictions. The script is formatted for deployment on Amazon Web Services' Lambda. notebook.ipynb
a Jupyter Notebook with the data analysis and models- Script
test.py
for testing - test.json where you can copy any json event to test
- Script
train.py
- Training the final model
- Instructions for Production deployment
- Video or image of how you interact with the deployed service
Dataset is from Kaggle and will be given instructions later how to download.
Written with help of #ChatGPT
Have you ever been to the beach and found yourself wanting to collect either shells or pebbles, but not sure which was which? Or maybe you're in the oil and gas industry and need a quick and accurate way to classify different geological materials? Well, I have the solution for you!
Introducing the Shells or Pebbles dataset – a collection of images specifically designed for binary classification tasks. With this dataset, you'll be able to easily determine whether a certain image is a shell or a pebble.
But the usefulness of this dataset doesn't stop there. In the oil and gas industry, accurately identifying and classifying different materials, including rocks and shells, is crucial for exploration and production activities. By understanding the composition and structure of the earth's layers, geologists can make informed decisions about where to drill for oil and gas.
And for those concerned about the environment, this dataset can also be used to study the impacts of climate change on coastal ecosystems. By analyzing the distribution and abundance of shells and pebbles on beaches, scientists can gain valuable insights into the health of marine life and the effects of human activities.
So whether you're an artist looking to create a beach-themed project or a scientist studying the earth's geological makeup, the Shells or Pebbles dataset has something to offer. With its reliable and accurate classification capabilities, this dataset can help you make better informed decisions and better understand the world around you.
Potential objectives for this project include:
- Develop a model that performs well on a binary classification problem.
- Tune the model's hyperparameters to get the best possible accuracy.
- Used learning rate, droprate as main hyperparameters. Also added data augmentation but due to lack of time and computer resources didn't spend much time on tuning it further. Size of inner layers, img size and other parameters could also be changed by the user.
- Use the callbacks to save the best model weights and and end training if the validation accuracy does not increase after a certain number of epochs.
- Utilize TensorBoard to visualize the training process and find trends or patterns in the data (I didn't make use of this in the end).
- Use the trained model to accurately categorize new photos as Shells or Pebbles.
- Deploy the trained model in a production environment.
- Create comprehensive Documentation for the project, including a detailed description of the model architecture, training procedure and deployment.
- Display the project's outcomes in a more professional way.
I selected possible best parameters and architecture to achieve a good accuracy. It is possible the architecture is not much suitable or there are other parameters that better fit this problem. It would need more investigation on the dataset and on creation of the model.
All development was done on Windows with conda.
You can create an environment
conda env create -f env_project.yml
conda activate capstone
Download repo
https://github.com/dimzachar/capstone_mlzoomcamp.git
Notes:
- You can git clone the repo in Saturn Cloud instead of running it in your own pc.
- Just make sure you have set it up, see here. Create secrets for Kaggle in order to download the data.
- You don't need pipenv if you use Saturn Cloud.
- See instructions below for more.
- You can access the environment here
For the virtual environment, I utilized pipenv.
If you want to use the same venv as me, install pipenv and dependencies, navigate to the folder with the given files:
cd capstone_mlzoomcamp
pip install pipenv
pipenv shell
pipenv install numpy pandas seaborn jupyter plotly scipy tensorflow==2.9.1 scikit-learn==1.1.3 tensorflow-gpu
Before you begin you need to download the data. You can either download them manually from Kaggle or use the kaggle cli with your API keys (you need to download the kaggle.json from your profile amd paste it in PATH/.kaggle) and extract the files
kaggle config set -n api.username -v YOUR_USERNAME
kaggle config set -n api.key -v YOUR_API_KEY
kaggle datasets download -d vencerlanz09/shells-or-pebbles-an-image-classification-dataset -p Images
If you run it on Saturn Cloud make sure you are inside /tensorflow/capstone_mlzoomcamp
.
This will download the zip file inside folder named Images. Then, unzip it inside this folder manually or using git bash and delete the zip file. Since you are inside capstone_mlzoomcamp folder do
unzip -q Images/shells-or-pebbles-an-image-classification-dataset.zip -d Images
rm Images/shells-or-pebbles-an-image-classification-dataset.zip
Folder structure should now look like this
Images
├───Pebbles
└───Shells
Now run create_directories
script, which will split the images into train, val and test folders (60%,20%,20%) with labels
pipenv run python create_directories.py
The final structure before you train the model should look like this
Images
├───test
│ ├───Pebbles
│ └───Shells
├───train
│ ├───Pebbles
│ └───Shells
└───val
├───Pebbles
└───Shells
To open the notebook.ipynb
and see what is inside (optional - running the whole thing would probably take min 2 hours), run jupyter
pipenv run jupyter notebook
For the evaluation you would need to run train.py
. This, will run the train function and construct a ML model with best parameters which will be saved in checkpoints
folder (it will be created automatically). The model with highest validation accuracy will be loaded, evaluated (it will return some metrics) and then converted to a Tensorflow Lite model in order to deploy it in the cloud later.
Note: If you run it on a CPU it will take some time (minimum 20 minutes). It is a good idea to use a GPU to speed up the training process.
pipenv run python train.py
Note:
- Ignore if you get any warnings and wait till you see the message
Finished
. In the end you will have amodel.tflite
file in the directory. You can also find the best model in .h5 format inside thecheckpoints
folder. - If you don't want to run
train.py
(even though you should) there are files in folderExtra_models
in.h5
and.tflite
format. I have no responsibility if they work (I guess they do).
To deploy the model locally, follow these steps:
- Install Docker on your system. Instructions can be found here.
- Build the Docker image for the model and run the container using the following commands:
docker build -t model .
docker run -it --rm -p 8080:8080 model:latest
then run
pipenv run python test.py
to test it locally using an url.
The function returns a dictionary with a single key-value pair, where the key is the class label and the value is the prediction value. The class label is "Shells" if the prediction value is greater than or equal to 0.5, or "Pebbles" if the prediction value is less than 0.5. The prediction value is always greater than or equal to 0.5.
For example, if the value of pred is 0.7, the class label will be "Shells" and the prediction value will be 0.7. If the value of pred is 0.3, the class label will be "Pebbles" and the prediction value will be 0.7.
In order to deploy it to AWS we push the docker image. Make sure you have an account and install AWS CLI. Instructions can be found here
First, create a repository on Amazon Elastic Container Registry (ECR) with an appropriate name
You will find the push commands there to tag and push the latest docker image
which you find on your system with
pipenv run docker images
Next, we publish to AWS Lambda.
Go to AWS Lambda, create function, select container image and add a name. Then, browse your image and finally hit create function
Go to configuration, change timeout to 30 seconds and increase memory RAM (e.g. 1024)
Test the function by changing the event json
Expose the lambda function using API Gateway. Go to API Gateway, select REST API and build a new API
Create new resource, name it predict
Create new method, select POST and hit click. Choose Lambda function as integration type and on Lambda function give the name of the function you created and hit save
Hit Test, add a JSON document on request body
{"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Pebbleswithquarzite.jpg/1280px-Pebbleswithquarzite.jpg" }
or other image
Hit Deploy on Actions, select New Stage and give a name
Copy the invoke URL, put it in your /test.py file and run it
Make sure you remove/delete everything after testing if necessary.
Video of cloud deployment
shells.mp4
That's a wrap!
- Send a pull request.
- If you liked this project, give a ⭐.
Connect with me: