Skip to content

Commit

Permalink
Update Part_1_Data_Preperation.ipynb
Browse files Browse the repository at this point in the history
  • Loading branch information
init27 committed Nov 20, 2024
1 parent 78bdd63 commit 0284c66
Showing 1 changed file with 10 additions and 20 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,7 @@
"source": [
"### Removing Corrupt Images\n",
"\n",
"Like everything in the real world, all datasets come with some corruption. Thankfully here we can fix it ourselves. \n",
"\n",
"Obviously this is a great first step to check. Unfortunately, the author of the notebook realised there are corrupt images after crashing a GPU script 8 hours later. \n",
"\n",
"We can now use this hindsight to act like smart data scientists and fix this first. "
"We will start by cleaning up the dataset first and checking for any corrupt images. "
]
},
{
Expand Down Expand Up @@ -72,7 +68,7 @@
"\n",
"- PIL: For handling images to be passed to our Llama model\n",
"- Huggingface Tranformers: For running the model\n",
"- Concurrent Library: To look smart when doing OS stuff by using concurrency :) "
"- Concurrent Library: To clean up faster"
]
},
{
Expand Down Expand Up @@ -107,9 +103,7 @@
"source": [
"#### Clean Corrupt Images\n",
"\n",
"Cleaning corruption is a task for AGI but we can handle the corrupt images in our dataset for now with some concurrency for fast checking. \n",
"\n",
"This takes a few moments so it might be a good idea to take a small break and socialise for a good change. "
"This might take a few moments since we have 5000 images in our dataset."
]
},
{
Expand Down Expand Up @@ -366,8 +360,6 @@
"source": [
"## EDA\n",
"\n",
"Now that we got rid of corruption we can proceed to building a great society with checking our dataset :) \n",
"\n",
"Let's start by double-checking any empty values"
]
},
Expand Down Expand Up @@ -612,7 +604,7 @@
"id": "1cc50c67",
"metadata": {},
"source": [
"If we had more ~~prompts~~ time, this would be a fancier plot but for now let's take a look at the distribution skew to understand what's in our dataset:"
"Let's take a look at the distribution skew to understand what's in our dataset:"
]
},
{
Expand Down Expand Up @@ -813,7 +805,7 @@
"id": "c2793936",
"metadata": {},
"source": [
"For once, lack of fashion knowledge is useful-we can reduce our work by creating less categories. Nicely organised just like an coder's wardrobe"
"For once, lack of fashion knowledge is useful-we can reduce our work by creating less categories."
]
},
{
Expand Down Expand Up @@ -1013,7 +1005,7 @@
"\n",
"At this stage, we are ready to start labelling the images using Llama-3.2 models. We will use 11B here for testing. \n",
"\n",
"For our rich readers, we suggest testing 90B as an assignment. Although you will find that 11B is a great candidate for this model. \n",
"We suggest testing 90B as an assignment. Although you will find that 11B is a great candidate for this model. \n",
"\n",
"Read more about the model capabilites [here](https://www.llama.com/docs/how-to-guides/vision-capabilities/)"
]
Expand Down Expand Up @@ -1123,9 +1115,7 @@
"source": [
"#### Labelling Prompt\n",
"\n",
"For anyone who feels strongly about Prompt Engineering-this section is for you. This requires exercising more prompt engineering techniques as suggested below.\n",
"\n",
"Suggested approach:\n",
"We did a few sample runs to arrive on the prompt below: \n",
"\n",
"- Run a simple prompt on an image\n",
"- See output and iterate\n",
Expand Down Expand Up @@ -1220,9 +1210,9 @@
"source": [
"### Testing Labelling Script\n",
"\n",
"The results from labelling above look promising, we can now start building a script skeleton in the notebook (don't worry script kiddies we will leave the notebook soon). \n",
"The results from labelling above look promising, we can now start building a script skeleton in the notebook to test our label logic.\n",
"\n",
"Let's test our approach for first 50 images after which we can let this run on multi-GPUs in a script:"
"Let's test our approach for first 50 images after which we can let this run on multi-GPUs in a script. Remember, Llama-3.2 models can only look at one image at once."
]
},
{
Expand Down Expand Up @@ -1326,7 +1316,7 @@
"id": "fcf2a894",
"metadata": {},
"source": [
"It's always a great idea to ~~have trust issues~~ validate LLM outputs, we can check our labels here:"
"It's always a great idea to validate LLM outputs, we can check our labels here:"
]
},
{
Expand Down

0 comments on commit 0284c66

Please sign in to comment.