Skip to content

Commit

Permalink
Updated the docker file and train python files to execute in the dock…
Browse files Browse the repository at this point in the history
…er image and updated README
  • Loading branch information
mHemaAP committed Sep 14, 2024
1 parent a9e6e6c commit ea19c1d
Show file tree
Hide file tree
Showing 4 changed files with 206 additions and 13 deletions.
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.git
*.pyc
__pycache__
tests/
8 changes: 7 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
FROM ubuntu:latest
FROM python:3.9-slim

WORKDIR /workspace

# Install Python packages
RUN pip install --no-cache-dir numpy==1.23.4 \
&& pip install --no-cache-dir torch==1.12.1+cpu torchvision==0.13.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

# COPY . .
COPY train.py /workspace/

CMD ["python", "train.py"]
79 changes: 75 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@

# PyTorch Docker Assignment

Welcome to the PyTorch Docker Assignment. This assignment is designed to help you understand and work with Docker and PyTorch.
Welcome to the PyTorch Docker Assignment. This assignment is designed to help understand and work with Docker and PyTorch.

## Assignment Overview

In this assignment, you will:
This project trains a neural network on the MNIST dataset using PyTorch. The project is containerized with Docker, making it easy to reproduce the environment. In this assignment contains:

1. Create a Dockerfile for a PyTorch (CPU version) environment.
2. Keep the size of your Docker image under 1GB (uncompressed).
Expand All @@ -18,8 +18,79 @@ In this assignment, you will:

## Starter Code

The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. You will need to complete the code at locations marked by TODO: comments.
The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. And with this submission, the code is completed.

## How to Run the Code Using Docker
Below are the instructions to build and run the code using Docker.

### Requirements
- Docker installed on your machine.

#### Dockerfile Overview
The provided `Dockerfile` does the following:

1. **Base Image:** Uses `python:3.9-slim` as the base image.
2. **Working Directory:** Sets `/workspace` as the working directory inside the container.
3. **Package Installation:** Installs specific versions of `numpy`, `torch`, and `torchvision` using `pip`.
4. **Copy Files:** Copies train.py to the working directory.
5. **Command to Execute:** The default command to run the training script is python `train.py`.

#### How to Build and Run the Docker Container
##### Step 1: Build the Docker Image
Navigate to the directory containing the `Dockerfile` and run the following command to build the Docker image:


```
docker build -t mnist-trainer:latest .
```
This command:

- Builds the Docker image and tags it as `mnist-trainer:latest`.

##### Step 2: Run the Docker Container
Once the image is built, you can run the container using the following command:


```
docker run --rm -it -v $(pwd)/data:/workspace/data mnist-trainer:latest
```
Explanation:

- `--rm`: Automatically removes the container once it exits.
- `-it`: Runs the container interactively, allowing you to see the training output in real time.
- `-v $(pwd)/data:/workspace/data`: Mounts the `data` directory from your host system into the container at `/workspace/data`, allowing MNIST data and model checkpoints to persist between runs.
- `mnist-trainer:latest`: Specifies the Docker image to run.

##### Step 3: Running with Checkpoint Resume
To resume training from a checkpoint, first make sure a model checkpoint exists at `./model_checkpoint.pth`. Then, add the `--resume` flag when running the container:


```
docker run --rm -it -v $(pwd)/data:/workspace/data mnist-trainer:latest --resume
```
This will load the existing checkpoint and continue training.

##### Additional Docker Commands
- **To view the logs:** Use the following command to check the logs of the running container:


```
docker logs <container-id>
```
- **To save the model:** After training, the model checkpoint will be saved in `./model_checkpoint.pth` on your local machine.

##### Notes
- The model architecture and training script can be modified in `train.py`.
- The container will automatically download the MNIST dataset during the training process if not already present.

## Test Results

All the tests run with the script `tests/grading.sh` completed successfully on gitpod.

## Submission

When you have completed the assignment, push your code to your Github repository. The Github Actions workflow will automatically build your Docker image, run your training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. Make sure that all checks are passing before you submit the assignment.
After the assignment completion, push code to the Github repository. The Github Actions workflow will automatically build the Docker image, run training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. It is made sure that all checks are passing before the assignment submission.
128 changes: 120 additions & 8 deletions train.py
Original file line number Diff line number Diff line change
@@ -1,47 +1,159 @@
import os
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import argparse
from torch.optim.lr_scheduler import StepLR
from torchvision import datasets, transforms
from torch.utils.data import DataLoader


class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
# TODO: Define your model architecture here
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
# TODO: Define the forward pass
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
pass

def train_epoch(epoch, args, model, device, data_loader, optimizer):
# TODO: Implement the training loop here
pass
model.train()
for batch_idx, (data, target) in enumerate(data_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(data_loader.dataset),
100. * batch_idx / len(data_loader), loss.item()))
if args.dry_run:
break

def test_epoch(model, device, data_loader):
# TODO: Implement the testing loop here
pass
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in data_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(data_loader.dataset)

print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(data_loader.dataset),
100. * correct / len(data_loader.dataset)))

def main():
# Parser to get command line arguments
parser = argparse.ArgumentParser(description='MNIST Training Script')
# TODO: Define your command line arguments here

parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
help='learning rate (default: 1.0)')
parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
help='Learning rate step gamma (default: 0.7)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--no-mps', action='store_true', default=False,
help='disables macOS GPU training')
parser.add_argument('--dry-run', action='store_true', default=False,
help='quickly check a single pass')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--save-model', action='store_true', default=True,
help='For Saving the current Model')
parser.add_argument('--resume', action='store_true', default=False,
help='Resume training from a checkpoint') # New argument for resuming
args = parser.parse_args()
use_cuda = torch.cuda.is_available()
#use_cuda = torch.cuda.is_available()
torch.manual_seed(args.seed)
device = torch.device("cuda" if use_cuda else "cpu")
#device = torch.device("cuda" if use_cuda else "cpu")
device = torch.device("cpu")

# TODO: Load the MNIST dataset for training and testing
train_kwargs = {'batch_size': args.batch_size}
test_kwargs = {'batch_size': args.test_batch_size}
# if use_cuda:
# cuda_kwargs = {'num_workers': 1,
# 'pin_memory': True,
# 'shuffle': True}
# train_kwargs.update(cuda_kwargs)
# test_kwargs.update(cuda_kwargs)

transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
dataset1 = datasets.MNIST('./data', train=True, download=True,
transform=transform)
dataset2 = datasets.MNIST('./data', train=False,
transform=transform)
train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

model = Net().to(device)
# TODO: Add a way to load the model checkpoint if 'resume' argument is True
# Add checkpoint loading functionality
if args.resume:
if os.path.isfile('./model_checkpoint.pth'):
print("=> Loading checkpoint 'model_checkpoint.pth'")
model.load_state_dict(torch.load('./model_checkpoint.pth'))
else:
print("=> No checkpoint found at 'model_checkpoint.pth'")

# TODO: Choose and define the optimizer here
optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)

# TODO: Implement the training and testing cycles
# Hint: Save the model after each epoch
for epoch in range(1, args.epochs + 1):
train_epoch(epoch, args, model, device, train_loader, optimizer)
test_epoch(model, device, test_loader)
scheduler.step()
print("Model training was completed!")
# Hint: Save the model after end of all epochs
if args.save_model:
print("Saving the checkpoint")
torch.save(model.state_dict(), "./model_checkpoint.pth")
print(f"Saved the checkpoint {os.getcwd()}/model_checkpoint.pth")

if __name__ == "__main__":
main()
main()

0 comments on commit ea19c1d

Please sign in to comment.