Updated the docker file and train python files to execute in the dock…

…er image and updated README
The-School-of-AI · Sep 14, 2024 · ea19c1d · ea19c1d
1 parent a9e6e6c
commit ea19c1d
Show file tree

Hide file tree

Showing 4 changed files with 206 additions and 13 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,4 @@
+.git
+*.pyc
+__pycache__
+tests/
diff --git a/Dockerfile b/Dockerfile
@@ -1,6 +1,12 @@
-FROM ubuntu:latest
+FROM python:3.9-slim
 
 WORKDIR /workspace
+
+# Install Python packages
+RUN pip install --no-cache-dir numpy==1.23.4 \
+    && pip install --no-cache-dir torch==1.12.1+cpu torchvision==0.13.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
+
+# COPY . .
 COPY train.py /workspace/
 
 CMD ["python", "train.py"]
diff --git a/README.md b/README.md
@@ -4,11 +4,11 @@
 
 # PyTorch Docker Assignment
 
-Welcome to the PyTorch Docker Assignment. This assignment is designed to help you understand and work with Docker and PyTorch.
+Welcome to the PyTorch Docker Assignment. This assignment is designed to help understand and work with Docker and PyTorch. 
 
 ## Assignment Overview
 
-In this assignment, you will:
+This project trains a neural network on the MNIST dataset using PyTorch. The project is containerized with Docker, making it easy to reproduce the environment. In this assignment contains:
 
 1. Create a Dockerfile for a PyTorch (CPU version) environment.
 2. Keep the size of your Docker image under 1GB (uncompressed).
@@ -18,8 +18,79 @@ In this assignment, you will:
 
 ## Starter Code
 
-The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. You will need to complete the code at locations marked by TODO: comments.
+The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. And with this submission, the code is completed.
+
+## How to Run the Code Using Docker
+Below are the instructions to build and run the code using Docker.
+
+### Requirements
+- Docker installed on your machine.
+
+#### Dockerfile Overview
+The provided `Dockerfile` does the following:
+
+1. **Base Image:** Uses `python:3.9-slim` as the base image.
+2. **Working Directory:** Sets `/workspace` as the working directory inside the container.
+3. **Package Installation:** Installs specific versions of `numpy`, `torch`, and `torchvision` using `pip`.
+4. **Copy Files:** Copies train.py to the working directory.
+5. **Command to Execute:** The default command to run the training script is python `train.py`.
+
+#### How to Build and Run the Docker Container
+##### Step 1: Build the Docker Image
+Navigate to the directory containing the `Dockerfile` and run the following command to build the Docker image:
+
+
+```
+docker build -t mnist-trainer:latest .
+
+```
+This command:
+
+- Builds the Docker image and tags it as `mnist-trainer:latest`.
+
+##### Step 2: Run the Docker Container
+Once the image is built, you can run the container using the following command:
+
+
+```
+docker run --rm -it -v $(pwd)/data:/workspace/data mnist-trainer:latest
+
+```
+Explanation:
+
+- `--rm`: Automatically removes the container once it exits.
+- `-it`: Runs the container interactively, allowing you to see the training output in real time.
+- `-v $(pwd)/data:/workspace/data`: Mounts the `data` directory from your host system into the container at `/workspace/data`, allowing MNIST data and model checkpoints to persist between runs.
+- `mnist-trainer:latest`: Specifies the Docker image to run.
+
+##### Step 3: Running with Checkpoint Resume
+To resume training from a checkpoint, first make sure a model checkpoint exists at `./model_checkpoint.pth`. Then, add the `--resume` flag when running the container:
+
+
+```
+docker run --rm -it -v $(pwd)/data:/workspace/data mnist-trainer:latest --resume
+
+```
+This will load the existing checkpoint and continue training.
+
+##### Additional Docker Commands
+- **To view the logs:** Use the following command to check the logs of the running container:
+
+
+```
+docker logs <container-id>
+
+```
+- **To save the model:** After training, the model checkpoint will be saved in `./model_checkpoint.pth` on your local machine.
+
+##### Notes
+- The model architecture and training script can be modified in `train.py`.
+- The container will automatically download the MNIST dataset during the training process if not already present.
+
+## Test Results
+
+All the tests run with the script `tests/grading.sh` completed successfully on gitpod.
 
 ## Submission
 
-When you have completed the assignment, push your code to your Github repository. The Github Actions workflow will automatically build your Docker image, run your training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. Make sure that all checks are passing before you submit the assignment.
+After the assignment completion, push code to the Github repository. The Github Actions workflow will automatically build the Docker image, run  training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. It is made sure that all checks are passing before the assignment submission.
diff --git a/train.py b/train.py
@@ -1,47 +1,159 @@
 import os
+import argparse
 import torch
+import torch.nn as nn
 import torch.nn.functional as F
 import torch.optim as optim
-import argparse
+from torch.optim.lr_scheduler import StepLR
 from torchvision import datasets, transforms
 from torch.utils.data import DataLoader
 
+
 class Net(torch.nn.Module):
     def __init__(self):
         super(Net, self).__init__()
         # TODO: Define your model architecture here
+        self.conv1 = nn.Conv2d(1, 32, 3, 1)
+        self.conv2 = nn.Conv2d(32, 64, 3, 1)
+        self.dropout1 = nn.Dropout(0.25)
+        self.dropout2 = nn.Dropout(0.5)
+        self.fc1 = nn.Linear(9216, 128)
+        self.fc2 = nn.Linear(128, 10)        
 
     def forward(self, x):
         # TODO: Define the forward pass
+        x = self.conv1(x)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x = F.relu(x)
+        x = F.max_pool2d(x, 2)
+        x = self.dropout1(x)
+        x = torch.flatten(x, 1)
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = self.dropout2(x)
+        x = self.fc2(x)
+        output = F.log_softmax(x, dim=1)
+        return output        
         pass
 
 def train_epoch(epoch, args, model, device, data_loader, optimizer):
     # TODO: Implement the training loop here
-    pass
+    model.train()
+    for batch_idx, (data, target) in enumerate(data_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = F.nll_loss(output, target)
+        loss.backward()
+        optimizer.step()
+        if batch_idx % args.log_interval == 0:
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
+                epoch, batch_idx * len(data), len(data_loader.dataset),
+                100. * batch_idx / len(data_loader), loss.item()))
+            if args.dry_run:
+                break    
 
 def test_epoch(model, device, data_loader):
     # TODO: Implement the testing loop here
-    pass
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in data_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
+            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
+            correct += pred.eq(target.view_as(pred)).sum().item()
+
+    test_loss /= len(data_loader.dataset)
+
+    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
+        test_loss, correct, len(data_loader.dataset),
+        100. * correct / len(data_loader.dataset)))
 
 def main():
     # Parser to get command line arguments
     parser = argparse.ArgumentParser(description='MNIST Training Script')
     # TODO: Define your command line arguments here
-
+    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
+                        help='input batch size for training (default: 64)')
+    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
+                        help='input batch size for testing (default: 1000)')
+    parser.add_argument('--epochs', type=int, default=10, metavar='N',
+                        help='number of epochs to train (default: 10)')
+    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
+                        help='learning rate (default: 1.0)')
+    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
+                        help='Learning rate step gamma (default: 0.7)')
+    parser.add_argument('--no-cuda', action='store_true', default=False,
+                        help='disables CUDA training')
+    parser.add_argument('--no-mps', action='store_true', default=False,
+                        help='disables macOS GPU training')
+    parser.add_argument('--dry-run', action='store_true', default=False,
+                        help='quickly check a single pass')
+    parser.add_argument('--seed', type=int, default=1, metavar='S',
+                        help='random seed (default: 1)')
+    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
+                        help='how many batches to wait before logging training status')
+    parser.add_argument('--save-model', action='store_true', default=True,
+                        help='For Saving the current Model')
+    parser.add_argument('--resume', action='store_true', default=False,
+                        help='Resume training from a checkpoint')  # New argument for resuming
     args = parser.parse_args()
-    use_cuda = torch.cuda.is_available()
+    #use_cuda = torch.cuda.is_available()
     torch.manual_seed(args.seed)
-    device = torch.device("cuda" if use_cuda else "cpu")
+    #device = torch.device("cuda" if use_cuda else "cpu")
+    device = torch.device("cpu")
 
     # TODO: Load the MNIST dataset for training and testing
+    train_kwargs = {'batch_size': args.batch_size}
+    test_kwargs = {'batch_size': args.test_batch_size}
+    # if use_cuda:
+    #     cuda_kwargs = {'num_workers': 1,
+    #                    'pin_memory': True,
+    #                    'shuffle': True}
+    #     train_kwargs.update(cuda_kwargs)
+    #     test_kwargs.update(cuda_kwargs)
+
+    transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+        ])
+    dataset1 = datasets.MNIST('./data', train=True, download=True,
+                       transform=transform)
+    dataset2 = datasets.MNIST('./data', train=False,
+                       transform=transform)
+    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
+    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)    
 
     model = Net().to(device)
     # TODO: Add a way to load the model checkpoint if 'resume' argument is True
+    # Add checkpoint loading functionality
+    if args.resume:
+        if os.path.isfile('./model_checkpoint.pth'):
+            print("=> Loading checkpoint 'model_checkpoint.pth'")
+            model.load_state_dict(torch.load('./model_checkpoint.pth'))
+        else:
+            print("=> No checkpoint found at 'model_checkpoint.pth'")
 
     # TODO: Choose and define the optimizer here
+    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
+
+    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)    
 
     # TODO: Implement the training and testing cycles
-    # Hint: Save the model after each epoch
+    for epoch in range(1, args.epochs + 1):
+        train_epoch(epoch, args, model, device, train_loader, optimizer)
+        test_epoch(model, device, test_loader)
+        scheduler.step() 
+    print("Model training was completed!")   
+    # Hint: Save the model after end of all epochs
+    if args.save_model:
+        print("Saving the checkpoint")
+        torch.save(model.state_dict(), "./model_checkpoint.pth")
+        print(f"Saved the checkpoint {os.getcwd()}/model_checkpoint.pth")
 
 if __name__ == "__main__":
-    main()
+    main()