A real-time Indian Sign Language (ISL) to Speech system that combines YOLO, MediaPipe, and a custom CNN+LSTM model to translate sign language videos into spoken words.
This project enables real-time sign language recognition from webcam video and converts it to spoken text in the browser.
Key Highlights:
- Real-time webcam capture in the browser.
- Landmark extraction with MediaPipe Holistic.
- YOLOv11 for robust person detection and frame cropping during training.
- Sequence modeling with a hybrid CNN + LSTM PyTorch model.
- Interactive web interface with WebSocket-based streaming.
- Automatic speech output using browser Text-to-Speech (TTS).
How the model was built
- Collect Videos:
- Raw videos of sign language gestures.
- Extract Frames:
- Videos are split into frame sequences for processing.
- YOLOv11 Detection:
- Each frame is passed through YOLOv11 to detect and crop the region containing the signer.
- This improves landmark extraction accuracy by focusing only on the signer.
- Extract Landmarks:
-
MediaPipe Holistic is used to extract:
- Pose landmarks
- Face landmarks
- Left & right hand landmarks
- Masking:
- Binary masks track which landmarks are present/missing per frame.
- This helps the model learn variable-length, partially visible features robustly.
- CNN + LSTM Model:
- CNN layers learn spatial features from the landmark sequences.
- LSTM layers capture temporal dependencies across frames.
- The final model classifies the sign gesture into one of the predefined sign classes.
How real-time recognition works
- User starts the camera from the browser.
- Frames are processed client-side with MediaPipe Holistic to draw pose, face, and hand landmarks for live feedback.
- The raw frame is also sent over WebSocket to the FastAPI backend.
- The server:
- Optionally runs YOLO crop (can be skipped).
- Runs MediaPipe Holistic again on the cropped/received frame.
- Maintains a rolling buffer (
deque) of landmark sequences. 5. When the user pressess: - The buffer is RESET and inference is STARTED.
- When enough frames are collected, the server feeds the landmark sequence through the CNN + LSTM model. 6. The predicted sign word is sent back to the browser in real-time. 7. The browser displays the word and can speak it aloud using the Web Speech API.
- Install Python dependencies
conda create -n isl-speech python=3.9
conda activate isl-speech
pip install -r requirements.txt- Place models
- YOLO weights (
yolo11n.pt) inmodels/ - Trained PyTorch model (
best_web_model.pth) inmodels/
- Start the FastAPI server
uvicorn web_app.app_ws:app --reload- Open the Web App
- Navigate to
http://localhost:8000 - Click Start Camera
- Use
sto start inference,rto reset buffer
| Key | Action |
|---|---|
s |
Reset buffer and start inference |
r |
Reset buffer only |
| Play Translation | Click the Speak Translation button to hear the translated sign |
web_app/
├── static/
│ └── index.html # Frontend HTML
├── app_ws.py # FastAPI server with WebSocket
├── model_def.py # PyTorch CNN + LSTM model definition
└── models/
├── yolo11n.pt
└── best_web_model.pth
- Sainava Modak
- Kartik Rajput