vSLAM-py

A beginner's attempt at a lightweight implementation of real-time Visual Odometry and Visual SLAM system in Python. Contains both monocular and stereo implementations. Continued from the abandoned vSLAM implementation in C++, which was too tedious to write and debug as a 1-person toy project.

Uses the following libraries (installed through pip):

OpenCV (for feature detection / matching)
g2o-python (for graph optimization)
Pangolin (for real-time visualization)
- Note: Pangolin is not available on PyPI, though I'm pushing for it

Sources of guidance / inspiration:

slambook-en (BEST textbook to learn visual SLAM)
Lecture: Photogrammetry I & II (2021, Uni Bonn, Cyrill Stachniss) (BEST lectures)
pyslam
twitchslam
ORB-SLAM1, ORB-SLAM2, and DSO papers
https://github.com/niconielsen32/ComputerVision

I also have notes written up for Visual SLAM hosted on my website, which has many more useful resources.

Currently sitting at ~0.14m translation error on the TUM-RGBD dataset, running purely Visual Odometry (more info below). I'm currently having trouble getting the bundle adjustment on the stereo SLAM to work well.

ORB Feature matching with outlier filtering

orb_tum.mp4

Estimated pose (green is ground truth pose, red is estimated pose)

path_tum.mp4

Installation

Getting started is super straightforward.

(Recommended, but not required) Create a virtual environment with Conda to not pollute your Python workspace:

conda create -n vslam-py python=3.8

Then, install the Python libraries

pip install -r requirements.txt

Note that for visualization, you need Python bindings for Pangolin installed. This isn't mandatory, but useful to visualize the camera poses in real-time. Matplotlib isn't well-made for that.

Usage

Mono Camera

Run monocular visual odometry, which uses epipolar geometry from contiguous images to estimate the camera pose (not the most accurate):

python3 main_mono_vo.py

Stereo Camera

Run stereo SLAM, which uses keyframes, builds a local map, minimizes reprojection error and uses bundle adjustment to estimate camera pose:

python3 main_stereo_slam.py

Datasets

I use the KITTI dataset (a small subsample) and the TUM-RGBD dataset to benchmark performance, as these datasets provide ground-truth pose data.

TUM Dataset (RGB-D)

python3 main_tum.py

KITTI Dataset (monocular)

python3 main_kitti.py

Important assumptions with your own cameras

Timestamps should be filled in correctly (default is incrementing by 0.1s). This is important for the optimization to work correctly.
For depth cameras, unknown depth values are expected to be populated with np.NaN (not 0). This is important for the optimization to work correctly

Terminology

Feature: A point in the image that can be tracked over time
Landmarks / Map points: 3D points in the world of features that are detected and tracked by the camera

Discussions

Known Issues

Pangolin and OpenCV windows don't play well together. I generally try to run them in separate processes so that they don't interfere with each other

Development Process

I've mostly been developing using the TUM-RGBD dataset and using the ZED stereo camera. This way, I directly obtain a rectified image, as well as the camera intrinsics. Generally, for your own camera, you'll need to do the camera calibration yourself to figure out these values (fx, fy, cx, cy, baseline).

Why Python

Python is much more flexible than C++, and easier to set up. Trying this stuff should be as easy as running 2-3 commands in the terminal. If it was written in C++, you'd need to install dependencies, compile the code, and then run it. In Python, you simply need to pip install the requirements, and you are good to go. Also, I find it a lot easier to debug a Python system (at this minimum codebase scale). In C++, I'd need something like gdb.

Of course, Python will never beat C++ performance due to its interpreted nature. However, all libraries I use are written in C++ under the hood (OpenCV, g2o, Pangolin), so the performance hit is minimal. Python is merely a high-level interface to these libraries.

Overall, it would have taken me a lot longer to write this in C++, and much less people would want to try it out.

Monocular vs. Stereo Visual SLAM

Mono and stereo visual odometry share many of the same techniques. Only difference with Mono is that ground-truth depth can be obtained directly (unless you run deep monocular depth estimation, or use a RGB-D camera).

However, monocular SLAM is more accessible, and everyone can just try it on their computers using a webcam. Stereo cameras aren't as standardized.

Timeline

03-20-2024: Initial commit for python. Project started in C++, but I transitioned to Python for ease of development.

04-29-2024: Add working frontend for visual odometry.

05-12-204: Add bundle adjustment.

05-21-2024: Add benchmarking with TUM-RGBD dataset to track improvements in data (finally...).

Benchmarking

The goal is to incrementally improve this SLAM system, while minimizing complexity, and guaranteeing real-time performance.

I will be benchmarking against the fr1/desk dataset.

2024-05-24 to 2024-05-28: Trying to use keyframes and local mapping, based on ideas from PTAM and ORB-SLAM. God this isn't that hard theoretically, but painfully tedious to implement.

Use dbow, a bag of words library with pretrained images from the TUM dataset. This will be used to match the current frame to the keyframes. If the current frame is not matched to any keyframes, it will be added as a new keyframe.

This doesn't work well...sunk cost fallacy. I think I should stop here. I should have stopped at 14cm RMSE.

2024-05-23: I was not strict with filtering bad matches between frames. Trying a few configurations, and cv2.findHomography to filter out outliers. Thanks Kj! Found best one to be 0.7 lowe's ratio threshold and cv2.findHomography to filter out outliers. improvement to ~0.14m RMSE!

Filtering out the matches, taking only the top matches also helps a lot with computation time. I am taking no filtering, 0.7 lowe's ratio threshold, and cv2.findHomography outlier rejection as the best configuration.

Some results below (note: measurements are flaky, sometimes I get better results with more matches, sometimes with less):

No filtering, no cv2.findHomography:

0.8 lowe's ratio threshold - 0.918237 m RMSE
0.7 lowe's ratio threshold - 0.161828 m RMSE
0.5 lowe's ratio threshold - 0.160740 m RMSE

We can see that being strict with bad matches really helps. Huge jump from 0.8 to 0.7 threshold.

0.7 lowe's ratio threshold, with cv2.findHomography outlier rejection:

top 100 matches - 0.290785 m RMSE
top 200 matches - 0.222357 m RMSE
top 400 matches - 0.172895 m RMSE
top 800 matches - 0.149658 m RMSE
top 1000 matches - 0.143917 m RMSE
top 1600 matches - 0.137692 m RMSE
No filtering top matches - 0.139491 m m RMSE

In general, the less we filter the better the results. This seems consisten with the paper Real-time monocular SLAM: Why filter?.

compared_pose_pairs 570 pairs
absolute_translational_error.rmse 0.142910 m
absolute_translational_error.mean 0.139057 m
absolute_translational_error.median 0.137050 m
absolute_translational_error.std 0.032960 m
absolute_translational_error.min 0.067943 m
absolute_translational_error.max 0.214920 m

2024-05-22: Realized that I had my transforms inverted... slightly better results!

Before

self.relative_poses.append(T)
self.poses.append(self.relative_poses[-1] @ self.poses[-1])

After

T = np.linalg.inv(T)
self.relative_poses.append(T)
self.poses.append(self.poses[-1] @ self.relative_poses[-1])

compared_pose_pairs 570 pairs
absolute_translational_error.rmse 0.918237 m
absolute_translational_error.mean 0.829502 m
absolute_translational_error.median 0.752555 m
absolute_translational_error.std 0.393810 m
absolute_translational_error.min 0.217567 m
absolute_translational_error.max 1.714041 m

2024-05-21: Added the dataset. Initial results. Ew...

compared_pose_pairs 570 pairs
absolute_translational_error.rmse 0.953799 m
absolute_translational_error.mean 0.892556 m
absolute_translational_error.median 0.849063 m
absolute_translational_error.std 0.336267 m
absolute_translational_error.min 0.216631 m
absolute_translational_error.max 1.840347 m

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
KITTI_sequence_1		KITTI_sequence_1
eval		eval
guides		guides
helpers		helpers
results		results
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
loop_closure.py		loop_closure.py
main_kitti.py		main_kitti.py
main_mono_vo.py		main_mono_vo.py
main_stereo_slam.py		main_stereo_slam.py
main_stereo_vo.py		main_stereo_vo.py
main_tum.py		main_tum.py
optimizer.py		optimizer.py
requirements.txt		requirements.txt
slam.py		slam.py
utils.py		utils.py
visualization.py		visualization.py
vo.py		vo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vSLAM-py

Installation

Usage

Mono Camera

Stereo Camera

Datasets

Important assumptions with your own cameras

Terminology

Discussions

Known Issues

Development Process

Why Python

Monocular vs. Stereo Visual SLAM

Timeline

Benchmarking

About

Releases

Packages

Languages

License

Gongsta/vSLAM-py

Folders and files

Latest commit

History

Repository files navigation

vSLAM-py

Installation

Usage

Mono Camera

Stereo Camera

Datasets

Important assumptions with your own cameras

Terminology

Discussions

Known Issues

Development Process

Why Python

Monocular vs. Stereo Visual SLAM

Timeline

Benchmarking

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages