The videos below show RidgeSfM reconstructions for KITTI VO scene 7 (using every 3rd frame).
We trained a depth prediction network on the KITTI depth prediction training set. We then processed scene 7 from the KITTI Visual Odometry dataset. We used the 'camera 2' image sequences, cropping the input to RGB images of size 1216x320. We used R2D2 as the keypoint detector.
For each scene, we use the reconstructed depth and camera parameters to reproject the pixels to form a point cloud. Each point in the cloud has the form (x,y,z,r,g,b) ∈ ℝ6. To simplify the point-cloud, we use K-Means to extract 1,000,000 centroids.