-
Notifications
You must be signed in to change notification settings - Fork 40
Working with Multiview Data #9
Comments
Looks very typical for using the wrong coordinate system conventions. Towards the end of the readme, there's an explanation of what the coordinate system should look like. Plus the colmap wrappers do a bit of stuff with the direct colmap results before they return the poses to the dataset loading functions. So you should make sure that you take the extrinsics after they really are fully processed (which requires writing some code somewhere). You can also take a look at logs/cameras.obj to help a bit with debugging. Gives you an idea whether your cameras are remotely reasonable (they aren't from what it looks like). The depth makes it clear that nothing remotely correct is learned in 3D. It completely overfits to each input camera with artifacts because the camera images are inconsistent with each other. |
Just to add, although I don't believe it's the immediate cause of these results, but there is no assumption about order of frames. For multi-view, you need to provide a image_to_camera_id_and_timestep.json that says which timestep an image belongs to. Even if the cameras weren't perfectly synchronized, it would still look somewhat blurry but 3D would be there. Currently, 3D is completely broken. What I would do to make sure that the extrinsics are in the right format:
|
Thank you for the fast replies and suggestions! I'll make sure that my coordinates system are in the same conventions. When populating the image_to_camera_id_and_timestep.json, how is time be represented? (0.0-1.0 relative to clip, frame number ...etc) |
Frame number, as in time index. So all images taken at the same timestep should have the same integer assigned to them. The timesteps should start at 0 and ideally not leave a hole (although I think that's not a problem if it happens). So the first 20 frames, all taken at the same time by 20 camera, would have timestep=0, the next 20 frames taken by 20 cameras at the same time have timestep=1, etc. |
Hello,
Thank you for the great repo.
I've been trying to use this on a multi-view data set and I'm having some trouble getting a network converge on good results.
The data I'm training on is taken from ~20-30 synced cameras(depending on how many colmap finds in the SFM) set up semi-evenly in a room. The cameras are static, but the scene is dynamic, albeit slow moving. I modified the data loading to take a json that contains frames from each camera. When building a training set, I made the assumption that the order of images loaded in the training is how the model expects frames to be ordered in time. Frames are picked sequentially from each camera, e.g If there's 30 cameras and 150 frames, camera 1 will contribute frames 1,31,61,91...etc.
I've gotten the network to run and train on the dataset, and the outputs are recognizable, but there's a lot of artifacts. Any help building intuition or advice on how to improve the quality of the outputs would be much appreciated.
Original image:
Outputs after 250k iterations:
The text was updated successfully, but these errors were encountered: