RT export questions #30

sungh66 · 2023-12-19T03:17:39Z

I saved the RT locally through pred_cameras.R and pred_cameras.T respectively and used visdom to visualize it. Then there is a big difference between the camera direction and the camera direction visualized with pred_ameras. Why is this? Is this saved RT in ndc format? What should I do if I want a c2w RT?

jytime · 2023-12-19T09:26:05Z

The pred_cameras.R and pred_cameras.T are in ndc. If you use the similar codes as here for visualization they should be exactly the same.

You can use this to construct new cameras using the saved R and t. Focal length can be omitted for visualization.

PoseDiffusion/pose_diffusion/demo.py

Lines 134 to 136 in 57d6444

    
           gt_cameras = PerspectiveCameras( 
        
               focal_length=gt_cameras_dict["gtFL"], R=gt_cameras_dict["gtR"], T=gt_cameras_dict["gtT"], device=device 
        
           )

sungh66 · 2023-12-19T09:31:25Z

Yeah, I know. What I want to do is extract the pose information RT for other back-end use, such as instant-ngp, but this RT is obviously the format of ndc, so I localized RT, after the second visualization and the original output is different, I was a little confused. Is there a way to get RT in non-NDC format?

jytime · 2023-12-19T10:45:15Z

Please refer to this issue #9 where I provided an example on how to convert ndc RT to colmap RT.

sungh66 · 2024-01-09T10:32:02Z

Thank you for your reply. I have solved the problem of pose extraction. At present, I am trying to train the data of co3d by myself and want to simply reproduce the results. Due to the large amount of data, I tried the training of one class (109GB), but the result was very poor, and I felt that there might be some overfitting. Can you briefly reveal some training strategies, such as the composition of the data? Maybe I'll try to make the result as good as possible considering the time cost? Which other indicator is more important, I mainly look at the auc indicator.

jytime · 2024-01-09T18:02:30Z

Well in my experience, if you want to train on one category (such as teddybear), one GPU is enough. You may need to change the lr correspondingly, and other hyperparameters should be almost the same. From my observation, the model trained for on category would perform better on the corresponding test set than the multi-category model (note means trained on teddybear and tested on teddybear, although the testing scenes are never shown).

The suggestion would be (1) try from teddybear because I tried it before (2) you can even start from forcing the model to be overfitted on one scene (3) in most cases, lr is the most important hyperparameter. Racc@15 and Tacc@15 are the indicators I care about most.

sungh66 · 2024-01-10T08:48:14Z

Thanks for your reply! The default stored weight file is model.safetensors. How can i convert these into pth files that i can load directly with torch?Is loss reduced to 0.03, racc_15 and tacc_15 to 0.8 and 0.7 respectively a good result（only 1 class）？

jytime · 2024-01-10T10:45:07Z

I am a bit confused by "The default stored weight file is model.safetensors". In our code, the trained checkpoint should be saved by below:

PoseDiffusion/pose_diffusion/train.py

Line 146 in 57d6444

    
           accelerator.save_state(output_dir=os.path.join(cfg.exp_dir, f"ckpt_{epoch:06}"))

You should be able to find the pth files in the corresponding path, and reload it by accelerator or torch itself. Please refer to the documents of accelerator for details.

racc_15 around 80% is not bad, at least it means something has been learned correctly.

jytime · 2024-01-10T18:06:00Z

@sungh66 by the way, if you mean racc_15 for the training data, it should be usually more than 90% or even 95% for one category training.

sungh66 · 2024-01-11T06:58:39Z

The default default_train.yaml does not have exp_dir keyword, I added it myself. The trained files are(ckpt_000055$ls)
model.safetensors optimizer.bin random_states_0.pkl random_states_1.pkl scheduler.bin
However, resume_ckpt cannot directly load the model.safetensors path

jytime · 2024-01-11T13:22:22Z

I am not sure why it is saved as model.safetensors (probably some version difference), while the operations may be checked here: https://huggingface.co/docs/safetensors/index. Or I think you can directly use the built-in function accelerator.load_state (https://huggingface.co/docs/accelerate/v0.26.0/en/usage_guides/checkpoint#checkpointing):

from accelerate import Accelerator
import torch

accelerator = Accelerator(project_dir="my/save/path")

my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99)
my_model, my_optimizer, my_training_dataloader = accelerator.prepare(my_model, my_optimizer, my_training_dataloader)

# Register the LR scheduler
accelerator.register_for_checkpointing(my_scheduler)

# Save the starting state
accelerator.save_state()

device = accelerator.device
my_model.to(device)

# Perform training
for epoch in range(num_epochs):
    for batch in my_training_dataloader:
        my_optimizer.zero_grad()
        inputs, targets = batch
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = my_model(inputs)
        loss = my_loss_function(outputs, targets)
        accelerator.backward(loss)
        my_optimizer.step()
    my_scheduler.step()

# Restore the previous state
accelerator.load_state("my/save/path/checkpointing/checkpoint_0")

sungh66 · 2024-01-15T09:45:40Z

Thanks for your reply! Demo.py didn't seem to be able to directly load the weights for my multi-gpu training, so I tried modifying demo.py according to the Accelerator documentation and successfully loaded the weights, but I didn't have gt-cameras.npz. The GT information here doesnt affect the RT result generation,right? Info is as follows:
t=00 | sampson=9.987731
t=01 | sampson=9.963307
Drop this pair because of insufficient valid matches
t=00 | sampson=9.987731
Drop this pair because of insufficient valid matches
t=01 | sampson=9.963307
Drop this pair because of insufficient valid matches
t=00 | sampson=9.987731
Drop this pair because of insufficient valid matches
t=01 | sampson=9.963307
Drop this pair because of insufficient valid matches
t=00 | sampson=9.987731
Drop this pair because of insufficient valid matches
t=01 | sampson=9.963307
Time taken: 8.7666 seconds
[2024-01-15 04:15:12,201][visdom][WARNING] - Setting up a new session...
[2024-01-15 04:15:12,211][websocket][INFO] - Websocket connected
[2024-01-15 04:15:12,212][visdom][INFO] - Visdom successfully connected to server
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Time taken: 8.6465 seconds

jytime · 2024-01-15T15:04:53Z

Hey great to hear that you have resolved it. Yes the GT information here doesnt affect the RT result generation, please just skip the corresponding codes.

While, the GGS log seems not correct here. The shown sampson error is around 10, but ideally it should be reduced to some numbers close to 1 or 2. It seems you have changed the GGS setting, from using GGS since t=10 to t=1, or any other settings. Please note that it is very likely that a sampson error of 10 will not lead to a good camera pose.

sungh66 · 2024-01-18T02:54:16Z

Thank you for reminding me. The problem is that there are still some errors in multi-gpu prediction. I changed it to single card loading multi-card weight and made some key modifications. Now the sampson results are completely normal. Could you please tell me how much epoch it took for the complete co3d(5.5TB) training? Is the len_train 16384? I would like to make a simple estimate of GPU computing resources , time and purchase equipment before repeating it. It would be great if you could tell me!

jytime · 2024-01-18T15:38:35Z

Hey I did not remember exactly how many epochs it was trained, but it took around 2-3 days on 8 A100 GPUs.

jytime · 2024-01-31T02:28:51Z

To avoid safetensors, you can set:

accelerator.save_state(output_dir=ckpt_path, safe_serialization=False)

jytime · 2024-03-09T02:34:00Z

@sungh66 If you meet a training speed problem, please check #33. This is related to data loading.

jytime closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RT export questions #30

RT export questions #30

sungh66 commented Dec 19, 2023

jytime commented Dec 19, 2023 •

edited

Loading

sungh66 commented Dec 19, 2023

jytime commented Dec 19, 2023

sungh66 commented Jan 9, 2024

jytime commented Jan 9, 2024

sungh66 commented Jan 10, 2024 •

edited

Loading

jytime commented Jan 10, 2024

jytime commented Jan 10, 2024

sungh66 commented Jan 11, 2024

jytime commented Jan 11, 2024

sungh66 commented Jan 15, 2024

jytime commented Jan 15, 2024

sungh66 commented Jan 18, 2024 •

edited

Loading

jytime commented Jan 18, 2024

jytime commented Jan 31, 2024

jytime commented Mar 9, 2024 •

edited

Loading

RT export questions #30

RT export questions #30

Comments

sungh66 commented Dec 19, 2023

jytime commented Dec 19, 2023 • edited Loading

sungh66 commented Dec 19, 2023

jytime commented Dec 19, 2023

sungh66 commented Jan 9, 2024

jytime commented Jan 9, 2024

sungh66 commented Jan 10, 2024 • edited Loading

jytime commented Jan 10, 2024

jytime commented Jan 10, 2024

sungh66 commented Jan 11, 2024

jytime commented Jan 11, 2024

sungh66 commented Jan 15, 2024

jytime commented Jan 15, 2024

sungh66 commented Jan 18, 2024 • edited Loading

jytime commented Jan 18, 2024

jytime commented Jan 31, 2024

jytime commented Mar 9, 2024 • edited Loading

jytime commented Dec 19, 2023 •

edited

Loading

sungh66 commented Jan 10, 2024 •

edited

Loading

sungh66 commented Jan 18, 2024 •

edited

Loading

jytime commented Mar 9, 2024 •

edited

Loading