Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RT export questions #30

Closed
sungh66 opened this issue Dec 19, 2023 · 16 comments
Closed

RT export questions #30

sungh66 opened this issue Dec 19, 2023 · 16 comments

Comments

@sungh66
Copy link

sungh66 commented Dec 19, 2023

I saved the RT locally through pred_cameras.R and pred_cameras.T respectively and used visdom to visualize it. Then there is a big difference between the camera direction and the camera direction visualized with pred_ameras. Why is this? Is this saved RT in ndc format? What should I do if I want a c2w RT?

@jytime
Copy link
Contributor

jytime commented Dec 19, 2023

The pred_cameras.R and pred_cameras.T are in ndc. If you use the similar codes as here for visualization they should be exactly the same.

You can use this to construct new cameras using the saved R and t. Focal length can be omitted for visualization.

gt_cameras = PerspectiveCameras(
focal_length=gt_cameras_dict["gtFL"], R=gt_cameras_dict["gtR"], T=gt_cameras_dict["gtT"], device=device
)

@sungh66
Copy link
Author

sungh66 commented Dec 19, 2023

Yeah, I know. What I want to do is extract the pose information RT for other back-end use, such as instant-ngp, but this RT is obviously the format of ndc, so I localized RT, after the second visualization and the original output is different, I was a little confused. Is there a way to get RT in non-NDC format?

@jytime
Copy link
Contributor

jytime commented Dec 19, 2023

Please refer to this issue #9 where I provided an example on how to convert ndc RT to colmap RT.

@jytime jytime closed this as completed Dec 22, 2023
@sungh66
Copy link
Author

sungh66 commented Jan 9, 2024

Thank you for your reply. I have solved the problem of pose extraction. At present, I am trying to train the data of co3d by myself and want to simply reproduce the results. Due to the large amount of data, I tried the training of one class (109GB), but the result was very poor, and I felt that there might be some overfitting. Can you briefly reveal some training strategies, such as the composition of the data? Maybe I'll try to make the result as good as possible considering the time cost? Which other indicator is more important, I mainly look at the auc indicator.

@jytime
Copy link
Contributor

jytime commented Jan 9, 2024

Well in my experience, if you want to train on one category (such as teddybear), one GPU is enough. You may need to change the lr correspondingly, and other hyperparameters should be almost the same. From my observation, the model trained for on category would perform better on the corresponding test set than the multi-category model (note means trained on teddybear and tested on teddybear, although the testing scenes are never shown).

The suggestion would be (1) try from teddybear because I tried it before (2) you can even start from forcing the model to be overfitted on one scene (3) in most cases, lr is the most important hyperparameter. Racc@15 and Tacc@15 are the indicators I care about most.

@sungh66
Copy link
Author

sungh66 commented Jan 10, 2024

Thanks for your reply! The default stored weight file is model.safetensors. How can i convert these into pth files that i can load directly with torch?Is loss reduced to 0.03, racc_15 and tacc_15 to 0.8 and 0.7 respectively a good result(only 1 class)?

@jytime
Copy link
Contributor

jytime commented Jan 10, 2024

I am a bit confused by "The default stored weight file is model.safetensors". In our code, the trained checkpoint should be saved by below:

accelerator.save_state(output_dir=os.path.join(cfg.exp_dir, f"ckpt_{epoch:06}"))

You should be able to find the pth files in the corresponding path, and reload it by accelerator or torch itself. Please refer to the documents of accelerator for details.

racc_15 around 80% is not bad, at least it means something has been learned correctly.

@jytime
Copy link
Contributor

jytime commented Jan 10, 2024

@sungh66 by the way, if you mean racc_15 for the training data, it should be usually more than 90% or even 95% for one category training.

@sungh66
Copy link
Author

sungh66 commented Jan 11, 2024

The default default_train.yaml does not have exp_dir keyword, I added it myself. The trained files are(ckpt_000055$ls)
model.safetensors optimizer.bin random_states_0.pkl random_states_1.pkl scheduler.bin
However, resume_ckpt cannot directly load the model.safetensors path

@jytime
Copy link
Contributor

jytime commented Jan 11, 2024

I am not sure why it is saved as model.safetensors (probably some version difference), while the operations may be checked here: https://huggingface.co/docs/safetensors/index. Or I think you can directly use the built-in function accelerator.load_state (https://huggingface.co/docs/accelerate/v0.26.0/en/usage_guides/checkpoint#checkpointing):

from accelerate import Accelerator
import torch

accelerator = Accelerator(project_dir="my/save/path")

my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99)
my_model, my_optimizer, my_training_dataloader = accelerator.prepare(my_model, my_optimizer, my_training_dataloader)

# Register the LR scheduler
accelerator.register_for_checkpointing(my_scheduler)

# Save the starting state
accelerator.save_state()

device = accelerator.device
my_model.to(device)

# Perform training
for epoch in range(num_epochs):
    for batch in my_training_dataloader:
        my_optimizer.zero_grad()
        inputs, targets = batch
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = my_model(inputs)
        loss = my_loss_function(outputs, targets)
        accelerator.backward(loss)
        my_optimizer.step()
    my_scheduler.step()

# Restore the previous state
accelerator.load_state("my/save/path/checkpointing/checkpoint_0")

@sungh66
Copy link
Author

sungh66 commented Jan 15, 2024

Thanks for your reply! Demo.py didn't seem to be able to directly load the weights for my multi-gpu training, so I tried modifying demo.py according to the Accelerator documentation and successfully loaded the weights, but I didn't have gt-cameras.npz. The GT information here doesnt affect the RT result generation,right? Info is as follows:
t=00 | sampson=9.987731
t=01 | sampson=9.963307
Drop this pair because of insufficient valid matches
t=00 | sampson=9.987731
Drop this pair because of insufficient valid matches
t=01 | sampson=9.963307
Drop this pair because of insufficient valid matches
t=00 | sampson=9.987731
Drop this pair because of insufficient valid matches
t=01 | sampson=9.963307
Drop this pair because of insufficient valid matches
t=00 | sampson=9.987731
Drop this pair because of insufficient valid matches
t=01 | sampson=9.963307
Time taken: 8.7666 seconds
[2024-01-15 04:15:12,201][visdom][WARNING] - Setting up a new session...
[2024-01-15 04:15:12,211][websocket][INFO] - Websocket connected
[2024-01-15 04:15:12,212][visdom][INFO] - Visdom successfully connected to server
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Drop this pair because of insufficient valid matches
t=00 | sampson=9.962898
Time taken: 8.6465 seconds

@jytime
Copy link
Contributor

jytime commented Jan 15, 2024

Hey great to hear that you have resolved it. Yes the GT information here doesnt affect the RT result generation, please just skip the corresponding codes.

While, the GGS log seems not correct here. The shown sampson error is around 10, but ideally it should be reduced to some numbers close to 1 or 2. It seems you have changed the GGS setting, from using GGS since t=10 to t=1, or any other settings. Please note that it is very likely that a sampson error of 10 will not lead to a good camera pose.

@sungh66
Copy link
Author

sungh66 commented Jan 18, 2024

Thank you for reminding me. The problem is that there are still some errors in multi-gpu prediction. I changed it to single card loading multi-card weight and made some key modifications. Now the sampson results are completely normal. Could you please tell me how much epoch it took for the complete co3d(5.5TB) training? Is the len_train 16384? I would like to make a simple estimate of GPU computing resources , time and purchase equipment before repeating it. It would be great if you could tell me!

@jytime
Copy link
Contributor

jytime commented Jan 18, 2024

Hey I did not remember exactly how many epochs it was trained, but it took around 2-3 days on 8 A100 GPUs.

@jytime
Copy link
Contributor

jytime commented Jan 31, 2024

To avoid safetensors, you can set:

accelerator.save_state(output_dir=ckpt_path, safe_serialization=False)

@jytime
Copy link
Contributor

jytime commented Mar 9, 2024

@sungh66 If you meet a training speed problem, please check #33. This is related to data loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants