-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why use the features of last 4 layers? #81
Comments
Honestly, it is not an intentional practice. And we appreciate your reminder. Thank you. |
I've played around with the 4 image encoder outputs and found that the results are not especially sensitive to throwing away some of the outputs. For example, for vit-large, if you repeat block 20 for all 4 outputs (instead of using 20,21,22,23 as normal), the result tends to look qualitatively similar. The same is true for blocks 21 and 22, though block 23 gives a distorted result. A similar pattern holds for vit-b and vit-s, except block 23 doesn't give distorted results with vit-s, weirdly. Here's a comparison between different outputs for vit-large, using an increased processing resolution. The top-right is the 'normal' output, while the bottom-left is the result from repeating block 20 for all 4 outputs: Out of curiosity, I also tried skipping all but the last fusion step (the one which takes the block 20 result as an input), in which case you get the result in the bottom-right. It seems to have more details than normal, though it's incorrect as a depth map. It may just be missing the low-frequency/ramp that would normally make it hard to see the details, which suggests the fusion step may deal with 'frequency' information? I'm not sure what to make of it, but it's an interesting result! At the very least, it feels like the model might perform just as well (with a bit of fine tuning) with only 1 or 2 of the outputs instead of all 4, which might speed up inference a bit. |
Hi @heyoeyo, thank you for sharing such an interesting observation! |
I've done a few more experiments with this and found that the Depth-Anything vit-large model can consistently generate these hi-detail outputs by scaling the fusion steps. For example, for vit-l, you can try adding scaling factors on the last two steps, which seem to have the biggest impact: path_2 = self.scratch.refinenet2(path_3 * 0.15, layer_2_rn, size=layer_1_rn.shape[2:])
path_1 = self.scratch.refinenet1(path_2 * 0.7, layer_1_rn) I've set up an interactive demo for this, in case anyone wants to play around with it to see what the fusion layers do (fusion 2 on vit-l has a blurring effect when scaled >1, for example): For anyone interested, I have a repo: MuggledDPT that includes other scripts for interacting with the Depth-Anything (and other DPT) outputs, including taking a webcam input. Not sure if you're still taking community repos @LiheYoung, but you're welcome to add this to the listing if you like, it's mostly meant to be an educational/explainer repo. |
Thank you for providing these surprising and interesting observations! I tested more images and had similar observations as yours:
Btw, when only using the final fusion block, did you use it as: path1 = self.scratch.refinenet1(layer_1_rn, layer_1_rn) # replace the original "path_2" variable with "layer_1_rn" Thank you again. I will definitely add your repo MuggledDPT to our repo in our next update. |
I can't remember exactly, but I think I did something equivalent to: layer_1_rn = self.scratch.layer1_rn(layer_1)
#layer_2_rn = self.scratch.layer2_rn(layer_2)
#layer_3_rn = self.scratch.layer3_rn(layer_3)
#layer_4_rn = self.scratch.layer4_rn(layer_4)
#path_4 = self.scratch.refinenet4(layer_4_rn, size=layer_3_rn.shape[2:])
#path_3 = self.scratch.refinenet3(path_4, layer_3_rn, size=layer_2_rn.shape[2:])
#path_2 = self.scratch.refinenet2(path_3, layer_2_rn, size=layer_1_rn.shape[2:])
path_1 = self.scratch.refinenet1(torch.zeros_like(layer_1_rn), layer_1_rn) (it only 'works' with vit-l. The base and especially small models are distorted by this) The other example 'using block 20 only' was done with a modification equivalent to changing the loop over the image encoder features to something like: for i, x in enumerate(out_features[0] for _ in range(4)): Which also has odd behavior. Repeating index [0], [1] or [2] gives nearly identical (good) results, but repeating index [3] gives a distorted output, at least for vit-l. It seems to suggest that there is something wrong/different with the last layer output. In case you hadn't seen it, there's a paper: "Vision Transformers Need Registers" that mentions artifacts in the later layers of the dinov2 encoder which is also evident in the depth-anything models. Might have something to do with these odd behaviors, though vit-l has artifacts starting on blocks 15-17, so I'm not really sure.
Thanks! |
We note that you use the features of last 4 layers from the encoder instead of intermediate layers (e.g. [5, 12, 18, 24] for vitl) as in some other works such as DINOv2. What's the reason for that and is there any remarkable difference between these two strategy?
The text was updated successfully, but these errors were encountered: