Why use the features of last 4 layers? #81

Kal-Elll · 2024-02-07T02:33:03Z

We note that you use the features of last 4 layers from the encoder instead of intermediate layers (e.g. [5, 12, 18, 24] for vitl) as in some other works such as DINOv2. What's the reason for that and is there any remarkable difference between these two strategy?

LiheYoung · 2024-02-09T03:46:24Z

Honestly, it is not an intentional practice. And we appreciate your reminder. Thank you.

heyoeyo · 2024-02-13T21:39:57Z

I've played around with the 4 image encoder outputs and found that the results are not especially sensitive to throwing away some of the outputs. For example, for vit-large, if you repeat block 20 for all 4 outputs (instead of using 20,21,22,23 as normal), the result tends to look qualitatively similar. The same is true for blocks 21 and 22, though block 23 gives a distorted result. A similar pattern holds for vit-b and vit-s, except block 23 doesn't give distorted results with vit-s, weirdly.

Here's a comparison between different outputs for vit-large, using an increased processing resolution. The top-right is the 'normal' output, while the bottom-left is the result from repeating block 20 for all 4 outputs:

Out of curiosity, I also tried skipping all but the last fusion step (the one which takes the block 20 result as an input), in which case you get the result in the bottom-right. It seems to have more details than normal, though it's incorrect as a depth map. It may just be missing the low-frequency/ramp that would normally make it hard to see the details, which suggests the fusion step may deal with 'frequency' information?

I'm not sure what to make of it, but it's an interesting result! At the very least, it feels like the model might perform just as well (with a bit of fine tuning) with only 1 or 2 of the outputs instead of all 4, which might speed up inference a bit.

LiheYoung · 2024-03-17T10:42:34Z

Hi @heyoeyo, thank you for sharing such an interesting observation!

heyoeyo · 2024-03-17T19:09:39Z

I've done a few more experiments with this and found that the Depth-Anything vit-large model can consistently generate these hi-detail outputs by scaling the fusion steps. For example, for vit-l, you can try adding scaling factors on the last two steps, which seem to have the biggest impact:

path_2 = self.scratch.refinenet2(path_3 * 0.15, layer_2_rn, size=layer_1_rn.shape[2:])
path_1 = self.scratch.refinenet1(path_2 * 0.7, layer_1_rn)

I've set up an interactive demo for this, in case anyone wants to play around with it to see what the fusion layers do (fusion 2 on vit-l has a blurring effect when scaled >1, for example):

For anyone interested, I have a repo: MuggledDPT that includes other scripts for interacting with the Depth-Anything (and other DPT) outputs, including taking a webcam input. Not sure if you're still taking community repos @LiheYoung, but you're welcome to add this to the listing if you like, it's mostly meant to be an educational/explainer repo.

LiheYoung · 2024-04-01T09:18:43Z

Thank you for providing these surprising and interesting observations! I tested more images and had similar observations as yours:

skipping the first three fusion blocks will produce sharper predictions, but are not correct enough from the perspective of MDE metrics.
decreasing the fusion weights of the path_3 and path_2 at test time will also produce sharper predictions.

Btw, when only using the final fusion block, did you use it as:

path1 = self.scratch.refinenet1(layer_1_rn, layer_1_rn) # replace the original "path_2" variable with "layer_1_rn"

Thank you again. I will definitely add your repo MuggledDPT to our repo in our next update.

heyoeyo · 2024-04-01T15:36:15Z

Btw, when only using the final fusion block, did you use it as...

I can't remember exactly, but I think I did something equivalent to:

layer_1_rn = self.scratch.layer1_rn(layer_1)
#layer_2_rn = self.scratch.layer2_rn(layer_2)
#layer_3_rn = self.scratch.layer3_rn(layer_3)
#layer_4_rn = self.scratch.layer4_rn(layer_4)
        
#path_4 = self.scratch.refinenet4(layer_4_rn, size=layer_3_rn.shape[2:])
#path_3 = self.scratch.refinenet3(path_4, layer_3_rn, size=layer_2_rn.shape[2:])
#path_2 = self.scratch.refinenet2(path_3, layer_2_rn, size=layer_1_rn.shape[2:])
path_1 = self.scratch.refinenet1(torch.zeros_like(layer_1_rn), layer_1_rn)

(it only 'works' with vit-l. The base and especially small models are distorted by this)

The other example 'using block 20 only' was done with a modification equivalent to changing the loop over the image encoder features to something like:

for i, x in enumerate(out_features[0] for _ in range(4)):

Which also has odd behavior. Repeating index [0], [1] or [2] gives nearly identical (good) results, but repeating index [3] gives a distorted output, at least for vit-l. It seems to suggest that there is something wrong/different with the last layer output. In case you hadn't seen it, there's a paper: "Vision Transformers Need Registers" that mentions artifacts in the later layers of the dinov2 encoder which is also evident in the depth-anything models. Might have something to do with these odd behaviors, though vit-l has artifacts starting on blocks 15-17, so I'm not really sure.

Thank you again. I will definitely add your repo MuggledDPT to our repo in our next update.

Thanks!

kagevazquez mentioned this issue Jun 19, 2024

Update depth_anythingV2 Mikubill/sd-webui-controlnet#2956

Closed

LiheYoung mentioned this issue Oct 14, 2024

Difference？ DepthAnything/Depth-Anything-V2#176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use the features of last 4 layers? #81

Why use the features of last 4 layers? #81

Kal-Elll commented Feb 7, 2024

LiheYoung commented Feb 9, 2024

heyoeyo commented Feb 13, 2024

LiheYoung commented Mar 17, 2024

heyoeyo commented Mar 17, 2024

LiheYoung commented Apr 1, 2024

heyoeyo commented Apr 1, 2024

Why use the features of last 4 layers? #81

Why use the features of last 4 layers? #81

Comments

Kal-Elll commented Feb 7, 2024

LiheYoung commented Feb 9, 2024

heyoeyo commented Feb 13, 2024

LiheYoung commented Mar 17, 2024

heyoeyo commented Mar 17, 2024

LiheYoung commented Apr 1, 2024

heyoeyo commented Apr 1, 2024