SPACE-CLIP

Abstract

Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggle to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models.

Overview

The VLA Integration Challenge and the SPACE-CLIP Solution. This figure illustrates the architectural paradigm shift proposed by our work. (a) The Inefficient Path: Conventional CLIP-based methods integrate depth estimation as an external module, creating two major bottlenecks for a Vision-Language-Action (VLA) model. (1) Architectural Conflict arises from needing a separate, inefficient data path from the VLA's primary vision encoder. (2) Textual Interference occurs as the VLA's reasoning engine must generate a textual query, conflicting with its main language tasks. (b) SPACE-CLIP's Seamless Integration: Our model acts as a lightweight, decoder-only module that attaches directly to the VLA's existing, frozen vision encoder. By directly interpreting hierarchical features, it provides spatial awareness without architectural redundancy or textual interference, enabling a truly integrated and efficient agent.

Architecture

The overall architecture of SPACE-CLIP. Our model leverages a frozen CLIP vision encoder to extract multi-level features, which are processed through a novel Dual Pathway architecture within the Dense Predictor. (1) The Semantic Decoder (top path) processes high-level features, dynamically modulated by the global context via a FiLM layer. (2) The Structural Decoder (bottom path) processes low-level features to preserve fine-grained details. The outputs of these two pathways are hierarchically fused at each upsampling stage. This fusion mechanism allows SPACE-CLIP to generate semantically coherent and geometrically precise depth maps.

Experiment

Qualitative comparison on the KITTI dataset. From left to right: Input Image, Ground Truth, and our SPACE-CLIP's prediction. Our model generates dense and detailed depth maps that accurately capture complex structures such as thin poles, distant vehicles, and foliage, demonstrating its superior geometric understanding.

Conclusion

SPACE-CLIP offers more than just a performance improvement; it presents a new, efficient, and architecturally elegant paradigm for repurposing foundation models. By keeping the massive pre-trained encoder untouched, our approach promotes a more sustainable and accessible methodology for developing advanced AI systems. Looking forward, we believe this direct interpretation strategy holds immense potential for enriching the next generation of embodied AI, such as VLA models, with the critical spatial awareness needed to intelligently interact with the physical world.

setup

conda create -n space-clip python=3.10
conda activate space-clip
pip install -r requirements.txt

Citation

If you find this repository helpful, please consider citing:

@misc{cho2026spaceclipspatialperceptionadaptive,
      title={SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation}, 
      author={Taewan Cho and Taeryang Kim and Andrew Jaeyong Choi},
      year={2026},
      eprint={2601.17657},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.17657}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
datasets		datasets
train_test_inputs		train_test_inputs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
space_clip.py		space_clip.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPACE-CLIP

Abstract

Overview

Architecture

Experiment

Conclusion

setup

Citation

About

Uh oh!

Releases

Packages

Languages

License

taewan2002/SPACE-CLIP

Folders and files

Latest commit

History

Repository files navigation

SPACE-CLIP

Abstract

Overview

Architecture

Experiment

Conclusion

setup

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages