We introduce a novel benchmark designed to propel the advancement of general-purpose, large-scale 3D vision models for remote sensing imagery. This dataset encompasses 54,951 pairs of remote sensing images and pixel-level aligned depth maps, accompanied by corresponding textual descriptions, spanning a broad array of geographical contexts. It serves as a tool for training and assessing 3D visual perception models within remote sensing image spatial understanding tasks. Furthermore, we introduce a remotely sensed depth estimation model derived from stable diffusion, harnessing its multimodal fusion capabilities, thereby delivering state-of-the-art performance on our dataset. Our endeavor seeks to make a profound contribution to the evolution of 3D visual perception models and the advancement of geographic artificial intelligence within the remote sensing domain.
The pipeline of data collection of RS3DBench: (1) Data Crawling (2) Alignment (3) Annotations (4) Post-Processing
Figure 1 shows the distribution percentage of different terrains in the dataset at various altitudes, excluding the ocean(The main altitude of the ocean is 0), and the altitude range is distributed between -149 to 4813 meters, which is of great significance for large-scale remote sensing terrain modeling, effectively compensating for the shortcomings of existing depth estimation datasets in terms of large elevation changes and multiple terrain scenarios. Figure 2 presents the overall proportion of the six major terrain categories: ocean 3.3%, low undulating mountains 9.1%, hills 6.4%, plains 38.0%, Highland 13.5%, and high undulating mountains 29.8%, providing crucial support for model training in complex geographical scenarios. Figure 3 depicts the distribution of terrain counts at four resolutions (30 m, 5 m, 2 m, 0.5 m) with the cube root of the number of pixels as the vertical axis. The results indicate that low resolution (30 m, 5 m) datasets have a higher proportion of plains and Highland, while high-resolution (2 m, 0.5 m) primarily consists of mountainous types.