Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing benchmarks remain limited in scale, model coverage, and evaluation granularity, restricting the development of reliable T23D quality assessment (T23DQA). Moreover, current objective evaluators often fail to capture fine-grained compositional and structural properties. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with 12 sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments during the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models.
@article{cui2025towards,
title={Towards Fine-Grained Text-to-3D Quality Assessment: A Database and A Two-Stage Rank-Learning Metric},
author={Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, and Yiling Xu},
journal={arXiv preprint arXiv:2509.23841},
year={2025}
}