Computer vision, the science of teaching machines to understand the visual world, has witnessed in the past decade how the paradigm shift from hand-crafted methods to deep neural networks—known as deep learning—has revolutionized the field, leading to breakthroughs across a wide range of vision problems. Recently, we have observed a trend that has sparked new interests from the community and may greatly impact the field in the long run, i.e., the scaling of vision models.
Specifically, the size of vision models has grown exponentially from tens of millions of parameters to hundreds of millions, or even billions, particularly after the emergence of Vision Transformers. Moreover, the scale and diversity of training data also have been increased dramatically to match the growth in model capacity: not only in quantity (like billions of web examples) but also in modalities, such as combining image and language. Here we call them Large Vision Models (LVMs) for brevity, which include both unimodal and multimodal vision models (e.g., visual language models).
On one hand, LVMs learned from broad data at scale have demonstrated great power in terms of generalization capability: they can cope with a wide range of domains or scenarios, and can be adapted, with minimal twists, to handle multiple visual tasks, such as image classification/captioning/segmentation, object/keypoint detection, and depth/surface normal estimation. Furthermore, multimodal LVMs have also brought opportunities for numerous downstream zero-shot inference applications, such as open-vocabulary classification/detection/segmentation and image editing/generation.
On the other hand, LVMs come with challenges and risks that need to be addressed by the community: training is costly and has negative environmental impact; LVMs are too big to fine-tune on downstream datasets; uneven distribution of web data may cause social biases (w.r.t. gender and races) and inequalities; the commonsense reasoning ability of LVMs still lags behind; and so on.
This special issue seeks original contributions towards advancing LVMs—in terms of development, evaluation, adaptation, applications, understanding, and so on—and addressing the potential negative aspects brought by LVMs.
The submission system is ready. Authors should select the "S.I.: The Promises and Dangers of Large Vision Models" article type when submitting their manuscripts to the journal.
Submitted papers should present original, unpublished work, relevant to one of the topics of the Special Issue. All submitted papers will be evaluated on the basis of relevance, significance of contribution, technical quality, scholarship, and quality of presentation, by at least three independent reviewers. It is the policy of the journal that no submission, or substantially overlapping submission, be published or be under review at another journal or conference at any time during the review process.
Please refer to the official CfP on the International Journal of Computer Vision (IJCV) website for more information.