
Abstract Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual real-ity. However, current generative methods are often task-oriented with …
Abstract Contrastive Language-Image Pre-training (CLIP) [37] has emerged as a pivotal model in computer vision and multi-modal learning, achieving state-of-the-art performance at aligning …
Abstract This paper addresses the task of video question answer-ing (videoQA) via a decomposed multi-stage, modular rea-soning framework. Previous modular methods have …
Figure 1. Modular Customization of Diffusion Models. Given a large set of individual concepts (left), the goal of Modular Customization is to enable independent customization (fine-tuning) …
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners Zitian Chen1, Yikang Shen2, Mingyu Ding3, Zhenfang Chen2, Hengshuang Zhao3, Erik Learned-Miller1, Chuang …
Abstract We propose a simple but effective modular approach MOPA (Modular ObjectNav with PointGoal agents) to sys-tematically investigate the inherent modularity of the object …
Physical reasoning remains a significant challenge for Vision-LanguageModels(VLMs). Thislimitationarisesfrom an inability to translate learned knowledge into predictions about …
Modular Blind Video Quality Assessment Wen Wen1, Mu Li2, Yabin Zhang3, Yiting Liao3, Junlin Li3, Li Zhang3, and Kede Ma1*
Figure 1. Our novel modular transfer learning approach for seman-tic visual navigation learns a general purpose semantic search pol-icy by finding image views sampled randomly in the …
Abstract One of the hallmarks of human intelligence is the ability to compose learned knowledge into novel concepts which can be recognized without a single training example. contrast, …