logo
ResearchBunny Logo
Towards artificial general intelligence via a multimodal foundation model

Computer Science

Towards artificial general intelligence via a multimodal foundation model

N. Fei, Z. Lu, et al.

This groundbreaking research introduces BriVL, a multimodal foundation model that excels in understanding and imagination across various cognitive tasks. Conducted by Nanyi Fei and colleagues, this study represents a significant leap towards Artificial General Intelligence (AGI).

00:00
00:00
~3 min • Beginner • English
Introduction
The study targets a key AGI objective: building systems that generalize across diverse cognitive tasks, handle unseen problems, and transfer knowledge across contexts. Existing AI advances typically excel at single abilities (vision, language, reasoning) and often rely on strongly correlated image-caption data and computationally heavy architectures (detectors, single-tower models), limiting generalization and real-world applicability. The authors hypothesize that a large-scale multimodal foundation model, pre-trained on weakly correlated image–text pairs reflecting real-world distributions, can learn more general, transferable, and cognitive representations, including imagination-like capabilities. They propose BriVL, a two-tower vision–language model trained via cross-modal contrastive learning with momentum queues, aiming for efficient large-scale training and fast inference while capturing global, context-level semantics beyond fine-grained region–word matches.
Literature Review
Prior multimodal foundation models (e.g., UNITER, OSCAR, VisualBERT, VL-BERT, M6) often assume strong semantic alignment between image–text pairs and frequently depend on object detectors and single-tower fusion, which is computationally intensive and constrains scalability and retrieval latency. Two-tower approaches like CLIP and ALIGN showed strong transfer by learning global embeddings but rely on within-batch negatives and curated datasets with filtering rules that shift data distributions away from the web’s natural long tail. Contrastive learning (e.g., CPC, SimCLR, MoCo) established powerful self-supervised paradigms; MoCo introduced momentum encoders and negative queues to decouple negative sample size from batch size. The authors position BriVL as combining two-tower efficiency with MoCo-style momentum queues and a large, minimally filtered, weakly correlated web-scale dataset to better reflect real-world multimodal semantics and enable cognitive abilities such as imagination.
Methodology
Data: A weak semantic correlation dataset (WSCD) of ~650M Chinese image–text pairs crawled from news, encyclopedia, and social media, minimally filtered only for pornographic/sensitive content, preserving long-tail and abstract concepts. English tasks were handled by translating to Chinese; an English-pretrained variant is referenced in supplementary materials. Architecture: Two-tower encoders with additional momentum encoders (four-tower during pre-training). The image encoder uses EfficientNet-B7 backbone, Multi-Scale Patch Pooling (MSPP) to extract 37 patch features (scales 1x1 and 6x6), followed by a Transformer-style self-attention block and a two-layer MLP projection to a 2,560-d joint space. The text encoder leverages a Transformer (RoBERTa-Large) with self-attention and a two-layer MLP projection to the same space. Learning objective: Cross-modal contrastive learning with InfoNCE loss in both directions (image-to-text and text-to-image). MoCo-style momentum encoders (momentum m=0.99) maintain large negative queues (N_i=N_t=13,440), decoupling negative count from batch size. Temperature τ=0.07. Similarity is dot product on L2-normalized embeddings. Training details: Images resized to 600x600 with random graying and color jitter. Optimizer: Adam (lr=1e-4, weight decay 1e-5). Distributed training with DeepSpeed across 14 machines (8×NVIDIA A100 each), mini-batch size 192 per machine (total 2,688) over 112 GPUs for ~10 days. Inference: Two-tower enables pre-computation and indexing of candidate embeddings for low-latency retrieval. Visualization and generation: Neural network feature visualization directly optimizes a random image to match a given text embedding using frozen BriVL encoders (cosine similarity objective). Text-to-image generation uses a frozen VQGAN (pre-trained on ImageNet) and frozen BriVL to optimize VQGAN token sequences to align image and text embeddings, with iterative quantization to the VQGAN codebook. Baselines and tasks: Compared against CLIP variants (ResNet-50/101/50x4) and a zero-shot remote sensing classifier (ZSSC). Evaluated on zero-shot remote sensing classification (UCM, AID), zero-shot news classification (Toutiao News, THUCNews), cross-modal retrieval (AIC-ICC), and VQA (Visual7W, ‘Telling’ subset), with Chinese translations as needed.
Key Findings
- Visualization and imagination: BriVL’s feature visualizations for abstract concepts (e.g., “nature”, “time”, “science”, “dream”) and full sentences (including classical Chinese poetry) yield coherent, context-rich images, suggesting learned common-sense associations. Compared to CLIP-guided VQGAN generations, BriVL-guided generations are more realistic and globally coherent, likely due to weakly correlated training data emphasizing holistic scene understanding. - Zero-shot remote sensing scene classification: • UCM (21 classes): BriVL achieves 58.43% (21/0 split) and outperforms CLIP variants across unseen/seen ratios: 82.41% (5/16), 72.91% (8/13), 69.16% (11/10), 65.06% (14/7). CLIP R50: 50.19/71.98/64.66/59.87/57.14; CLIP R101: 54.81/76.84/70.52/63.61/62.01; CLIP R50x4: 56.67/76.02/71.53/64.44/63.77. ZSSC: 58.7 (5/16), 35.4 (8/13), 19.6 (11/10), 15.1 (14/7). • AID (30 classes): BriVL achieves 58.12% (30/0) and leads across splits: 76.73% (8/22), 71.25% (12/18), 67.52% (16/14), 64.19% (20/10). CLIP R50: 46.01/65.99/59.15/54.44/51.72; CLIP R101: 48.05/68.71/64.39/57.75/54.54; CLIP R50x4: 50.96/69.32/64.30/59.53/56.35. • Visualization of “baseball field viewed from above” reflects aerial perspective components (e.g., sector area), despite limited remote sensing data in WSCD, indicating perspective generalization. - Zero-shot news classification: • Toutiao News / THUCNews (%): RoBERTa-base 36.51/26.61; RoBERTa-base (finetune on 22M texts) 38.14/28.69; BriVL w/ RoBERTa-base 52.01/47.53; RoBERTa-large 42.55/49.62; BriVL w/ RoBERTa-large 62.38/58.83. Cross-modal pre-training substantially boosts zero-shot textual classification. • Category-wise analysis (Toutiao): BriVL w/ RoBERTa-large improves 10/15 categories (e.g., sports +88.22 pp, entertainment +60.36 pp) with a few declines (e.g., automobile −10.58 pp), reflecting broader association coverage. - Cross-modal retrieval (AIC-ICC, Chinese): • Direct training baseline: Image→Text R@1/5/10 = 36.03/59.48/69.71; Text→Image 28.66/54.33/65.26; Recall@SUM 317.47. • Best pre-train & finetune (Fix BN, 4 unfixed blocks): Image→Text 45.61/68.01/76.31; Text→Image 34.06/58.86/69.09; Recall@SUM 355.94. Pre-training yields large gains; image→text is generally easier than text→image. - VQA (Visual7W, ‘Telling’, Chinese translation): • Direct training overall 72.16%; pre-train & finetune best overall 80.67% (no BN fix, 2 unfixed blocks), with per-type improvements (e.g., What 79.89, Where 81.71, When 87.78, Who 84.48, Why 82.66, How 76.31). Finetuning strategy affects outcomes; optimal settings differ by task. Overall, results support that weakly correlated web-scale multimodal pre-training with MoCo-style negative queues enables strong generalization, cross-domain transfer, and emergent imagination/common-sense-like behavior.
Discussion
The findings validate the central hypothesis: training on large-scale, weakly correlated image–text pairs and aligning modalities via contrastive learning can produce a foundation model with broad transfer and emergent cognitive-like capabilities. BriVL’s superior zero-shot performance on remote sensing and news classification, its gains on cross-modal retrieval and VQA, and its coherent visualizations/generations suggest it captures global semantics and context beyond literal region–word matches. The two-tower architecture enables efficient retrieval, and momentum queues provide abundant negatives without massive batch sizes, making large-scale training more resource-efficient. The observed imagination (e.g., abstract concepts, poetry, and perspective transformations) indicates that exposure to loosely aligned multimodal content fosters richer semantic associations, addressing key AGI desiderata such as generalization and handling unanticipated tasks. These advances have implications for adaptable AI systems across domains (e.g., healthcare, neuroscience), while highlighting the need for interpretability, bias mitigation, and responsible deployment.
Conclusion
This work introduces BriVL, a large-scale multimodal foundation model trained on a 650M weakly correlated image–text dataset using a two-tower contrastive framework with momentum queues and MSPP for efficient and effective representation learning. BriVL demonstrates strong zero-shot and finetuned performance across domains and tasks, along with emergent imagination and common-sense-like capabilities revealed through visualization and text-to-image generation. The study suggests weakly correlated multimodal pre-training as a promising path toward more general AI. Future directions include: (1) expanding modalities (e.g., video, audio) and scaling model capacity; (2) building larger, multilingual multimodal datasets to enable cross-lingual transfer; (3) advancing interpretability methods for foundation models; (4) developing more effective and efficient adaptation/finetuning strategies; and (5) addressing ethical, bias, and misuse concerns in both pre-training data and downstream applications.
Limitations
- Data distribution and language: The main pre-training data are Chinese web sources; English tasks required translation, possibly introducing noise and cultural mismatches. The dataset lacks certain domains (e.g., remote sensing), which may limit or bias generalization despite positive results. - Evaluation scope: VQA was limited to ‘Telling’ questions (no bounding-box-dependent ‘Pointing’), and some claims (imagination/common sense) are supported qualitatively by visualizations, which may be subjective. - Pre-training signals: Only image–text matching (contrastive) was used; additional objectives (e.g., generative, masked modeling) might further enhance capabilities. - Computational resources: Despite resource-saving design (queues), training still required extensive GPU resources (112 A100s for ~10 days), limiting accessibility. - Societal risks: Potential to encode biases/stereotypes from web data and risks of misuse (e.g., generating/manipulating content). Continuous monitoring, data curation, and safeguards are necessary.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny