logo
ResearchBunny Logo
Towards artificial general intelligence via a multimodal foundation model

Computer Science

Towards artificial general intelligence via a multimodal foundation model

N. Fei, Z. Lu, et al.

This groundbreaking research introduces BriVL, a multimodal foundation model that excels in understanding and imagination across various cognitive tasks. Conducted by Nanyi Fei and colleagues, this study represents a significant leap towards Artificial General Intelligence (AGI).

00:00
00:00
Playback language: English
Introduction
The ultimate aim of Artificial Intelligence (AI) is to replicate human cognitive functions. While deep learning has yielded remarkable achievements in specific areas like computer vision (CV) and natural language processing (NLP), most current AI systems excel only in narrow, single-cognitive tasks. This limitation motivates the pursuit of Artificial General Intelligence (AGI), characterized by human-level performance across a wide range of cognitive tasks, adaptability to unforeseen problems, and knowledge transfer between diverse contexts. The development of an AGI system would not only accelerate AI research but also revolutionize numerous fields, including neuroscience, healthcare, and biomedicine. Recent advancements in deep learning have seen models surpassing human performance on specific tasks. For example, ResNets have exceeded human capabilities in image classification, and RoBERTa has outperformed humans on several natural language understanding benchmarks. However, these successes are largely confined to individual cognitive abilities. To overcome this constraint and progress towards AGI, the authors propose a foundation model pre-trained on a vast amount of multimodal data (visual and textual), enabling rapid adaptation to diverse downstream tasks. The rationale behind this approach is twofold: firstly, foundation models (also known as pre-trained models) are inherently designed for adaptation to various tasks via fine-tuning after pre-training on extensive data. Secondly, the use of multimodal data mirrors human intelligence, which often utilizes visual and textual information concurrently. Human brains process multimodal inputs, encoding concepts into invariant representations across different sensory modalities. Therefore, pre-training a large-scale multimodal foundation model is considered a plausible pathway to AGI. Existing multimodal models typically rely on the assumption of strong semantic correlation between image and text pairs, limiting their generalization abilities. This research aims to address this limitation.
Literature Review
Existing multimodal foundation models, while demonstrating promise in fast learning and cross-modal understanding, often rely on the assumption of strong semantic correlation between input image-text pairs (e.g., image-caption pairs). This assumption, while simplifying the training process, severely restricts generalization capabilities because real-world data rarely exhibits such perfect alignment. Many state-of-the-art models further employ computationally expensive object detectors and single-tower network architectures, limiting their scalability and real-world applicability. The single-tower architecture, in particular, leads to high latency during inference due to the need for pairwise comparisons of all query-candidate pairs. Models like OpenAI CLIP and Google ALIGN are closely related to BriVL but differ in their data assumptions and training strategies. CLIP discards image-text pairs with low word frequency, while ALIGN employs various filtering rules. BriVL's dataset is significantly larger and preserves a more natural data distribution, reflecting real-world scenarios.
Methodology
To overcome the limitations of existing models, the authors develop BriVL (Bridging-Vision-and-Language), a large-scale multimodal foundation model trained using self-supervised learning on weakly correlated image-text data. The dataset, WSCD (Weak Semantic Correlation Dataset), comprises 650 million image-text pairs collected from the web without human annotation, containing complex human emotions and thoughts. This contrasts with previous methods that focused on strongly correlated data. BriVL employs a two-tower architecture, avoiding the computational overhead of object detectors and enabling efficient inference. The two-tower design allows pre-computation and indexing of embeddings, significantly reducing retrieval latency. The architecture utilizes separate encoders for image and text inputs, with a cross-modal contrastive learning (CL) algorithm to model weak image-text correlation and learn a unified semantic space. The CL algorithm leverages a momentum mechanism (inspired by MoCo) to maintain large queues of negative samples without needing a large batch size, making it computationally efficient. BriVL differs from CLIP and ALIGN in utilizing weakly correlated data and a momentum mechanism for negative sample selection, which makes it resource-efficient compared to CLIP and ALIGN, which require large batch sizes.
Key Findings
BriVL demonstrates promising results across various downstream tasks, including remote sensing scene classification, news classification, cross-modal retrieval, and visual question answering (VQA). The model's performance on these diverse tasks highlights its strong generalization and cross-domain transfer learning abilities. **Neural Network Visualization:** A novel visualization technique reveals BriVL's imagination capabilities. The model generates visual representations of abstract concepts (e.g., "nature," "time," "science") and complex sentences, demonstrating an ability to link abstract concepts to concrete objects and capture implicit meanings. This suggests the model has learned common sense and demonstrates the effectiveness of multimodal pre-training with weakly correlated data. **Text-to-Image Generation:** Using VQGAN, the authors demonstrate that BriVL can generate more realistic and coherent images compared to CLIP, further showcasing its strong imagination and understanding of text. The model can generate images corresponding to scenes rarely seen or even nonexistent in reality, highlighting its generalization capabilities and robustness. **Remote Sensing Scene Classification:** Zero-shot experiments on UCM and AID datasets show that BriVL significantly outperforms existing methods, including ZSSC and CLIP variants, indicating strong cross-domain knowledge transfer abilities. Visualizations of the model's response to a remote sensing concept ("baseball field viewed from above") further confirm its capacity to handle perspectives different from those in its training data. **News Classification:** Zero-shot experiments on Toutiao News and THUCNews demonstrate that BriVL enhances the performance of RoBERTa, both in overall accuracy and in specific categories, particularly for more diverse topics. Phrase retrieval analysis shows BriVL's ability to capture a wider range of semantic associations for some categories. **Cross-Modal Retrieval:** BriVL achieves superior performance in cross-modal retrieval compared to a model trained directly on the AIC-ICC dataset, showing the benefit of pre-training. Different finetuning strategies influence performance on this task. **Visual Question Answering (VQA):** BriVL excels in VQA on the Visual7W dataset, with substantial improvements over directly-trained models, illustrating its strong multimodal understanding capabilities. VQA examples showcase its ability to answer complex questions that demand common sense reasoning.
Discussion
BriVL's success across diverse downstream tasks demonstrates the potential of large-scale multimodal pre-training on weakly correlated data for building more cognitive and general AI models. The model's strong imagination and reasoning abilities, as evidenced by visualizations and text-to-image generation, are primarily attributed to the unique characteristics of its training data. The weakly correlated image-text pairs expose the model to complex human emotions and thoughts, leading to improved generalization and knowledge transfer capabilities. This approach contrasts with models trained on strongly correlated data, which may overfit and lack the ability to handle diverse and unexpected scenarios. The findings suggest that pre-training with weakly correlated multimodal data is a promising avenue for approaching AGI.
Conclusion
This research presents BriVL, a multimodal foundation model pre-trained on a large-scale weakly correlated dataset that exhibits strong cross-modal understanding, generalization, and imagination capabilities. The results demonstrate the effectiveness of this approach for various downstream tasks and highlight its potential for advancing towards AGI. Future research could explore expanding BriVL's capabilities by incorporating additional modalities (video, audio) and incorporating multilingual data for language translation tasks. Furthermore, developing deeper model-interpretability tools is crucial for understanding the model's decision-making process and mitigating potential risks associated with bias and misuse.
Limitations
While BriVL demonstrates significant advancements, limitations exist. The model's performance relies heavily on the quality and representativeness of the pre-training data. Biases present in the data could be reflected in the model's outputs, requiring careful monitoring and mitigation. Further research is needed to develop robust methods for detecting and addressing such biases. Additionally, the model's computational requirements for pre-training are substantial, limiting accessibility to researchers with limited resources.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny