logo
ResearchBunny Logo
Human 0, MLLM 1: Unlocking New Layers of Automation in Language-Conditioned Robotics with Multimodal LLMs

Engineering and Technology

Human 0, MLLM 1: Unlocking New Layers of Automation in Language-Conditioned Robotics with Multimodal LLMs

R. Elmallah, N. Zamani, et al.

Discover the groundbreaking research conducted by Ramy ElMallah, Nima Zamani, and Chi-Guhn Lee, which explores automating human functions in language-conditioned robotics using Multimodal Large Language Models. Their experiments demonstrated impressive results with GPT-4 and Google Gemini, achieving over 90% accuracy in feasibility analysis. This study reveals how MLLM-IL has the potential to revolutionize the field.... show more
Abstract
Language-conditioned robotics has seen tremendous growth in frameworks that aim to improve the success rates of robots acting upon the environment according to free-form language instructions. However, most existing frameworks leverage a human in the loop to assist with critical functions. Humans are mainly involved in ensuring that a human-requested task is feasible, resetting the robot when it diverges from achieving the requested goal, and deciding if it has completed the task. As human involvement limits the scalability of language-conditioned robotics, we propose automating these human functions through Multimodal Large Language Models in the Loop (MLLM-IL). We conduct experiments leveraging multimodal large language models, specifically OpenAI's GPT-4, and Google Gemini, to evaluate their potential in automating crucial functions. The introduced new layers of automation include analyzing task feasibility, assessing task progress, and detecting task success. We investigate how different factors, including the choice of LLM, image resolution of the input images, and the structure of the prompt, affect the performance of the LLMs in achieving the target functions. Results show significant zero-shot success with feasibility analysis accuracies exceeding 90%. Our work demonstrates the immense potential of utilizing MLLM-IL to complement existing frameworks in language-conditioned robotics, opening the space for a wealth of new applications.
Publisher
IEEE Robotics and Automation Letters
Published On
Authors
Ramy ElMallah, Nima Zamani, Chi-Guhn Lee
Tags
language-conditioned robotics
automation
Multimodal Large Language Models
GPT-4
Google Gemini
feasibility analysis
scalability
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny