Engineering and TechnologyIEEE Robotics and Automation Letters

Human 0, MLLM 1: Unlocking New Layers of Automation in Language-Conditioned Robotics with Multimodal LLMs

R. Elmallah, N. Zamani, et al.

Discover the groundbreaking research conducted by Ramy ElMallah, Nima Zamani, and Chi-Guhn Lee, which explores automating human functions in language-conditioned robotics using Multimodal Large Language Models. Their experiments demonstrated impressive results with GPT-4 and Google Gemini, achieving over 90% accuracy in feasibility analysis. This study reveals how MLLM-IL has the potential to revolutionize the field.... show more

General Summary Metrics

Abstract

Language-conditioned robotics has seen tremendous growth in frameworks that aim to improve the success rates of robots acting upon the environment according to free-form language instructions. However, most existing frameworks leverage a human in the loop to assist with critical functions. Humans are mainly involved in ensuring that a human-requested task is feasible, resetting the robot when it diverges from achieving the requested goal, and deciding if it has completed the task. As human involvement limits the scalability of language-conditioned robotics, we propose automating these human functions through Multimodal Large Language Models in the Loop (MLLM-IL). We conduct experiments leveraging multimodal large language models, specifically OpenAI's GPT-4, and Google Gemini, to evaluate their potential in automating crucial functions. The introduced new layers of automation include analyzing task feasibility, assessing task progress, and detecting task success. We investigate how different factors, including the choice of LLM, image resolution of the input images, and the structure of the prompt, affect the performance of the LLMs in achieving the target functions. Results show significant zero-shot success with feasibility analysis accuracies exceeding 90%. Our work demonstrates the immense potential of utilizing MLLM-IL to complement existing frameworks in language-conditioned robotics, opening the space for a wealth of new applications.

Publisher

IEEE Robotics and Automation Letters

Published On

Authors

Ramy ElMallah, Nima Zamani, Chi-Guhn Lee

DOI

https://doi.org/10.1109/ME61309.2024.10789747

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Health and Fitness

The effect of daily intake of vitamin D-fortified yogurt drink, with and without added calcium, on serum adiponectin and sirtuins 1 and 6 in adult subjects with type 2 diabetes

B. Nikooyeh, B. W. Hollis, et al.

Medicine and Health

Identification of a new cannabidiol n-hexyl homolog in a medicinal cannabis variety with an antinociceptive activity in mice: cannabidihexol

P. Linciano, C. Citti, et al.

Medicine and Health

Neuronal responses in the human primary motor cortex coincide with the subjective onset of movement intention in brain-machine interface-mediated actions

J. Noel, M. Bockbrader, et al.

Medicine and Health

Neuronal responses in the human primary motor cortex coincide with the subjective onset of movement intention in brain-machine interface-mediated actions

J. Noel, M. Bockbrader, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny