logo
ResearchBunny Logo
Human 0, MLLM 1: Unlocking New Layers of Automation in Language-Conditioned Robotics with Multimodal LLMs

Engineering and Technology

Human 0, MLLM 1: Unlocking New Layers of Automation in Language-Conditioned Robotics with Multimodal LLMs

R. Elmallah, N. Zamani, et al.

Discover the groundbreaking research conducted by Ramy ElMallah, Nima Zamani, and Chi-Guhn Lee, which explores automating human functions in language-conditioned robotics using Multimodal Large Language Models. Their experiments demonstrated impressive results with GPT-4 and Google Gemini, achieving over 90% accuracy in feasibility analysis. This study reveals how MLLM-IL has the potential to revolutionize the field.... show more
Introduction

The study addresses the reliance of language-conditioned robotic systems on human-in-the-loop (HITL) supervision for task feasibility assessment, intervention during trajectory deviations, and success verification. This reliance limits scalability and operational efficiency. With the emergence of multimodal large language models (MLLMs) such as GPT-4 and Gemini that process both text and images, the paper investigates whether these models can automate key human functions. The authors propose Multimodal Large Language Models in the Loop (MLLM-IL) to automate task feasibility analysis, progress assessment, and success detection using natural language instructions and visual observations. Using tasks from the CALVIN framework, they evaluate GPT-4 and Gemini under various conditions (image resolutions, prompt structures, and input frame structures). Contributions include: introducing the MLLM-IL framework; comprehensive evaluation of GPT-4 and Gemini for automating human functions; analysis of image resolution, prompt structure, and frame input structure; and releasing a new dataset based on CALVIN to spur further research.

Literature Review

The related work spans three areas. A) Language-Conditioned Robotics: Prior frameworks enable robots to interpret and execute natural language instructions, frequently retaining human oversight for feasibility checks and success detection. The CALVIN benchmark offers long-horizon, language-conditioned tasks with diverse sensory inputs and continuous actions and has been widely adopted by recent works for benchmarking and training. Many real-world studies still employ HITL approaches for feasibility and success verification, highlighting scalability challenges. B) MLLMs in Robotics: Foundation models, including LLMs and VLMs, have been used to replace or augment components of robotic systems for better flexibility and practicality. Prior works use VLMs for success detection via VQA formats and zero-shot reward modeling via CLIP-based approaches. The present work emphasizes zero-shot use of MLLMs for success detection and other supervisory roles via prompt engineering without domain-specific fine-tuning. C) Vision Capabilities of MLLMs: Benchmarks assess MLLM robustness to distribution shifts, multimodal comprehension, and structured reasoning. The paper focuses on evaluating aspects that affect MLLM-IL functionality for robotics, including prompt design and input structure.

Methodology

The MLLM-IL framework inserts a multimodal LLM into the robot control loop to automate human supervisory roles. Problem setup: each task is defined by a natural language instruction I. The robot perceives via a static monocular RGB camera, providing a limited history of image observations o = {o_{t-n}, …, o_t}. A language-conditioned agent A maps I and o to actions. A controller c can start, stop, and reset the agent. MLLM in the loop: an MLLM M receives I and one or more images from o to perform (1) Feasibility Analysis—judge whether the task is achievable from the initial observation(s) before execution; (2) Progress Assessment—monitor a subsampled sequence {o_t, o_{t-k}, o_{t-2k}, …} to determine if the agent is making appropriate progress and trigger feedback or resets; and (3) Success Detection—compare initial and final (and possibly intermediate) frames with the expected outcome to judge task success. MLLM decisions guide controller c to continue, adjust, or halt. Experimental setup: The authors curate a dataset from CALVIN’s language-labeled expert trajectories to evaluate feasibility analysis, progress assessment, and success detection, including studies on image resolution, prompt structure, and frame input structure. They include all CALVIN tasks except those involving the LED, lightbulb, and slider due to non-intuitiveness. Quantitative and qualitative analyses report metrics such as accuracy, recall (where relevant), and “Unsure” rates. Each experiment uses 200 samples per run; most are repeated three times (twice for GPT-4-Turbo feasibility), reporting averages and standard deviations. Prompts follow prompt-engineering best practices, and a prompt-sensitivity study is conducted.

Key Findings

Overall, MLLMs can automate feasibility analysis, progress assessment, and success detection with notable zero-shot performance, though with model- and setup-specific trade-offs.

  • Feasibility analysis:
    • Image resolution: Higher resolutions did not consistently improve performance; 200 px performed relatively worse than 500–768 px. GPT-4o and Gemini 1.5 Pro achieved average accuracies exceeding 90% at 768×768. GPT-4-Turbo performed poorly at detecting infeasible tasks, with average accuracy for negative cases below 50%.
    • Common error types (qualitative): (i) unexpected assumptions (e.g., assuming color/object equivalence), (ii) scene misunderstanding, and (iii) disregard for relevant scene details (e.g., missing that a drawer is already closed or no object is grasped).
    • Prompt sensitivity: GPT-4o was robust to prompt changes, whereas Gemini 1.5 Pro suffered a large drop with an alternative prompt (e.g., balanced accuracy dropping to about 79.0 from about 91.1).
    • Unsure rates (Table IV): GPT-4o produced lower “Unsure” rates than Gemini across tasks—Feasibility: 0.03 (GPT-4o) vs 0.17 (Gemini).
  • Progress assessment:
    • Prompting: Chain-of-Thought (CoT) prompts generally improved accuracy and reliability for GPT-4o versus final-decision-only prompts; CoT&CI further increased scrutiny and a tendency to classify trajectories as incorrect. GPT-4o outperformed Gemini across prompting schemas.
    • Unsure rates (Table IV): GPT-4o 0.03 vs Gemini 0.39, indicating substantially more indecision from Gemini.
  • Success detection:
    • Input structure: Using multiple sequential images led to higher balanced accuracy for GPT-4o than a single grid image. For GPT-4o, sequential input achieved about 85.0% success accuracy, 52.1% failure accuracy, and 68.6% BA; the grid achieved about 56.1% success accuracy, 71.3% failure accuracy, and 63.7% BA. Gemini achieved lower BAs (e.g., sequential BA ≈ 62.8%, grid BA ≈ 57.9%). The approach can reach high recall (≈85%).
    • Skip-frames (frame frequency): For GPT-4o, providing only the first and last frames yielded the highest BA (≈72.2%) compared with skipping 3 frames (≈67.4% BA) or 6 frames (≈68.6% BA). Gemini showed higher BA in this study (e.g., ≈78.0% with first+last) but at the cost of many “Unsure” labels (not counted in accuracy).
  • Cost/context trade-offs: Sequential images improve accuracy but increase token usage. Longer contexts may degrade LLM performance; concise, informative frame selection can be preferable.
Discussion

The findings support the central hypothesis that MLLMs can automate key human supervisory roles in language-conditioned robotics. Feasibility analysis showed high zero-shot accuracy with GPT-4o and Gemini at practical image resolutions, meaning task requests can be filtered pre-execution without human screening. Progress assessment benefits from structured reasoning (CoT), enabling better real-time monitoring and timely resets or corrections. Success detection works reliably when temporal context is represented appropriately, with sequential images outperforming grid layouts for GPT-4o and demonstrating strong recall, which is crucial for closing the loop on autonomous execution. However, model behavior differences matter: GPT-4o was more decisive and robust to prompt changes, while Gemini had higher “Unsure” rates and greater prompt sensitivity. There are trade-offs in token costs and context length; fewer, well-chosen frames (e.g., first and last) can sometimes outperform denser temporal sampling due to long-context limitations, though certain tasks still require intermediate frames to detect state changes. Overall, MLLM-IL can reduce HITL dependence by automating feasibility checks, monitoring, and outcome verification, improving scalability and enabling broader deployment of language-conditioned robots.

Conclusion

The paper introduces MLLM-IL to automate feasibility analysis, progress assessment, and success detection in language-conditioned robotics using GPT-4 and Gemini. Experiments on a curated dataset based on CALVIN show strong zero-shot performance, particularly for GPT-4o, with high feasibility accuracy, improved progress assessment via CoT prompting, and reliable success detection using sequential image inputs. The approach reduces the need for human supervision and can improve scalability and efficiency. Future work includes integrating additional sensory modalities (e.g., depth, force/torque), refining prompt engineering, reducing context length while preserving temporal information, exploring few-shot or tool-augmented strategies, and expanding evaluations across more task types and environments to stress-test robustness.

Limitations
  • Scope of tasks: Excludes CALVIN tasks involving the LED, lightbulb, and slider due to non-intuitiveness, limiting generality across all possible actions.
  • Sensing: Uses a static monocular RGB camera; lack of depth or multimodal sensing may constrain perception-dependent judgments.
  • Dataset/domain: Evaluations are based on the CALVIN framework and a curated dataset; transferability to different environments or real-world deployments may vary.
  • Prompt sensitivity and model variance: Gemini showed notable sensitivity to prompt changes and higher rates of "Unsure" decisions; all LLMs exhibit stochasticity, mitigated by multiple runs.
  • Long-context limitations and token costs: Performance can degrade with long sequences; sequential inputs improve accuracy but increase token/cost budgets.
  • No domain-specific fine-tuning: Models are used zero-shot; specialized fine-tuning or additional supervision could change outcomes.
  • Some tables indicate poor infeasible detection for GPT-4-Turbo, suggesting model choice is critical.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny