AutoMA

Overview

AutoMA focuses on training a language-conditioned robot policy. It operates in two key steps: (1) Automatic Hierarchy and Module Design. In this phase, we employ LLMs and VLMs for task comprehension, which autonomously generates a hierarchy of modules relevant to the task, along with the corresponding training labels. (2) End-to-End Training of Modular Attention. Once the training data and module design are established, the second step involves training a transformer model end-to-end, embedding the modular hierarchy to guide the learning process effectively.

LLMs decompose high-level tasks, such as "stack the red block on the brown block" into modules and construct a hierarchy among them. AutoMA integrates this hierarchy using slot attention mechanisms for policy learning. For instance, the Task and Object only focus on Lang tokens in the first attention layer, the ObjLoc module attends to Image to localizing objects, and the CTRL module attends to EE, task, and ObjLoc to generate final robot actions.

Method Step 1: Automatic Synthesis of Hierarchy and Modules

For instance, in a stacking task, the high-level task is decomposed into several sub-tasks: (1) understanding the required action type, (2) identifying which object to manipulate, (3) locating the object in the image, and (4) generating the end-effector controls. As shown in the figure, these sub-tasks form a hierarchical structure where Lang, Image, and EE are considered as input modules, Task, Object, ObjLoc, and CTRL are sub-task modules.

Large Language Models (LLMs) are employed to automate the generation of task hierarchies. The goal is to analyze an expert demonstration and construct an appropriate hierarchy. As illustrated in the code below, the prompt template used in our approach consists of three main components: (1) A preamble detailing the task of hierarchy generation; (2) An in-context learning example that demonstrates the input-output format; and (3) The actual input derived from the expert demonstrations.


**Task**
You will be presented with a short video of a robot performing a task, together with a sentence describing the task. 
Your task is to create an understanding of the tasks, with the goal of performing imitation learning.
The understandings will include the following:
1) According to the video and the sentence, what information should the robot pay attention to?
2) Regarding the selected information, how do they form up a hierarchical structure?

**Example**
All_modalities_available: 
- I1: Natural Language Instruction
- I2: Single Image of Current Time
- I3: Joint State of Robot
- I4: End-effector location of Robot
All_output_ground_truths:
- G1: Action Trajectory
Requirements:
- No modules should have the same (input ground truth, output ground truth) pair.
APIs available:
- get_from_language(target)
- get_from_image(target)
Sentence: 
- Lift up the cube.
Video:
- <frame1><frame2><frame3><frame4><frame5>

Understanding:
R1 = Query(key=I1); N1 = Decode(R1); # Get the task to perform, from language. Now R1 contains the task to perform.
R2 = Query(key=I1); N2 = Decode(R2); # Get the object to manipulate, from language. Now R2 contains the object to manipulate.
R3 = Query(key=I2); N3 = Decode(R3); # Get the location of the object, from object to manipulate and image. Now R3 contains the location of the object.
R4 = Query(key=I3); N4 = Decode(R4); # Get the location of the robot, from joint state. Now R4 contains the location of the robot.
R5 = Query(key=[R1, R3, R4]); N5 = Decode(R5); # Get the plan to perform the task, from task, object location, and robot location. Now R5 contains the plan to perform the task.

Generating labels:
- N1: get_from_language("what task to perform")
- N2: get_from_language("what object to manipulate")
- N3: get_from_image(N2)
- N4: I4
- N5: G1

**Your Response**
All_modalities_available: 
- <Modality 1><Modality 2>, …, <Modality N>
All_output_ground_truths:
- <Output 1>
Requirements:
- No modules should have the same (input ground truth, output ground truth) pair.
Sentence: 
- <sentence>
Video:
- <frame1><frame2><frame3><frame4><frame5>

Understanding:

Method Step 2: End-to-End Training of Modular Attention

The proposed modules address specific sub-tasks, such as detecting the robot's end-effector or calculating the distance between the robot and a target object. To achieve this, we leverage modern attention mechanisms to manage information flow within attention layers, ensuring the network focuses on critical inputs.

Specifically, we implement a supervised attention mechanism that allows for customized information routing (defined by the automatically generatedhierarchy), facilitating the creation of modular structures within an end-to-end neural network. The core concept behind this mechanism is that the hierarchy reveals knowledge about optimal token pairings. In other words, if the hierarchy points out key tokens that are essential for the queries, their similarity scores can be maximized to improve the network's focus and accuracy.

The figure illustrates the attention mechanism which facilitates the module ObjLoc. Queries (Q) are used to identify the most relevant keys (K), producing scores (the gray and yellow matrix) that reflect their alignment. The module ObjLoc is designed to take in the object information, and locate the object in the image. As shown in the figure, there are 2 object in the query. For each object query, the most relevant image patch (light blue) is identified, and the corresponding value, which is considered to contain the image is used as the output. In order to do so, the corresponding score in the attention map should be maximized during optimization. This process is called supervised attention. In this way, we define our first training objective, the Task Hierarchy Loss:

The task hierarchy loss L_h optimizes towards fetching the key k_n for the given module’s query q_n. Therefore, it can guide the attention of the sub-modules to enforce the controlled data flow.

Additionally, we also need to supervise the actual outputs of each module. To do so, we create MLPs for each module, which serve as prediction heads. The attention output token on of the n-th module is passed through MLP_n, which is supervised through the sub-task label l_n. We define the second training objective, the Sub-Task Loss as following:

The overall training objective is L_h + L_sub. The result of this training process is a robot policy that generates actions, which is instantiated by a transformer network and is embedded by modules that correspond for the sub-tasks.

Experiment Tasks

we evaluate the effectiveness of AutoMA across multiple manipulation tasks, each with distinct setups:

Stacking: task (a)-(c) for block stacking tasks
Sorting: task (d)-(e) for object sorting tasks with and without distractors
TableTop: task (f) for manipulation in real-world with distractors

Evaluations

Image-BC: This baseline adopts an image-to-action agent framework, similar to BC-Z, it is built upon ResNet-18 backbone and employs FiLM for conditioning using CLIP language features.
Diffusion Policy: This baseline is a standard diffusion-based policy, We adopt the 1D temporal convolutional networks and construct the U-net backbone.
ModAttn: This baseline is a transformer based network, which shares the same architecture with our proposed AutoMA, but trained end-to-end without enforcing rich contexts.

AutoMA: Automated Modular Attention enables Context-Rich Imitation Learning using Foundation Models

Given the task "stack the blue block on the red block ", LLMs design hierarchical modules h by decomposing the task into several sub-task modules, while VLMs supervises the module. The policy network, enriched with this context, can then effectively execute the task.

Abstract

Overview

Method Step 1: Automatic Synthesis of Hierarchy and Modules

Method Step 2: End-to-End Training of Modular Attention

Experiment Tasks

Evaluations

Sort Object

Sort Object

Sort Object

Sort Object

Sort Object

Sort Object

Stack Blocks

Stack Blocks

Stack Blocks

Stack Blocks

Stack Blocks

Stack Blocks