MemGUI-Agent: End-to-End Long-Horizon Mobile GUI Agent With Proactive Context Management

Contents

MemGUI-Agent: End-to-End Long-Horizon Mobile GUI Automation with Proactive Context Management

Mobile GUI agents powered by multimodal large language models (MLLMs) have made impressive strides on short-horizon tasks. But when a task spans dozens of steps across multiple apps, current approaches crumble under their own context. MemGUI-Agent, published on arXiv (2606.19926) by researchers from Zhejiang University and Kuaishou Technology, tackles this head-on with a novel framework called ConAct that turns context management from a passive burden into a first-class action.

The project is fully open-source under Apache 2.0, with code on GitHub (71 stars), a live project page, a curated dataset (MemGUI-3K), and a trainable 8B-parameter model (MemGUI-8B-SFT) ready for deployment.

The Problem with ReAct-Style Agents

Most GUI agents today follow the ReAct (Reason + Act) paradigm: the model reasons about the current screen, picks an action, observes the result, and repeats. This works beautifully for three-step tasks. For thirty-step tasks that span multiple apps, it breaks down in predictable ways.

Prompt Explosion

Every action and observation gets appended to the context window. After 20 steps, the prompt can balloon to tens of thousands of tokens. Models hit their context ceiling and start truncating earlier steps.

Fact Dilution

Important facts get buried. A price noticed on step 3, a username entered on step 7, a confirmation code from step 12 float in a sea of interleaved thoughts and actions.

Lost Context at App Transitions

When a task crosses app boundaries, critical handoff information is often the first thing to get compressed away. Base models like Qwen3-VL-8B-Instruct achieve near-zero success on long-horizon tasks.

ConAct: Context-as-Action

The core insight: context management should be an action, not a side effect. ConAct maintains three structured fields:

1. Folded Action History

Compresses completed interaction spans into concise summaries. A sequence of 8 taps becomes a single folded entry.

2. Folded UI State

Persistent facts extracted from the screen. Unlike observations that scroll away, UI state persists across the entire task.

3. Recent Step Record

The most recent 2-3 steps remain in full detail, providing local interaction continuity.

The context size stays roughly constant regardless of task length. For a 30-step task, ReAct context might be 20,000+ tokens, while ConAct stays under 5,000.

MemGUI-3K Dataset

2,956 trajectories across diverse mobile tasks
82,103 total interaction steps recorded
64,430 evaluator-approved reasonable steps (78.5%)
Full ConAct annotations for supervised training

Training MemGUI-8B-SFT

Base Model:       Qwen3-VL-8B-Instruct
Method:           LoRA SFT
Epochs:           1
Learning Rate:    1e-4
LoRA Rank/Alpha:  8 / 32
Target Modules:   all-linear
Max Sequence Len: 32,768 tokens
Batch Size:       2 per device
Gradient Accum:   8 steps
GPUs:             8x

Benchmark Results

The flagship MemGUI-Agent-235B achieves 62.5% Pass@3 on MemGUI-Bench. The practical MemGUI-8B-SFT reaches 23.4% Pass@1.

On MobileWorld (OOD), MemGUI-Agent-235B achieves 29.1% and MemGUI-8B-SFT reaches 17.9%.

Case Studies

MemGUI-Agent completes multi-app workflows: searching for products, comparing prices, and completing purchases while preserving critical information through the ConAct framework.

Quick Start

git clone https://github.com/kwai/MemGUI-Agent.git
cd MemGUI-Agent
pip install -r requirements.txt

Conclusion

MemGUI-Agent represents a paradigm shift in context management for GUI agents. By elevating context handling from a passive side effect to an explicit, trainable action, the ConAct framework solves the fundamental scaling problem for long-horizon mobile automation.

Resources: Paper | Code | Project Page | License: Apache 2.0