MemGUI-Agent: End-to-End Long-Horizon Mobile GUI Automation with Proactive Context Management

Mobile GUI agents powered by multimodal large language models (MLLMs) have made impressive strides on short-horizon tasks. But when a task spans dozens of steps across multiple apps, current approaches crumble under their own context. MemGUI-Agent, published on arXiv (2606.19926) by researchers from Zhejiang University and Kuaishou Technology, tackles this head-on with a novel framework called ConAct that turns context management from a passive burden into a first-class action.

The project is fully open-source under Apache 2.0, with code on GitHub (71 stars), a live project page, a curated dataset (MemGUI-3K), and a trainable 8B-parameter model (MemGUI-8B-SFT) ready for deployment.

The Problem with ReAct-Style Agents

Most GUI agents today follow the ReAct (Reason + Act) paradigm: the model reasons about the current screen, picks an action, observes the result, and repeats. This works beautifully for three-step tasks. For thirty-step tasks that span multiple apps, it breaks down in predictable ways.

Prompt Explosion

Every action and observation gets appended to the context window. After 20 steps, the prompt can balloon to tens of thousands of tokens. Models hit their context ceiling and start truncating earlier steps.

Fact Dilution

Important facts get buried. A price noticed on step 3, a username entered on step 7, a confirmation code from step 12 float in a sea of interleaved thoughts and actions.

Lost Context at App Transitions

When a task crosses app boundaries, critical handoff information is often the first thing to get compressed away. Base models like Qwen3-VL-8B-Instruct achieve near-zero success on long-horizon tasks.

ConAct: Context-as-Action

The core insight: context management should be an action, not a side effect. ConAct maintains three structured fields:

ConAct framework architecture

1. Folded Action History

Compresses completed interaction spans into concise summaries. A sequence of 8 taps becomes a single folded entry.

2. Folded UI State

Persistent facts extracted from the screen. Unlike observations that scroll away, UI state persists across the entire task.

3. Recent Step Record

The most recent 2-3 steps remain in full detail, providing local interaction continuity.

The context size stays roughly constant regardless of task length. For a 30-step task, ReAct context might be 20,000+ tokens, while ConAct stays under 5,000.

MemGUI-3K Dataset

MemGUI-3K dataset statistics
  • 2,956 trajectories across diverse mobile tasks
  • 82,103 total interaction steps recorded
  • 64,430 evaluator-approved reasonable steps (78.5%)
  • Full ConAct annotations for supervised training

Training MemGUI-8B-SFT

Base Model: Qwen3-VL-8B-Instruct Method: LoRA SFT Epochs: 1 Learning Rate: 1e-4 LoRA Rank/Alpha: 8 / 32 Target Modules: all-linear Max Sequence Len: 32,768 tokens Batch Size: 2 per device Gradient Accum: 8 steps GPUs: 8x

Benchmark Results

Main performance comparison

The flagship MemGUI-Agent-235B achieves 62.5% Pass@3 on MemGUI-Bench. The practical MemGUI-8B-SFT reaches 23.4% Pass@1.

MemGUI-Bench leaderboard

On MobileWorld (OOD), MemGUI-Agent-235B achieves 29.1% and MemGUI-8B-SFT reaches 17.9%.

MobileWorld leaderboard

Case Studies

Case study demonstration

MemGUI-Agent completes multi-app workflows: searching for products, comparing prices, and completing purchases while preserving critical information through the ConAct framework.

Quick Start

git clone https://github.com/kwai/MemGUI-Agent.git cd MemGUI-Agent pip install -r requirements.txt

Conclusion

MemGUI-Agent represents a paradigm shift in context management for GUI agents. By elevating context handling from a passive side effect to an explicit, trainable action, the ConAct framework solves the fundamental scaling problem for long-horizon mobile automation.

Resources: Paper | Code | Project Page | License: Apache 2.0

Sponsored Links

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply