GateMem: The First Benchmark For Memory Governance In Multi-Principal AI Agents

GateMem is the first benchmark designed to evaluate how well AI agents govern shared memory across multiple principals (users, roles, and scopes). Instead of asking “can the agent remember?”, GateMem asks a harder question: “can the agent remember the right things for the right people and forget what it should?” Across 91 multi-party episodes and 2,218 hidden checkpoints, GateMem reveals a sobering truth: today’s memory-augmented agents — from RAG pipelines to specialized memory systems like Mem0 and A-MEM — consistently fail at access control and active forgetting, even when they achieve high recall.

Contents

The Problem: Why Memory Governance Matters

Modern AI agents increasingly operate in shared contexts: a medical AI assistant handling multiple patients, a workplace agent serving an entire team, or an educational tutor managing dozens of students. In these settings, memory is no longer a single-user cache — it is a shared resource with access boundaries.

Existing memory benchmarks evaluate recall in isolation. They measure whether an agent can retrieve relevant information from a single conversation. But they ignore a crucial dimension: can the agent determine who is allowed to see what, and ensure that deleted information stays gone?

This gap is precisely what GateMem addresses.

GateMem’s Three Pillars

GateMem decomposes memory governance into three independent dimensions that are jointly evaluated:

Utility (U)

Can the agent answer correctly for authorized requests that require state updates? This measures basic functional competence — the agent’s ability to use shared memory productively.

Access Control (A)

Can the agent avoid leaking protected information to unauthorized or over-scoped requesters? This tests whether the agent respects role-based and scope-based boundaries.

Active Forgetting (F)

Can the agent avoid recovering, confirming, or reconstructing deleted information after explicit deletion requests? This is the hardest dimension — deletion must be irreversible, not merely hidden.

These three dimensions are combined into a single multiplicative metric:

MGS = U x (1 - A) x (1 - F)

The MGS (Memory Governance Score) is multiplicative because a failure in any dimension should dominate the score. An agent that achieves perfect recall but leaks everything scores zero. An agent that refuses everything for safety also scores zero. The metric penalizes imbalance.

What Makes GateMem Novel

GateMem introduces several design innovations that set it apart from prior benchmarks:

Multi-principal setting: 91 long-form episodes spanning Medical, Office, Education, and Household domains, each involving multiple users with distinct access rights.
Hidden checkpoints: 2,218 checkpoints with leak-target annotations are embedded within episodes. The agent cannot see when it is being evaluated — preventing gaming behaviors like selectively refusing at known test points.
Structured judging: Each checkpoint carries a pre-annotated leak target, enabling precise, automated evaluation of what information was or was not leaked.
Joint U+A+F evaluation: No prior benchmark evaluates all three dimensions simultaneously in a shared-memory multi-agent setting.
Open ecosystem: MIT-licensed codebase, public leaderboard, and online submission portal for community contributions.

Key Results at a Glance

GateMem evaluated 7 memory-agent baselines across 6 backbone LLMs. The results paint a clear picture of the current state of memory governance:

Approach	Token Cost	Governance Quality	Key Weakness
Long-Context Prompting	High (4-8x tokens)	Best MGS	Cost-prohibitive at scale
RAG-Naive	Medium	Moderate	Still leaks unauthorized info
RAG-Policy	Medium	Better safety	Over-refusal hurts utility
A-MEM / Mem0	Low	Low-Moderate	No governance by default
ReMeM-I / ReMeM-S	Low	Low-Moderate	Leaks deleted information

Key takeaway: Long-context prompting achieves the best governance but at 4-8x the token cost — impractical for production. All other methods show significant gaps, particularly in active forgetting and access control.

Lessons Learned for AI Engineers

Memory does not equal governance. High recall correlates with higher leakage risk. An agent that remembers everything also leaks everything. Engineers must treat memory governance as a separate architectural concern, not a side effect of better retrieval.
Long-context is strongest but cost-prohibitive. If you can afford the tokens, it works. But at 4-8x the cost of RAG-based approaches, it is not a viable production solution for most applications.
Policy-aware RAG reduces leakage but introduces a new problem: over-refusal. Agents become so cautious that they deny legitimate requests, trading utility for safety in ways that users will notice.
External memory does not equal governance. Popular memory systems like Mem0 and A-MEM provide storage and retrieval but lack built-in authorization. They leak across roles without explicit governance layers.
Deletion compliance is the hardest problem. Every evaluated method fails active forgetting tests. Deleted information is often recoverable, confirmable, or reconstructable through inference. True deletion — where the agent cannot even confirm whether deleted information existed — remains an open challenge.

How to Get Started with GateMem

GateMem is open-source (MIT license) and designed for easy adoption:

pip install -r requirements.txt
python bench/scripts/run_eval.py

GitHub: github.com/rzhub/GateMem (79 stars, Python)
Online submission: huggingface.co/spaces/Ray368/GateMem-Submit
Leaderboard: rzhub.github.io/GateMem
License: MIT

The benchmark is ready to use today. If you are building multi-user AI agents — whether in healthcare, enterprise SaaS, or education — GateMem provides the evaluation framework you need to ensure your agent governs memory correctly, not just remembers accurately.