GateMem is the first benchmark designed to evaluate how well AI agents govern shared memory across multiple principals (users, roles, and scopes). Instead of asking “can the agent remember?”, GateMem asks a harder question: “can the agent remember the right things for the right people and forget what it should?” Across 91 multi-party episodes and 2,218 hidden checkpoints, GateMem reveals a sobering truth: today’s memory-augmented agents — from RAG pipelines to specialized memory systems like Mem0 and A-MEM — consistently fail at access control and active forgetting, even when they achieve high recall.
Contents
The Problem: Why Memory Governance Matters
Modern AI agents increasingly operate in shared contexts: a medical AI assistant handling multiple patients, a workplace agent serving an entire team, or an educational tutor managing dozens of students. In these settings, memory is no longer a single-user cache — it is a shared resource with access boundaries.
Existing memory benchmarks evaluate recall in isolation. They measure whether an agent can retrieve relevant information from a single conversation. But they ignore a crucial dimension: can the agent determine who is allowed to see what, and ensure that deleted information stays gone?
This gap is precisely what GateMem addresses.

GateMem’s Three Pillars
GateMem decomposes memory governance into three independent dimensions that are jointly evaluated:
Utility (U)
Can the agent answer correctly for authorized requests that require state updates? This measures basic functional competence — the agent’s ability to use shared memory productively.
Access Control (A)
Can the agent avoid leaking protected information to unauthorized or over-scoped requesters? This tests whether the agent respects role-based and scope-based boundaries.
Active Forgetting (F)
Can the agent avoid recovering, confirming, or reconstructing deleted information after explicit deletion requests? This is the hardest dimension — deletion must be irreversible, not merely hidden.
These three dimensions are combined into a single multiplicative metric:
MGS = U x (1 - A) x (1 - F)
The MGS (Memory Governance Score) is multiplicative because a failure in any dimension should dominate the score. An agent that achieves perfect recall but leaks everything scores zero. An agent that refuses everything for safety also scores zero. The metric penalizes imbalance.
What Makes GateMem Novel
GateMem introduces several design innovations that set it apart from prior benchmarks:
- Multi-principal setting: 91 long-form episodes spanning Medical, Office, Education, and Household domains, each involving multiple users with distinct access rights.
- Hidden checkpoints: 2,218 checkpoints with leak-target annotations are embedded within episodes. The agent cannot see when it is being evaluated — preventing gaming behaviors like selectively refusing at known test points.
- Structured judging: Each checkpoint carries a pre-annotated leak target, enabling precise, automated evaluation of what information was or was not leaked.
- Joint U+A+F evaluation: No prior benchmark evaluates all three dimensions simultaneously in a shared-memory multi-agent setting.
- Open ecosystem: MIT-licensed codebase, public leaderboard, and online submission portal for community contributions.

Key Results at a Glance
GateMem evaluated 7 memory-agent baselines across 6 backbone LLMs. The results paint a clear picture of the current state of memory governance:

| Approach | Token Cost | Governance Quality | Key Weakness |
|---|---|---|---|
| Long-Context Prompting | High (4-8x tokens) | Best MGS | Cost-prohibitive at scale |
| RAG-Naive | Medium | Moderate | Still leaks unauthorized info |
| RAG-Policy | Medium | Better safety | Over-refusal hurts utility |
| A-MEM / Mem0 | Low | Low-Moderate | No governance by default |
| ReMeM-I / ReMeM-S | Low | Low-Moderate | Leaks deleted information |
Key takeaway: Long-context prompting achieves the best governance but at 4-8x the token cost — impractical for production. All other methods show significant gaps, particularly in active forgetting and access control.
Lessons Learned for AI Engineers
- Memory does not equal governance. High recall correlates with higher leakage risk. An agent that remembers everything also leaks everything. Engineers must treat memory governance as a separate architectural concern, not a side effect of better retrieval.
- Long-context is strongest but cost-prohibitive. If you can afford the tokens, it works. But at 4-8x the cost of RAG-based approaches, it is not a viable production solution for most applications.
- Policy-aware RAG reduces leakage but introduces a new problem: over-refusal. Agents become so cautious that they deny legitimate requests, trading utility for safety in ways that users will notice.
- External memory does not equal governance. Popular memory systems like Mem0 and A-MEM provide storage and retrieval but lack built-in authorization. They leak across roles without explicit governance layers.
- Deletion compliance is the hardest problem. Every evaluated method fails active forgetting tests. Deleted information is often recoverable, confirmable, or reconstructable through inference. True deletion — where the agent cannot even confirm whether deleted information existed — remains an open challenge.
How to Get Started with GateMem
GateMem is open-source (MIT license) and designed for easy adoption:
pip install -r requirements.txt
python bench/scripts/run_eval.py
- GitHub: github.com/rzhub/GateMem (79 stars, Python)
- Online submission: huggingface.co/spaces/Ray368/GateMem-Submit
- Leaderboard: rzhub.github.io/GateMem
- License: MIT
The benchmark is ready to use today. If you are building multi-user AI agents — whether in healthcare, enterprise SaaS, or education — GateMem provides the evaluation framework you need to ensure your agent governs memory correctly, not just remembers accurately.