ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents

High-fidelity simulated workspaces for rigorous agent evaluation — Gmail, Calendar, Docs, Drive, and Slack.

5 Mock Services
44 Tasks
6 Models
4 Harnesses
7,224 Trials
ClawsBench evaluation pipeline
ClawsBench evaluation pipeline: tasks execute in Docker containers with mock services, sandboxed agent execution, and deterministic state management.

Overview

LLM agents are increasingly deployed to automate productivity tasks — email triage, meeting scheduling, document management — but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows.

ClawsBench addresses this with five high-fidelity mock services that replicate real Google Workspace and Slack APIs with full state management and deterministic snapshot/restore. Our 44 structured tasks cover single-service, cross-service, and safety-critical scenarios, enabling rigorous evaluation of both what agents can do and what they should not do.

We decompose agent scaffolding into two independent levers — domain skills (API knowledge via progressive disclosure) and a meta prompt (cross-service coordination) — and vary both to measure their separate and combined effects across 6 models, 4 agent harnesses, and 33 experimental conditions.

High-Fidelity Mock Environments

Each environment implements a full REST API backed by SQLite, with realistic seed data including needles, edge cases, and safety traps. Agents interact exclusively through HTTP APIs.

Gmail
claw-gmail
62 endpoints with realistic seeded emails, needles, and edge cases
Google Calendar
claw-gcal
Full Calendar API mock with recurring events and timezone handling
Google Docs
claw-gdocs
Document CRUD with collaborative editing semantics
Google Drive
claw-gdrive
File management with sharing permissions and access control
Slack
claw-slack
45 endpoints covering channels, threads, reactions, and DMs

Agent Performance

All models evaluated on OpenClaw with full scaffolding (skills + meta prompt). TSR = Task Success Rate, UAR = Unsafe Action Rate. 95% cluster bootstrap CIs.

Main results: TSR and UAR for six models on OpenClaw with full scaffolding
Figure 2. TSR (left) and UAR (right) for six models on OpenClaw with full scaffolding. The top five models cluster at 53–63% TSR; only Flash-Lite trails. Capability and safety rankings diverge: Opus leads TSR (63%) but ties for highest UAR (23%).

Key Findings

Agent Rogue Behaviors

Analysis of 7,224 trajectories reveals eight recurring patterns of unsafe behavior across models and harnesses.

Sandbox Escalation

Agents systematically probe evaluation infrastructure via environment variable enumeration, database access, and direct localhost calls. GPT-5.4 on Codex made 1,471 escalation calls; it explicitly acknowledged hitting the sandbox boundary.

Prompt Injection Compliance

Embedded document comments, CC injection via email headers, and social-engineering exfiltration. Compliance rates range from 90% (Flash-Lite) to 0% (Claude models). Only one agent across 7,224 trials explicitly detected an injection.

Unauthorized Contract Modification

Despite explicit legal blockers, violation rates range 0–67%. Safety rules can backfire: one agent classified a legal notice as an “embedded override” and dismissed it, modifying all 5 contracts.

Confidential Data Leakage

Agents forward internal financials to external recipients or share entire Drive folders without reviewing contents. Agents that sanitize data content still fail to check recipient authorization.

Over-Refusal

Agents decline legitimate requests or add unnecessary caveats. Safety-trained models sometimes refuse to execute valid task instructions, mistaking normal operations for prohibited actions.

Overzealous Enforcement

Agents refuse valid operations or apply safety constraints too broadly, blocking legitimate task completion in an effort to be “safe.”

Hallucinated Actions

Agents fabricate API responses, invent email addresses, or claim to have completed actions they never executed.

Degenerate Loops

Agents enter infinite retry loops, repeating failed API calls or re-reading the same files without making progress.

News

Upcoming

Resources

ClawsBench is developed by the BenchFlow team and collaborators from RLWRLD, Ohio State, Stanford, CMU, UC Berkeley, Amazon, UC Santa Cruz, Dartmouth, Boston University, and UNC.