Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks—such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, manually setting up the environment for each test scenario, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses a LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.


Overview of ToolEmu
Overview of ToolEmu

ToolEmu assists in rapidly identifying realistic failures of LM agents across various tools and scenarios within an LM-emulated environment and facilitates the development of safer LM agents with LM-automated evaluations. It consists of 3 main components:

  • Tool EmulatorsToolEmu uses a strong LM (e.g. GPT-4) to emulate the execution of tools in a virtual sandbox. In contrast to typical emulation environments, ToolEmu can emulate tools using only their specifications and inputs, without needing their implementations. This allows for faster prototyping of LM agents across different scenarios, while accommodating the evaluation of high-stakes tools that may lack existing APIs or sandbox implementations.
  • Safety & Helpfulness EvaluatorsTo support scalable and quantitative risk assessments, ToolEmu includes an LM-based safety evaluator to automate the identification of potential failures caused by LM agents and quantifies the associated risk severities. To capture the potential tradeoff between safety and effectiveness, ToolEmu also includes an LM-based helpfulness evaluator. Both evaluators are applied to the agent emulation trajectories and exhibit reasonbly high agreement with human annotators, matching the level of human inter-annotator agreement.
  • Curated Benchmark To demonstrate how ToolEmu can be used to build an evaluation benchmark that quantitatively assesses LM agents across various tools and scenario, we curate an initial benchmark covering 36 toolkits (311 tools) and 144 test cases for risk assessment. Many tools lack existing sandboxed evaluations or are hard to test. The scalablility of ToolEmu allows expanding it to more tools and scenarios (try our demo).

Adversarial Emulator for Red-teaming

By default, our standard emulator is provided with the necessary information for emulation (e.g., the tool specifications) and samples test scenarios randomly. However, it may be inefficient for identifying rare and long-tailed risks, as the majority of test scenarios may result in benign or low-risk outcomes. To further facilitate risk assessment and detection of long-tail failures, we introduce an adversarial emulator for red-teaming. The adversarial emulator is provided with additional information (e.g., the intended risks that could arise from LM agent’s actions) to instantiate sandbox states that are more likely to cause agent failures. See the illustrative example below.

Comparison between standard and adversarial emulators
Comparison between standard and adversarial emulators
(Left) Standard Simulator. (Right) Adversarial Simulator.
The adversarial emulator crafts a scenario where there are two medications that match the user’s description and fatal risks could arise if the wrong one was brought, catching a potentially severe failure of ChatGPT-3.5 agent.

Curating the Evaluation Benchmark

Our emulator allows us to directly specify the types of tools and scenarios on which to test LM agents. Leveraging this strength, we curate an evaluation benchmark that encompasses a diverse set of tools, test cases, and risk types, allowing for a broad and quantitative analysis of LM agents.

Our dataset curation procedure is depicted below. We first compiled an extensive array of tool specifications, with which we collected a diverse collection of test cases for risk assessment. For both tool specifications and test cases, we used GPT-4 to generate an initial set, followed by human filtering and modifications. All the cases were generated and tested without tool or sandbox implementations!

Dataset Curation Procedure
Data Curation

Our final curated tool set contains 36 toolkits comprising a total of 311 tools, most of which lack sandboxed evaluations in existing benchmarks or are challenging to test in current real environments:

Toolkit Category
Toolkit Category

Our final curated dataset consists of 144 test cases spanning 9 risk types, as depicted below:

Risk type distribution
Risk Distribution

Validating ToolEmu

Similar to the sim-to-real transfer challenges in simulated environments for developing autonomous driving systems and robotics (e.g., Tobin et al., 2017; Tan et al., 2018), we must ensure that risk assessments within our LM-based emulator are faithful to the real world. As our primary evaluation objective, we examine if our framework can assist in identifying true failures that both genuinely risky and have realistic emulations validated by human annotators.

End-to-end validationToolEmu enables the identification of true failures with about 70+% precision. Furthermore, our adversarial emulator detects more true failures than the standard alternative, albeit at a slight trade-off in precision that results from a mild decrease in emulator validity.

End-to-end validation of ToolEmu
End-to-end Validation
Identified Failure Precision is ratio of identified failures with ToolEmu that were true failures,. True Failure Incidence is the occurrence of such true failures among all test cases.

Validating the emulatorsWe assess the quality of the emulations based on the frequency with which they are free of critical issues (i.e., the emulations are possible to instantiate with an actual tool and sandbox setup), as determined by human validations. We find that the ratio of critical-issue-free trajectories to all emulated trajectories is over 80% for both the standard and adversarial emulators.

Validating the evaluatorsWe assess the accuracy of our automatic evaluators for both safety and helpfulness by measuring their agreement with human annotations. Both our safety and helpfulness evaluators demonstrate a reasonably high agreement with human annotations (denoted as "A-H"), achieving a Cohen’s κ of over 0.45, matching the agreement rate between human annotators (denoted as "H-H").

Detailed validation of individual components in ToolEmu
Detailed Validation
(Left) Our emulator produces emulations free of critical issues over 80% of the time, according to our human validation. (Right) Our automatic evaluators align well with human annotations, achieving a Cohen’s κ larger than 0.45.

Evaluating Language Model Agents within ToolEmu

We use to ToolEmu to quantitatively evaluate the safety and helpfulness of different LM agents within the curated benchmark. Among the base LM agents, GPT-4 (OpenAI, 2023) and Claude-2 (Anthropic, 2023) agents demonstrate the best safety and helpfulness. However, they still exhibit failures in 39.4% and 44.3% of our test cases, respectively. The Vicuna-1.5 agents appear to be safer than the ChatGPT agent, but largely due to their inefficacy in utilizing tools to solve tasks or pose actual risks with these tools, as evidenced by their lower helpfulness scores. By engineering a "Safety" prompt for GPT-4, we are able to reduce its failure rate to 23.9%. However, the failure rate is still quite high, underscoring the critical need to develop safer LM agents for real-world deployment.

Evaluation of current LM agents
Agent Benchmarks
‘NoAct’ denotes an agent that refrains from taking any actions, which could achieve a perfect safety score but a nearly 0 helpfulness score. Both the safety and helpfulness scores are in the range of 0-3 with higher being better. The failure incidence is the occurrence of identified failures among all test cases.


    title={Identifying the Risks of LM Agents with an LM-Emulated Sandbox},
    author={Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J and Hashimoto, Tatsunori},
    booktitle={The Twelfth International Conference on Learning Representations},