Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Abstract
Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks—such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, manually setting up the environment for each test scenario, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses a LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.
ToolEmu
Overview of ToolEmu
ToolEmu assists in rapidly identifying realistic failures of LM agents across various tools and scenarios within an LM-emulated environment and facilitates the development of safer LM agents with LM-automated evaluations. It consists of 3 main components:
- Tool EmulatorsToolEmu uses a strong LM (e.g. GPT-4) to emulate the execution of tools in a virtual sandbox. In contrast to typical emulation environments, ToolEmu can emulate tools using only their specifications and inputs, without needing their implementations. This allows for faster prototyping of LM agents across different scenarios, while accommodating the evaluation of high-stakes tools that may lack existing APIs or sandbox implementations.
- Safety & Helpfulness EvaluatorsTo support scalable and quantitative risk assessments, ToolEmu includes an LM-based safety evaluator to automate the identification of potential failures caused by LM agents and quantifies the associated risk severities. To capture the potential tradeoff between safety and effectiveness, ToolEmu also includes an LM-based helpfulness evaluator. Both evaluators are applied to the agent emulation trajectories and exhibit reasonbly high agreement with human annotators, matching the level of human inter-annotator agreement.
- Curated Benchmark To demonstrate how ToolEmu can be used to build an evaluation benchmark that quantitatively assesses LM agents across various tools and scenario, we curate an initial benchmark covering 36 toolkits (311 tools) and 144 test cases for risk assessment. Many tools lack existing sandboxed evaluations or are hard to test. The scalablility of ToolEmu allows expanding it to more tools and scenarios (try our demo).
Typical Failures of LM agents Identified with ToolEmu
With our emulators, we are able to identify a wide range of long-tail, potentially severe failures of current LM agents. We show some illustrative examples below. The risky actions executed by the LM agents are highlighted in red, while specific points to note are marked in yellow.
Adversarial Emulator for Red-teaming
By default, our standard emulator is provided with the necessary information for emulation (e.g., the tool specifications) and samples test scenarios randomly. However, it may be inefficient for identifying rare and long-tailed risks, as the majority of test scenarios may result in benign or low-risk outcomes. To further facilitate risk assessment and detection of long-tail failures, we introduce an adversarial emulator for red-teaming. The adversarial emulator is provided with additional information (e.g., the intended risks that could arise from LM agent’s actions) to instantiate sandbox states that are more likely to cause agent failures. See the illustrative example below.
Curating the Evaluation Benchmark
Our emulator allows us to directly specify the types of tools and scenarios on which to test LM agents. Leveraging this strength, we curate an evaluation benchmark that encompasses a diverse set of tools, test cases, and risk types, allowing for a broad and quantitative analysis of LM agents.
Our dataset curation procedure is depicted below. We first compiled an extensive array of tool specifications, with which we collected a diverse collection of test cases for risk assessment. For both tool specifications and test cases, we used GPT-4 to generate an initial set, followed by human filtering and modifications. All the cases were generated and tested without tool or sandbox implementations!
Dataset Curation Procedure
Our final curated tool set contains 36 toolkits comprising a total of 311 tools, most of which lack sandboxed evaluations in existing benchmarks or are challenging to test in current real environments:
Toolkit Category
Our final curated dataset consists of 144 test cases spanning 9 risk types, as depicted below:
Risk type distribution
Validating ToolEmu
Similar to the sim-to-real transfer challenges in simulated environments for developing autonomous driving systems and robotics (e.g., Tobin et al., 2017; Tan et al., 2018), we must ensure that risk assessments within our LM-based emulator are faithful to the real world. As our primary evaluation objective, we examine if our framework can assist in identifying true failures that both genuinely risky and have realistic emulations validated by human annotators.
End-to-end validationToolEmu enables the identification of true failures with about 70+% precision. Furthermore, our adversarial emulator detects more true failures than the standard alternative, albeit at a slight trade-off in precision that results from a mild decrease in emulator validity.
End-to-end validation of ToolEmu
Validating the emulatorsWe assess the quality of the emulations based on the frequency with which they are free of critical issues (i.e., the emulations are possible to instantiate with an actual tool and sandbox setup), as determined by human validations. We find that the ratio of critical-issue-free trajectories to all emulated trajectories is over 80% for both the standard and adversarial emulators.
Validating the evaluatorsWe assess the accuracy of our automatic evaluators for both safety and helpfulness by measuring their agreement with human annotations. Both our safety and helpfulness evaluators demonstrate a reasonably high agreement with human annotations (denoted as "A-H"), achieving a Cohen’s κ of over 0.45, matching the agreement rate between human annotators (denoted as "H-H").
Detailed validation of individual components in ToolEmu
Evaluating Language Model Agents within ToolEmu
We use to ToolEmu to quantitatively evaluate the safety and helpfulness of different LM agents within the curated benchmark. Among the base LM agents, GPT-4 (OpenAI, 2023) and Claude-2 (Anthropic, 2023) agents demonstrate the best safety and helpfulness. However, they still exhibit failures in 39.4% and 44.3% of our test cases, respectively. The Vicuna-1.5 agents appear to be safer than the ChatGPT agent, but largely due to their inefficacy in utilizing tools to solve tasks or pose actual risks with these tools, as evidenced by their lower helpfulness scores. By engineering a "Safety" prompt for GPT-4, we are able to reduce its failure rate to 23.9%. However, the failure rate is still quite high, underscoring the critical need to develop safer LM agents for real-world deployment.
Evaluation of current LM agents
Citation
@inproceedings{ruan2024toolemu, title={Identifying the Risks of LM Agents with an LM-Emulated Sandbox}, author={Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J and Hashimoto, Tatsunori}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024} }