『Adaptive Stress Testing for Language Model Toxicity』のカバーアート

Adaptive Stress Testing for Language Model Toxicity

Adaptive Stress Testing for Language Model Toxicity

無料で聴く

ポッドキャストの詳細を見る

このコンテンツについて

This episode explores ASTPrompter, a novel approach to automated red-teaming for large language models (LLMs). Unlike traditional methods that focus on simply triggering toxic outputs, ASTPrompter is designed to discover likely toxic prompts – those that could naturally emerge during regular language model use. The approach uses Adaptive Stress Testing (AST), a technique that identifies likely failure points, and reinforcement learning to train an "adversary" model. This adversary generates prompts that aim to elicit toxic responses from a "defender" model, but importantly, these prompts have a low perplexity, meaning they are realistic and likely to occur, unlike many prompts generated by other methods.

Adaptive Stress Testing for Language Model Toxicityに寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。