无明显分析结果4 天前

FET
概括

PICO是一种新型变换器架构,通过隔离系统提示与用户输入,结合专家代理和知识图谱,有效防御提示注入攻击,确保模型安全。

PICO(提示隔离与网络安全监管)是一种新型变换器架构,旨在防止提示注入攻击,确保响应的安全可靠。该架构通过将系统提示与用户输入在两个通道中隔离,直到融合阶段,保护可信指令不被篡改。它结合了安全专家代理和网络安全知识图谱,能识别潜在攻击并动态调整依赖可信指令。PICO提供了形式化的安全保证,有效应对复杂攻击,推动安全、可靠的语言模型发展。

原文内容

PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight Prompt injection attacks have emerged as a serious threat in current LLMs, where adversaries may alter model behavior by injecting malicious instructions into the prompt. Existing approaches, such as input sanitization, fixed prompt templates, and heuristic-based filtering, often mix trusted system instructions with untrusted user inputs, leading to brittle defenses that are easily bypassed. For example, an adversary could include a cleverly worded request that causes the model to "forget its internal guidelines," thereby triggering unintended behavior. In a https://arxiv.org/abs/2504.21029 Dr. Ben Goertzel, CEO of SingularityNET and Artificial Superintelligence Alliance, and Paulos Yibelo, Security Engineer at Amazon, propose PICO (Prompt Isolation and Cybersecurity Oversight), a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. The PICO framework takes a fundamentally different approach compared to existing defenses like input sanitization, template filtering, and RLHF-based training by architecturally segregating system prompts from user inputs through dual channels that remain separate until a controlled fusion stage. This ensures that trusted instructions remain intact while only the untrusted user input is subject to analysis and adaptation. 🖼️ https://pbs.twimg.com/media/GqLl9kDW4AAhOY8.png (Fine-Tuning Based version of PICO Dual-Stream Secure Transformer Architecture.) What makes PICO particularly effective is its integration of a Security Expert Agent within a Mixture-of-Experts framework, coupled with a Cybersecurity Knowledge Graph that provides domain-specific reasoning to identify potential attacks. When suspicious patterns are detected, the system dynamically shifts to rely more heavily on the trusted system instructions. The mathematical foundation of PICO provides formal guarantees about its security properties, demonstrating effectiveness against both simple prompt injections and more sophisticated attacks like "Policy Puppetry" that can bypass conventional defenses. We believe this research represents an important step toward building language models that remain secure and aligned with their intended behavior, even when faced with increasingly sophisticated adversarial inputs. Learn more and explore how the PICO approach improves upon existing methods, which often mix trusted and untrusted data or rely solely on heuristics, by enforcing a mathematically principled invariant: https://arxiv.org/abs/2504.21029