Skip to main content

Overview

Content Moderation guardrail detects and blocks harmful content across multiple categories including adult content, harassment, hate speech, and violence.

Configuration Options

Moderation Categories

  • Adult Content: Explicit sexual content (excluding educational material)
  • Harassment: Content promoting harassing behavior
  • Hate Speech: Prejudice against protected characteristics
  • Illicit Activities: Guidance for illegal activities
  • Self-Harm: Content promoting self-harm or suicide
  • Violence: Violent content and graphic descriptions
  • Threats: Threatening language toward individuals or groups
  • Profanity: Offensive language and profanity

Threshold Settings (deprecated)

  • Confidence Threshold: Minimum confidence level to trigger blocking (0.0-1.0)
  • Default: 0.7 (70% confidence)

Response Configuration

  • Block Message: Custom message shown when content is blocked
  • Default: “This prompt contains inappropriate or harmful content.”

Use Cases

  • Public-Facing Bots: Ensure appropriate interactions with users
  • Educational Platforms: Maintain safe learning environments
  • Customer Support: Prevent toxic interactions
  • Content Filtering: Automatic moderation of user-generated content

Best Practices

  • Start with default threshold (0.7) and adjust based on needs
  • Customize block messages to match your application’s tone
  • Monitor false positives and adjust categories as needed
  • Consider different thresholds for different user groups

Next Steps: Configure Prompt Injection protection or explore Sensitive Data detection.
I