Overview
Content Moderation guardrail detects and blocks harmful content across multiple categories including adult content, harassment, hate speech, and violence.Configuration Options
Moderation Categories
- Adult Content: Explicit sexual content (excluding educational material)
- Harassment: Content promoting harassing behavior
- Hate Speech: Prejudice against protected characteristics
- Illicit Activities: Guidance for illegal activities
- Self-Harm: Content promoting self-harm or suicide
- Violence: Violent content and graphic descriptions
- Threats: Threatening language toward individuals or groups
- Profanity: Offensive language and profanity
Threshold Settings (deprecated)
- Confidence Threshold: Minimum confidence level to trigger blocking (0.0-1.0)
- Default: 0.7 (70% confidence)
Response Configuration
- Block Message: Custom message shown when content is blocked
- Default: “This prompt contains inappropriate or harmful content.”
Use Cases
- Public-Facing Bots: Ensure appropriate interactions with users
- Educational Platforms: Maintain safe learning environments
- Customer Support: Prevent toxic interactions
- Content Filtering: Automatic moderation of user-generated content
Best Practices
- Start with default threshold (0.7) and adjust based on needs
- Customize block messages to match your application’s tone
- Monitor false positives and adjust categories as needed
- Consider different thresholds for different user groups
Next Steps: Configure Prompt Injection protection or explore Sensitive Data detection.