Content Moderation

Overview

Content Moderation guardrail detects and blocks harmful content across multiple categories including adult content, harassment, hate speech, and violence.

Configuration Options

Moderation Categories

Adult Content: Explicit sexual content (excluding educational material)
Harassment: Content promoting harassing behavior
Hate Speech: Prejudice against protected characteristics
Illicit Activities: Guidance for illegal activities
Self-Harm: Content promoting self-harm or suicide
Violence: Violent content and graphic descriptions
Threats: Threatening language toward individuals or groups
Profanity: Offensive language and profanity

Threshold Settings (deprecated)

Confidence Threshold: Minimum confidence level to trigger blocking (0.0-1.0)
Default: 0.7 (70% confidence)

Response Configuration

Block Message: Custom message shown when content is blocked
Default: “This prompt contains inappropriate or harmful content.”

Use Cases

Public-Facing Bots: Ensure appropriate interactions with users
Educational Platforms: Maintain safe learning environments
Customer Support: Prevent toxic interactions
Content Filtering: Automatic moderation of user-generated content

Best Practices

Start with default threshold (0.7) and adjust based on needs
Customize block messages to match your application’s tone
Monitor false positives and adjust categories as needed
Consider different thresholds for different user groups

Next Steps: Configure Prompt Injection protection or explore Sensitive Data detection.

Get Started

Integration

Explore UI

Use Cases

Overview

Configuration Options

Moderation Categories

Threshold Settings (deprecated)

Response Configuration

Use Cases

Best Practices

Get Started

Integration

Explore UI

Use Cases

​Overview

​Configuration Options

​Moderation Categories

​Threshold Settings (deprecated)

​Response Configuration

​Use Cases

​Best Practices

Overview

Configuration Options

Moderation Categories

Threshold Settings (deprecated)

Response Configuration

Use Cases

Best Practices