What are Content signals?

Overview

Content signals represent a category of safety and policy enforcement markers that identify potentially harmful, inappropriate, or policy-violating material in user communications and system outputs. These signals are critical for maintaining platform safety, legal compliance, and user protection. Effective content signal detection enables proactive content moderation, risk mitigation, and adherence to community standards and regulatory requirements.

⚠️ Content Warning

This documentation may contain examples of potentially offensive or harmful content used to demonstrate Aiceberg's Signal detection. These examples are included for technical education purposes only and do not reflect our organization's values. For more information, see the full Content Warning.

Illegality

Definition: Content that violates laws, regulations, or legal standards across various jurisdictions, potentially exposing the platform or users to legal liability.

Characteristics:

References to illegal activities or services
Instructions for unlawful behavior
Content that violates intellectual property rights
Material that contravenes local, national, or international law
Regulatory compliance violations

Illegality Categories:

Cybercrimes: Malware, phishing, hacking, etc.
Drug-related: Trafficking, manufacturing, paraphernalia, etc.
Fraud & Financial Crimes: Money laundering, tax evasion, securities fraud, etc.
Miscellaneous: Bribery, gambling, arms dealing, etc.
Property: Burglary, arson, embezzlement, etc.
Sex: Trafficking, abuse, obscene materials, etc.
Terrorism & National Security: Espionage, terrorism, infrastructure attacks, etc.
Violence: Kidnapping, murder, assault, etc.

Example Patterns:

"How to manufacture illegal drugs"
"Selling counterfeit designer products"
"Instructions for tax evasion schemes"
"Where to buy stolen credit card information"
"How to hack into someone's account"

Toxicity

Definition: Content containing hostile, aggressive, disrespectful, or harmful language that creates negative user experiences or unsafe environments.

Characteristics:

Hate Speech: Targeting individuals or groups based on protected characteristics
Harassment: Bullying, stalking, intimidation, or persistent unwanted contact
Threats: Direct or implied threats of violence or harm
Discrimination: Content promoting prejudice or unfair treatment
Profanity: Excessive or inappropriate use of offensive language
Personal Attacks: Ad hominem attacks, doxxing, character assassination

Example Patterns:

Explicit threats: "I'm going to hurt you"
Hate speech: Slurs and derogatory language targeting protected groups
Harassment: "You're worthless and should disappear"
Discrimination: Content promoting stereotypes or exclusion

Code Requested

Definition: User requests for code generation, programming assistance, or software development help, which may require special handling for security and policy compliance.

General Request Types:

General Programming: Algorithm implementation, syntax help, debugging
Web Development: Frontend, backend, database integration
Security Code: Cryptography, authentication, security tools
System Administration: Scripts, automation, configuration
Data Processing: Analytics, machine learning, data manipulation
Integration Code: APIs, webhooks, third-party services

Risk Indicators:

Requests for potentially harmful code
Bypass or circumvention techniques
Malicious functionality descriptions
Unauthorized access methods
Privacy violation tools

Example Patterns:

"Write a Python script to..."
"Help me debug this JavaScript function"
"Create a SQL query for..."
"Generate code to automate..."
"Show me how to implement..."

Code Present

Definition: Detection of programming code, scripts, or technical instructions within user input or system output that may require review for safety and policy compliance.

Code Types:

Source Code: Programming languages (Python, JavaScript, Java, etc.)
Markup Languages: HTML, CSS
Query Languages: SQL, database queries
Others: Haskell, Swift, R, Objective-C

Example Patterns:

Code blocks with syntax highlighting markers
Function definitions and class declarations
Import statements and library references
Variable assignments and data structures
Control flow statements (if, for, while)

Code Vulnerability (future feature)

Definition: Code containing security flaws, weaknesses, or implementation errors that could be exploited to compromise systems, data, or user safety.

General Vulnerability Categories:

Injection Flaws: SQL injection, command injection, XSS
Authentication Issues: Weak passwords, session management flaws
Access Control Problems: Privilege escalation, unauthorized access
Cryptographic Weaknesses: Poor encryption, key management issues
Input Validation Failures: Buffer overflows, format string bugs
Configuration Errors: Insecure defaults, exposed credentials

Common Vulnerability Patterns:

SQL Injection: Unsanitized database queries
Cross-Site Scripting (XSS): Unescaped user input in web pages
Buffer Overflow: Memory management errors
Hard-coded Credentials: Passwords or keys in source code
Insecure Cryptography: Weak algorithms or implementations
Race Conditions: Timing-dependent security flaws

Example Patterns:

query = "SELECT * FROM users WHERE id = " + user_input
eval(user_provided_code)
system(command_from_user)
password = "hardcoded_password"
if (user.isAdmin = true) (assignment instead of comparison)