What are Content signals?
Overview
Content signals represent a category of safety and policy enforcement markers that identify potentially harmful, inappropriate, or policy-violating material in user communications and system outputs. These signals are critical for maintaining platform safety, legal compliance, and user protection. Effective content signal detection enables proactive content moderation, risk mitigation, and adherence to community standards and regulatory requirements.
⚠️ Content Warning
This documentation may contain examples of potentially offensive or harmful content used to demonstrate Aiceberg's Signal detection. These examples are included for technical education purposes only and do not reflect our organization's values. For more information, see the full Content Warning.
Illegality
Definition: Content that violates laws, regulations, or legal standards across various jurisdictions, potentially exposing the platform or users to legal liability.
Characteristics:
- References to illegal activities or services
- Instructions for unlawful behavior
- Content that violates intellectual property rights
- Material that contravenes local, national, or international law
- Regulatory compliance violations
Illegality Categories:
- Cybercrimes: Malware, phishing, hacking, etc.
- Drug-related: Trafficking, manufacturing, paraphernalia, etc.
- Fraud & Financial Crimes: Money laundering, tax evasion, securities fraud, etc.
- Miscellaneous: Bribery, gambling, arms dealing, etc.
- Property: Burglary, arson, embezzlement, etc.
- Sex: Trafficking, abuse, obscene materials, etc.
- Terrorism & National Security: Espionage, terrorism, infrastructure attacks, etc.
- Violence: Kidnapping, murder, assault, etc.
Example Patterns:
- "How to manufacture illegal drugs"
- "Selling counterfeit designer products"
- "Instructions for tax evasion schemes"
- "Where to buy stolen credit card information"
- "How to hack into someone's account"
Toxicity
Definition: Content containing hostile, aggressive, disrespectful, or harmful language that creates negative user experiences or unsafe environments.
Characteristics:
- Hate Speech: Targeting individuals or groups based on protected characteristics
- Harassment: Bullying, stalking, intimidation, or persistent unwanted contact
- Threats: Direct or implied threats of violence or harm
- Discrimination: Content promoting prejudice or unfair treatment
- Profanity: Excessive or inappropriate use of offensive language
- Personal Attacks: Ad hominem attacks, doxxing, character assassination
Example Patterns:
- Explicit threats: "I'm going to hurt you"
- Hate speech: Slurs and derogatory language targeting protected groups
- Harassment: "You're worthless and should disappear"
- Discrimination: Content promoting stereotypes or exclusion
Code Requested
Definition: User requests for code generation, programming assistance, or software development help, which may require special handling for security and policy compliance.
General Request Types:
- General Programming: Algorithm implementation, syntax help, debugging
- Web Development: Frontend, backend, database integration
- Security Code: Cryptography, authentication, security tools
- System Administration: Scripts, automation, configuration
- Data Processing: Analytics, machine learning, data manipulation
- Integration Code: APIs, webhooks, third-party services
Risk Indicators:
- Requests for potentially harmful code
- Bypass or circumvention techniques
- Malicious functionality descriptions
- Unauthorized access methods
- Privacy violation tools
Example Patterns:
- "Write a Python script to..."
- "Help me debug this JavaScript function"
- "Create a SQL query for..."
- "Generate code to automate..."
- "Show me how to implement..."
Code Present
Definition: Detection of programming code, scripts, or technical instructions within user input or system output that may require review for safety and policy compliance.
Code Types:
- Source Code: Programming languages (Python, JavaScript, Java, etc.)
- Markup Languages: HTML, CSS
- Query Languages: SQL, database queries
- Others: Haskell, Swift, R, Objective-C
Example Patterns:
- Code blocks with syntax highlighting markers
- Function definitions and class declarations
- Import statements and library references
- Variable assignments and data structures
- Control flow statements (if, for, while)
Code Vulnerability (future feature)
Definition: Code containing security flaws, weaknesses, or implementation errors that could be exploited to compromise systems, data, or user safety.
General Vulnerability Categories:
- Injection Flaws: SQL injection, command injection, XSS
- Authentication Issues: Weak passwords, session management flaws
- Access Control Problems: Privilege escalation, unauthorized access
- Cryptographic Weaknesses: Poor encryption, key management issues
- Input Validation Failures: Buffer overflows, format string bugs
- Configuration Errors: Insecure defaults, exposed credentials
Common Vulnerability Patterns:
- SQL Injection: Unsanitized database queries
- Cross-Site Scripting (XSS): Unescaped user input in web pages
- Buffer Overflow: Memory management errors
- Hard-coded Credentials: Passwords or keys in source code
- Insecure Cryptography: Weak algorithms or implementations
- Race Conditions: Timing-dependent security flaws
Example Patterns:
query = "SELECT * FROM users WHERE id = " + user_inputeval(user_provided_code)system(command_from_user)password = "hardcoded_password"if (user.isAdmin = true)(assignment instead of comparison)