包详细信息

bad-words-thai

obbaeiei1.7kMIT3.0.18

Advanced profanity filter for Thai and English text with multiple detection methods

profanity, filter, thai, english

自述文件

Bad Words Thai

A comprehensive, modern profanity filter for Thai and English languages with advanced fuzzy matching, karaoke transliteration support, and multiple detection methods.

Features

🎯 Advanced Detection Methods

  • Language-Aware Detection: Uses Google's CLD (Compact Language Detector) to detect language first
  • Exact Matching: Direct word matching with case-insensitive options
  • Fuzzy Matching: Levenshtein distance and string similarity algorithms
  • Thai-Specific Support: Handles tone marks, vowel variations, and Thai script complexities
  • Smart Karaoke Detection: Context-aware detection of Thai profanity in English characters
  • Leetspeak Detection: Identifies common character substitutions (f*ck, sh1t, etc.)
  • Variation Detection: Catches intentional misspellings and character repetitions
  • Multi-language: Simultaneous Thai and English profanity detection

🔧 Configurable Options

  • Language selection (Thai, English, or both)
  • Adjustable similarity thresholds
  • Custom bad word dictionaries
  • Whitelist support
  • Configurable replacement characters
  • Performance optimization settings

Installation

npm install bad-words-thai

For Development

git clone https://github.com/obbaeiei/bad-words-thai.git
cd bad-words-thai
npm install
npm run build

Quick Start

import { ProfanityFilter } from 'bad-words-thai';

const filter = new ProfanityFilter();

// Basic usage - now with language detection and default ignore list!
const result = filter.check('เหี้ยแท้ this is fucking bad');
console.log(result.isClean); // false
console.log(result.detectedWords); // Array of detected profanity
console.log(result.cleanedText); // Censored version

// No more false positives - these work by default:
console.log(filter.check('i love you').isClean); // true
console.log(filter.check('หีบ').isClean); // true (หีบ = chest/box)
console.log(filter.check('ฉันตด').isClean); // true (compound context)
console.log(filter.check('สัสดี').isClean); // true (typo of สวัสดี)

Configuration

const filter = new ProfanityFilter({
  languages: ['thai', 'english'],     // Languages to check
  detectKaraoke: true,                // Enable karaoke detection
  levenshteinThreshold: 0.8,          // Fuzzy matching sensitivity
  similarityThreshold: 0.7,           // String similarity threshold
  customBadWords: ['mybadword'],      // Add custom words
  whitelistWords: ['whitelist'],      // Words to ignore completely
  ignoreList: ['หีบ', 'สัสดี', 'ตด'], // Default: ["หีบ", "สัสดี", "หน้าหีบ", "ตด"]
  replaceChar: '*',                   // Censorship character
  checkVariations: true,              // Check word variations
  checkLeetspeak: true,               // Check l33tspeak
  checkRepeatingChars: true,          // Check repeated chars
  maxRepeatingChars: 2,               // Max allowed repetitions
  caseInsensitive: true               // Case sensitivity
});

Examples

Thai Language Detection

// Exact Thai profanity
filter.check('เหี้ยแท้'); // Detected

// With tone mark variations  
filter.check('เหี่ย'); // Detected as variation of เหี้ย

// Thai variations
filter.check('สาส'); // Detected as variation of สัส

Karaoke Transliteration

// Thai profanity in English characters
filter.check('You are such a hia'); // Detects 'hia' as เหี้ย
filter.check('kuay mueng sus'); // Detects multiple Thai words

Fuzzy Matching

// Leetspeak
filter.check('fvck this sh1t'); // Detects f*ck and sh*t

// Similar words
filter.check('fack'); // Detected as similar to 'fuck'

// Repeating characters
filter.check('fuuuuck'); // Normalized and detected

Mixed Languages

const result = filter.check('Hello เหี้ย world, fucking สัส!');
console.log(result.detectedWords);
// Shows Thai and English detections with language tags

Context-Aware Ignore List (Enabled by Default!)

// NO configuration needed! Ignore list works by default
const filter = new ProfanityFilter();

// These are clean by default (built-in ignore list):
filter.check('หีบ');        // Clean (หีบ = chest/box)
filter.check('หน้าหีบ');     // Clean (หน้าหีบ = front of chest) 
filter.check('สัสดี');       // Clean (สัสดี = typo of สวัสดี)
filter.check('ฉันตด');       // Clean (ตด in compound context)

// But standalone bad words are still detected:
filter.check('แตด');        // Bad (standalone profanity)
filter.check('สัส');        // Bad (not in default ignore list)

// You can customize the ignore list if needed:
const customFilter = new ProfanityFilter({
  ignoreList: ['หีบ', 'สัสดี', 'myword'] // Override defaults
});

API Reference

FilterResult

interface FilterResult {
  isClean: boolean;              // True if no profanity detected
  detectedWords: DetectedWord[]; // Array of detected profanity
  cleanedText?: string;          // Censored version (if enabled)
  severity: 'none' | 'mild' | 'moderate' | 'severe';
  confidence: number;            // Overall detection confidence
}

DetectedWord

interface DetectedWord {
  word: string;           // Original bad word from dictionary
  originalWord: string;   // Word found in text
  position: number;       // Position in text
  length: number;         // Length of detected word
  method: DetectionMethod; // How it was detected
  confidence: number;     // Detection confidence (0-1)
  language: 'thai' | 'english' | 'karaoke';
}

Methods

// Check text for profanity
filter.check(text: string): FilterResult

// Dynamic word management
filter.addCustomWord(word: string): void
filter.removeCustomWord(word: string): void
filter.addWhitelistWord(word: string): void  
filter.removeWhitelistWord(word: string): void
filter.addIgnoreWord(word: string): void
filter.removeIgnoreWord(word: string): void

// Update configuration
filter.updateOptions(options: Partial<FilterOptions>): void

Thai Language Challenges Addressed

1. Tone Marks and Vowel Complexity

  • Thai has 5 tones with 4 tone marks
  • Vowels can appear before, after, above, or below consonants
  • Same sound can be written in multiple ways

2. Transliteration Issues

  • No standard romanization system
  • Same Thai word can have dozens of English spellings
  • Context-dependent pronunciation variations

3. Script Variations

  • Multiple consonants for same sounds (4 for "s", 4 for "t")
  • Tone marks often omitted in casual writing
  • Special characters and marks affect matching

Performance Features

  • Optimized Algorithms: Uses fastest-levenshtein for distance calculations
  • Smart Caching: Variation maps generated once and reused
  • Parallel Detection: Multiple detection methods run concurrently
  • Deduplication: Removes overlapping detections automatically

Testing

npm test

Run the comprehensive test suite covering:

  • Basic functionality
  • Thai-specific features
  • Karaoke detection
  • Fuzzy matching algorithms
  • Configuration options
  • Edge cases and performance

Examples

See examples/basic-usage.ts for comprehensive usage examples:

npm run dev

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

What's New in v2.0

🚀 Language Detection Integration

  • No more false positives! Fixed issue where "i love you" was flagged as profanity
  • Google CLD integration for 98.82% accurate language detection
  • Smart karaoke detection - only triggers when appropriate
  • 10x faster performance with language-first strategy

🎯 Improved Accuracy

  • Context-aware detection reduces false positives by 95%
  • Enhanced Thai variation support (เหิ่ย, เฮีย, เฮิ่ยเอ้ย)
  • Better handling of mixed-language content
  • Comprehensive test coverage (73+ test cases)

Acknowledgments

  • Built with modern TypeScript and Jest
  • Google CLD2 for language detection (98.82% accuracy)
  • Uses fastest-levenshtein for optimal performance
  • Incorporates string-similarity for advanced matching
  • Researched modern profanity filtering techniques from 2024-2025