Text Normalization is the process of transforming an input text into a standard, consistent format before further processing (like tokenization, search, or machine learning).
Some important aspects when normalizing text:
- Punctuation Handling
- Detect sentence boundaries (
. ? ! …
) vs. other uses of punctuation.
- Example:
- “I live on Alley St.” → St. = Street (not sentence end).
- “Dr. Smith said hi.” → Dr. = Doctor, not end of sentence.
- Numbers, Dates, and Quantities
- Convert digits to words (rule-based or ML-based).
- Example:
1975
→ “nineteen seventy-five” (if a year) or “one thousand nine hundred seventy-five” (if a number).
25
→ “twenty-five”.
25th
→ “twenty-fifth”.
- Abbreviations and Acronyms
- Expand shortened forms to full words.
- Example:
- “Dr.” → “Doctor”
- “USA” → “United States of America”
- Homographs (same spelling, different pronunciation/meaning)
- Need contextual disambiguation.
- Example:
- “I live in the USA” → pronounced /lɪv/
- “Fresh live animals” → pronounced /laɪv/
- Case Normalization
- Lowercasing (in NLP tasks), except for TTS where casing may imply emphasis.
- Example: “NASA” (should stay uppercase → acronym).
- Special Symbols & Units
$50
→ “fifty dollars”
70 km/h
→ “seventy kilometers per hour”
- Handling Non-standard Words (NSWs)
- URLs:
www.google.com
→ “double u double u double u dot goo gle dot com”.
- Emojis: 🙂 → “smiley face”.
- Hashtags: #Google, #DeepLearning, #MachineLearning
Algorithms/Methods Used
- Rule-based systems: regex, grammar rules, lookup tables.
- Finite State Transducers (FSTs): efficient for large-scale normalization (used in Google TTS).
- ML/Deep Learning models:
seq2seq
or transformers for contextual disambiguation (e.g., handling live vs live).
Example: “Hello!! How are you today, Dr. Manner from the 123 Hospital?”.
- Punctuations
- “Hello!!” → The exclamation mark indicates emphasis / strong emotions.
- Sentence Segmentation
- “How are you today, Dr. Manner” → Split into “How are you today” + “Doctor Manner…”, the coma indicates pause in speech.
- Abbreviation
- “Dr.” → Abbreviation for “Doctor”