Word Tokenization is what comes right after (or before) Text Normalization. After the text is normalized into a standard format (convert abbreviations, expand numbers, lowercasing, handle punctuation, dates, money, hashtags), the next step is to split the text into individual tokens, typically words or meaningful subunits. This process is known as tokenization.
Word Tokenization with Python:
from nltk import word_tokenize
text = "Hello!! How are you today, Dr. Manner from the 123 Hospital? I live on Alley St. since 1975!"
tokens = word_tokenize(text)
print("Tokens:", tokens)
Output:
Tokens: ['Hello', '!', '!', 'How', 'are', 'you', 'today', ',',
'Dr', '.', 'Manner', 'from', 'the', '123', 'Hospital', '?', 'I', 'live', 'on', 'Alley',
'St.', 'since', '1975', '!']
Some useful libraries to help with normalization:
split()
method in Python can be used to split the string based on space. But it does not extract the punctuations from the word.text = "Hello!! How are you today, Dr. Manner from the 123 Hospital? I live on Alley St. since 1975!"
words = text.split()
print(words)
Output:
['Hello!!', 'How', 'are', 'you', 'today,', 'Dr.', 'Manner',
'from', 'the', '123', 'Hos-pital?', 'I', 'live', 'on', 'Alley',
'St.', 'since', '1975!']
contractions
, word2number
, and number2words
are libraries that can help with the text normalization.
word2number
: To convert numbers written in words to their numeric representation.contractions
: To expand short forms (e.g., “I’m” → “I am”).num2words
: To convert numbers (e.g., 123) to words (e.g., "one hundred twenty three").Install all these libraries in one go:
pip install nltk contractions word2number num2words
Code 1: