Word Tokenization is what comes right after (or before) Text Normalization. After the text is normalized into a standard format (convert abbreviations, expand numbers, lowercasing, handle punctuation, dates, money, hashtags), the next step is to split the text into individual tokens, typically words or meaningful subunits. This process is known as tokenization.

Word Tokenization with Python:

from nltk import word_tokenize

text = "Hello!! How are you today, Dr. Manner from the 123 Hospital? I live on Alley St. since 1975!"
tokens = word_tokenize(text)
print("Tokens:", tokens)

Output:

Tokens: ['Hello', '!', '!', 'How', 'are', 'you', 'today', ',', 
'Dr', '.', 'Manner', 'from', 'the', '123', 'Hospital', '?', 'I', 'live', 'on', 'Alley', 
'St.', 'since', '1975', '!']

Some useful libraries to help with normalization:

text = "Hello!! How are you today, Dr. Manner from the 123 Hospital? I live on Alley St. since 1975!"
words = text.split()
print(words)

Output:

['Hello!!', 'How', 'are', 'you', 'today,', 'Dr.', 'Manner', 
'from', 'the', '123', 'Hos-pital?', 'I', 'live', 'on', 'Alley', 
'St.', 'since', '1975!']

Install all these libraries in one go:

pip install nltk contractions word2number num2words

Code 1: