Basics of Natural Languge Processing (NLP)

There are different types of Natural Langauge Processing (NLP), each focusing on specific aspects of language understanding and text processing. Some of the major types include Text Classification and Sentiment Analysis, Named Entity Recognition (NER), Machine Translation, Speech Recognition, and much more.

In this post, I’ll describe the following topics: (1) Regular Expressions & Word Tokenization, (2) “Bag-of-word”, (3) Simple Text Preprocessing, (4) Gensim, (5) Named Entity Recognition, and (6) Spacy.

1.0 Regular Expressions & Word Tokenization

1.1 Regular Expressions

Regular expressions, often referred to as regex or regexp, are patterns that define search criteria for text. In the Python example below, I illustrate a few examples of how to use regex using the “re” Python library.

import re

################################################
# Example 1: Matching a Date (MM/DD/YYYY)
################################################

date_pattern = r'\d{2}/\d{2}/\d{4}'
text = "Today's date is 09/27/2023. Please remember to submit your report by 10/15/2023."
dates = re.findall(date_pattern, text)
print("Dates:", dates)

----------------- OUTPUT -----------------------
Dates: ['09/27/2023', '10/15/2023']
------------------------------------------------

################################################
# Example 2: Matching Email Addresses
################################################

email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
text = "Contact us at support@example.com or sales@company.co.uk"
emails = re.findall(email_pattern, text, re.IGNORECASE)
print("Emails:", emails)

----------------- OUTPUT -----------------------
Emails: ['support@example.com', 'sales@company.co.uk']
------------------------------------------------

################################################
# Example 3: Matching Phone Numbers (XXX-XXX-XXXX)
################################################

phone_pattern = r'\d{3}-\d{3}-\d{4}'
text = "For customer service, call 555-123-4567 or 888-555-7890."
phone_numbers = re.findall(phone_pattern, text)
print("Phone Numbers:", phone_numbers)

----------------- OUTPUT -----------------------
Phone Numbers: ['555-123-4567', '888-555-7890']
------------------------------------------------

1.2 Tokenization

Tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a text into individual units or “tokens.” These tokens are typically words, phrases, or symbols, and tokenization is an essential step for various NLP tasks.

The Python example below shows how to tokenize into words (example #1), and how to tokenize into sentences (example #2).

# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

################################################
# Example 1: Tokenize into words
################################################

text = "Tokenization is an important NLP task. It helps break text into tokens."

words = word_tokenize(text)
print("Words:", words)

----------------- OUTPUT -----------------------
Words: ['Tokenization', 'is', 'an', 'important', 'NLP', 'task', '.', 'It', 'helps', 'break', 'text', 'into', 'tokens', '.']
------------------------------------------------

################################################
# Example 2: Tokenize into sentences
################################################

text = "Tokenization is an important NLP task. It helps break text into tokens."

sentences = sent_tokenize(text)
print("Sentences:", sentences)

----------------- OUTPUT -----------------------
Sentences: ['Tokenization is an important NLP task.', 'It helps break text into tokens.']
------------------------------------------------

2.0 Bag of Word

In natural language processing (NLP), the “bag-of-words” (BoW) model is a simple method for finding topics in text. The name “bag-of-words” comes from the fact that it doesn’t consider the order of words in a document; instead, it treats text as an unordered “bag” of words. Bag-of-word can be useful when determining the significant words in a text, based on the number of times they are used.

The Python code below shows how to extract the number of times a word appears in the text, as well as the top three words based on the number of appearances.

from nltk.tokenize import word_tokenize
from collections import Counter

################################################
# Example 1: Bag-of-word
################################################

# Create text
text = """In natural language processing (NLP), the "bag-of-words" (BoW) model is a simple method for 
          finding topics in text. The name "bag-of-words" comes from the fact that it doesn't consider 
          the order of words in a document; instead, it treats text as an unordered "bag" of words"""

# Count number of times a word appears
word_counts = Counter(word_tokenize(text))
print(word_counts)

----------------- OUTPUT -----------------------
Counter({'the': 3, '``': 3, "''": 3, '(': 2, ') ...
------------------------------------------------

# Print top 3 words
word_counts.most_common(3) 

----------------- OUTPUT -----------------------
[('the', 3), ('``', 3), ("''", 3)]
------------------------------------------------

As seen in the example above, the top 3 words are “the”, ““”, and “””. It is important to note that by default, bag-of-word treats words in a case-sensitive manner, which means it distinguishes between uppercase and lowercase letters. This means that if the same word appears in both uppercase and lowercase forms in the text, they will be counted as separate words.

3.0 Simple Text Preprocessing

Simple text preprocessing in natural language processing (NLP) involves a series of basic and essential steps to clean and prepare text data for analysis or machine learning tasks. These preprocessing steps help reduce noise, improving the quality of the data, and making it more amenable to downstream NLP tasks. Here are some common steps involved in simple text preprocessing: lowercasing, tokenization, removing punctuation, removing special characters and numbers, removing stopwords, and much more.

3.1 Lowercasing & Remove Stopwords

The Python code below shows how to lowercasing sentences, as well as removing stopwords, such as “a”, “the”, “in”, “on”, “at”, and so on.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

# Create text
text = """In natural language processing (NLP), the "bag-of-words" (BoW) model is a simple method for 
          finding topics in text. The name "bag-of-words" comes from the fact that it doesn't consider 
          the order of words in a document; instead, it treats text as an unordered "bag" of words"""

# Make it lowercase and tokenize
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]

# Remove stop words
no_stops = [t for t in tokens if t not in stopwords.words('english')]

# Print top 3 words
Counter(no_stops).most_common(3)

print(no_stops)

----------------- OUTPUT -----------------------
[('text', 2), ('words', 2), ('natural', 1)]
------------------------------------------------

Under the Python code for Bag-of-work (section 2.0), we saw that the top three words were the following [(‘the’, 3), (‘“’, 3), (“””, 3)], while the Python code above removes the stopwords, and the top three words are now [(‘text’, 2), (‘words’, 2), (‘natural’, 1)].

4.0 Gensim

Gensim is an open-source Python library for natural language processing (NLP) and machine learning. It is particularly known for its efficient and scalable implementations of various algorithms related to text analysis, with a focus on topics like topic modeling, document similarity analysis, and word embeddings.

4.1 Creating a Gensim Corpus

A Genim Corpus consists of two main contents:

Dictionary: This is an object that maps words to unique integer IDs. Each word in the corpus is assigned a unique ID.
Corpus: The corpus itself is a collection of documents where each document is represented as a list of (word ID, word frequency) tuples.

The Python code below shows the unique ID for every word as well as the tuples of the ID of the word and the word’s frequency.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.corpora.dictionary import Dictionary

# Create text
text = """In natural language processing (NLP), the "bag-of-words" (BoW) model is a simple method for 
          finding topics in text. The name "bag-of-words" comes from the fact that it doesn't consider 
          the order of words in a document; instead, it treats text as an unordered "bag" of words"""

# Make it lowercase and tokenize
tokens = [word_tokenize(sentence.lower()) for sentence in text.split('.')]

# Remove stopwords from each token list within tokens
filtered_tokens = [[word for word in token_list if word not in stopwords.words('english')] for token_list in tokens]

# Create a Gensim Dictionary
dictionary = Dictionary(filtered_tokens)

# Create ID for every word
dictionary.token2id

----------------- OUTPUT -----------------------
{"''": 0,
 '(': 1,
 ')': 2,
 ',': 3,
 '``': 4,
 'bag-of-words': 5,
 'bow': 6,
 'finding': 7,
 'language': 8,
 'method': 9,
 'model': 10,
 'natural': 11,
 'nlp': 12,
...
'unordered': 28,
 'words': 29}

------------------------------------------------

# Create Gensim Corpus using doc2bow
corpus = [dictionary.doc2bow(doc) for doc in filtered_tokens]

# Print Output
corpus


----------------- OUTPUT -----------------------
[[(0, 1), (1, 2), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1)], [(0, 2), (3, 1), (4, 2), (5, 1), (15, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2)]]
------------------------------------------------

As seen under the first output, all words are assigned a unique ID. For instance, “words” is assigned to ID #29.

The second output shows a list of documents (sentences in this case), where each document (sentence) is a list of (word ID, word frequency) tuples. You will see that the last tuple is (29,2), where 29 is the unique ID of “words”, and 2 is the frequency of “words”.

4.2 TF-IDF within Gensim

In Gensim, TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used to determine the most important words in each document in the corpus. The idea behind TF-IDF is to ensure that the most common words don’t show up as keywords, as explained below.

Term Frequency (TF): TF measures the frequency of a term (word) within a document.
Inverse Document Frequency (IDF): IDF measures the importance of a term across a collection of documents. It assigned a weight to the term based on how often it appears in the entire corpus. Terms that appear in many documents receive a lower IDF score, while terms that are rare receive a higher IDF score.
Term Frequency-Inverse Document Frequency (TF-IDF): Combines both TF and IDF to assign a weight to each term within a document relative to its importance in the entire corpus.

The Python code below shows how to TF-IDF using the text above for the second sentence:

from gensim.models.tfidfmodel import TfidfModel

# Create a TF-IDF model from a given corpus
tfidf = TfidfModel(corpus)

# Show TF-IDF for second sentence
tfidf[corpus[1]]

----------------- OUTPUT -----------------------
[(17, 0.25),
 (18, 0.25),
 (19, 0.25),
 (20, 0.25),
 (21, 0.25),
 (22, 0.25),
 (23, 0.25),
 (24, 0.25),
 (25, 0.25),
 (26, 0.25),
 (27, 0.25),
 (28, 0.25),
 (29, 0.5)]
------------------------------------------------

The output above suggests that the term 29 (“words) has a higher importance or uniqueness in the context of the document compared to the other terms, as it has a TF-IDF weight of 0.5. The other terms have lower TF-IDF weights, indicating they are less significant in this document.

5.0 Named Entity Recognition

Named Entity Recognition (NER) is a subtask of natural language processing (NLP) that focuses on identifying and classifying named entities (specific words or phrases that represent entities with proper names) within a text. Named entities are typically real-world objects such as names of people, organizations, locations, dates, numerical values, and more. NER is important in NLP because it helps in extracting structured information from unstructured text.

The Python code below shows how the “nltk” library can be used to extract specific words that represent entities.

import nltk

# Create Sentence
sentence = """In New York, I like to ride the Metro to visit MOMA and some restaurants rated well by Ruth Reichl"""

# Tokenization: Split the sentence into individual words or tokens
tokenized_sent = nltk.word_tokenize(sentence)

# Part-of-Speech Tagging: Assign grammatical tags to each word
tagged_sent = nltk.pos_tag(tokenized_sent)

# Named Entity Recognition (NER): Identify and classify named entities
ner_tree = nltk.ne_chunk(tagged_sent)

# Print NER Results
print(ner_tree)

----------------- OUTPUT -----------------------
(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP))
------------------------------------------------

As seen in the Python code above, the code was able to detect New York as a GPE (Geopolitical entity), Metro and MOMA as Organizations, as well as Ruth Reichl as a Person. It does this without a knowledge base, like Wikipedia, but instead as a trained statistical and grammatical parser.

6.0 Spacy

Spacy is an NPL library similar to Gensim but with different implementations, including a particular focus on creating NPL pipelines to generate models and corpora. Spacy is open-source and has several extra libraries, including “Displacy”, which is a visualization tool for viewing parse trees and uses Node.js to create interactive text.

The Python code below shows how Spacy is able to correctly identify Copenhagen and Denmark as Geopolitical entities, and Mette Frederiksen as a person.

import spacy

# Create Document
doc = nlp("""Copenhagen is the capital of Denmark; and the Prime Minister is Mette Frederiksen""")

# Print Entities from Document
for ent in doc.ents:
    print(ent.label_, ent.text)

----------------- OUTPUT -----------------------
ORG Copenhagen
GPE Denmark
PERSON Mette Frederiksen
------------------------------------------------