Basics of Natural Languge Processing (NLP)

    There are different types of Natural Langauge Processing (NLP), each focusing on specific aspects of language understanding and text processing. Some of the major types include Text Classification and Sentiment Analysis, Named Entity Recognition (NER), Machine Translation, Speech Recognition, and much more.

    In this post, I’ll describe the following topics: (1) Regular Expressions & Word Tokenization, (2) “Bag-of-word”, (3) Simple Text Preprocessing, (4) Gensim, (5) Named Entity Recognition, and (6) Spacy.

    1.0 Regular Expressions & Word Tokenization

    1.1 Regular Expressions

    Regular expressions, often referred to as regex or regexp, are patterns that define search criteria for text. In the Python example below, I illustrate a few examples of how to use regex using the “re” Python library.

    1.2 Tokenization

    Tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a text into individual units or “tokens.” These tokens are typically words, phrases, or symbols, and tokenization is an essential step for various NLP tasks.

    The Python example below shows how to tokenize into words (example #1), and how to tokenize into sentences (example #2).

    2.0 Bag of Word

    In natural language processing (NLP), the “bag-of-words” (BoW) model is a simple method for finding topics in text. The name “bag-of-words” comes from the fact that it doesn’t consider the order of words in a document; instead, it treats text as an unordered “bag” of words. Bag-of-word can be useful when determining the significant words in a text, based on the number of times they are used.

    The Python code below shows how to extract the number of times a word appears in the text, as well as the top three words based on the number of appearances.

    As seen in the example above, the top 3 words are “the”, ““”, and “””. It is important to note that by default, bag-of-word treats words in a case-sensitive manner, which means it distinguishes between uppercase and lowercase letters. This means that if the same word appears in both uppercase and lowercase forms in the text, they will be counted as separate words.

    3.0 Simple Text Preprocessing

    Simple text preprocessing in natural language processing (NLP) involves a series of basic and essential steps to clean and prepare text data for analysis or machine learning tasks. These preprocessing steps help reduce noise, improving the quality of the data, and making it more amenable to downstream NLP tasks. Here are some common steps involved in simple text preprocessing: lowercasing, tokenization, removing punctuation, removing special characters and numbers, removing stopwords, and much more.

    3.1 Lowercasing & Remove Stopwords

    The Python code below shows how to lowercasing sentences, as well as removing stopwords, such as “a”, “the”, “in”, “on”, “at”, and so on.

    Under the Python code for Bag-of-work (section 2.0), we saw that the top three words were the following [(‘the’, 3), (‘“’, 3), (“””, 3)], while the Python code above removes the stopwords, and the top three words are now [(‘text’, 2), (‘words’, 2), (‘natural’, 1)].

    4.0 Gensim

    Gensim is an open-source Python library for natural language processing (NLP) and machine learning. It is particularly known for its efficient and scalable implementations of various algorithms related to text analysis, with a focus on topics like topic modeling, document similarity analysis, and word embeddings.

    4.1 Creating a Gensim Corpus

    A Genim Corpus consists of two main contents:

    1. Dictionary: This is an object that maps words to unique integer IDs. Each word in the corpus is assigned a unique ID.
    2. Corpus: The corpus itself is a collection of documents where each document is represented as a list of (word ID, word frequency) tuples.

    The Python code below shows the unique ID for every word as well as the tuples of the ID of the word and the word’s frequency.

    As seen under the first output, all words are assigned a unique ID. For instance, “words” is assigned to ID #29.

    The second output shows a list of documents (sentences in this case), where each document (sentence) is a list of (word ID, word frequency) tuples. You will see that the last tuple is (29,2), where 29 is the unique ID of “words”, and 2 is the frequency of “words”.

    4.2 TF-IDF within Gensim

    In Gensim, TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used to determine the most important words in each document in the corpus. The idea behind TF-IDF is to ensure that the most common words don’t show up as keywords, as explained below.

    1. Term Frequency (TF): TF measures the frequency of a term (word) within a document.
    2. Inverse Document Frequency (IDF): IDF measures the importance of a term across a collection of documents. It assigned a weight to the term based on how often it appears in the entire corpus. Terms that appear in many documents receive a lower IDF score, while terms that are rare receive a higher IDF score.
    3. Term Frequency-Inverse Document Frequency (TF-IDF): Combines both TF and IDF to assign a weight to each term within a document relative to its importance in the entire corpus.

    The Python code below shows how to TF-IDF using the text above for the second sentence:

    The output above suggests that the term 29 (“words) has a higher importance or uniqueness in the context of the document compared to the other terms, as it has a TF-IDF weight of 0.5. The other terms have lower TF-IDF weights, indicating they are less significant in this document.

    5.0 Named Entity Recognition

    Named Entity Recognition (NER) is a subtask of natural language processing (NLP) that focuses on identifying and classifying named entities (specific words or phrases that represent entities with proper names) within a text. Named entities are typically real-world objects such as names of people, organizations, locations, dates, numerical values, and more. NER is important in NLP because it helps in extracting structured information from unstructured text.

    The Python code below shows how the “nltk” library can be used to extract specific words that represent entities.

    As seen in the Python code above, the code was able to detect New York as a GPE (Geopolitical entity), Metro and MOMA as Organizations, as well as Ruth Reichl as a Person. It does this without a knowledge base, like Wikipedia, but instead as a trained statistical and grammatical parser.

    6.0 Spacy

    Spacy is an NPL library similar to Gensim but with different implementations, including a particular focus on creating NPL pipelines to generate models and corpora. Spacy is open-source and has several extra libraries, including “Displacy”, which is a visualization tool for viewing parse trees and uses Node.js to create interactive text.

    The Python code below shows how Spacy is able to correctly identify Copenhagen and Denmark as Geopolitical entities, and Mette Frederiksen as a person.