Natural Language Processing:

Natural Language Processing is the technology used to aid computers to understand the human’s natural language.

Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.

The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.

Most NLP techniques rely on machine learning to derive meaning from human languages.

Introduction to the NLTK library for Python

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data.

It provides easy-to-use interfaces to many corpora and lexical resources. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.

We’ll use this toolkit to show some basics of the natural language processing field. For the examples below, I’ll assume that we have imported the NLTK toolkit. We can do this like this: import nltk.

Steps for NLP:

1] Sentence Tokenization:

Sentence tokenization (also called sentence segmentation) is the problem of dividing a string of written language into its component sentences.

2] Word Tokenization:

Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.

3] Text Lemmatization and Stemming:

For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

4] Stop words:

Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.

5] Regex:

A regular expression, regex, or regexp is a sequence of characters that define a search pattern. Let’s see some basics.

. - match any character except newline
\w - match word
\d - match digit
\s - match whitespace
\W - match not word
\D - match not digit
\S - match not whitespace
[abc] - match any of a, b, or c
[^abc] - not match a, b, or c
[a-g] - match a character between a & g

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

This collides with Python’s usage of the same character for the same purpose in string literals;

for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

6] Bag-of-words:

Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called feature extraction.

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

Design a vocabulary of known words (also called tokens)
Choose a measure of the presence of known words

7] TF-IDF

One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores.

These frequent words may not contain as much “informational gain” to the model compared with some rarer and domain-specific words.

One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called TF-IDF.

Project | 01

Project | 01 Natural Language Processing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

Introduction to the NLTK library for Python

​

2] Word Tokenization:

7] TF-IDF

Project | 01

Project | 01 Natural Language Processing

​