🀄

NLTK (Natural Language Toolkit)

Created

Sep 1, 2025 10:05 PM

Tags

Module	Use Case	Key Classes/Functions
`tokenize`	Word/Sentence splitting	`word_tokenize`, `RegexpTokenizer`
`corpus`	Built-in datasets	`stopwords`, `wordnet`, `movie_reviews`
`stem`	Word reduction	`PorterStemmer`, `SnowballStemmer`
`lemmatize`	Lemma reduction	`WordNetLemmatizer`
`tag`	POS tagging	`pos_tag`, `PerceptronTagger`
`chunk`	Chunking grammar phrases	`RegexpParser`
`parse`	Parse tree generation	`ChartParser`, `RecursiveDescentParser`
`probability`	Word frequency/probability analysis	`FreqDist`, `ConditionalFreqDist`
`classify`	Text classification	`NaiveBayesClassifier`, `DecisionTreeClassifier`

NLTK (Natural Language Toolkit) – Study Notes

Installation

Install the NLTK library:

pip install nltk

First-time setup (download required resources):

import nltk
nltk.download()  # Opens download GUI
nltk.download('punkt')  # For tokenizers
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')  # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging

Tokenization (nltk.tokenize)

Method	Description	Example
`word_tokenize(text)`	Tokenizes text into words	`word_tokenize("Hello world!")`
`sent_tokenize(text)`	Tokenizes text into sentences	`sent_tokenize("Hi. How are you?")`
`RegexpTokenizer(r'\w+')`	Tokenizes using regex pattern	`RegexpTokenizer(r'\w+').tokenize("Text! #example")`

Corpus Access (nltk.corpus)

Corpus	Use	Example
`stopwords`	Provides stop words	`stopwords.words('english')`
`wordnet`	Lexical database for English	`wordnet.synsets('car')`
`gutenberg`	Access to classic books	`gutenberg.words('austen-emma.txt')`
`movie_reviews`	Sentiment analysis dataset	`movie_reviews.words()`

Stemming (nltk.stem)

Stemmer	Description	Example
`PorterStemmer()`	Common and less aggressive	`stemmer.stem("running") → "run"`
`LancasterStemmer()`	More aggressive	`lancaster.stem("running") → "run"`
`SnowballStemmer(language)`	Multi-language support	`SnowballStemmer("english").stem("running")`

Lemmatization (nltk.stem.WordNetLemmatizer)

Reduces a word to its base form (lemma) using vocabulary and grammar.
Part-of-speech (POS) tag improves accuracy.

python
CopyEdit
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos='v')  # Output: "run"

Part-of-Speech Tagging (nltk.tag)

Method	Description	Example
`pos_tag(tokens)`	Tags each token with a POS tag	`pos_tag(['This', 'is', 'fun'])`
`PerceptronTagger`	Model used internally for POS tagging	Automatic, no manual call needed

Chunking (nltk.chunk)

Extracts phrases or structures using grammar rules.

python
CopyEdit
import nltk

grammar = "NP: {<DT>?<JJ>*<NN>}"  # Noun Phrase pattern
parser = nltk.RegexpParser(grammar)

Parsing (nltk.parse)

Parser	Description
`RecursiveDescentParser`	Simple and interpretable
`ChartParser`	More efficient, handles ambiguities better

Frequency Distribution (nltk.probability)

Count the frequency of items like words or characters.

python
CopyEdit
from nltk import FreqDist

fdist = FreqDist(['a', 'b', 'a'])
fdist.most_common(1)  # Output: [('a', 2)]

Classification (nltk.classify)

Classifier	Purpose
`NaiveBayesClassifier`	Probabilistic classifier for text
`DecisionTreeClassifier`	Rule-based classification model

Semantic and Logic Modules (nltk.sem, nltk.inference)

Used for tasks involving logic, inference, and semantics.
Includes tools for working with propositional and predicate logic.

Summary Table of Key Modules and Functions

Category	Function/Class	Purpose
Tokenization	`word_tokenize`, `sent_tokenize`, `RegexpTokenizer`	Split text
Corpus	`stopwords`, `wordnet`, `gutenberg`, `movie_reviews`	Dataset access
Stemming	`PorterStemmer`, `LancasterStemmer`, `SnowballStemmer`	Word reduction
Lemmatization	`WordNetLemmatizer`	Accurate base forms
POS Tagging	`pos_tag`	Tagging word roles
Chunking	`RegexpParser`	Extracting phrases
Parsing	`RecursiveDescentParser`, `ChartParser`	Grammar parsing
Frequency	`FreqDist`	Word frequency analysis
Classification	`NaiveBayesClassifier`, `DecisionTreeClassifier`	Text classification
Semantics	`nltk.sem`, `nltk.inference`	Logical reasoning

NLP Basics – Study Notes

‣

Tokenization

‣

Stemming and Lemmatization

‣

Stop Word Removal

‣

N-grams

Vector in NLP?

A vector is a numerical representation of text.
In natural language processing, words, sentences, or documents are converted into vectors so that they can be processed by machine learning models.
Vectors capture properties of the text such as frequency, position, or semantic meaning.

Bag of Words (BoW)

The Bag of Words model is a basic method for text vectorization.
It represents a document as a collection (bag) of its words, ignoring grammar and word order.
Only the frequency (count) of each word is considered.

Example:

Suppose you have two sentences:

Sentence 1: "I love NLP"
Sentence 2: "NLP is fun"

Vocabulary = ['I', 'love', 'NLP', 'is', 'fun']

BoW Vectors:

Sentence 1 → [1, 1, 1, 0, 0]
Sentence 2 → [0, 0, 1, 1, 1]

Each vector position corresponds to the count of a vocabulary word.

Types of Text Vectors

Count Vector (Bag of Words)

Counts the number of times each word appears.
High-dimensional and sparse.

TF-IDF Vector (Term Frequency – Inverse Document Frequency)

Weights word frequency by how unique the word is across all documents.
Reduces the influence of common words.
Formula:

Term Frequency (TF): how often a word appears in a document.
Inverse Document Frequency (IDF): how rare the word is across all documents.
TF-IDF = TF * IDF

One-Hot Encoding

Each word is represented by a vector with a 1 at the index of the word in the vocabulary, and 0 elsewhere.
Doesn't capture similarity between words.

Word Embeddings (e.g., Word2Vec, GloVe, FastText)

Words are represented in dense vectors that capture semantic meaning.
Similar words have similar vectors.
Lower-dimensional and more expressive than BoW or TF-IDF.

Sentence Embeddings

Represent whole sentences or paragraphs in a vector space.
Captures context and semantics beyond individual words.
Examples: Universal Sentence Encoder, BERT embeddings.

Summary of Vector Types

Vector Type	Description	Captures Meaning	Handles Context
Count Vector	Word count per document	No	No
TF-IDF	Weighted word frequency	Partially	No
One-Hot Encoding	Binary position indicator	No	No
Word Embeddings	Dense vectors from large corpora	Yes	Limited
Sentence Embeddings	Context-aware sentence representation	Yes	Yes

Study Notes – Python Examples of Text Vectorization Methods

Count Vector (Bag of Words)


from sklearn.feature_extraction.text import CountVectorizer

texts = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']

print(X.toarray())
# [[0 0 1 1]
#  [1 1 0 1]]

TF-IDF Vector

python
CopyEdit
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["I love NLP", "NLP is fun"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']

print(X.toarray())
# [[0.     0.     0.707  0.707]
#  [0.707 0.707 0.     0.707]]

One-Hot Encoding


from sklearn.preprocessing import LabelBinarizer

words = ['NLP', 'fun', 'love']
encoder = LabelBinarizer()
one_hot = encoder.fit_transform(words)

print(encoder.classes_)
# ['fun' 'love' 'NLP']

print(one_hot)
# [[0 0 1]
#  [1 0 0]
#  [0 1 0]]

Note: This is word-level one-hot. For sentence-level, use a tokenizer + BoW with binary=True.

Word Embeddings (using gensim Word2Vec)


from gensim.models import Word2Vec

sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=1, seed=1)

word_vec = model.wv['NLP']
print(word_vec)
# Output: 10-dimensional dense vector for 'NLP'

Install Gensim if needed: pip install gensim

The architecture employed is based on theGoogleNews-vectors-negative300.bin framework introduced by Tomas Mikolov and colleagues.

‣

Background and Overview: GoogleNews-vectors-negative300.bin

Sentence Embeddings (using sentence-transformers)

python
CopyEdit
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["I love NLP", "NLP is fun"]
embeddings = model.encode(sentences)

print(embeddings.shape)
# (2, 384) → Each sentence is a 384-dimensional vector

Install with: pip install sentence-transformers