NLTK (Natural Language Toolkit)
🀄

NLTK (Natural Language Toolkit)

Created
Sep 1, 2025 10:05 PM
Tags
Module
Use Case
Key Classes/Functions
tokenize
Word/Sentence splitting
word_tokenize, RegexpTokenizer
corpus
Built-in datasets
stopwords, wordnet, movie_reviews
stem
Word reduction
PorterStemmer, SnowballStemmer
lemmatize
Lemma reduction
WordNetLemmatizer
tag
POS tagging
pos_tag, PerceptronTagger
chunk
Chunking grammar phrases
RegexpParser
parse
Parse tree generation
ChartParser, RecursiveDescentParser
probability
Word frequency/probability analysis
FreqDist, ConditionalFreqDist
classify
Text classification
NaiveBayesClassifier, DecisionTreeClassifier

NLTK (Natural Language Toolkit) – Study Notes

  1. Installation
  • Install the NLTK library:
  • pip install nltk
  • First-time setup (download required resources):
  • import nltk
    nltk.download()  # Opens download GUI
    nltk.download('punkt')  # For tokenizers
    nltk.download('stopwords')  # For stop word removal
    nltk.download('wordnet')  # For lemmatization
    nltk.download('averaged_perceptron_tagger')  # For POS tagging
    
  1. Tokenization (nltk.tokenize)
Method
Description
Example
word_tokenize(text)
Tokenizes text into words
word_tokenize("Hello world!")
sent_tokenize(text)
Tokenizes text into sentences
sent_tokenize("Hi. How are you?")
RegexpTokenizer(r'\w+')
Tokenizes using regex pattern
RegexpTokenizer(r'\w+').tokenize("Text! #example")
  1. Corpus Access (nltk.corpus)
Corpus
Use
Example
stopwords
Provides stop words
stopwords.words('english')
wordnet
Lexical database for English
wordnet.synsets('car')
gutenberg
Access to classic books
gutenberg.words('austen-emma.txt')
movie_reviews
Sentiment analysis dataset
movie_reviews.words()
  1. Stemming (nltk.stem)
Stemmer
Description
Example
PorterStemmer()
Common and less aggressive
stemmer.stem("running") → "run"
LancasterStemmer()
More aggressive
lancaster.stem("running") → "run"
SnowballStemmer(language)
Multi-language support
SnowballStemmer("english").stem("running")
  1. Lemmatization (nltk.stem.WordNetLemmatizer)
  • Reduces a word to its base form (lemma) using vocabulary and grammar.
  • Part-of-speech (POS) tag improves accuracy.
python
CopyEdit
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos='v')  # Output: "run"

  1. Part-of-Speech Tagging (nltk.tag)
Method
Description
Example
pos_tag(tokens)
Tags each token with a POS tag
pos_tag(['This', 'is', 'fun'])
PerceptronTagger
Model used internally for POS tagging
Automatic, no manual call needed
  1. Chunking (nltk.chunk)
  • Extracts phrases or structures using grammar rules.
python
CopyEdit
import nltk

grammar = "NP: {<DT>?<JJ>*<NN>}"  # Noun Phrase pattern
parser = nltk.RegexpParser(grammar)

  1. Parsing (nltk.parse)
Parser
Description
RecursiveDescentParser
Simple and interpretable
ChartParser
More efficient, handles ambiguities better
  1. Frequency Distribution (nltk.probability)
  • Count the frequency of items like words or characters.
python
CopyEdit
from nltk import FreqDist

fdist = FreqDist(['a', 'b', 'a'])
fdist.most_common(1)  # Output: [('a', 2)]

  1. Classification (nltk.classify)
Classifier
Purpose
NaiveBayesClassifier
Probabilistic classifier for text
DecisionTreeClassifier
Rule-based classification model
  1. Semantic and Logic Modules (nltk.semnltk.inference)
  • Used for tasks involving logic, inference, and semantics.
  • Includes tools for working with propositional and predicate logic.
  1. Summary Table of Key Modules and Functions
Category
Function/Class
Purpose
Tokenization
word_tokenizesent_tokenizeRegexpTokenizer
Split text
Corpus
stopwordswordnetgutenbergmovie_reviews
Dataset access
Stemming
PorterStemmerLancasterStemmerSnowballStemmer
Word reduction
Lemmatization
WordNetLemmatizer
Accurate base forms
POS Tagging
pos_tag
Tagging word roles
Chunking
RegexpParser
Extracting phrases
Parsing
RecursiveDescentParserChartParser
Grammar parsing
Frequency
FreqDist
Word frequency analysis
Classification
NaiveBayesClassifierDecisionTreeClassifier
Text classification
Semantics
nltk.semnltk.inference
Logical reasoning

NLP Basics – Study Notes

Tokenization

Stemming and Lemmatization

Stop Word Removal

N-grams

Vector in NLP?

  • vector is a numerical representation of text.
  • In natural language processing, words, sentences, or documents are converted into vectors so that they can be processed by machine learning models.
  • Vectors capture properties of the text such as frequency, position, or semantic meaning.
  1. Bag of Words (BoW)
  • The Bag of Words model is a basic method for text vectorization.
  • It represents a document as a collection (bag) of its words, ignoring grammar and word order.
  • Only the frequency (count) of each word is considered.

Example:

Suppose you have two sentences:

  • Sentence 1: "I love NLP"
  • Sentence 2: "NLP is fun"

Vocabulary = ['I', 'love', 'NLP', 'is', 'fun']

BoW Vectors:

  • Sentence 1 → [1, 1, 1, 0, 0]
  • Sentence 2 → [0, 0, 1, 1, 1]

Each vector position corresponds to the count of a vocabulary word.

Types of Text Vectors

  1. Count Vector (Bag of Words)
    • Counts the number of times each word appears.
    • High-dimensional and sparse.
  2. TF-IDF Vector (Term Frequency – Inverse Document Frequency)
    • Weights word frequency by how unique the word is across all documents.
    • Reduces the influence of common words.
    • Formula:
      • Term Frequency (TF): how often a word appears in a document.
      • Inverse Document Frequency (IDF): how rare the word is across all documents.
      • TF-IDF = TF * IDF
  3. One-Hot Encoding
    • Each word is represented by a vector with a 1 at the index of the word in the vocabulary, and 0 elsewhere.
    • Doesn't capture similarity between words.
  4. Word Embeddings (e.g., Word2Vec, GloVe, FastText)
    • Words are represented in dense vectors that capture semantic meaning.
    • Similar words have similar vectors.
    • Lower-dimensional and more expressive than BoW or TF-IDF.
  5. Sentence Embeddings
    • Represent whole sentences or paragraphs in a vector space.
    • Captures context and semantics beyond individual words.
    • Examples: Universal Sentence Encoder, BERT embeddings.
  1. Summary of Vector Types
Vector Type
Description
Captures Meaning
Handles Context
Count Vector
Word count per document
No
No
TF-IDF
Weighted word frequency
Partially
No
One-Hot Encoding
Binary position indicator
No
No
Word Embeddings
Dense vectors from large corpora
Yes
Limited
Sentence Embeddings
Context-aware sentence representation
Yes
Yes

Study Notes – Python Examples of Text Vectorization Methods

  1. Count Vector (Bag of Words)

from sklearn.feature_extraction.text import CountVectorizer

texts = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']

print(X.toarray())
# [[0 0 1 1]
#  [1 1 0 1]]

  1. TF-IDF Vector
python
CopyEdit
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["I love NLP", "NLP is fun"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']

print(X.toarray())
# [[0.     0.     0.707  0.707]
#  [0.707 0.707 0.     0.707]]

  1. One-Hot Encoding

from sklearn.preprocessing import LabelBinarizer

words = ['NLP', 'fun', 'love']
encoder = LabelBinarizer()
one_hot = encoder.fit_transform(words)

print(encoder.classes_)
# ['fun' 'love' 'NLP']

print(one_hot)
# [[0 0 1]
#  [1 0 0]
#  [0 1 0]]

Note: This is word-level one-hot. For sentence-level, use a tokenizer + BoW with binary=True.

  1. Word Embeddings (using gensim Word2Vec)

from gensim.models import Word2Vec

sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=1, seed=1)

word_vec = model.wv['NLP']
print(word_vec)
# Output: 10-dimensional dense vector for 'NLP'

Install Gensim if needed: pip install gensim

The architecture employed is based on theGoogleNews-vectors-negative300.bin framework introduced by Tomas Mikolov and colleagues.

Background and Overview: GoogleNews-vectors-negative300.bin
  1. Sentence Embeddings (using sentence-transformers)
python
CopyEdit
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["I love NLP", "NLP is fun"]
embeddings = model.encode(sentences)

print(embeddings.shape)
# (2, 384) → Each sentence is a 384-dimensional vector

Install with: pip install sentence-transformers