Created
Sep 1, 2025 10:05 PM
Tags
Module | Use Case | Key Classes/Functions |
tokenize | Word/Sentence splitting | word_tokenize , RegexpTokenizer |
corpus | Built-in datasets | stopwords , wordnet , movie_reviews |
stem | Word reduction | PorterStemmer , SnowballStemmer |
lemmatize | Lemma reduction | WordNetLemmatizer |
tag | POS tagging | pos_tag , PerceptronTagger |
chunk | Chunking grammar phrases | RegexpParser |
parse | Parse tree generation | ChartParser , RecursiveDescentParser |
probability | Word frequency/probability analysis | FreqDist , ConditionalFreqDist |
classify | Text classification | NaiveBayesClassifier , DecisionTreeClassifier |
NLTK (Natural Language Toolkit) – Study Notes
- Installation
- Install the NLTK library:
- First-time setup (download required resources):
pip install nltk
import nltk
nltk.download() # Opens download GUI
nltk.download('punkt') # For tokenizers
nltk.download('stopwords') # For stop word removal
nltk.download('wordnet') # For lemmatization
nltk.download('averaged_perceptron_tagger') # For POS tagging
- Tokenization (
nltk.tokenize
)
Method | Description | Example |
word_tokenize(text) | Tokenizes text into words | word_tokenize("Hello world!") |
sent_tokenize(text) | Tokenizes text into sentences | sent_tokenize("Hi. How are you?") |
RegexpTokenizer(r'\w+') | Tokenizes using regex pattern | RegexpTokenizer(r'\w+').tokenize("Text! #example") |
- Corpus Access (
nltk.corpus
)
Corpus | Use | Example |
stopwords | Provides stop words | stopwords.words('english') |
wordnet | Lexical database for English | wordnet.synsets('car') |
gutenberg | Access to classic books | gutenberg.words('austen-emma.txt') |
movie_reviews | Sentiment analysis dataset | movie_reviews.words() |
- Stemming (
nltk.stem
)
Stemmer | Description | Example |
PorterStemmer() | Common and less aggressive | stemmer.stem("running") → "run" |
LancasterStemmer() | More aggressive | lancaster.stem("running") → "run" |
SnowballStemmer(language) | Multi-language support | SnowballStemmer("english").stem("running") |
- Lemmatization (
nltk.stem.WordNetLemmatizer
)
- Reduces a word to its base form (lemma) using vocabulary and grammar.
- Part-of-speech (POS) tag improves accuracy.
python
CopyEdit
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos='v') # Output: "run"
- Part-of-Speech Tagging (
nltk.tag
)
Method | Description | Example |
pos_tag(tokens) | Tags each token with a POS tag | pos_tag(['This', 'is', 'fun']) |
PerceptronTagger | Model used internally for POS tagging | Automatic, no manual call needed |
- Chunking (
nltk.chunk
)
- Extracts phrases or structures using grammar rules.
python
CopyEdit
import nltk
grammar = "NP: {<DT>?<JJ>*<NN>}" # Noun Phrase pattern
parser = nltk.RegexpParser(grammar)
- Parsing (
nltk.parse
)
Parser | Description |
RecursiveDescentParser | Simple and interpretable |
ChartParser | More efficient, handles ambiguities better |
- Frequency Distribution (
nltk.probability
)
- Count the frequency of items like words or characters.
python
CopyEdit
from nltk import FreqDist
fdist = FreqDist(['a', 'b', 'a'])
fdist.most_common(1) # Output: [('a', 2)]
- Classification (
nltk.classify
)
Classifier | Purpose |
NaiveBayesClassifier | Probabilistic classifier for text |
DecisionTreeClassifier | Rule-based classification model |
- Semantic and Logic Modules (
nltk.sem
,nltk.inference
)
- Used for tasks involving logic, inference, and semantics.
- Includes tools for working with propositional and predicate logic.
- Summary Table of Key Modules and Functions
Category | Function/Class | Purpose |
Tokenization | word_tokenize , sent_tokenize , RegexpTokenizer | Split text |
Corpus | stopwords , wordnet , gutenberg , movie_reviews | Dataset access |
Stemming | PorterStemmer , LancasterStemmer , SnowballStemmer | Word reduction |
Lemmatization | WordNetLemmatizer | Accurate base forms |
POS Tagging | pos_tag | Tagging word roles |
Chunking | RegexpParser | Extracting phrases |
Parsing | RecursiveDescentParser , ChartParser | Grammar parsing |
Frequency | FreqDist | Word frequency analysis |
Classification | NaiveBayesClassifier , DecisionTreeClassifier | Text classification |
Semantics | nltk.sem , nltk.inference | Logical reasoning |
NLP Basics – Study Notes
‣
Tokenization
‣
Stemming and Lemmatization
‣
Stop Word Removal
‣
N-grams
Vector in NLP?
- A vector is a numerical representation of text.
- In natural language processing, words, sentences, or documents are converted into vectors so that they can be processed by machine learning models.
- Vectors capture properties of the text such as frequency, position, or semantic meaning.
- Bag of Words (BoW)
- The Bag of Words model is a basic method for text vectorization.
- It represents a document as a collection (bag) of its words, ignoring grammar and word order.
- Only the frequency (count) of each word is considered.
Example:
Suppose you have two sentences:
- Sentence 1: "I love NLP"
- Sentence 2: "NLP is fun"
Vocabulary = ['I', 'love', 'NLP', 'is', 'fun']
BoW Vectors:
- Sentence 1 → [1, 1, 1, 0, 0]
- Sentence 2 → [0, 0, 1, 1, 1]
Each vector position corresponds to the count of a vocabulary word.
Types of Text Vectors
- Count Vector (Bag of Words)
- Counts the number of times each word appears.
- High-dimensional and sparse.
- TF-IDF Vector (Term Frequency – Inverse Document Frequency)
- Weights word frequency by how unique the word is across all documents.
- Reduces the influence of common words.
- Formula:
- Term Frequency (TF): how often a word appears in a document.
- Inverse Document Frequency (IDF): how rare the word is across all documents.
- TF-IDF = TF * IDF
- One-Hot Encoding
- Each word is represented by a vector with a 1 at the index of the word in the vocabulary, and 0 elsewhere.
- Doesn't capture similarity between words.
- Word Embeddings (e.g., Word2Vec, GloVe, FastText)
- Words are represented in dense vectors that capture semantic meaning.
- Similar words have similar vectors.
- Lower-dimensional and more expressive than BoW or TF-IDF.
- Sentence Embeddings
- Represent whole sentences or paragraphs in a vector space.
- Captures context and semantics beyond individual words.
- Examples: Universal Sentence Encoder, BERT embeddings.
- Summary of Vector Types
Vector Type | Description | Captures Meaning | Handles Context |
Count Vector | Word count per document | No | No |
TF-IDF | Weighted word frequency | Partially | No |
One-Hot Encoding | Binary position indicator | No | No |
Word Embeddings | Dense vectors from large corpora | Yes | Limited |
Sentence Embeddings | Context-aware sentence representation | Yes | Yes |
Study Notes – Python Examples of Text Vectorization Methods
- Count Vector (Bag of Words)
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']
print(X.toarray())
# [[0 0 1 1]
# [1 1 0 1]]
- TF-IDF Vector
python
CopyEdit
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["I love NLP", "NLP is fun"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']
print(X.toarray())
# [[0. 0. 0.707 0.707]
# [0.707 0.707 0. 0.707]]
- One-Hot Encoding
from sklearn.preprocessing import LabelBinarizer
words = ['NLP', 'fun', 'love']
encoder = LabelBinarizer()
one_hot = encoder.fit_transform(words)
print(encoder.classes_)
# ['fun' 'love' 'NLP']
print(one_hot)
# [[0 0 1]
# [1 0 0]
# [0 1 0]]
Note: This is word-level one-hot. For sentence-level, use a tokenizer + BoW with binary=True.
- Word Embeddings (using
gensim
Word2Vec)
from gensim.models import Word2Vec
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=1, seed=1)
word_vec = model.wv['NLP']
print(word_vec)
# Output: 10-dimensional dense vector for 'NLP'
Install Gensim if needed: pip install gensim
The architecture employed is based on theGoogleNews-vectors-negative300.bin
framework introduced by Tomas Mikolov and colleagues.
‣
GoogleNews-vectors-negative300.bin
- Sentence Embeddings (using
sentence-transformers
)
python
CopyEdit
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["I love NLP", "NLP is fun"]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (2, 384) → Each sentence is a 384-dimensional vector
Install with: pip install sentence-transformers