Created
Sep 1, 2025 10:05 PM
Tags
Module  | Use Case  | Key Classes/Functions  | 
tokenize | Word/Sentence splitting  | word_tokenize, RegexpTokenizer | 
corpus | Built-in datasets  | stopwords, wordnet, movie_reviews | 
stem | Word reduction  | PorterStemmer, SnowballStemmer | 
lemmatize | Lemma reduction  | WordNetLemmatizer | 
tag | POS tagging  | pos_tag, PerceptronTagger | 
chunk | Chunking grammar phrases  | RegexpParser | 
parse | Parse tree generation  | ChartParser, RecursiveDescentParser | 
probability | Word frequency/probability analysis  | FreqDist, ConditionalFreqDist | 
classify | Text classification  | NaiveBayesClassifier, DecisionTreeClassifier | 
NLTK (Natural Language Toolkit) – Study Notes
- Installation
 
- Install the NLTK library:
 - First-time setup (download required resources):
 
pip install nltkimport nltk
nltk.download()  # Opens download GUI
nltk.download('punkt')  # For tokenizers
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')  # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
- Tokenization (
nltk.tokenize) 
Method  | Description  | Example  | 
word_tokenize(text) | Tokenizes text into words  | word_tokenize("Hello world!") | 
sent_tokenize(text) | Tokenizes text into sentences  | sent_tokenize("Hi. How are you?") | 
RegexpTokenizer(r'\w+') | Tokenizes using regex pattern  | RegexpTokenizer(r'\w+').tokenize("Text! #example") | 
- Corpus Access (
nltk.corpus) 
Corpus  | Use  | Example  | 
stopwords | Provides stop words  | stopwords.words('english') | 
wordnet | Lexical database for English  | wordnet.synsets('car') | 
gutenberg | Access to classic books  | gutenberg.words('austen-emma.txt') | 
movie_reviews | Sentiment analysis dataset  | movie_reviews.words() | 
- Stemming (
nltk.stem) 
Stemmer  | Description  | Example  | 
PorterStemmer() | Common and less aggressive  | stemmer.stem("running") → "run" | 
LancasterStemmer() | More aggressive  | lancaster.stem("running") → "run" | 
SnowballStemmer(language) | Multi-language support  | SnowballStemmer("english").stem("running") | 
- Lemmatization (
nltk.stem.WordNetLemmatizer) 
- Reduces a word to its base form (lemma) using vocabulary and grammar.
 - Part-of-speech (POS) tag improves accuracy.
 
python
CopyEdit
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos='v')  # Output: "run"
- Part-of-Speech Tagging (
nltk.tag) 
Method  | Description  | Example  | 
pos_tag(tokens) | Tags each token with a POS tag  | pos_tag(['This', 'is', 'fun']) | 
PerceptronTagger | Model used internally for POS tagging  | Automatic, no manual call needed  | 
- Chunking (
nltk.chunk) 
- Extracts phrases or structures using grammar rules.
 
python
CopyEdit
import nltk
grammar = "NP: {<DT>?<JJ>*<NN>}"  # Noun Phrase pattern
parser = nltk.RegexpParser(grammar)
- Parsing (
nltk.parse) 
Parser  | Description  | 
RecursiveDescentParser | Simple and interpretable  | 
ChartParser | More efficient, handles ambiguities better  | 
- Frequency Distribution (
nltk.probability) 
- Count the frequency of items like words or characters.
 
python
CopyEdit
from nltk import FreqDist
fdist = FreqDist(['a', 'b', 'a'])
fdist.most_common(1)  # Output: [('a', 2)]
- Classification (
nltk.classify) 
Classifier  | Purpose  | 
NaiveBayesClassifier | Probabilistic classifier for text  | 
DecisionTreeClassifier | Rule-based classification model  | 
- Semantic and Logic Modules (
nltk.sem,nltk.inference) 
- Used for tasks involving logic, inference, and semantics.
 - Includes tools for working with propositional and predicate logic.
 
- Summary Table of Key Modules and Functions
 
Category  | Function/Class  | Purpose  | 
Tokenization  | word_tokenize, sent_tokenize, RegexpTokenizer | Split text  | 
Corpus  | stopwords, wordnet, gutenberg, movie_reviews | Dataset access  | 
Stemming  | PorterStemmer, LancasterStemmer, SnowballStemmer | Word reduction  | 
Lemmatization  | WordNetLemmatizer | Accurate base forms  | 
POS Tagging  | pos_tag | Tagging word roles  | 
Chunking  | RegexpParser | Extracting phrases  | 
Parsing  | RecursiveDescentParser, ChartParser | Grammar parsing  | 
Frequency  | FreqDist | Word frequency analysis  | 
Classification  | NaiveBayesClassifier, DecisionTreeClassifier | Text classification  | 
Semantics  | nltk.sem, nltk.inference | Logical reasoning  | 
NLP Basics – Study Notes
‣
Tokenization
‣
Stemming and Lemmatization
‣
Stop Word Removal
‣
N-grams
Vector in NLP?
- A vector is a numerical representation of text.
 - In natural language processing, words, sentences, or documents are converted into vectors so that they can be processed by machine learning models.
 - Vectors capture properties of the text such as frequency, position, or semantic meaning.
 
- Bag of Words (BoW)
 
- The Bag of Words model is a basic method for text vectorization.
 - It represents a document as a collection (bag) of its words, ignoring grammar and word order.
 - Only the frequency (count) of each word is considered.
 
Example:
Suppose you have two sentences:
- Sentence 1: "I love NLP"
 - Sentence 2: "NLP is fun"
 
Vocabulary = ['I', 'love', 'NLP', 'is', 'fun']
BoW Vectors:
- Sentence 1 → [1, 1, 1, 0, 0]
 - Sentence 2 → [0, 0, 1, 1, 1]
 
Each vector position corresponds to the count of a vocabulary word.
Types of Text Vectors
- Count Vector (Bag of Words)
 - Counts the number of times each word appears.
 - High-dimensional and sparse.
 - TF-IDF Vector (Term Frequency – Inverse Document Frequency)
 - Weights word frequency by how unique the word is across all documents.
 - Reduces the influence of common words.
 - Formula:
 - Term Frequency (TF): how often a word appears in a document.
 - Inverse Document Frequency (IDF): how rare the word is across all documents.
 - TF-IDF = TF * IDF
 - One-Hot Encoding
 - Each word is represented by a vector with a 1 at the index of the word in the vocabulary, and 0 elsewhere.
 - Doesn't capture similarity between words.
 - Word Embeddings (e.g., Word2Vec, GloVe, FastText)
 - Words are represented in dense vectors that capture semantic meaning.
 - Similar words have similar vectors.
 - Lower-dimensional and more expressive than BoW or TF-IDF.
 - Sentence Embeddings
 - Represent whole sentences or paragraphs in a vector space.
 - Captures context and semantics beyond individual words.
 - Examples: Universal Sentence Encoder, BERT embeddings.
 
- Summary of Vector Types
 
Vector Type  | Description  | Captures Meaning  | Handles Context  | 
Count Vector  | Word count per document  | No  | No  | 
TF-IDF  | Weighted word frequency  | Partially  | No  | 
One-Hot Encoding  | Binary position indicator  | No  | No  | 
Word Embeddings  | Dense vectors from large corpora  | Yes  | Limited  | 
Sentence Embeddings  | Context-aware sentence representation  | Yes  | Yes  | 
Study Notes – Python Examples of Text Vectorization Methods
- Count Vector (Bag of Words)
 
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']
print(X.toarray())
# [[0 0 1 1]
#  [1 1 0 1]]
- TF-IDF Vector
 
python
CopyEdit
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["I love NLP", "NLP is fun"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
# ['fun' 'is' 'love' 'nlp']
print(X.toarray())
# [[0.     0.     0.707  0.707]
#  [0.707 0.707 0.     0.707]]
- One-Hot Encoding
 
from sklearn.preprocessing import LabelBinarizer
words = ['NLP', 'fun', 'love']
encoder = LabelBinarizer()
one_hot = encoder.fit_transform(words)
print(encoder.classes_)
# ['fun' 'love' 'NLP']
print(one_hot)
# [[0 0 1]
#  [1 0 0]
#  [0 1 0]]
Note: This is word-level one-hot. For sentence-level, use a tokenizer + BoW with binary=True.
- Word Embeddings (using 
gensimWord2Vec) 
from gensim.models import Word2Vec
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=1, seed=1)
word_vec = model.wv['NLP']
print(word_vec)
# Output: 10-dimensional dense vector for 'NLP'
Install Gensim if needed: pip install gensim 
The architecture employed is based on theGoogleNews-vectors-negative300.bin framework introduced by Tomas Mikolov and colleagues. 
‣
GoogleNews-vectors-negative300.bin- Sentence Embeddings (using 
sentence-transformers) 
python
CopyEdit
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["I love NLP", "NLP is fun"]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (2, 384) → Each sentence is a 384-dimensional vector
Install with: pip install sentence-transformers