NLP Module 1 — Coding Topics

Generic NLP Pipeline

NLP Pipeline

⬛⬜⬜⬜ Beginner

Build an end-to-end text processing pipeline: tokenize → normalize → tag → parse → output. Shows all stages in sequence.

NLTK spaCy Python

Implement each stage as a function/module
Pass raw text through each level and inspect output
Visualize stage-by-stage transformation

Ambiguity Detection

Text Processing

⬛⬛⬜⬜ Intermediate

Write code to identify and illustrate lexical, syntactic, and semantic ambiguity in example sentences using parsing and word sense tools.

spaCy NLTK WordNet

Detect words with multiple senses (lexical ambiguity)
Parse sentences and show multiple parse trees
Log ambiguity score per sentence

ML vs Rule-Based Comparison

ML Role

⬛⬛⬜⬜ Intermediate

Implement a simple task (e.g., spam detection or POS tagging) using both hand-written rules AND an ML model. Compare accuracy side by side.

scikit-learn NLTK pandas

Write rule-based classifier with regex/keyword lists
Train Naive Bayes or Logistic Regression model
Compare accuracy, precision, recall on same test set

Machine Translation (Demo)

Application

⬛⬛⬜⬜ Intermediate

Implement a simple word-level translation using a bilingual dictionary, then contrast with a pretrained neural MT model (Helsinki-NLP).

transformers HuggingFace Python dict

Build dictionary-lookup translator
Use a pretrained MarianMT model for comparison
Observe failure modes of both approaches

Question Answering System

Application

⬛⬛⬛⬜ Advanced

Build a simple extractive QA system: given a paragraph and a question, extract the answer span. Use a retrieval-based pipeline or pretrained model.

transformers BERT/DistilBERT HuggingFace

Set up a QA pipeline with HuggingFace
Feed custom context paragraphs
Evaluate span extraction accuracy (Exact Match, F1)

Information Retrieval (TF-IDF)

Application

⬛⬛⬜⬜ Intermediate

Build a document retrieval engine: index a set of documents, accept a query, and return ranked results using TF-IDF similarity.

scikit-learn NumPy NLTK

Preprocess and vectorize documents with TF-IDF
Compute cosine similarity for query matching
Return ranked document list

Text Categorization

Application

⬛⬛⬜⬜ Intermediate

Train a text classifier on a labeled dataset (e.g., 20 Newsgroups or AG News). Implement preprocessing, feature extraction, and evaluation.

scikit-learn NLTK datasets (HF)

Preprocess text: lowercase, stop-word removal, stemming
Extract Bag-of-Words or TF-IDF features
Train and evaluate Naive Bayes / SVM classifier

Text Summarization

Application

⬛⬛⬜⬜ Intermediate

Implement extractive summarization (sentence scoring with TF-IDF) and contrast with abstractive summarization using a pretrained T5/BART model.

NLTK transformers rouge-score

Score sentences by word importance (extractive)
Use HuggingFace pipeline for abstractive summary
Evaluate with ROUGE score

Sentiment Analysis

Application

⬛⬜⬜⬜ Beginner

Classify text as positive/negative/neutral. Start with a lexicon-based approach (VADER), then move to a trained ML model.

NLTK VADER scikit-learn transformers

Apply VADER to sample reviews, inspect compound score
Train a logistic regression classifier on IMDB dataset
Compare lexicon vs ML accuracy

NLP Evaluation Metrics

Evaluation

⬛⬜⬜⬜ Beginner

Implement and understand core NLP evaluation metrics: accuracy, precision, recall, F1, BLEU, ROUGE — used across all Module 1 applications.

scikit-learn nltk.translate rouge-score

Compute confusion matrix, precision, recall, F1
Calculate BLEU score for MT output
Calculate ROUGE for summarization output