⬛⬜⬜⬜ Beginner
Build an end-to-end text processing pipeline: tokenize → normalize → tag → parse → output. Shows all stages in sequence.
NLTK
spaCy
Python
- Implement each stage as a function/module
- Pass raw text through each level and inspect output
- Visualize stage-by-stage transformation
⬛⬛⬜⬜ Intermediate
Write code to identify and illustrate lexical, syntactic, and semantic ambiguity in example sentences using parsing and word sense tools.
spaCy
NLTK WordNet
- Detect words with multiple senses (lexical ambiguity)
- Parse sentences and show multiple parse trees
- Log ambiguity score per sentence
⬛⬛⬜⬜ Intermediate
Implement a simple task (e.g., spam detection or POS tagging) using both hand-written rules AND an ML model. Compare accuracy side by side.
scikit-learn
NLTK
pandas
- Write rule-based classifier with regex/keyword lists
- Train Naive Bayes or Logistic Regression model
- Compare accuracy, precision, recall on same test set
⬛⬛⬜⬜ Intermediate
Implement a simple word-level translation using a bilingual dictionary, then contrast with a pretrained neural MT model (Helsinki-NLP).
transformers
HuggingFace
Python dict
- Build dictionary-lookup translator
- Use a pretrained MarianMT model for comparison
- Observe failure modes of both approaches
⬛⬛⬛⬜ Advanced
Build a simple extractive QA system: given a paragraph and a question, extract the answer span. Use a retrieval-based pipeline or pretrained model.
transformers
BERT/DistilBERT
HuggingFace
- Set up a QA pipeline with HuggingFace
- Feed custom context paragraphs
- Evaluate span extraction accuracy (Exact Match, F1)
⬛⬛⬜⬜ Intermediate
Build a document retrieval engine: index a set of documents, accept a query, and return ranked results using TF-IDF similarity.
scikit-learn
NumPy
NLTK
- Preprocess and vectorize documents with TF-IDF
- Compute cosine similarity for query matching
- Return ranked document list
⬛⬛⬜⬜ Intermediate
Train a text classifier on a labeled dataset (e.g., 20 Newsgroups or AG News). Implement preprocessing, feature extraction, and evaluation.
scikit-learn
NLTK
datasets (HF)
- Preprocess text: lowercase, stop-word removal, stemming
- Extract Bag-of-Words or TF-IDF features
- Train and evaluate Naive Bayes / SVM classifier
⬛⬛⬜⬜ Intermediate
Implement extractive summarization (sentence scoring with TF-IDF) and contrast with abstractive summarization using a pretrained T5/BART model.
NLTK
transformers
rouge-score
- Score sentences by word importance (extractive)
- Use HuggingFace pipeline for abstractive summary
- Evaluate with ROUGE score
⬛⬜⬜⬜ Beginner
Classify text as positive/negative/neutral. Start with a lexicon-based approach (VADER), then move to a trained ML model.
NLTK VADER
scikit-learn
transformers
- Apply VADER to sample reviews, inspect compound score
- Train a logistic regression classifier on IMDB dataset
- Compare lexicon vs ML accuracy
⬛⬜⬜⬜ Beginner
Implement and understand core NLP evaluation metrics: accuracy, precision, recall, F1, BLEU, ROUGE — used across all Module 1 applications.
scikit-learn
nltk.translate
rouge-score
- Compute confusion matrix, precision, recall, F1
- Calculate BLEU score for MT output
- Calculate ROUGE for summarization output