TOPIC 01
Inflectional & Derivational Morphology
Morphology
⬛⬜⬜⬜ Beginner
Analyse English word forms — identify stems, affixes, inflections (plays→play+s) and derivations (happy→happiness). Use NLTK WordNet + MorphyFeedback.
NLTK
WordNet
spaCy
- Classify inflectional vs derivational word pairs
- Extract morpheme boundaries using NLTK morphy()
- Build a morphological analysis table for a word list
TOPIC 02
Regular Expressions for NLP
Regex
⬛⬜⬜⬜ Beginner
Write Python re patterns for real NLP tasks: tokenization, date/email extraction, phone number normalization, and pattern-based morpheme detection.
Python re
NLTK RegexpTokenizer
- Extract emails, dates, phone numbers from raw text
- Build a regex-based word tokenizer (handle contractions)
- Detect inflectional suffixes (-ing, -ed, -s, -er) via patterns
- Implement regex-based sentence boundary detection
TOPIC 03
Finite Automata (DFA & NFA)
Automata
⬛⬛⬜⬜ Intermediate
Implement a Deterministic Finite Automaton (DFA) and Non-deterministic FA (NFA) in Python. Use them to recognise morphological patterns (e.g., valid English plurals).
Python (pure)
networkx
matplotlib
- Build a DFA class with states, transitions, accept states
- Recognise simple patterns: numbers, dates, plural -s/-es/-ies
- Simulate NFA with epsilon transitions
- Visualize FA state diagram using networkx
TOPIC 04
Finite State Transducers (FST)
FST
⬛⬛⬛⬜ Advanced
Implement an FST that maps between surface forms and underlying morphological forms. E.g., "foxes" ↔ "fox+N+PL". Understand input/output tape duality.
Python (pure)
graphviz
fst library (optional)
- Implement FST as a state-transition table with I/O pairs
- Transduce: "fox" → "foxes", "play" → "played"
- Run transducer in both directions (analysis & generation)
- Visualize FST as a labeled directed graph
TOPIC 05
Morphological Parsing with FST
FST
⬛⬛⬛⬜ Advanced
Use the FST framework to parse words into morpheme sequences. Input: surface word → Output: lemma + morphological tags (POS, tense, number, person).
Python (pure)
NLTK
spaCy morphology
- Parse "walked" → walk + VERB + PAST
- Parse "happiest" → happy + ADJ + SUPERLATIVE
- Handle orthographic alternations (y→i: "happy→happier")
- Compare FST output with spaCy's built-in morphologizer
TOPIC 06
Lexicon-Free FST
FST
⬛⬛⬛⬜ Advanced
Build a morphological analyser that works without a pre-defined word list — using only phonological/orthographic rules encoded in the transducer itself.
Python (pure)
re
- Implement suffix-stripping rules as FST transitions
- Handle spelling changes: doubling (run→running), e-drop (make→making)
- Apply rules in cascade: surface → intermediate → underlying
- Test on unseen words not in any dictionary
TOPIC 07
Porter Stemmer
Morphology
⬛⬜⬜⬜ Beginner
Implement the Porter Stemmer algorithm from scratch (5 rule phases). Compare with NLTK's built-in implementation and with Snowball/Lancaster stemmers.
Python (pure)
NLTK
re
- Implement all 5 Porter phases (Step 1a, 1b, 1c, 2, 3, 4, 5)
- Verify with test cases from Porter's original paper
- Compare Porter vs Snowball vs Lancaster on word lists
- Measure over-stemming and under-stemming errors
TOPIC 08
N-Gram Language Model
N-Grams
⬛⬛⬜⬜ Intermediate
Build Unigram, Bigram, and Trigram language models from a corpus. Compute MLE probabilities, apply Laplace smoothing, and calculate sentence perplexity.
Python (pure)
NLTK
collections
numpy
- Count n-grams from tokenized corpus
- Compute P(word | context) with MLE + Laplace smoothing
- Generate text by sampling from the model
- Calculate perplexity on held-out test sentences
TOPIC 09
N-Gram Spelling Correction
N-Grams
⬛⬛⬜⬜ Intermediate
Implement a noisy channel spelling corrector combining an edit-distance error model with an n-gram language model. Rank candidates by P(word) × P(error|word).
Python (pure)
NLTK
difflib
numpy
- Generate candidate corrections via edit distance (insert/delete/sub)
- Score candidates with unigram language model probability
- Re-rank using bigram context (noisy channel model)
- Evaluate on a misspelled word test set