NLP Module 2 — Coding Topics Map

TOPIC 01

Inflectional & Derivational Morphology

Morphology

⬛⬜⬜⬜ Beginner

Analyse English word forms — identify stems, affixes, inflections (plays→play+s) and derivations (happy→happiness). Use NLTK WordNet + MorphyFeedback.

NLTK WordNet spaCy

Classify inflectional vs derivational word pairs
Extract morpheme boundaries using NLTK morphy()
Build a morphological analysis table for a word list

TOPIC 02

Regular Expressions for NLP

Regex

⬛⬜⬜⬜ Beginner

Write Python re patterns for real NLP tasks: tokenization, date/email extraction, phone number normalization, and pattern-based morpheme detection.

Python re NLTK RegexpTokenizer

Extract emails, dates, phone numbers from raw text
Build a regex-based word tokenizer (handle contractions)
Detect inflectional suffixes (-ing, -ed, -s, -er) via patterns
Implement regex-based sentence boundary detection

TOPIC 03

Finite Automata (DFA & NFA)

Automata

⬛⬛⬜⬜ Intermediate

Implement a Deterministic Finite Automaton (DFA) and Non-deterministic FA (NFA) in Python. Use them to recognise morphological patterns (e.g., valid English plurals).

Python (pure) networkx matplotlib

Build a DFA class with states, transitions, accept states
Recognise simple patterns: numbers, dates, plural -s/-es/-ies
Simulate NFA with epsilon transitions
Visualize FA state diagram using networkx

TOPIC 04

Finite State Transducers (FST)

FST

⬛⬛⬛⬜ Advanced

Implement an FST that maps between surface forms and underlying morphological forms. E.g., "foxes" ↔ "fox+N+PL". Understand input/output tape duality.

Python (pure) graphviz fst library (optional)

Implement FST as a state-transition table with I/O pairs
Transduce: "fox" → "foxes", "play" → "played"
Run transducer in both directions (analysis & generation)
Visualize FST as a labeled directed graph

TOPIC 05

Morphological Parsing with FST

FST

⬛⬛⬛⬜ Advanced

Use the FST framework to parse words into morpheme sequences. Input: surface word → Output: lemma + morphological tags (POS, tense, number, person).

Python (pure) NLTK spaCy morphology

Parse "walked" → walk + VERB + PAST
Parse "happiest" → happy + ADJ + SUPERLATIVE
Handle orthographic alternations (y→i: "happy→happier")
Compare FST output with spaCy's built-in morphologizer

TOPIC 06

Lexicon-Free FST

FST

⬛⬛⬛⬜ Advanced

Build a morphological analyser that works without a pre-defined word list — using only phonological/orthographic rules encoded in the transducer itself.

Python (pure) re

Implement suffix-stripping rules as FST transitions
Handle spelling changes: doubling (run→running), e-drop (make→making)
Apply rules in cascade: surface → intermediate → underlying
Test on unseen words not in any dictionary

TOPIC 07

Porter Stemmer

Morphology

⬛⬜⬜⬜ Beginner

Implement the Porter Stemmer algorithm from scratch (5 rule phases). Compare with NLTK's built-in implementation and with Snowball/Lancaster stemmers.

Python (pure) NLTK re

Implement all 5 Porter phases (Step 1a, 1b, 1c, 2, 3, 4, 5)
Verify with test cases from Porter's original paper
Compare Porter vs Snowball vs Lancaster on word lists
Measure over-stemming and under-stemming errors

TOPIC 08

N-Gram Language Model

N-Grams

⬛⬛⬜⬜ Intermediate

Build Unigram, Bigram, and Trigram language models from a corpus. Compute MLE probabilities, apply Laplace smoothing, and calculate sentence perplexity.

Python (pure) NLTK collections numpy

Count n-grams from tokenized corpus
Compute P(word | context) with MLE + Laplace smoothing
Generate text by sampling from the model
Calculate perplexity on held-out test sentences

TOPIC 09

N-Gram Spelling Correction

N-Grams

⬛⬛⬜⬜ Intermediate

Implement a noisy channel spelling corrector combining an edit-distance error model with an n-gram language model. Rank candidates by P(word) × P(error|word).

Python (pure) NLTK difflib numpy

Generate candidate corrections via edit distance (insert/delete/sub)
Score candidates with unigram language model probability
Re-rank using bigram context (noisy channel model)
Evaluate on a misspelled word test set

Recommended Teaching Sequence

Topic 02
Regex

start here

→

Topic 01
Morphology

concepts

→

Topic 07
Porter Stemmer

apply it

→

Topic 03
Finite Automata

theory

→

Topic 04
FST

extend FA

→

Topic 05
Morph. Parsing

apply FST

→

Topic 06
Lexicon-Free FST

extend

→

Topic 08
N-Gram LM

statistics

→

Topic 09
Spell Correct

apply LM

TOPIC	CATEGORY	PRIMARY LIBRARY	DIFFICULTY	BUILDS ON
01 · Morphology Analysis	Morphology	NLTK, WordNet	⬛⬜⬜⬜	Module 1 Pipeline
02 · Regular Expressions	Regex	Python re	⬛⬜⬜⬜	Python basics
03 · Finite Automata	Automata	Pure Python	⬛⬛⬜⬜	Topic 02 (Regex)
04 · Finite State Transducers	FST	Pure Python	⬛⬛⬛⬜	Topic 03 (FA)
05 · Morphological Parsing	FST	NLTK + spaCy	⬛⬛⬛⬜	Topics 01 + 04
06 · Lexicon-Free FST	FST	Pure Python	⬛⬛⬛⬜	Topic 05
07 · Porter Stemmer	Morphology	NLTK	⬛⬜⬜⬜	Topics 01 + 02
08 · N-Gram Language Model	N-Grams	NLTK + NumPy	⬛⬛⬜⬜	Module 1 basics
09 · N-Gram Spell Correction	N-Grams	NLTK + difflib	⬛⬛⬜⬜	Topic 08