This project extends an address dictionary from a simple linked list into a fully fledged Patricia trie spellchecker. It was developed for the University of Melbourne subject COMP20003 – Algorithms and Data Structures, Assignment 2 Stages 2 & 3. The repository now contains a polished, high-distinction submission: production-ready C code, reproducible experimentation scripts, and the final complexity report.
- Patricia trie core (
dict2) – radix-2 compressed trie storing bit-level stems, stable duplicate ordering, exact match or closest recommendation via edit distance. - Spellchecker semantics – when a query misses, the trie gathers the mismatch subtree, evaluates Levenshtein distances, and returns the closest key (alphabetical tie-breaks), matching the assignment spec.
- Operation counting – instrumentation reports bit comparisons (
b), node accesses (n), and string/edit comparisons (s) on every query. This powers the Stage 3 analysis without timing noise. - Linked-list baseline (
dict1) – Stage 1 implementation retained for direct comparison and experimentation. - Stage 3 analytics tooling – scripts generate aggregated metrics, prefix sensitivity studies, and publication-ready charts used in the submitted report.
- Memory safety – all allocations checked, the trie owns its records, and valgrind/AddressSanitizer builds are integrated into the Makefile.
- Algorithm done at the bit level with MSB‑first semantics, including the final
\0byte in every key. This makes prefix reasoning exact and eliminates string‑boundary edge cases that often break split logic in Patricia tries. - Split‑safe insertion that carefully reattaches existing children and data to the “old suffix” node. This avoids duplicated data, preserves invariants, and guarantees stable duplicate ordering.
- A research‑grade, machine‑independent counting policy (b/n/s) aligned with the reference solution. Counts are produced by the same code that performs the work, not by ad‑hoc log statements.
- Single‑pass CSV ingestion with precise output formatting to match golden files byte‑for‑byte, including
x/ynumeric rounding rules. - Clean ownership model: the trie owns records, query results borrow; teardown does a deep free. This is why valgrind shows 0 leaks on the largest suite.
-
Bit‑accurate keys and stems
- Challenge: Patricia tries branch on bits, not characters. Getting bit indices wrong by even 1 bit corrupts structure and search.
- Solution:
src2/bit.cimplementsgetBit()with MSB‑first indexing;src2/stem.cbuilds exact‑length stems with zero‑padded tails. Keys are measured as(strlen(key)+1)*8bits to include the trailing\0sentinel.
-
Node splitting without breaking invariants
- Challenge: When two keys diverge inside a node’s stem, we must create a
commonnode, move the old node’s children/data under anoldnode, and attach anewnodefor the inserted key. It’s easy to lose children or duplicate data. - Solution:
src2/ptree.c:ptree_insert()computes the LCP in bits, allocates three stems, moves children/data intooldnode, appends the new record tonewnode, and wires them undercommonaccording to their first bit. The parent pointer (or root) is updated atomically before the old shell is freed.
- Challenge: When two keys diverge inside a node’s stem, we must create a
-
Exact vs closest‑match queries
- Challenge: A miss should return the best candidate from the mismatch subtree only, not from the whole dictionary; ties must choose the alphabetically earliest key and then return all of its records in stable order.
- Solution:
src2/ptree.c:ptree_query()walks the search path counting work. On mismatch it collects candidate leaves with a small DFS (collect_candidates()), also considering the mismatch node itself when it stores data. It computeseditDistance()per candidate, incrementssfor each call, and usesstrcmpfor tie‑breaks.
-
Output fidelity
- Challenge: The golden files require exact byte‑for‑byte matches, especially for
x/yfields. - Solution:
src2/printrec.ccentralizes formatting;x/yare printed as long doubles with 5 decimals when present.
- Challenge: The golden files require exact byte‑for‑byte matches, especially for
-
Memory correctness under scale
- Challenge: A deep trie with many duplicates stresses ownership and free order.
- Solution: Records are owned only at leaf/exact nodes via
rlist_t;matches_tholds borrowed pointers.ptree_free()recursively frees children, stem, data list (deep‑freeing records), then the node. ASan and valgrind targets are in the Makefile.
b(bit comparisons): count only bits compared during LCP checks along the search path; do not count the single bit used to choose an outgoing edge.n(node accesses): count nodes visited on the search path only; the post‑mismatch DFS to enumerate candidates is excluded by design to keep counts operation‑based and comparable.s(string‑level comparisons): count eacheditDistance()call; add one for exact matches for parity with reference numbers; add one forstrcmptie‑breaks.
All increments happen in the same code that performs the work: lcp_mem_vs_key() and ptree_query() in src2/ptree.c. The linked‑list baseline mirrors the policy (bitwise comparator per node, one string comparison per node).
- Every internal node has a non‑zero stem; siblings start with different first bits.
- If
dhead != NULL, the node’s stem equals a complete key (includes\0). - Duplicates are appended to the same node’s record list and preserve CSV order.
- File diffs:
make testcompares outputs withmatching_results/and passes across the 1/2/22/1067 suites. - Memory:
make asan-testandmake valgrindreport no leaks or invalid accesses on large datasets. - Metrics:
report/metrics_both.csvshows the expected separation between LL and PT;scripts/prefix_study.pyconfirms that longer query prefixes shrink candidate subtrees and drives → 1.
- Full report: COMP20003 — Assignment 2 Stage 3 Report.md (source) and COMP20003 — Assignment 2 Stage 3 Report.pdf (print) at the repo root. See also the generated data under
report/. - Methodology: operation‑based counts (
bbits,nnodes,sstrings) on the official suites 1/2/22/1067; plus a prefix‑length study (2,5,8,12,16,20 chars). Counts are platform‑independent and come from the instrumented code path. - Key results (averages per query):
- dataset_1067 — PT:
b≈297,n≈16.75,s=1.00; LL:b≈8481,n=1067,s=1067(report/metrics_both.csv:5). - dataset_22 — PT:
b≈274.7,n≈2.67,s=1.00; LL:b≈2092.3,n=22,s=22(report/metrics_both.csv:3).
- dataset_1067 — PT:
- Prefix study (report/prefix/prefix_metrics.csv:1): with only 2‑char prefixes PT temporarily evaluates multiple candidates (
s≈35), but by 5+ charss→1while LL remains ats=1067regardless of prefix. - Figures: triptych charts under
report/figs/andreport/prefix/(log‑y variants included) visually demonstrate the asymptotic gap: PT scales with path depth (L') while LL scales withN.
make report
# metrics: report/metrics_both.csv
# plots: report/figs/*.png and report/prefix/*.pngThe command runs dict2/dict1 over all suites, aggregates b/n/s, and draws the bars+lines figures used in the write‑up.
- Built a bit‑level Patricia trie with precise split logic and a principled counting framework to evaluate data‑structure efficiency independent of hardware.
- Engineered a reproducible experimentation stack (shell + Python) to generate metrics and publication‑ready plots; wrote the complexity report tying theory to empirical results.
- Achieved perfect output fidelity and zero‑leak memory safety on thousand‑record datasets.
- Swap‑in a vtable‐style
dict_apito run LL/PT under a unified CLI. - Persist the trie or serialize stems for faster warm starts.
- Parallelize candidate edit‑distance on large mismatch subtrees (kept single‑threaded here to match marking environment).
├── src2/ # Patricia trie implementation and shared modules
│ ├── main2.c # CLI wrapper for dict2
│ ├── ptree.[ch] # trie insert/query/free + edit-distance matching
│ ├── csv2.[ch] # single-pass CSV loader (headers + records)
│ ├── record.[ch] # record_t model, matches_t, stats_t
│ ├── bit.[ch], stem.[ch], editdist.[ch], printrec.[ch]
├── src1/ # Stage 1 linked-list dictionary + bit comparator
├── scripts/ # Automation (tests, metrics, plots, prefix study)
├── tests/ # Author-provided and custom datasets/query suites
├── matching_results/# Golden outputs from the course staff
├── report/ # Generated metrics, plots, LaTeX/Word templates
│ └── ... + final Stage 3 write-up (PDF & Markdown)
├── Makefile
└── dict2 / *.o / REPORT.pdf ... (built or generated artefacts)
Tip – run
make cleanbefore committing or packaging to removedict2, object files, and regenerated CSV/plot artefacts.
# Patricia trie program (Stage 2)
make dict2
./dict2 2 tests/dataset_22.csv output.txt < tests/test22.in > output.stdout.out
# Linked-list baseline (Stage 1) for comparisons
make dict1
./dict1 1 tests/dataset_22.csv output1.txt < tests/test22.in > output1.stdout.out| Command | Purpose |
|---|---|
make test |
Runs scripts/run_tests.sh (diffs outputs vs golden files). |
make report |
Rebuilds metrics, plots, and prefix-study data under report/. |
make asan-test |
AddressSanitizer build + smoke test (useful off Ed). |
make valgrind |
Full leak check on dataset_1067 (requires local valgrind). |
- Official suites live under
tests/with matching golden outputs inmatching_results/. scripts/run_tests.shexecutes the regression pipeline, showing file diffs and informative stdout comparisons.scripts/review_tests.shbuilds bespoke micro-cases (duplicates, tie-breaks, NOTFOUND behaviour, etc.) to sanity-check tricky edge conditions.- Custom inputs and expected outputs used during development are in
tests/custom/.
- Run
make reportto populate:report/metrics_both.csv– aggregated averages for linked list vs trie.report/figs/*.png– grouped bars + lines (log-linear variants included).report/prefix/– prefix-length study data and charts.
- The final write-up is available in both English and Chinese:
COMP20003 — Assignment 2 Stage 3 Report.md(source)COMP20003 — Assignment 2 Stage 3 Report.pdf(submitted version)REPORT_CN.md(optional Mandarin summary)
- LaTeX/Word templates (
report/latex,report/word) document how the report was formatted for submission.
- Keys are stored as byte arrays including the trailing
\0. Bit access is MSB-first, ensuring consistent stems and comparisons. - Duplicate addresses stay in CSV order thanks to per-node record lists.
- The trie counts only search-path node visits and prefix bit comparisons; the mismatch candidate DFS is excluded by design to mirror the reference policy.
- Edit distance uses a rolling-row Levenshtein implementation, ensuring tight memory bounds.
- Built a bit-level Patricia trie spellchecker that delivers exact or closest-match results while tracking analytic counters for each query.
- Automated a metrics pipeline (bash + Python/matplotlib) to compare linked-list vs trie behaviour across real datasets and generated prefix workloads, feeding directly into the Stage 3 complexity report.
- Ensured memory correctness on large datasets via valgrind/ASan and structured ownership of 35-field records.
- Binary artefacts in the repo (
dict2,*.o,REPORT.pdf,home.zip, plot PNGs) are retained for completeness; regenerate them with the provided Makefile and scripts if required. - The project targets POSIX/C99 environments. Tested locally with
ccand on the university’s Ed platform.
Feel free to reach out if you need clarification on any module or would like an abridged excerpt for a portfolio. Enjoy exploring the project!