Skip to content

PDGGK/ads-project2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Patricia Tree Spellchecker (COMP20003 Assignment 2)

This project extends an address dictionary from a simple linked list into a fully fledged Patricia trie spellchecker. It was developed for the University of Melbourne subject COMP20003 – Algorithms and Data Structures, Assignment 2 Stages 2 & 3. The repository now contains a polished, high-distinction submission: production-ready C code, reproducible experimentation scripts, and the final complexity report.

Highlights

  • Patricia trie core (dict2) – radix-2 compressed trie storing bit-level stems, stable duplicate ordering, exact match or closest recommendation via edit distance.
  • Spellchecker semantics – when a query misses, the trie gathers the mismatch subtree, evaluates Levenshtein distances, and returns the closest key (alphabetical tie-breaks), matching the assignment spec.
  • Operation counting – instrumentation reports bit comparisons (b), node accesses (n), and string/edit comparisons (s) on every query. This powers the Stage 3 analysis without timing noise.
  • Linked-list baseline (dict1) – Stage 1 implementation retained for direct comparison and experimentation.
  • Stage 3 analytics tooling – scripts generate aggregated metrics, prefix sensitivity studies, and publication-ready charts used in the submitted report.
  • Memory safety – all allocations checked, the trie owns its records, and valgrind/AddressSanitizer builds are integrated into the Makefile.

Why This Project Stands Out

  • Algorithm done at the bit level with MSB‑first semantics, including the final \0 byte in every key. This makes prefix reasoning exact and eliminates string‑boundary edge cases that often break split logic in Patricia tries.
  • Split‑safe insertion that carefully reattaches existing children and data to the “old suffix” node. This avoids duplicated data, preserves invariants, and guarantees stable duplicate ordering.
  • A research‑grade, machine‑independent counting policy (b/n/s) aligned with the reference solution. Counts are produced by the same code that performs the work, not by ad‑hoc log statements.
  • Single‑pass CSV ingestion with precise output formatting to match golden files byte‑for‑byte, including x/y numeric rounding rules.
  • Clean ownership model: the trie owns records, query results borrow; teardown does a deep free. This is why valgrind shows 0 leaks on the largest suite.

Core Challenges & Solutions

  • Bit‑accurate keys and stems

    • Challenge: Patricia tries branch on bits, not characters. Getting bit indices wrong by even 1 bit corrupts structure and search.
    • Solution: src2/bit.c implements getBit() with MSB‑first indexing; src2/stem.c builds exact‑length stems with zero‑padded tails. Keys are measured as (strlen(key)+1)*8 bits to include the trailing \0 sentinel.
  • Node splitting without breaking invariants

    • Challenge: When two keys diverge inside a node’s stem, we must create a common node, move the old node’s children/data under an oldnode, and attach a newnode for the inserted key. It’s easy to lose children or duplicate data.
    • Solution: src2/ptree.c:ptree_insert() computes the LCP in bits, allocates three stems, moves children/data into oldnode, appends the new record to newnode, and wires them under common according to their first bit. The parent pointer (or root) is updated atomically before the old shell is freed.
  • Exact vs closest‑match queries

    • Challenge: A miss should return the best candidate from the mismatch subtree only, not from the whole dictionary; ties must choose the alphabetically earliest key and then return all of its records in stable order.
    • Solution: src2/ptree.c:ptree_query() walks the search path counting work. On mismatch it collects candidate leaves with a small DFS (collect_candidates()), also considering the mismatch node itself when it stores data. It computes editDistance() per candidate, increments s for each call, and uses strcmp for tie‑breaks.
  • Output fidelity

    • Challenge: The golden files require exact byte‑for‑byte matches, especially for x/y fields.
    • Solution: src2/printrec.c centralizes formatting; x/y are printed as long doubles with 5 decimals when present.
  • Memory correctness under scale

    • Challenge: A deep trie with many duplicates stresses ownership and free order.
    • Solution: Records are owned only at leaf/exact nodes via rlist_t; matches_t holds borrowed pointers. ptree_free() recursively frees children, stem, data list (deep‑freeing records), then the node. ASan and valgrind targets are in the Makefile.

Counting Policy (b/n/s)

  • b (bit comparisons): count only bits compared during LCP checks along the search path; do not count the single bit used to choose an outgoing edge.
  • n (node accesses): count nodes visited on the search path only; the post‑mismatch DFS to enumerate candidates is excluded by design to keep counts operation‑based and comparable.
  • s (string‑level comparisons): count each editDistance() call; add one for exact matches for parity with reference numbers; add one for strcmp tie‑breaks.

All increments happen in the same code that performs the work: lcp_mem_vs_key() and ptree_query() in src2/ptree.c. The linked‑list baseline mirrors the policy (bitwise comparator per node, one string comparison per node).

Correctness Invariants (Patricia Trie)

  • Every internal node has a non‑zero stem; siblings start with different first bits.
  • If dhead != NULL, the node’s stem equals a complete key (includes \0).
  • Duplicates are appended to the same node’s record list and preserve CSV order.

Evidence & Reproducibility

  • File diffs: make test compares outputs with matching_results/ and passes across the 1/2/22/1067 suites.
  • Memory: make asan-test and make valgrind report no leaks or invalid accesses on large datasets.
  • Metrics: report/metrics_both.csv shows the expected separation between LL and PT; scripts/prefix_study.py confirms that longer query prefixes shrink candidate subtrees and drive s → 1.

Stage 3 Report (Summary)

  • Full report: COMP20003 — Assignment 2 Stage 3 Report.md (source) and COMP20003 — Assignment 2 Stage 3 Report.pdf (print) at the repo root. See also the generated data under report/.
  • Methodology: operation‑based counts (b bits, n nodes, s strings) on the official suites 1/2/22/1067; plus a prefix‑length study (2,5,8,12,16,20 chars). Counts are platform‑independent and come from the instrumented code path.
  • Key results (averages per query):
    • dataset_1067 — PT: b≈297, n≈16.75, s=1.00; LL: b≈8481, n=1067, s=1067 (report/metrics_both.csv:5).
    • dataset_22 — PT: b≈274.7, n≈2.67, s=1.00; LL: b≈2092.3, n=22, s=22 (report/metrics_both.csv:3).
  • Prefix study (report/prefix/prefix_metrics.csv:1): with only 2‑char prefixes PT temporarily evaluates multiple candidates (s≈35), but by 5+ chars s→1 while LL remains at s=1067 regardless of prefix.
  • Figures: triptych charts under report/figs/ and report/prefix/ (log‑y variants included) visually demonstrate the asymptotic gap: PT scales with path depth (L') while LL scales with N.

Reproducing the Report Data

make report
# metrics: report/metrics_both.csv
# plots:   report/figs/*.png  and  report/prefix/*.png

The command runs dict2/dict1 over all suites, aggregates b/n/s, and draws the bars+lines figures used in the write‑up.

What I’d Call Out On A Resume

  • Built a bit‑level Patricia trie with precise split logic and a principled counting framework to evaluate data‑structure efficiency independent of hardware.
  • Engineered a reproducible experimentation stack (shell + Python) to generate metrics and publication‑ready plots; wrote the complexity report tying theory to empirical results.
  • Achieved perfect output fidelity and zero‑leak memory safety on thousand‑record datasets.

Potential Extensions

  • Swap‑in a vtable‐style dict_api to run LL/PT under a unified CLI.
  • Persist the trie or serialize stems for faster warm starts.
  • Parallelize candidate edit‑distance on large mismatch subtrees (kept single‑threaded here to match marking environment).

Repository Layout (key files)

├── src2/            # Patricia trie implementation and shared modules
│   ├── main2.c      # CLI wrapper for dict2
│   ├── ptree.[ch]   # trie insert/query/free + edit-distance matching
│   ├── csv2.[ch]    # single-pass CSV loader (headers + records)
│   ├── record.[ch]  # record_t model, matches_t, stats_t
│   ├── bit.[ch], stem.[ch], editdist.[ch], printrec.[ch]
├── src1/            # Stage 1 linked-list dictionary + bit comparator
├── scripts/         # Automation (tests, metrics, plots, prefix study)
├── tests/           # Author-provided and custom datasets/query suites
├── matching_results/# Golden outputs from the course staff
├── report/          # Generated metrics, plots, LaTeX/Word templates
│   └── ... + final Stage 3 write-up (PDF & Markdown)
├── Makefile
└── dict2 / *.o / REPORT.pdf ... (built or generated artefacts)

Tip – run make clean before committing or packaging to remove dict2, object files, and regenerated CSV/plot artefacts.

Build & Run

# Patricia trie program (Stage 2)
make dict2
./dict2 2 tests/dataset_22.csv output.txt < tests/test22.in > output.stdout.out

# Linked-list baseline (Stage 1) for comparisons
make dict1
./dict1 1 tests/dataset_22.csv output1.txt < tests/test22.in > output1.stdout.out

Convenience targets

Command Purpose
make test Runs scripts/run_tests.sh (diffs outputs vs golden files).
make report Rebuilds metrics, plots, and prefix-study data under report/.
make asan-test AddressSanitizer build + smoke test (useful off Ed).
make valgrind Full leak check on dataset_1067 (requires local valgrind).

Datasets & Testing

  • Official suites live under tests/ with matching golden outputs in matching_results/.
  • scripts/run_tests.sh executes the regression pipeline, showing file diffs and informative stdout comparisons.
  • scripts/review_tests.sh builds bespoke micro-cases (duplicates, tie-breaks, NOTFOUND behaviour, etc.) to sanity-check tricky edge conditions.
  • Custom inputs and expected outputs used during development are in tests/custom/.

Metrics, Plots & Reports (Stage 3)

  • Run make report to populate:
    • report/metrics_both.csv – aggregated averages for linked list vs trie.
    • report/figs/*.png – grouped bars + lines (log-linear variants included).
    • report/prefix/ – prefix-length study data and charts.
  • The final write-up is available in both English and Chinese:
    • COMP20003 — Assignment 2 Stage 3 Report.md (source)
    • COMP20003 — Assignment 2 Stage 3 Report.pdf (submitted version)
    • REPORT_CN.md (optional Mandarin summary)
  • LaTeX/Word templates (report/latex, report/word) document how the report was formatted for submission.

Implementation Notes

  • Keys are stored as byte arrays including the trailing \0. Bit access is MSB-first, ensuring consistent stems and comparisons.
  • Duplicate addresses stay in CSV order thanks to per-node record lists.
  • The trie counts only search-path node visits and prefix bit comparisons; the mismatch candidate DFS is excluded by design to mirror the reference policy.
  • Edit distance uses a rolling-row Levenshtein implementation, ensuring tight memory bounds.

Resume-Friendly Talking Points

  • Built a bit-level Patricia trie spellchecker that delivers exact or closest-match results while tracking analytic counters for each query.
  • Automated a metrics pipeline (bash + Python/matplotlib) to compare linked-list vs trie behaviour across real datasets and generated prefix workloads, feeding directly into the Stage 3 complexity report.
  • Ensured memory correctness on large datasets via valgrind/ASan and structured ownership of 35-field records.

Acknowledgements & Notes

  • Binary artefacts in the repo (dict2, *.o, REPORT.pdf, home.zip, plot PNGs) are retained for completeness; regenerate them with the provided Makefile and scripts if required.
  • The project targets POSIX/C99 environments. Tested locally with cc and on the university’s Ed platform.

Feel free to reach out if you need clarification on any module or would like an abridged excerpt for a portfolio. Enjoy exploring the project!

About

Algorithms-and-Data-Structures-assignment2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published