Java port of wolfgarbe/PruningRadixTrie. See link for the background of the project etc. All credit to Wolf Garbe.
No features have been added/removed other than where necessary to match Java conventions. So functionality and performance should be unchanged, aside only from these exceptions:
- Removed parameter
out long termFrequencyCountPrefix
from methodgetTopkTermsForPrefix
because awkward to achieve this in Java without changing the method's return type significantly. Specifically, the functionality lost is- getting the number of terms in the dictionary which begin with the given prefix (not an essential feature). readTermsFromFile
now requires that the frequency values in the file have a max value of 2^63 - 1, whereas before they could have a max value of 2^64 - 1 (due to the frequency values are interpreted as signedlong
s rather than unsignedlong
s).- Introduced parameter
String delimiter
to methodreadTermsFromFile
so that text files with delimiters other than \t can be read. See Usage below.
Copy the source code into your project! There are only 4 classes to copy, and no external dependencies. I'm afraid I do not have plans to upload this project to Maven Central or the like.
Create Object
PruningRadixTrie pruningRadixTrie = new PruningRadixTrie();
addTerm: insert term and term frequency count into Pruning Radix Trie. Frequency counts for same term are summed up.
pruningRadixTrie.addTerm("microsoft", 1000);
getTopkTermsForPrefix: retrieve the top-k most relevant terms for a given prefix from the Pruning Radix Trie.
String prefix = "micro";
int topK = 10;
List<TermAndFrequency> results = pruningRadixTrie.getTopkTermsForPrefix(prefix, topK);
for (TermAndFrequency result : results) {
System.out.println(result.getTerm() + " " + result.getTermFrequencyCount());
}
readTermsFromFile: Deserialise the Pruning Radix Trie from disk for persistence. Note here that terms.txt contains tab-delimited lines, ie in the format term\tfrequency. If format is term,frequency then replace second argument with ",", etc.
pruningRadixTrie.readTermsFromFile("terms.txt", "\t");
writeTermsToFile: Serialise the Pruning Radix Trie to disk for persistence.
pruningRadixTrie.writeTermsToFile("terms.txt");