Files
ngramanalysis
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
ngram.analysis The ngram.analysis program contains various functions for analysing the output from the ngram.counting program. It has the following syntax: ngram.analysis processing_folder COMMAND OPTIONS where COMMAND is one of: invert_index search get_top make_wordlength_stats view_wordlength_stats invert_index invert_index is used for creating an index sorted by descending frequency on the n-gram index specified by the first option. For example, to create an index sorted by frequency on the 4-gram index where the words are ordered as they were in the text, run invert_index 0_1_2_3 As another example, to create an index sorted by frequency on the 3-gram index that has the word order so that the third word is first, then the first and then the second word, run invert_index 2_0_1 . invert_index has a complexity of O(N), where N is the size of the n-gram index to be created. get_top Invocation: get_top n m get_top is the counterpart to invert_index. It displays the m most frequent n-grams. For example, to get the 10 most frequent 2-grams, run get_top 2 10 (after having run invert_index 0_1 at some point previously) get_top has a complexity of O(1). search search is used for searching for n-grams. It takes a single option as an argument which is the search string. For example run ./search "this is * test" to search for all 4-grams where the first word is "this", the second word is "is" and the last word is "test". The search would find 4-grams like "this is a test" or "this is the test". search has a worst case complexity of O(log(N)), where N is the size of the n-gram index. make_wordlength_stats This function takes as a single argument the number n. This function goes through the file by_0_1_2_ ..._n and creates a table of n-gram length. (in utf8 code-units) to frequency of n-grams that n-gram length occuring in the corpus. The single argument for this function is the relevant value of n. The complexity of this function is O(1). view_wordlength_stats This function shows the table generated by make_wordlength_stats. The argument is the number of words (i.e. the n) for which the table should be shown. make_wordlength_stats needs to be called before calling view_wordlength_stats. entropy_of This function is relatively similar to the search function, and simply represents a series of transformations on the search output. It returns the entropy (in bits) of the result of the search, given that an n-gram matches the search string. i.e. if the output of the search "a *" on a particular corpus is a bit 1 a lot 1 Then the entropy returned is 1, as the value of the word following "a" has as much entropy as a single coin toss. entropy_index This function creates an index by entropy. It takes as the only parameter a pattern describing a search string like "= * =", where each '=' represents a known word, and each '*' character epresents the wildcard(s) whose entropy is used as the key of the index. This index can then be queried using the entropy_index_get_top command (see below). entropy_index_get_top This function takes as the first argument a pattern to match a search string (see the entropy_index command). The second argument is the number of results to display (k). The optional third argument gives a threshhold for the minimum frequency that an n-gram with entropy 0 must have to be shown. This value is 1 by default (i.e. show all n-grams with frequency 0). This function displays the top k search strings(that match the pattern of the first argument) with the lowest entropy.