Skip to content

Files

Latest commit

 

History

History

ngramanalysis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ngram.analysis

	The ngram.analysis program contains various functions for analysing the output
	from the ngram.counting program.
	
	It has the following syntax:
	
	ngram.analysis processing_folder COMMAND OPTIONS
	
		where COMMAND is one of:
				invert_index
				search
				get_top
				make_wordlength_stats
				view_wordlength_stats

invert_index
	invert_index is used for creating an index sorted by descending frequency
	on the n-gram index specified by the first option.

	For example, to create an index sorted by frequency on the 4-gram index where
	the words are ordered as they were in the text, run invert_index 0_1_2_3

	As another example, to create an index sorted by frequency on the 3-gram index
	that has the word order so that the third word is first, then the first and then
	the second word, run invert_index 2_0_1 .

	invert_index has a complexity of O(N), where N is the size of the n-gram  index
	to be created.

get_top
	Invocation: get_top n m
	get_top is the counterpart to invert_index. It displays the m most frequent n-grams.

	For example, to get the 10 most frequent 2-grams, run get_top 2 10
	 (after having run invert_index 0_1 at some point previously)

	get_top has a complexity of O(1).

search
	search is used for searching for n-grams. It takes a single option as an argument
	which is the search string. 

	For example run ./search "this is * test"  to search for all 4-grams where the first
	word is "this", the second word is "is" and the last word is "test". The search would
	find 4-grams like "this is a test" or "this is the test".

	search has a worst case complexity of O(log(N)), where N is the size of the n-gram index.

make_wordlength_stats
	This function takes as a single argument the number n.
	This function goes through the file by_0_1_2_ ..._n and creates a table of n-gram length.
	(in utf8 code-units) to frequency of n-grams that n-gram length occuring in the corpus.
	The single argument for this function is the relevant value of n.
	The complexity of this function is O(1). 

view_wordlength_stats
	This function shows the table generated by make_wordlength_stats. The argument is the number of
	words (i.e. the n) for which the table should be shown. make_wordlength_stats needs to be called
	before calling view_wordlength_stats.

entropy_of
	This function is relatively similar to the search function, and simply represents a series of transformations
	on the search output. It returns the entropy (in bits) of the result of the search, given that an n-gram matches
	the search string.
	i.e. if the output of the search "a *" on a particular corpus is
		a bit	1
		a lot	1
	Then the entropy returned is 1, as the value of the word following "a" has as much entropy as a single coin toss.

entropy_index
	This function creates an index by entropy.
	It takes as the only parameter a pattern describing a search string like "= * =",
		where each '=' represents a known word, and 
		each '*' character epresents the wildcard(s) whose entropy is used as the key of the index.
	This index can then be queried using the entropy_index_get_top command (see below).

entropy_index_get_top
	This function takes as the first argument a pattern to match a search string (see the entropy_index command).
	The second argument is the number of results to display (k).
	The optional third argument gives a threshhold for the minimum frequency that an n-gram with entropy 0 must have to be shown. 
		This value is 1 by default (i.e. show all n-grams with frequency 0).
	This function displays the top k search strings(that match the pattern of the first argument) with the lowest entropy.