GitHub - anashwin/incremental_set_cover: Code from the Symantec team from RIPS 2015 at UCLA

RIPS 2015 - Symantec Group

Students Participants Ashwin Narayan (ashwin.narayan@williams.edu) Anna King Natalia Postawa Vuk Markovic

Academic Mentor Alejandro Morales

====================================== CODE DOCUMENTATION =============================================

We list the methods in each file of code. We only list methods that are useful for a user. Unfortunately I cannot figure out how to get the databases online.

demo.py - This file can be used to run any of the aspects of the program. Specific instructions are contained in the comments in the file itself. Out of the box, running:
```
  python demo.py 
  
  will run both the non-real-time and real-time algorithms with default parameters
```

Outward facing methods. The following methods are pre-written and will run the entire algorithm from start to finish with fully customizable parameters.

full_clustering_procedure.py full_clustering_procedure(ndata = 100000, N=50000, nmachines = 50, min_q_len = 6, max_q_len = 15, number_of_clusterings=1, queryfile = None, gcpa_type='better') This method runs the non-real-time clustering procedure with the default parameters given above. Returns a GCPA object from which clusters and covers can be extracted. Parameters: ndata: number of data elements to consider N: number of queries nmachines: number of machines min_q_len, max_q_len: minimum and maximum length of a query number_of_clusterings: number of times the algorithm should run queryfile: if None, then we generate queries if you want to use already generated queries, put in the name of the CSV file (WITHOUT THE EXTENSION) where the queries are stored gcpa_type: set to 'better' if we want to use BetterGreedy for processing set to 'linear' if we want to use the normal greedy algorithm
full_clustering_procedure_comparisons full_clustering_procedure(ndata = 100000, N=50000, nmachines = 50, min_q_len = 6, max_q_len = 15, number_of_clusterings=1, queryfile = None): This method runs the non-real-time clustering procedure, and then also runs N-greedy and the two baseline algorithms for comparison. It generates an output CSV file comparing the optimality of each of the methods, and also prints out the average time per query each method took. Parameters: Same as above
full_realtime.py full_realtime(precompute_fraction=.4, nqueries=50000, ndataunits=100000, nmachines=50, r=3, np=.995, min_q_len=6, max_q_len=15, ctype='fast', gcpatype='better', queryfile=None) This method runs the real-time procedure, using a specified fraction to precompute. The method returns a GCPA object (which contains information about the precomputed queries), the real-time queries, and a list of covers for the real-time queries Parameters: precompute_fraction: the fraction of queries to be used for pre-computing nqueries: number of queries we want ndataunits: number of data elements we want nmachines: number of machines r: repetition factor np: random graph property min_q_len, max_q_len: min and max length of a query ctype: if 'fast' use the fast clustering procedure if 'full' use the full clustering procedure gcpatype: if 'better' use BetterGreedy if 'linear' use the normal greedy algorithm queryfile: if None, then we generate queries if you want to use already generated queries, put in the name of the CSV file (WITHOUT THE EXTENSION) where the queries are stored
full_realtime_comparisons.py full_realtime(precompute_fraction=.4, nqueries=50000, ndataunits=100000, nmachines=50, r=3, np=.995, min_q_len=6, max_q_len=15, ctype='both', gcpatype='both', queryfile=None) This runs the real-time procedure as above but also compares it to results generated from the reference algorithms. It generates output CSV files for time and optimality comparisons (only of the real-time queries) Parameters: All are the same as above except: ctype, gcpa_type: set to 'both' if you want both types of clustering and processing to be analyzed

Classes. We have four main classes that store the vital information of the program

cluster.py class Cluster Contains the data about a cluster Instance variables: queries: the list of queries in the cluster span: the number of unique data elements covered by elements in queries in the clsuter min_query_len: length of the smallest query Methods: init(self, queries=[]): creates a Cluster object with the given list of queries or(self, other): return a new cluster with the union of the sets of queries in self and other (use self|other) ior(self, other): adds the queries of another cluster to this cluster (use self|=other) getitem(self): allows use to use [] to index the cluster with queries len(self): returns number of queries in the cluster aligned_output(self): prints out a neatly formatted output of the cluster
clustering.py class Clustering Contains the data about a clustering. So to cluster a list of queries, we just need to create a Clustering object with the list of queries that we want to cluster Instance variables: theta_1, theta_2: thresholds described in the paper queries_processed: number of queries that we have processed so far clusters: list of clusters in the clustering cl_entropies: a list containing the entropy information of the cluster format = [number of queries, {element:frequency}, entropy] cluster_count: number of clusters data_hash: contains the potential clusters each data element is in prev_exp_entropy: the expected entropy of a cluster before a query is added Methods: init(self, queries=[], theta_1=theta_2 = .5): clusters the list of queries with the given thresholds insert_rt_fast(self, query): fast method of inserting a query insert_rt_noupdate(self, query): inserts a query into the clustering using the full method insert_rt_update(self, query): inserts the query and then updates the cluster's entropy and frequency
gcpa_fast.py class GCPA Used for processing a clustering using the normal greedy algorithm. To pre-process a cluster, create a GCPA object with the clustering and the total number of data units as inputs. Then run process with machines and the data_unit_in_machine hash table as inputs to process. Instance variables:
queues: for each cluster, the list of data parts that need to be processed, in order partindex_by_cluster: for each cluster, gives the parts that cover each data element in that cluster partcover_by_cluster: given an index of a part, this gives the cover of that part
gcpa_better.py class GPCA_better The same as above, except we use better greedy instead of greedy for all computations.

Other various important methods

linear_greedy.py linear_greedy(query, machines, dataunit_in_machine) Returns the cover of query. Parameters: query: the query to be covered (as a set) machines: the list of machines dataunit_in_machine: hash table where keys are data elements and they point to the machines they are in

generate_hash(machines, data_size) Given the machines and the number of data elements, creates the dataunit_in_machine table Parameters: machines: the set of machines data_size: the number of data units (we assume data units are labeled 1, 2, ... , data_size
graph_query_gen.py iterative_dfs(graph, start, path=[], min_len=6, max_len=15) Given a graph and a query size, this returns a query using the query generation algorithm Paramaters: graph: an Erdos-Renyi graph start: the node to use as a starting point path: use if you want to use a prefix for the query min_len, max_len: the range of lengths that the query should be
generate_machines.py generate(data, m, rep=3) Randomly assorts data into machines with the given replication factor. Returns a list of sets Parameters: data: the list of all possible data points
m: the number of machines rep: the replication factor
baseline.py baseline(query, machines, dataunit_in_machine) Covers the query using the baseline method (randomly choose machines that intersect the query) Parameters: Same as linear_greedy above

better_baseline(query, machines, dataunit_in_machine): Covers the query using a "smarter" baseline method: Choose an uncovered data unit, and then randomly choose a machine which covers that unit Parameters: Same as linear_greedy above

Housekeeping methods for accessing and/or writing to file

extract_clusters.py extract_clusters(filenames) Returns a list of clusters (which are lists of queries) contained in the outfile generated by the clustering algorithm. Parameters: filename: the name of the file (WITH EXTENSION) containing the clusters

===========================================================================================================

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
baseline.pyc		baseline.pyc
cluster.py		cluster.py
cluster.pyc		cluster.pyc
cluster_lens_file.csv		cluster_lens_file.csv
cluster_lens_wide_file.csv		cluster_lens_wide_file.csv
clustering.py		clustering.py
clustering.pyc		clustering.pyc
csv_operations.py		csv_operations.py
csv_operations.pyc		csv_operations.pyc
demo.py		demo.py
documentation.txt~		documentation.txt~
extract_clusters.py		extract_clusters.py
full_clustering_procedure.py		full_clustering_procedure.py
full_clustering_procedure.pyc		full_clustering_procedure.pyc
full_clustering_procedure_comparisons.py		full_clustering_procedure_comparisons.py
full_clustering_procedure_comparisons.pyc		full_clustering_procedure_comparisons.pyc
full_realtime.py		full_realtime.py
full_realtime.pyc		full_realtime.pyc
full_realtime_comparisons.py		full_realtime_comparisons.py
full_realtime_comparisons.pyc		full_realtime_comparisons.pyc
full_realtime_debug.py		full_realtime_debug.py
gcpa_better.py		gcpa_better.py
gcpa_better.pyc		gcpa_better.pyc
gcpa_fast.py		gcpa_fast.py
gcpa_fast.pyc		gcpa_fast.pyc
generate_machines.py		generate_machines.py
generate_machines.pyc		generate_machines.pyc
line_counter.py		line_counter.py
linear_greedy.py		linear_greedy.py
linear_greedy.pyc		linear_greedy.pyc
n50np0.993_0.csv		n50np0.993_0.csv
n50np0.995_0.csv		n50np0.995_0.csv
n50np2.0_0.csv		n50np2.0_0.csv
n50np3.0_0.csv		n50np3.0_0.csv
natalia_query_gen.py		natalia_query_gen.py
natalia_query_gen.pyc		natalia_query_gen.pyc
np_cluster_number.py		np_cluster_number.py
np_cluster_number_graph.py		np_cluster_number_graph.py
queries_from_real_data.py		queries_from_real_data.py
random_query_gen.py		random_query_gen.py
real_data_dict.p		real_data_dict.p
real_data_index_file.csv		real_data_index_file.csv
real_data_parser.py		real_data_parser.py
small_size.py		small_size.py
trec6-kmeans-xu99		trec6-kmeans-xu99
trec_fix.py		trec_fix.py
trec_largest_number.py		trec_largest_number.py
trec_small_100_cover_len_comparison.csv		trec_small_100_cover_len_comparison.csv
trec_small_100_time_comparison.csv		trec_small_100_time_comparison.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

anashwin/incremental_set_cover

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages