-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corpus × corpus distances #47
Comments
Thanks for an interesting idea! There can be a problem with building a big network though: EMD calculation chokes on 1,000 and becomes unpractically slow on 2,000 - the cubic complexity is to blame. On the other side, there are tricks to reduce the problem size, and we also can subsample by significance. Are you aware of any papers about this? |
I was thinking of computing the subset graph of all document sets (i.e. BOW with ones and zeros) in O(N² / log N) (see Yellin and Jutla, 1993), where N is the sum of set cardinalities, i.e. N = |D| * |V|, where |D| is the number of documents and |V| is the size of the dictionary. By Heap's Law, |V| ≈ √|D|, i.e. the time complexity is O(|D|³ / log |D|). However, this is asymptotically slower than the O(|D|² |V| log |V|) ≈ O(|D|2.5 / log |D|) time currently required to compute all pairwise distances (not to mention that the number of words two documents have in common will be less than |V|), so this still needs more thinking through. If we had the subset graph, then for every |
Although an explicit subset graph construction seems unfeasible, there is a linear algebra procedure, which allows us to filter out strict subsets of strict subsets of the query (see paragraph 2 of above post): Take sign(A)T · sign(B), where A and B are term-document matrices, divide each column of the result by (∑j sign(Aij))iT, assign 1 to cells where division by zero takes place, floor the resulting matrix, and let the result be M. Take sign(A)T · sign(B), divide each row of the result by (∑j sign(Bij))iT, assign 1 to cells where division by zero (0 / 0) takes place, floor the resulting matrix, and let the result be N. Then a document j in B is a subset of document i in A iff Nij = 1 and a strict subset iff Mij = 0 and Nij = 1. Let Q and C be the term-document matrices for the queries and for the collection documents, respectively. Computing M and N twice, once for A=Q and B=C and again for A=C and B=C, gives the worst-case time complexity O(|D|2|V|) ≈ O(|D|2.5). In theory, this is still not an improvement over O(|D|2 |V| log |V|) ≈ O(|D|2.5 / log |D|), although an optimized implementation of matrix dot product can well make this worthwhile. I would like to benchmark this in the future. |
Have you considered speed improvements that might result from computing distances between two corpora (queries and document collection)? With cosine similarity, this is simply a dot product between two term-document matrices. With network flows, perhaps a large network, where the same words in different documents would be distinct nodes, could be constructed? Running a for cycle over
nearest_neighbors
is not very fast even with the heuristics you implemented.The text was updated successfully, but these errors were encountered: