Corpus × corpus distances #47

Witiko · 2019-02-02T10:38:39Z

Have you considered speed improvements that might result from computing distances between two corpora (queries and document collection)? With cosine similarity, this is simply a dot product between two term-document matrices. With network flows, perhaps a large network, where the same words in different documents would be distinct nodes, could be constructed? Running a for cycle over nearest_neighbors is not very fast even with the heuristics you implemented.

The text was updated successfully, but these errors were encountered:

vmarkovtsev · 2019-02-02T10:44:41Z

Thanks for an interesting idea! There can be a problem with building a big network though: EMD calculation chokes on 1,000 and becomes unpractically slow on 2,000 - the cubic complexity is to blame. On the other side, there are tricks to reduce the problem size, and we also can subsample by significance. Are you aware of any papers about this?

Witiko · 2019-02-02T12:03:29Z

I was thinking of computing the subset graph of all document sets (i.e. BOW with ones and zeros) in O(N² / log N) (see Yellin and Jutla, 1993), where N is the sum of set cardinalities, i.e. N = |D| * |V|, where |D| is the number of documents and |V| is the size of the dictionary. By Heap's Law, |V| ≈ √|D|, i.e. the time complexity is O(|D|³ / log |D|). However, this is asymptotically slower than the O(|D|² |V| log |V|) ≈ O(|D|^2.5 / log |D|) time currently required to compute all pairwise distances (not to mention that the number of words two documents have in common will be less than |V|), so this still needs more thinking through.

If we had the subset graph, then for every nearest_neighbors call, we could filter out strict subsets of strict subsets of the query, unless this leaves us with less than k documents. This is correct, because a missing word cannot decrease the word mover's distance if the cost is a metric. I do not have an estimate on how many documents this filters out, but intuitively, this would be a large portion of D.

Witiko · 2019-02-08T19:59:02Z

Although an explicit subset graph construction seems unfeasible, there is a linear algebra procedure, which allows us to filter out strict subsets of strict subsets of the query (see paragraph 2 of above post):

Take sign(A)^T · sign(B), where A and B are term-document matrices, divide each column of the result by (∑_j sign(A_ij))_i^T, assign 1 to cells where division by zero takes place, floor the resulting matrix, and let the result be M. Take sign(A)^T · sign(B), divide each row of the result by (∑_j sign(B_ij))_i^T, assign 1 to cells where division by zero (0 / 0) takes place, floor the resulting matrix, and let the result be N. Then a document j in B is a subset of document i in A iff N_ij = 1 and a strict subset iff M_ij = 0 and N_ij = 1.

Let Q and C be the term-document matrices for the queries and for the collection documents, respectively. Computing M and N twice, once for A=Q and B=C and again for A=C and B=C, gives the worst-case time complexity O(|D|²|V|) ≈ O(|D|^2.5). In theory, this is still not an improvement over O(|D|² |V| log |V|) ≈ O(|D|^2.5 / log |D|), although an optimized implementation of matrix dot product can well make this worthwhile.

I would like to benchmark this in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus × corpus distances #47

Corpus × corpus distances #47

Witiko commented Feb 2, 2019 •

edited

Loading

vmarkovtsev commented Feb 2, 2019

Witiko commented Feb 2, 2019 •

edited

Loading

Witiko commented Feb 8, 2019 •

edited

Loading

Corpus × corpus distances #47

Corpus × corpus distances #47

Comments

Witiko commented Feb 2, 2019 • edited Loading

vmarkovtsev commented Feb 2, 2019

Witiko commented Feb 2, 2019 • edited Loading

Witiko commented Feb 8, 2019 • edited Loading

Witiko commented Feb 2, 2019 •

edited

Loading

Witiko commented Feb 2, 2019 •

edited

Loading

Witiko commented Feb 8, 2019 •

edited

Loading