Merge branch 'latest' (Sparse random projections, bugfixes)

zgornel · Jan 10, 2019 · 654ec36 · 654ec36
2 parents c910a7c + 9636245
commit 654ec36
Show file tree

Hide file tree

Showing 10 changed files with 481 additions and 45 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -1,5 +1,10 @@
 ## StringAnalysis Release Notes
 
+v0.2.4
+------
+ - Added sparse random projections
+ - Bugfixes
+
 v0.2.3
 ------
  - Preprocessing improvements

diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "StringAnalysis"
 uuid = "f2e5b97c-ed87-11e8-3fa8-a300efcc83e4"
 authors = ["Corneliu Cofaru <cornel@oxoaresearch.com>"]
-version = "0.2.1"
+version = "0.2.4"
 
 [deps]
 BinaryProvider = "b99e7846-7c00-51b0-8f62-c81ae34c0232"

diff --git a/docs/src/examples.md b/docs/src/examples.md
@@ -95,6 +95,27 @@ update_inverse_index!(crps)
 crps.inverse_index
 ```
 
+## Preprocessing
+The text preprocessing mainly consists of the `prepare` and `prepare!` functions and preprocessing flags which start mostly with `strip_` except for `stem_words`. The preprocessing function `prepare` works on `AbstractDocument`, `Corpus` and `AbstractString` types, returning new objects; `prepare!` works only on `AbstractDocument`s and `Corpus` as the strings are immutable.
+```@repl index
+str="This is a text containing words, some more words, a bit of punctuation and 1 number...";
+sd = StringDocument(str);
+flags = strip_punctuation|strip_articles|strip_punctuation|strip_whitespace
+prepare(str, flags)
+prepare!(sd, flags);
+text(sd)
+```
+More extensive preprocessing examples can be viewed in `test/preprocessing.jl`.
+
+One can strip parts of languages i.e. prepositions, articles in languages other than English (support provided from [Languages.jl](https://github.com/JuliaText/Languages.jl)):
+```@repl index
+using Languages
+it = StringDocument("Quest'e un piccolo esempio di come si puo fare l'analisi");
+StringAnalysis.language!(it, Languages.Italian());
+prepare!(it, strip_articles|strip_prepositions|strip_whitespace);
+it.text
+```
+
 ## Features
 
 ### Document Term Matrix (DTM)
@@ -167,17 +188,35 @@ tf!(M.dtm, tfm);
 Matrix(tfm)
 ```
 
-## Pre-processing
-The text preprocessing mainly consists of the `prepare` and `prepare!` functions and preprocessing flags which start mostly with `strip_` except for `stem_words`. The preprocessing function `prepare` works on `AbstractDocument`, `Corpus` and `AbstractString` types, returning new objects; `prepare!` works only on `AbstractDocument`s and `Corpus` as the strings are immutable.
+## Dimensionality reduction
+
+### Random projections
+In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse. They have been applied to many natural language tasks under the name of _random indexing_. The core idea behind random projection is given in the [Johnson-Lindenstrauss lemma](https://cseweb.ucsd.edu/~dasgupta/papers/jl.pdf) which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points [(Wikipedia)](https://en.wikipedia.org/wiki/Random_projection). 
+
+The implementation here relies on the generalized sparse random projection matrix to generate a random projection model. For more details see the API documentation for `RPModel` and `random_projection_matrix`.
+To construct a random projection matrix that maps `m` dimension to `k`, one can do
 ```@repl index
-str="This is a text containing words and a bit of punctuation...";
-flags = strip_punctuation|strip_articles|strip_punctuation|strip_whitespace
-prepare(str, flags)
-sd = StringDocument(str);
-prepare!(sd, flags);
-text(sd)
+m = 10; k = 2; T = Float32;
+density = 0.2;  # percentage of non-zero elements
+R = StringAnalysis.random_projection_matrix(m, k, T, density)
+```
+Building a random projection model from a `DocumentTermMatrix` or `Corpus` is straightforward
+```@repl index
+M = DocumentTermMatrix{Float32}(crps)
+model = RPModel(M, k=2, density=0.5, stats=:tf)
+model2 = rp(crps, T, k=17, density=0.1, stats=:tfidf)
+```
+Once the model is created, one can reduce document term vector dimensionality. First, the document term vector is constructed using the `stats` keyword argument and subsequently, the vector is projected into the random sub-space:
+```@repl index
+doc = StringDocument("this is a new document")
+embed_document(model, doc)
+embed_document(model2, doc)
+```
+To embed a document term matrix, one only has to do
+```@repl index
+Matrix(M.dtm * model.R')  # 3 documents x 2 sub-space dimensions
+Matrix(M.dtm * model2.R')  # 3 documents x 17 sub-space dimentsions
 ```
-More extensive preprocessing examples can be viewed in `test/preprocessing.jl`.
 
 ## Semantic Analysis
 
@@ -203,34 +242,34 @@ M = DocumentTermMatrix{Float32}(crps, sort(collect(keys(crps.lexicon))));
 ```
 building an LSA model is straightforward:
 ```@repl index
-lsa_model = LSAModel(M, k=3, stats=:tf)
+lm = LSAModel(M, k=3, stats=:tf)
 ```
 Once the model is created, it can be used to either embed documents
 ```@repl index
 query = StringDocument("Apples and an exotic fruit.");
-embed_document(lsa_model, query)
+embed_document(lm, query)
 ```
 search for matching documents
 ```@repl index
-idxs, corrs = cosine(lsa_model, query);
+idxs, corrs = cosine(lm, query);
 
 for (idx, corr) in zip(idxs, corrs)
     println("$corr -> \"$(crps[idx].text)\"");
 end
 ```
 or check for structure within the data
 ```@repl index
-U, V = lsa_model.U, lsa_model.Vᵀ';
+U, V = lm.U, lm.Vᵀ';
 Matrix(U*U')  # document to document similarity
 Matrix(V*V')  # term to term similarity
 ```
 LSA models can be saved and retrieved to and from am easy to read and parse text format.
 ```@repl index
 file = "model.txt"
-lsa_model
-save(lsa_model, file)  # model saved
+lm
+save_lsa_model(lm, file)  # model saved
 print(join(readlines(file)[1:5], "\n"))  # first five lines
-new_model = load(file, Float64)  # change element type
+new_model = load_lsa_model(file, Float64)  # change element type
 rm(file)
 ```
 

diff --git a/src/StringAnalysis.jl b/src/StringAnalysis.jl
@@ -53,8 +53,10 @@ module StringAnalysis
     export Stemmer, stem!, stem, stemmer_types
     export tokenize, tokenize_fast, tokenize_slow, sentence_tokenize
     export tf!, tf, tf_idf!, tf_idf, bm_25!, bm_25
-    export LSAModel, lsa, embed_document, embed_word, get_vector, index,
-           cosine, similarity, load, save
+    export LSAModel, lsa, save_lsa_model, load_lsa_model
+    export RPModel, rp, save_rp_model, load_rp_model
+    export embed_document, similarity, cosine,
+           vocabulary, in_vocabulary, index, get_vector
     export lda
     export frequent_terms, sparse_terms,
            prepare!, prepare,
@@ -80,6 +82,7 @@ module StringAnalysis
     include("dtm.jl")
     include("stats.jl")
     include("lsa.jl")
+    include("rp.jl")
     include("lda.jl")
     include("preprocessing.jl")
     include("show.jl")

diff --git a/src/lsa.jl b/src/lsa.jl
@@ -4,8 +4,9 @@
 LSA (latent semantic analysis) model. It constructs from a document term matrix (dtm)
 a model that can be used to embed documents in a latent semantic space pertaining to
 the data. The model requires that the document term matrix be a
-`DocumentTermMatrix{T<:AbstractFloat}` because the matrices resulted from the SVD operation
-will be forced to contain elements of type `T`.
+`DocumentTermMatrix{T<:AbstractFloat}` because the elements of the matrices resulted
+from the SVD operation are floating point numbers and these have to match or be
+convertible to type `T`.
 
 # Fields
   * `vocab::Vector{S}` a vector with all the words in the corpus
@@ -14,10 +15,8 @@ will be forced to contain elements of type `T`.
   * `Σinv::A` inverse of the singular value matrix
   * `Vᵀ::A` transpose of the word embedding matrix
   * `stats::Symbol` the statistical measure to use for word importances in documents
-                    available values are:
-                    `:tf` (term frequency)
-                    `:tfidf` (default, term frequency - inverse document frequency)
-                    `:bm25` (Okapi BM25)
+Available values are: `:tf` (term frequency), `:tfidf` (default, term frequency -
+inverse document frequency) and `:bm25` (Okapi BM25)
   * `idf::Vector{T}` inverse document frequencies for the words in the vocabulary
   * `nwords::T` averge number of words in a document
   * `κ::Int` the `κ` parameter of the BM25 statistic
@@ -148,14 +147,14 @@ function Base.show(io::IO, lm::LSAModel{S,T,A,H}) where {S,T,A,H}
     num_docs, len_vecs = size(lm.U)
     num_terms = length(lm.vocab)
     print(io, "LSA Model ($(lm.stats)) $(num_docs) documents, " *
-          "$(num_terms) terms, $(len_vecs)-element $(T) vectors")
+          "$(num_terms) terms, dimensionality $(len_vecs), $(T) vectors")
 end
 
 
 """
-    lsa(X [;k=3, stats=:tfidf, κ=2, β=0.75, tol=1e-15])
+    lsa(X [;k=<num documents>, stats=:tfidf, κ=2, β=0.75, tol=1e-15])
 
-Constructs an LSA model. The input `X` can be a `Corpus` or a `DocumentTermMatrix`.
+Constructs a LSA model. The input `X` can be a `Corpus` or a `DocumentTermMatrix`.
 Use `?LSAModel` for more details. Vector components smaller than `tol` will be
 zeroed out.
 """
@@ -169,7 +168,7 @@ function lsa(dtm::DocumentTermMatrix{T};
 end
 
 function lsa(crps::Corpus,
-             ::Type{T} = Float32;
+             ::Type{T} = DEFAULT_FLOAT_TYPE;
              k::Int=length(crps),
              stats::Symbol=:tfidf,
              tol::T=T(1e-15),
@@ -235,7 +234,7 @@ end
 """
     embed_document(lm, doc)
 
-Return the vector representation of a document `doc` using the LSA model `lm`.
+Return the vector representation of a document `doc`, obtained using the LSA model `lm`.
 """
 embed_document(lm::LSAModel{S,T,A,H}, doc::AbstractDocument) where {S,T,A,H} =
     # Hijack vocabulary hash to use as lexicon (only the keys needed)
@@ -302,7 +301,8 @@ end
 """
     similarity(lm, doc1, doc2)
 
-Return the cosine similarity value between two documents `doc1` and `doc2`.
+Return the cosine similarity value between two documents `doc1` and `doc2`
+whose vector representations have been obtained using the LSA model `lm`.
 """
 function similarity(lm::LSAModel, doc1, doc2)
     return embed_document(lm, doc1)' * embed_document(lm, doc2)
@@ -314,7 +314,7 @@ end
 
 Saves an LSA model `lm` to disc in file `filename`.
 """
-function save(lm::LSAModel{S,T,A,H}, filename::AbstractString) where {S,T,A,H}
+function save_lsa_model(lm::LSAModel{S,T,A,H}, filename::AbstractString) where {S,T,A,H}
     ndocs = size(lm.U, 1)
     nwords = size(lm.Vᵀ, 2)
     k = size(lm.U, 2)
@@ -339,14 +339,14 @@ end
 
 
 """
-    load(filename, type; [sparse=true])
+    load_lsa_model(filename, eltype; [sparse=true])
 
 Loads an LSA model from `filename` into an LSA model object. The embeddings matrix
-element type is specified by `type` (default `Float32`) while the keyword argument
+element type is specified by `eltype` (default `DEFAULT_FLOAT_TYPE`) while the keyword argument
 `sparse` specifies whether the matrix should be sparse or not.
 """
-function load(filename::AbstractString, ::Type{T}=Float32;
-              sparse::Bool=true) where T<: AbstractFloat
+function load_lsa_model(filename::AbstractString, ::Type{T}=DEFAULT_FLOAT_TYPE;
+                        sparse::Bool=true) where T<: AbstractFloat
     # Matrix type for LSA model
     A = ifelse(sparse, SparseMatrixCSC{T, Int}, Matrix{T})
     # Define parsed variables local to outer scope of do statement