Skip to content

Commit

Permalink
Merge branch 'latest' (Sparse random projections, bugfixes)
Browse files Browse the repository at this point in the history
  • Loading branch information
zgornel committed Jan 10, 2019
2 parents c910a7c + 9636245 commit 654ec36
Show file tree
Hide file tree
Showing 10 changed files with 481 additions and 45 deletions.
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
## StringAnalysis Release Notes

v0.2.4
------
- Added sparse random projections
- Bugfixes

v0.2.3
------
- Preprocessing improvements
Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "StringAnalysis"
uuid = "f2e5b97c-ed87-11e8-3fa8-a300efcc83e4"
authors = ["Corneliu Cofaru <cornel@oxoaresearch.com>"]
version = "0.2.1"
version = "0.2.4"

[deps]
BinaryProvider = "b99e7846-7c00-51b0-8f62-c81ae34c0232"
Expand Down
71 changes: 55 additions & 16 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,27 @@ update_inverse_index!(crps)
crps.inverse_index
```

## Preprocessing
The text preprocessing mainly consists of the `prepare` and `prepare!` functions and preprocessing flags which start mostly with `strip_` except for `stem_words`. The preprocessing function `prepare` works on `AbstractDocument`, `Corpus` and `AbstractString` types, returning new objects; `prepare!` works only on `AbstractDocument`s and `Corpus` as the strings are immutable.
```@repl index
str="This is a text containing words, some more words, a bit of punctuation and 1 number...";
sd = StringDocument(str);
flags = strip_punctuation|strip_articles|strip_punctuation|strip_whitespace
prepare(str, flags)
prepare!(sd, flags);
text(sd)
```
More extensive preprocessing examples can be viewed in `test/preprocessing.jl`.

One can strip parts of languages i.e. prepositions, articles in languages other than English (support provided from [Languages.jl](https://github.com/JuliaText/Languages.jl)):
```@repl index
using Languages
it = StringDocument("Quest'e un piccolo esempio di come si puo fare l'analisi");
StringAnalysis.language!(it, Languages.Italian());
prepare!(it, strip_articles|strip_prepositions|strip_whitespace);
it.text
```

## Features

### Document Term Matrix (DTM)
Expand Down Expand Up @@ -167,17 +188,35 @@ tf!(M.dtm, tfm);
Matrix(tfm)
```

## Pre-processing
The text preprocessing mainly consists of the `prepare` and `prepare!` functions and preprocessing flags which start mostly with `strip_` except for `stem_words`. The preprocessing function `prepare` works on `AbstractDocument`, `Corpus` and `AbstractString` types, returning new objects; `prepare!` works only on `AbstractDocument`s and `Corpus` as the strings are immutable.
## Dimensionality reduction

### Random projections
In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse. They have been applied to many natural language tasks under the name of _random indexing_. The core idea behind random projection is given in the [Johnson-Lindenstrauss lemma](https://cseweb.ucsd.edu/~dasgupta/papers/jl.pdf) which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points [(Wikipedia)](https://en.wikipedia.org/wiki/Random_projection).

The implementation here relies on the generalized sparse random projection matrix to generate a random projection model. For more details see the API documentation for `RPModel` and `random_projection_matrix`.
To construct a random projection matrix that maps `m` dimension to `k`, one can do
```@repl index
str="This is a text containing words and a bit of punctuation...";
flags = strip_punctuation|strip_articles|strip_punctuation|strip_whitespace
prepare(str, flags)
sd = StringDocument(str);
prepare!(sd, flags);
text(sd)
m = 10; k = 2; T = Float32;
density = 0.2; # percentage of non-zero elements
R = StringAnalysis.random_projection_matrix(m, k, T, density)
```
Building a random projection model from a `DocumentTermMatrix` or `Corpus` is straightforward
```@repl index
M = DocumentTermMatrix{Float32}(crps)
model = RPModel(M, k=2, density=0.5, stats=:tf)
model2 = rp(crps, T, k=17, density=0.1, stats=:tfidf)
```
Once the model is created, one can reduce document term vector dimensionality. First, the document term vector is constructed using the `stats` keyword argument and subsequently, the vector is projected into the random sub-space:
```@repl index
doc = StringDocument("this is a new document")
embed_document(model, doc)
embed_document(model2, doc)
```
To embed a document term matrix, one only has to do
```@repl index
Matrix(M.dtm * model.R') # 3 documents x 2 sub-space dimensions
Matrix(M.dtm * model2.R') # 3 documents x 17 sub-space dimentsions
```
More extensive preprocessing examples can be viewed in `test/preprocessing.jl`.

## Semantic Analysis

Expand All @@ -203,34 +242,34 @@ M = DocumentTermMatrix{Float32}(crps, sort(collect(keys(crps.lexicon))));
```
building an LSA model is straightforward:
```@repl index
lsa_model = LSAModel(M, k=3, stats=:tf)
lm = LSAModel(M, k=3, stats=:tf)
```
Once the model is created, it can be used to either embed documents
```@repl index
query = StringDocument("Apples and an exotic fruit.");
embed_document(lsa_model, query)
embed_document(lm, query)
```
search for matching documents
```@repl index
idxs, corrs = cosine(lsa_model, query);
idxs, corrs = cosine(lm, query);
for (idx, corr) in zip(idxs, corrs)
println("$corr -> \"$(crps[idx].text)\"");
end
```
or check for structure within the data
```@repl index
U, V = lsa_model.U, lsa_model.Vᵀ';
U, V = lm.U, lm.Vᵀ';
Matrix(U*U') # document to document similarity
Matrix(V*V') # term to term similarity
```
LSA models can be saved and retrieved to and from am easy to read and parse text format.
```@repl index
file = "model.txt"
lsa_model
save(lsa_model, file) # model saved
lm
save_lsa_model(lm, file) # model saved
print(join(readlines(file)[1:5], "\n")) # first five lines
new_model = load(file, Float64) # change element type
new_model = load_lsa_model(file, Float64) # change element type
rm(file)
```

Expand Down
7 changes: 5 additions & 2 deletions src/StringAnalysis.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,10 @@ module StringAnalysis
export Stemmer, stem!, stem, stemmer_types
export tokenize, tokenize_fast, tokenize_slow, sentence_tokenize
export tf!, tf, tf_idf!, tf_idf, bm_25!, bm_25
export LSAModel, lsa, embed_document, embed_word, get_vector, index,
cosine, similarity, load, save
export LSAModel, lsa, save_lsa_model, load_lsa_model
export RPModel, rp, save_rp_model, load_rp_model
export embed_document, similarity, cosine,
vocabulary, in_vocabulary, index, get_vector
export lda
export frequent_terms, sparse_terms,
prepare!, prepare,
Expand All @@ -80,6 +82,7 @@ module StringAnalysis
include("dtm.jl")
include("stats.jl")
include("lsa.jl")
include("rp.jl")
include("lda.jl")
include("preprocessing.jl")
include("show.jl")
Expand Down
34 changes: 17 additions & 17 deletions src/lsa.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@
LSA (latent semantic analysis) model. It constructs from a document term matrix (dtm)
a model that can be used to embed documents in a latent semantic space pertaining to
the data. The model requires that the document term matrix be a
`DocumentTermMatrix{T<:AbstractFloat}` because the matrices resulted from the SVD operation
will be forced to contain elements of type `T`.
`DocumentTermMatrix{T<:AbstractFloat}` because the elements of the matrices resulted
from the SVD operation are floating point numbers and these have to match or be
convertible to type `T`.
# Fields
* `vocab::Vector{S}` a vector with all the words in the corpus
Expand All @@ -14,10 +15,8 @@ will be forced to contain elements of type `T`.
* `Σinv::A` inverse of the singular value matrix
* `Vᵀ::A` transpose of the word embedding matrix
* `stats::Symbol` the statistical measure to use for word importances in documents
available values are:
`:tf` (term frequency)
`:tfidf` (default, term frequency - inverse document frequency)
`:bm25` (Okapi BM25)
Available values are: `:tf` (term frequency), `:tfidf` (default, term frequency -
inverse document frequency) and `:bm25` (Okapi BM25)
* `idf::Vector{T}` inverse document frequencies for the words in the vocabulary
* `nwords::T` averge number of words in a document
* `κ::Int` the `κ` parameter of the BM25 statistic
Expand Down Expand Up @@ -148,14 +147,14 @@ function Base.show(io::IO, lm::LSAModel{S,T,A,H}) where {S,T,A,H}
num_docs, len_vecs = size(lm.U)
num_terms = length(lm.vocab)
print(io, "LSA Model ($(lm.stats)) $(num_docs) documents, " *
"$(num_terms) terms, $(len_vecs)-element $(T) vectors")
"$(num_terms) terms, dimensionality $(len_vecs), $(T) vectors")
end


"""
lsa(X [;k=3, stats=:tfidf, κ=2, β=0.75, tol=1e-15])
lsa(X [;k=<num documents>, stats=:tfidf, κ=2, β=0.75, tol=1e-15])
Constructs an LSA model. The input `X` can be a `Corpus` or a `DocumentTermMatrix`.
Constructs a LSA model. The input `X` can be a `Corpus` or a `DocumentTermMatrix`.
Use `?LSAModel` for more details. Vector components smaller than `tol` will be
zeroed out.
"""
Expand All @@ -169,7 +168,7 @@ function lsa(dtm::DocumentTermMatrix{T};
end

function lsa(crps::Corpus,
::Type{T} = Float32;
::Type{T} = DEFAULT_FLOAT_TYPE;
k::Int=length(crps),
stats::Symbol=:tfidf,
tol::T=T(1e-15),
Expand Down Expand Up @@ -235,7 +234,7 @@ end
"""
embed_document(lm, doc)
Return the vector representation of a document `doc` using the LSA model `lm`.
Return the vector representation of a document `doc`, obtained using the LSA model `lm`.
"""
embed_document(lm::LSAModel{S,T,A,H}, doc::AbstractDocument) where {S,T,A,H} =
# Hijack vocabulary hash to use as lexicon (only the keys needed)
Expand Down Expand Up @@ -302,7 +301,8 @@ end
"""
similarity(lm, doc1, doc2)
Return the cosine similarity value between two documents `doc1` and `doc2`.
Return the cosine similarity value between two documents `doc1` and `doc2`
whose vector representations have been obtained using the LSA model `lm`.
"""
function similarity(lm::LSAModel, doc1, doc2)
return embed_document(lm, doc1)' * embed_document(lm, doc2)
Expand All @@ -314,7 +314,7 @@ end
Saves an LSA model `lm` to disc in file `filename`.
"""
function save(lm::LSAModel{S,T,A,H}, filename::AbstractString) where {S,T,A,H}
function save_lsa_model(lm::LSAModel{S,T,A,H}, filename::AbstractString) where {S,T,A,H}
ndocs = size(lm.U, 1)
nwords = size(lm.Vᵀ, 2)
k = size(lm.U, 2)
Expand All @@ -339,14 +339,14 @@ end


"""
load(filename, type; [sparse=true])
load_lsa_model(filename, eltype; [sparse=true])
Loads an LSA model from `filename` into an LSA model object. The embeddings matrix
element type is specified by `type` (default `Float32`) while the keyword argument
element type is specified by `eltype` (default `DEFAULT_FLOAT_TYPE`) while the keyword argument
`sparse` specifies whether the matrix should be sparse or not.
"""
function load(filename::AbstractString, ::Type{T}=Float32;
sparse::Bool=true) where T<: AbstractFloat
function load_lsa_model(filename::AbstractString, ::Type{T}=DEFAULT_FLOAT_TYPE;
sparse::Bool=true) where T<: AbstractFloat
# Matrix type for LSA model
A = ifelse(sparse, SparseMatrixCSC{T, Int}, Matrix{T})
# Define parsed variables local to outer scope of do statement
Expand Down
Loading

0 comments on commit 654ec36

Please sign in to comment.