Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem inferring topics on new docs using a saved model #8

Open
mjockers opened this issue Dec 4, 2017 · 0 comments
Open

Problem inferring topics on new docs using a saved model #8

mjockers opened this issue Dec 4, 2017 · 0 comments

Comments

@mjockers
Copy link

mjockers commented Dec 4, 2017

Here is a dummied up script to test what seems to be a bug with inference in dfrtopics

options(java.parameters="-Xmx6g")
library(dfrtopics)
library(dplyr)

#first create some dummy data for repeatability. Read in moby dick from gutenberg. Since readlines breaks at the newline char we'll treat each newline as a new "text"

texts <- text_of_file <- readLines("http://www.gutenberg.org/files/2701/2701-0.txt")

#Now remove those pesky blanks

texts <- texts[-which(texts == "")]

#Grab 2000 random items for training and put into dataframe with proper colnames and some dummied id labels

training_docs <- data_frame(id = paste("Train", 1:2000, sep="_"), text = sample(texts, 2000))

#Now grab another 100 that we'll pretend are new documents for inference later on

inference_docs <- data_frame(id = paste("Test", 1:100, sep="_"), text = sample(texts, 100))

#Make an instance list for the training docs (for the sake of this demo, no stoplist)

training_ilist <- make_instances(training_docs)

#Train a topic model

m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)

#Now write the model to disk so we can load it later. Also write out the instance list, we're going to need it.

write_mallet_model(m, "DEMO_MODEL", save_instances = TRUE)

#Before we can infer the topical makeup of new files, we need a compatible instance list (aka use-pipe-from in mallet)

#For some reason, load_mallet_model_directory does not load the instance file that we saved above as part of the write_mallet_model . . . I'm not sure why?

#Interestingly, we can build an inferencer from the model before reloading it using load_mallet_model_directory, but it does not work after loading. in other words: this works correctly

inf <- inferencer(m)
inf

#But once we relaod the model from file, like this

m <- load_mallet_model_directory("DEMO_MODEL") #DEMO_MODEL = local path

#We can't create an inferencer
inf <- inferencer(m)
inf # returns NULL

#Hmm, that's weird. Imagine that we quit R and want to come back another day and load the model and do some inference on some new files. It looks like we cannot do that.

#But maybe there is another route. I saved the instance list, so perhaps I can read it in and then use it in conjunction with the compatible_instances(docs, instances) function

ilist <- read_instances("DEMO_MODEL/instances.mallet")
inference_ilist <- compatible_instances(inference_docs, ilist)

#Ok, so now we've got a loaded model from disk and a compatiable instance list. I should be able to infer topics on new docs. . .

inferred_m <- infer_topics(m, inference_ilist) # Tada!

#But no. . . .

#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, :
#RcallMethod: invalid object parameter

#According to the help file: m can be either a topic inferencer object from read_inferencer or inferencer or a mallet_model object. m is of the later type:

class(m)
[1] "mallet_model"

#So why the error?

#Let's try another route. rebuild the same model

m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)
m_inferencer <- inferencer(m)

#Save it to disk

write_inferencer(m_inferencer, "DEMO_MODEL/m_inferencer.mallet")

#Read the inference from the file

inf <- read_inferencer("DEMO_MODEL/m_inferencer.mallet")
test <- infer_topics(inf, inference_ilist)

#Ugh. same error again . . .
#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, :
#RcallMethod: invalid object parameter

#What now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant