You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to fix an unrelated issue, I experimented with the code from #5, but using ZombieWriter::MachineLearning rather than ZombieWriter::Randomization.
zombie=ZombieWriter::MachineLearning.newzombie.add_string(content: "This is filler text that I invented.This is also a paragraph that could be used")zombie.add_string(content: "This post is amazing. Please take a look")zombie.add_string(content: "For all sports fan, you must watch this video. Hey you have to check this out.")array=zombie.generate_articlesparray#/Users/tariqali/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/kmeans-clusterer-0.11.4/lib/kmeans-clusterer.rb:237:in `sort_by': comparison of Float with NaN failed (ArgumentError)
The culprit is the third string. Classifier-Reborn classified its lsi_norm as a vector of NaNs...
"For all sports fan, you must watch this video. Hey you have to check this out.\n"=>#<ClassifierReborn::ContentNode:0x007fdec4b25ae8@categories=[],@lsi_norm=GSL::Vector[nannannannannannannan ... ],@lsi_vector=GSL::Vector[0.000e+000.000e+000.000e+000.000e+000.000e+000.000e+000.000e+00 ... ],@raw_norm=GSL::Vector[0.000e+000.000e+000.000e+000.000e+000.000e+000.000e+000.000e+00 ... ],@raw_vector=GSL::Vector[0.000e+000.000e+000.000e+000.000e+000.000e+000.000e+000.000e+00 ... ],@word_hash={:for=>1,:sport=>1,:fan=>1,:must=>1,:watch=>1,:video=>1,:hei=>1,:check=>1,:out=>1}>}
Changing the third string slightly resolves the issue.
zombie=ZombieWriter::MachineLearning.newzombie.add_string(content: "This is filler text that I invented.This is also a paragraph that could be used")zombie.add_string(content: "This post is amazing. Please take a look")zombie.add_string(content: "For all sports fan, you must watch this video. Hey you have to check this out. Filler, filler, filler.")array=zombie.generate_articlesparray
"For all sports fan, you must watch this video. Hey you have to check this out. Filler, filler, filler.\n"=>
#<ClassifierReborn::ContentNode:0x007fd931432fd0
@categories=[],
@lsi_norm=GSL::Vector
[ 6.205e-01 1.432e-01 1.432e-01 1.432e-01 1.432e-01 1.432e-01 0.000e+00 ... ],
@lsi_vector=GSL::Vector
[ 6.593e-01 1.522e-01 1.522e-01 1.522e-01 1.522e-01 1.522e-01 0.000e+00 ... ],
@raw_norm=GSL::Vector
[ 5.547e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@raw_vector=GSL::Vector
[ 6.272e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
@word_hash={:for=>1, :sport=>1, :fan=>1, :must=>1, :watch=>1, :video=>1, :hei=>1, :check=>1, :out=>1, :filler=>3}>}
But why? Both scenarios appeared to have a @word_hash, so it isn't quite clear why one string had a vector of NaNs and the other one doesn't. Is it because in the second scenario, the third string had words that were similar to that of the first string? I will have to research this issue more carefully and decide how to gracefully handle this potential error.
This problem is probably not likely to happen in the real-world...if you add long passages to ZombieWriter, there's bound to be a few overlaps of words that classifier-reborn can detect. But it could happen...which is why I need to figure out how to fix it.
The text was updated successfully, but these errors were encountered:
Hi @mahaina. I'll see if I can work on this issue, probably in the next two weeks. If you have a sample corpus where this error can occur reliably, please send that over to me so that I can use it as 'test' material (though it's not necessary and I can work with the existing corpus within the OP). Right now though, I'm using those three sentences I mentioned in the OP, which allows me to reliably reproduce the error, but it's possible that your corpus might have some unique characteristics as well.
While trying to fix an unrelated issue, I experimented with the code from #5, but using ZombieWriter::MachineLearning rather than ZombieWriter::Randomization.
The culprit is the third string. Classifier-Reborn classified its
lsi_norm
as a vector of NaNs...Changing the third string slightly resolves the issue.
But why? Both scenarios appeared to have a
@word_hash
, so it isn't quite clear why one string had a vector of NaNs and the other one doesn't. Is it because in the second scenario, the third string had words that were similar to that of the first string? I will have to research this issue more carefully and decide how to gracefully handle this potential error.This problem is probably not likely to happen in the real-world...if you add long passages to ZombieWriter, there's bound to be a few overlaps of words that
classifier-reborn
can detect. But it could happen...which is why I need to figure out how to fix it.The text was updated successfully, but these errors were encountered: