-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program crashes for larger quantities of articles #9
Comments
Sorry for not seeing your comment earlier. Basically,
This sounds super-cool. Keep me posted to see what will happen next. I'm curious whether the output of ZombieWriter improve significantly when given enough paragraphs. EDIT: That being said, the stack trace also seem to mention |
Er, looking at the error trace again, I don't think the problem is with matrix multiplication (since that's the buggy/slow part of In case I can't handle it, I could either try to make 'titles' for article-clusters optional...or fallback to an empty title if one can't be generated. |
That would be great! Just a warning, the content in these has some explicit words - currently, my use case is to have the program write its own reviews for videogames. So with this data set, I grabbed about 300 lines of text from real user reviews of the call of duty videogame, which happen to be mostly negative. Shouldn't be relevant to the code, but just wanted to explain why the content itself might seem so weird or poorly written :) Let me know if there's any other info that would help you! |
Working on the issue right now. As a side-note, I am using a new version of Ruby and forgot to install rb-gsl, so the program thought I didn't have GSL installed. So, I got this separate error with your dataset.
Installing rb-gsl resolved this issue and allowed me to encounter the issue you're facing with (though it also suggest to me that I really do need to make using After reading the docs of ClassifierReborn::Summarizer, I think I know what might be causing most of two-or-more-sentence summarizations to be failing. I'll demonstrate with the following article. FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered
I am not a fan of the Battlefield series but it blows this COD away. First, Classifier-Reborn take that article and split it up into sentences using this regex: ["FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered\n\n\nI am not a fan of the Battlefield series but it blows this COD away", ".", "\n"] Classifier-Reborn then creates a new LSI dedicated only to summarization ( chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } Essentially, we are throwing away any chunks that are simply one-word sentences (such as ".") or is empty (once we strip away whitespace). So the only chunk that we add to the new LSI is...
...i.e, a single sentence according to LSI. After Classifier-Reborn runs => #<ClassifierReborn::LSI:0x007fed63338760
@auto_rebuild=false,
@built_at_version=-1,
@cache_node_vectors=nil,
@items=
{"FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered\n" +
"\n" +
"\n" +
"I am not a fan of the Battlefield series but it blows this COD away"=>
#<ClassifierReborn::ContentNode:0x007fed630b16e0
@categories=[],
@lsi_norm=nil,
@lsi_vector=nil,
@word_hash=
{:fwiw=>1,
:top=>1,
:cod=>2,
:multiplay=>1,
:mw2=>1,
:ghost=>1,
:hate=>1,
:ruin=>1,
:seri=>2,
:imo=>1,
:mw3=>1,
:remast=>1,
:fan=>1,
:battlefield=>1,
:blow=>1,
:awai=>1}>},
@language="en",
@version=1,
@word_list=#<ClassifierReborn::WordList:0x007fed63338738 @location_table={}>> Since there is only one sentence, it is impossible for the LSI to summarize the content, and so an error is thrown. This, by the way, is why you were unable to prevent this error from occurring by adding the sentence "Test." to the end of every comment. Classifier-Reborn is programmed to reject single-word sentences when generating summaries, so it threw away that sentence. When I replaced all instances of "Test." to "Test Sentence." in the second CSV file, I was able to generate articles and summaries without issues (though a lot of the summaries were simply "Test Sentence"). (Note: Also, when I began removing problematic clusters from your original CSV, I'm able to still sometimes see articles with only one legitimate sentence (such as "Multiplayer/pvp hmmm.\n"), which of course leads to summarization to fail. So merely handling the issue where summarization fails even when you have two or more sentences won't really help in the long term. We need a more general solution to this problem, like adding "Test Sentence." .) Since the root-cause of this issue is headline generation, I'm tempted to either find a better way of summarizing articles or just downplay/remove that feature...since I'm not sure whether the headlines actually add anything to the article. For now though, a good, general hotfix would be to generate an empty title in case an error is thrown by ClassifierReborn. I'll work on doing that right now. |
Just pushed up ZombieWriter version 0.3.0 to rubygems.org . Now, if we encounter an error with Classifier-Reborn when generating titles, we generate an empty title instead. Let me know if this fixes the problem. |
Just tried with the updated version, and it all works now, including the 1 sentence versions. Been able to get some awesome output as a result, excited to try it with other data sets. Definitely appreciate the fast response and update! |
I've been using ZombieWriter and finding that it hits the same crash in Classifier-Reborn when I have a larger quantity of rows in the CSV file:
'`Jacks-MacBook-Pro:Projects johncambou$ ruby review-generator.rb /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi/content_node.rb:30:in
transposed_search_vector': undefined methodcol' for nil:NilClass (NoMethodError) from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:190:in
block in proximity_array_for_content'from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:188:in
collect' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:188:in
proximity_array_for_content'from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:166:in
block in highest_relative_content' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:166:in
each_key'from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:166:in
highest_relative_content' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi/summarizer.rb:29:in
perform_lsi'from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi/summarizer.rb:10:in
summary' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:21:in
header'from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:69:in
block in generate_articles' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:57:in
map'from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:57:in
generate_articles' from review-generator.rb:12:in
What's really strange to me is that this only happens for larger quantities of articles. When I have only ~40 or less rows in the CSV, it runs fine, but as I get to ~50+, the program will always hit the crash.
What's even stranger is that this doesn't seem to be consistent - sometimes it will crash at only 35 CSV lines, or sometimes it runs successfully at 56. Sometimes it will crash at the exact same CSV file that it was correctly processing earlier.
I've very meticulously tested if this is being caused by the specific content of my articles, but the program runs fine for any subset of my articles - it only crashes when I get above this certain general limit in quantity.
At this point I have tried:
I'm completely lost. Ideally I'd like to run the program with 300+ paragraphs, so that I can really get crazy with the output, but it's disappointing to be capped at so few. If you have any suggestions on how to fix this it'd be greatly appreciated.
The text was updated successfully, but these errors were encountered: