Skip to content

gitpan/AI-Categorize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME
    AI::Categorize - Automatically categorize documents based on content

SYNOPSIS
      ### This is one of the categorizers available (see below for more)
      use AI::Categorize::NaiveBayes;
      my $c = new AI::Categorize::NaiveBayes();
  
      ### Supply some training documents so it can learn how to categorize
      $c->stopwords('the','a','and','but','I');  # Ignore these words
      $c->add_document($name, \@categories, $content);
      ... repeat for many documents, then:
      $c->crunch();
  
      $c->save_state('filename'); # Save machine for later use
  
      ### Categorize a new unknown document
      my $c = new AI::Categorize::NaiveBayes();
      $c->restore_state('filename');
      my $results = $c->categorize($content);
      if ($results->in_category('sports')) { ... }
      my @cats = $results->categories;
      my @scores = $results->scores(@cats);

DESCRIPTION
    This module implements several algorithms for automatically guessing
    category information of documents based on the category information of
    existing documents. For example, one might categorize incoming email
    messages in order to place them into existing mailboxes, or one might
    categorize newspaper articles by general topic (business, sports, etc.). All
    of the categorizers learn their categorization rules from a body of existing
    pre-categorized documents.

    Disclaimer: the results of any of these algorithms are far from infallible
    (close to fallible?). Categorization of documents is often a difficult task
    even for humans well-trained in the particular domain of knowledge, and
    there are many things a human would consider that none of these algorithms
    consider. These are only statistical tests - at best they are neat tricks or
    helpful assistants, and at worst they are totally unreliable. If you plan to
    use this module for anything important, human supervision is essential.

    But this voodoo can be quite fun. =)

ALGORITHMS
    Currently two different algorithms are implemented in this bundle:

      AI::Categorize::NaiveBayes
      AI::Categorize::kNN

    These are all subclasses of "AI::Categorize". Please see the documentation
    of these individual modules for more details on their guts and quirks. The
    common interface for all the algorithms is described here.

    All these classes are designed to be subclassible so you can modify their
    behavior to suit your needs.

AI::Categorize Methods
    * new()
        Creates a new categorizer object (hereafter referred to as "$c"). The
        arguments to "new()" will depend on which subclass of "AI::Categorize"
        you happen to be using. See the subclasses' individual documentation for
        more info.

    * $c->stopwords()
    * $c->stopwords(@words)
        Gets (and optionally sets) the list of stopwords. Stopwords are words
        that should be ignored by the categorizer, and typically they are the
        most common non-informative words in the documents. The most common
        reason to use stopwords is to reduce processing time.

        The stoplist should be set before processing any documents.

    * $c->stopwords_hash()
        Returns the stopwords as the keys of a hash reference. The corresponding
        values are all 1. Can be useful for quick checking of whether a word is
        a stopword.

    * $c->add_stopword($word)
        Adds a single entry to the stopword list.

    * $c->add_document($name, $categories, $content)
        Adds a new training document to the database. "$name" should be a unique
        string identifying this document. "$categories" may be either the name
        of a single category to which this document belongs, or a reference to
        an array containing the names of several categories. "$content" is the
        text content of the document.

        To ease syntax, in the future "$content" may be allowed to be given as a
        path to the document, which will be opened and parsed.

    * $c->crunch()
        After all documents have been added, call "crunch()" so that the
        categorizer can compute some statistics on the training data and get
        ready to categorize new documents.

    * $c->categorize($content)
        Processes the text in "$content" and returns an object blessed into the
        "AI::Categorize::Result" class.

        To ease memory requirements, in the future "$content" may be allowed to
        be passed as a filehandle.

    * $c->cat_map()
        Returns the 'category map' object, which is an object of the class
        "AI::Categorize::Map". See the documentation for this class below.

    * $c->save_state($filename)
        At any time you may save the state of the categorizer to a file, so that
        you can reload it later using the "restore_state()" method.

    * $c->restore_state($filename)
        Reads in the categorizer data from $filename, which should have
        previously been saved using the "save_state()" method.

    * $c->error(\@assigned_categories, \@correct_categories)
        Returns the fraction of the binary (yes/no) categorization decisions
        that were incorrect. For instance, if there are 4 entries in
        "@assigned_categories", 5 entries in "@correct_categories", 2 of those
        catagories overlap, and there are a total of 20 categories to choose
        from, then the error is "(2+3)/20 = 5/20 = 0.25". In this case, 2 errors
        of false assignment were made, and 3 errors of omission were made.

        The accuracy and error will always have a sum of 1.

    * $c->accuracy(\@assigned_categories, \@correct_categories)
        Returns the fraction of the binary (yes/no) categorization decisions
        that were correct. For instance, if there are 4 entries in
        "@assigned_categories", 5 entries in "@correct_categories", 2 of those
        catagories overlap, and there are a total of 20 categories to choose
        from, then the accuracy is "(2+13)/20 = 15/20 = 0.75". In this case, 2
        correct assignments were made, and 13 categories were correctly omitted.

        The accuracy and error will always have a sum of 1.

    * $c->F1(\@assigned_categories, \@correct_categories)
        This method computes the F1 measure, which is helpful for evaluating how
        well the categorizer did when it assigned categories. The F1 measure is
        defined to be 2 times the number of correctly assigned categories
        divided by the sum of the number of assigned categories and correct
        categories.

        In other words, if A is the set of categories that were assigned by the
        system, C is the set of categories that should have been assigned by the
        system, and I is the intersection of A and C, then

                   2*I
            F1 = -------
                  A + C

        (Other sources may define F1 as "2*recall*precision/(recall+precision)",
        which is equivalent to the above formula but forces division by zero if
        either A or C is empty.)

        A perfect job categorizing (all correct categories were assigned and no
        extras were assigned) will have an F1 score of 1. A terrible job
        categorizing (no overlap between correct & assigned categories) will
        have an F1 score of 0. Medium jobs will be somewhere in between.

    * $c->precision(\@assigned_categories, \@correct_categories)
        Returns the 'precision' measure, which is defined as "I/A", where "A" is
        the number of elements in "@assigned_categories", and "I" is the number
        of elements in the intersection of "@assigned_categories" and
        "@correct_categories".

        If your categorizer is being too strict, i.e. assigning fewer categories
        than it should be, then the precision will be significantly higher than
        the recall.

    * $c->recall(\@assigned_categories, \@correct_categories)
        Returns the 'recall' measure, which is defined as "I/C", where "C" is
        the number of elements in "@correct_categories", and "I" is the number
        of elements in the intersection of "@assigned_categories" and
        "@correct_categories".

        If your categorizer is being too lenient, i.e. assigning more categories
        than it should be, then the recall will be significantly higher than the
        precision.

    * $c->extract_words($text)
        Returns a reference to a hash whose keys are the words contained in
        "$text" and whose values are the number of times each word appears.
        Stopwords are omitted and words are put into canonical form
        (lower-cased, leading & trailing non-word characters stripped).

        Don't call this method directly, as it is used internally by the various
        categorization modules. However, you may be interested in subclassing
        one of the modules and overriding "extract_words()" to behave
        differently. For instance, you may want to "lemmatize" your words to
        remove affixes so that "abominable", "abominableness", "abominably",
        "abominate", "abomination", and "abominator" all share a single entry in
        the categorizer.

"AI::Categorize::Result" Methods
    An "AI::Categorize::Result" object (hereafter abbreviated as "$r") is
    returned by the "$c->categorize" method, described above.

    * $r->in_category($category)
        Returns true or false depending on whether the document was placed in
        the given category.

    * $r->categories()
        Returns an ordered list of the categories the document was placed in,
        with best matches first.

    * $r->scores(@categories)
        Returns a list of result scores for the given categories. Since the
        interface is still changing, not very much can officially be said about
        the scores, except that a good score is higher than a bad score. This
        may change to something like a probability scale, with all numbers
        between 0 and 1, and a threshold for membership somewhere in between.

        Please consider the scoring feature somewhat unstable for now.

"AI::Categorize::Map" Methods
    The "AI::Categorize::Map" class manages the relationships between documents
    and categories. It is designed to support fast lookups either by category or
    by document. An "AI::Categorize::Map" object is returned by the
    "$c->cat_map" method (described above) and will be called "$m" for
    convenience in this documentation.

    In general, feel free to make any queries of the map, but don't make any
    changes to its data. Changes should usually only be made through the
    categorizer object.

    * $m->add_document($doc_name, \@cat_names)
        Adds the given document to the map with membership in the given
        categories. The "AI::Categorize" "add_document()" method calls this
        internally.

    * $m->categories()
        Returns a list of all known categories in a list context, or the number
        of known categories in a scalar context.

    * $m->documents()
        Returns a list of all known categories in a list context, or the number
        of known categories in a scalar context.

    * $m->categories_of($doc_name)
        Returns a list of categories that "$doc_name" belongs to in a list
        context, or the number of such categories in a scalar context.

    * $m->documents_of($cat_name)
        Returns a list of documents that "$cat_name" contains in a list context,
        or the number of such documents in a scalar context.

    * $m->is_in_category($doc_name, $cat_name)
        Returns true if "$doc_name" belongs to the "$cat_name" category, or
        false otherwise.

    * $m->contains_document($cat_name, $doc_name)
        Returns true if "$cat_name" contains the document "$doc_name", or false
        otherwise. Note that this is just a synonym for the "is_in_category()"
        method with the names turned around.

CAVEATS
    Don't depend on the specific scores given by "$r->scores". They may change
    in future releases.

    The entire categorizer is currently created in memory, which can get pretty
    demanding if you have a lot of data. If this turns out to be a problem,
    future versions may try to cache large chunks on disk. This would come with
    a speed penalty.

    Finally, I am not an expert in document categorization. I have thought about
    it some, and I have written these modules largely as a way to concretize my
    thinking and learn more about the processes. If you know of ways to improve
    accuracy, please let me know.

TO DO
    Idea from obvy: try tying into infobot (purl) to identify IRC moods: @moods
    = qw(indifferent flame_mode pissed inebriated happy sad)

AUTHOR
    Ken Williams, ken@forum.swarthmore.edu

COPYRIGHT
    Copyright 2000-2001 Ken Williams. All rights reserved.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

SEE ALSO
    perl(1), DBI(3).

    "A re-examination of text categorization methods" by Yiming Yang the section
    on "http://www.cs.cmu.edu/~yiming/publications.html"

    Other links from Na'im Tyson:

    www.ruf.rice.edu/~barlow/corpus.html (corp. lx.) ciir.cs.umass.edu (info.
    ret) www.georgetown.edu/wilson/IR/IR.html (class in IR @ Georgetown
    University) www.research.att.com/~lewis (professional homepage of David
    Lews, one of the leaders in document categorization. you may want to visit
    his site sooner than the others since he has left AT&T research.)

About

Read-only release history for AI-Categorize

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages