Skip to content
yooper edited this page Aug 21, 2016 · 14 revisions

PHP Text Analysis

Want to process text using PHP? Well, you picked the right library for the task.

PHP Text Analysis provides a variety of tools for :

  • Analysis
  • Collections - data structures for managing documents during analysis
  • Collocation - helps you find terms that co-occur more often than would be expected by chance.
  • Console - a command line interface for performing base indexing and text mining analysis with PHP
  • Entity Extraction - helps you find entities such as people, places and dates
  • Downloaders - Downloads 3rd party data files from the web
  • Filters - A set of tools for normalizing the terms and tokens before data analysis begins
  • Phonetics - Phonetic algorithms for fixing data. Helpful when you need to perform record linkage tasks with PHP
  • Ngrams - PHP code for generating NGrams from a given set of tokens or terms
  • Stemmers - Several stemmers are available for normalizing the data sets prior to further analysis
  • Tokenizers - A common set of tokenizers is availble for breaking up the corpus into tokens or sentences
  • Utilities - helper utilities for manipulating text data

Beyond Analysis

PHP Text Analysis is a light weight Information Retrieval and NLP library built using PHP. In addition, to analysis tools, PHP Text Analysis can be used to create a search engine that supports simple and advanced query types. This is especially useful when your data models have raw text that must be searchable.

  • Adapters
  • Engines
  • Indexes
  • Query

Suggestions on Performance

Performance is always very challenging. Here are a couple suggestions on how to improve the speed of your code.

  • Use the whitespace tokenizer, it works better than the general tokenizer
  • Use the filter classes on the whole text/corpus, avoid the applyTranformation method calls within the TokenDoc class. They are useful when each token must be validated or transformed. A lot of the filter classes have been re-written to better support the above approach