Home

PHP Text Analysis

Want to process text using PHP? Well, you picked the right library for the task.

PHP Text Analysis provides a variety of tools for :

Analysis
- Date Analysis - use to extract dates from a given corpus
- Frequency Distribution - provides you with the basic tools to do simple analysis and is used as a base for many other algorithms
- Rapid Automatic Keyword Extraction (RAKE) - use the RAKE algorithm to rapidly automate keyword extraction
Collections - data structures for managing documents during analysis
Collocation - helps you find terms that co-occur more often than would be expected by chance.
Console - a command line interface for performing base indexing and text mining analysis with PHP
Entity Extraction - helps you find entities such as people, places and dates
Downloaders - Downloads 3rd party data files from the web
Filters - A set of tools for normalizing the terms and tokens before data analysis begins
Phonetics - Phonetic algorithms for fixing data. Helpful when you need to perform record linkage tasks with PHP
Ngrams - PHP code for generating NGrams from a given set of tokens or terms
Stemmers - Several stemmers are available for normalizing the data sets prior to further analysis
Tokenizers - A common set of tokenizers is availble for breaking up the corpus into tokens or sentences
Utilities - helper utilities for manipulating text data

Beyond Analysis

PHP Text Analysis is a light weight Information Retrieval and NLP library built using PHP. In addition, to analysis tools, PHP Text Analysis can be used to create a search engine that supports simple and advanced query types. This is especially useful when your data models have raw text that must be searchable.

Adapters
Engines
Indexes
Query

Suggestions on Performance

Performance is always very challenging. Here are a couple suggestions on how to improve the speed of your code.

Use the whitespace tokenizer, it works better than the general tokenizer
Use the filter classes on the whole text/corpus, avoid the applyTranformation method calls within the TokenDoc class. They are useful when each token must be validated or transformed. A lot of the filter classes have been re-written to better support the above approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

PHP Text Analysis

Beyond Analysis

Suggestions on Performance

Clone this wiki locally