-
Notifications
You must be signed in to change notification settings - Fork 88
Home
yooper edited this page Aug 21, 2016
·
14 revisions
Want to process text using PHP? Well, you picked the right library for the task.
PHP Text Analysis provides a variety of tools for :
- Analysis
- Date Analysis - use to extract dates from a given corpus
- Frequency Distribution - provides you with the basic tools to do simple analysis and is used as a base for many other algorithms
- Rapid Automatic Keyword Extraction (RAKE) - use the RAKE algorithm to rapidly automate keyword extraction
- Collections - data structures for managing documents during analysis
- Collocation - helps you find terms that co-occur more often than would be expected by chance.
- Console - a command line interface for performing base indexing and text mining analysis with PHP
- Entity Extraction - helps you find entities such as people, places and dates
- Downloaders - Downloads 3rd party data files from the web
- Filters - A set of tools for normalizing the terms and tokens before data analysis begins
- Phonetics - Phonetic algorithms for fixing data. Helpful when you need to perform record linkage tasks with PHP
- Ngrams - PHP code for generating NGrams from a given set of tokens or terms
- Stemmers - Several stemmers are available for normalizing the data sets prior to further analysis
- Tokenizers - A common set of tokenizers is availble for breaking up the corpus into tokens or sentences
- Utilities - helper utilities for manipulating text data
PHP Text Analysis is a light weight Information Retrieval and NLP library built using PHP. In addition, to analysis tools, PHP Text Analysis can be used to create a search engine that supports simple and advanced query types. This is especially useful when your data models have raw text that must be searchable.
- Adapters
- Engines
- Indexes
- Query
Performance is always very challenging. Here are a couple suggestions on how to improve the speed of your code.
- Use the whitespace tokenizer, it works better than the general tokenizer
- Use the filter classes on the whole text/corpus, avoid the applyTranformation method calls within the TokenDoc class. They are useful when each token must be validated or transformed. A lot of the filter classes have been re-written to better support the above approach