forked from Anonyfox/meteor-tags
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added Language dependent Configuration (0.0.8)
+ Added a filter attribute to deactivate the high pass filter (Default: filter activated) + Added language dependent configuration + Filter possible tags with Yaki itself + Added README.md content + Added an Inspector + Fixed some Bugs
- Loading branch information
Francesco Möller
committed
Mar 19, 2015
1 parent
284fd0b
commit e295990
Showing
6 changed files
with
175 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,56 @@ | ||
# Yaki | ||
Yaki can capture relevant tags from any bunch of text. Works on the client and on the server. | ||
|
||
*Beware*: This is an early alpha test release and NOT suitable for production. | ||
Features from Yaki: | ||
- Uses term normalizations to construct a list of terms | ||
- Uses Stopword Lists and a language dependent alphabet as dictionaries | ||
- Calculates tag relevance via statisitcal methods: like entropy and standard normal distribution | ||
- Uses [n-Gram](http://en.wikipedia.org/wiki/N-gram) for stemming and simmilarity detection | ||
- Can find word combinations (in case of multiple occurences) | ||
- Currently supported languages: english and german | ||
- Uses language dependent feature configurations to improve QoS | ||
|
||
#### License | ||
Text Retrieval classification: *morphology* and parts of *syntax* (without vocabulary) | ||
|
||
***Beware***: This is an early alpha test release and NOT suitable for production. | ||
|
||
### Installation | ||
|
||
```shell | ||
$ meteor add nefiltari:yaki | ||
``` | ||
|
||
### How-To | ||
For simple tagging (most features are activated by default) use following syntax: | ||
```coffee | ||
console.log Yaki("This is a sample text to demonstrate the tagging engine `Yaki`.").extract() | ||
# -> [ 'demonstrate', 'yaki', 'engine', 'tagging' ] | ||
``` | ||
|
||
If you know the language then you can specify this as second parameter (use the Top Level Domain abbreviation). | ||
The default language is english. | ||
Use additional (maybe) known tags to add a stronger weight to some words. | ||
```coffee | ||
text = "Dieser Beispieltext demonstriert das Tagging in deutscher Sprache." | ||
console.log Yaki(text, {language: 'de', tags: ['yaki']}).extract() | ||
# -> [ 'yaki', 'demonstriert', 'beispieltext', 'deutscher', 'sprache' ] | ||
``` | ||
|
||
You can normalize and `clean()` an array of words, fragments or tags with Yaki. | ||
```coffee | ||
fragments = ['(legend)', 'advanced.', 'MultiColor', '-> HTTP <-'] | ||
console.log Yaki(fragments).clean() | ||
# -> [ 'legend', 'advanced', 'multicolor', 'http' ] | ||
``` | ||
|
||
### ToDo | ||
|
||
- [ ] Instead of transferring the heavy stopword-lists to the client, proxy client requests through | ||
a server method | ||
- [x] Improve the algorithm to find multi-word phrases instead of just single words | ||
- [x] Refactor the source code to improve readability and performance even further | ||
- [ ] Add more test cases to ensure quality and enable better collaboration | ||
|
||
### License | ||
|
||
This code is licenced under the LGPL 3.0. Do whatever you want with this code, but I'd like to get improvements and bugfixes back. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Configuration | ||
This file provides each language their special foible in calculating significance. | ||
|
||
Configuration = @Configuration = | ||
## English | ||
The english language has more shorter and lesser capitalized words. Asign also a stronger | ||
weight on word entropy because the (most) short an simple word variations | ||
|
||
en: | ||
# Stemming with k-gramm (Yaki.stem) | ||
k: 4 | ||
similarity: 0.6 | ||
# Calculation (Yaki.calculate) | ||
entropieStrength: 3 | ||
frequencyCoefficient: 1.0 | ||
capitalizeBonus: 10 | ||
akkronymBonus: 15 | ||
positionCoefficient: 1.0 | ||
tagBonus: 20 | ||
# Word Combination (Yaki.combine) | ||
combinationOccurences: 2 | ||
# Analyse (Yaki.analyse) | ||
minQuality: 3 | ||
## German | ||
The german language has very long and more capitalized words. The words needs a | ||
softer simmilarity level because more word variations (morphology). | ||
|
||
de: | ||
# Stemming with k-gramm (Yaki.stem) | ||
k: 4 | ||
similarity: 0.4 | ||
# Calculation (Yaki.calculate) | ||
entropieStrength: 2 | ||
frequencyCoefficient: 1.0 | ||
capitalizeBonus: 2 | ||
akkronymBonus: 15 | ||
positionCoefficient: 1.0 | ||
tagBonus: 20 | ||
# Word Combination (Yaki.combine) | ||
combinationOccurences: 2 | ||
# Analyse (Yaki.analyse) | ||
minQuality: 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.