Skip to content

English sentence segmentation rules based on SRX standard.

Notifications You must be signed in to change notification settings

apohllo/srx-english

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

srx-english

DESCRIPTION

‘srx-english’ is a Ruby library containing English sentence and word segmentation rules. The sentence segementation rules are based on rules defined by Marcin Miłkowski: morfologik.blogspot.com/2009/11/talking-about-srx-in-lt-during-ltc.html

FEATURES/PROBLEMS

  • this library is generated by ‘srx2ruby’ which has some limitations and might be not 100% SRX standard compliant.

INSTALL

Standard rubygems installation:

$ gem install srx-english

BASIC USAGE

The library defines the SRX::English::Sentence class allowing to iterate over the matched sentences:

require 'srx/english/sentence_splitter'

text =<<-END
  This is e.g. Mr. Smith, who talks slowly... And this is another sentence.
END

splitter = SRX::English::SentenceSplitter.new(text)
splitter.each do |sentence|
  puts sentence.gsub(/\n|\r/,"")
end
# This is e.g. Mr. Smith, who talks slowly...
# And this is another sentence.

require 'srx/english/word_splitter'

sentence = 'My home is my castle.'
splitter = SRX::English::WordSplitter.new(sentence)
splitter.each do |word,type,start_offset,end_offset|
  puts "'#{word}' #{type}"
end
# 'My' word
# ' ' other
# 'home' word
# ' ' other
# 'is' word
# ' ' other
# 'my' word
# ' ' other
# 'castle' word
# '.' punct

LICENSE

Copyright © 2011 Aleksander Pohl, Marcin Miłkowski, Jarosław Lipski

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

FEEDBACK

About

English sentence segmentation rules based on SRX standard.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages