Skip to content

Commit

Permalink
Update StringNormalizer.java
Browse files Browse the repository at this point in the history
Unicode line separator (U2028) and paragraph separator (U2029) made equivalent to a standard white space for String normalization.
  • Loading branch information
rccarrasco committed Dec 2, 2013
1 parent 71982bd commit cddb2ae
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions src/main/java/eu/digitisation/io/StringNormalizer.java
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ public class StringNormalizer {
= java.text.Normalizer.Form.NFC;

/**
* Reduce whitespace.
* Reduce whitespace (including line and paragraph separators)
*
* @param s a string.
* @return The string with simple spaces between words.
*/
public static String reduceWS(String s) {
return s.replaceAll("\\p{Space}+", " ").trim();
return s.replaceAll("(\\p{Space}|\\p{general_category=Zl}|\\p{general_category=Zp})+", " ").trim();
}

/**
Expand Down

0 comments on commit cddb2ae

Please sign in to comment.