From 9ce236299edd4ab58d7cd2b473fe3427db55a019 Mon Sep 17 00:00:00 2001 From: Oliver <480930+rivo@users.noreply.github.com> Date: Wed, 27 Jul 2022 00:01:06 +0100 Subject: [PATCH] Updated README to reflect all recent changes. --- README.md | 108 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 92 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index b2cc6d0..89fc21a 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,12 @@ [![Go Reference](https://pkg.go.dev/badge/github.com/rivo/uniseg.svg)](https://pkg.go.dev/github.com/rivo/uniseg) [![Go Report](https://img.shields.io/badge/go%20report-A%2B-brightgreen.svg)](https://goreportcard.com/report/github.com/rivo/uniseg) -This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](http://unicode.org/reports/tr29/) (Unicode version 14.0.0). - -At this point, only the determination of grapheme cluster boundaries is implemented. +This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](https://unicode.org/reports/tr29/) and Unicode Line Breaking according to [Unicode Standard Annex #14](https://unicode.org/reports/tr14/) (Unicode version 14.0.0). ## Background +### Grapheme Clusters + In Go, [strings are read-only slices of bytes](https://blog.golang.org/strings). They can be turned into Unicode code points using the `for` loop or by casting: `[]rune(str)`. However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples: |String|Bytes (UTF-8)|Code points (runes)|Grapheme clusters| @@ -17,7 +17,19 @@ In Go, [strings are read-only slices of bytes](https://blog.golang.org/strings). |πŸ³οΈβ€πŸŒˆ|14 bytes: `f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88`|4 code points: `1f3f3 fe0f 200d 1f308`|1 cluster: `[1f3f3 fe0f 200d 1f308]`| |πŸ‡©πŸ‡ͺ|8 bytes: `f0 9f 87 a9 f0 9f 87 aa`|2 code points: `1f1e9 1f1ea`|1 cluster: `[1f1e9 1f1ea]`| -This package provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit. +This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit. + +### Word Boundaries + +Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings. + +### Sentence Boundaries + +Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings. + +### Line Breaking + +Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters). ## Installation @@ -25,26 +37,90 @@ This package provides a tool to iterate over these grapheme clusters. This may b go get github.com/rivo/uniseg ``` -## Basic Example +## Examples + +### Counting Characters in a String ```go -package uniseg +n := uniseg.GraphemeClusterCount("πŸ‡©πŸ‡ͺπŸ³οΈβ€πŸŒˆ") +fmt.Println(n) +// 2 +``` -import ( - "fmt" +### Using the [`Graphemes`](https://pkg.go.dev/github.com/rivo/uniseg#Graphemes) Class - "github.com/rivo/uniseg" -) +This is the most convenient method of iterating over grapheme clusters: + +```go +gr := uniseg.NewGraphemes("πŸ‘πŸΌ!") +for gr.Next() { + fmt.Printf("%x ", gr.Runes()) +} +// [1f44d 1f3fc] [21] +``` + +### Using the [`Step`](https://pkg.go.dev/github.com/rivo/uniseg#Step) or [`StepString`](https://pkg.go.dev/github.com/rivo/uniseg#StepString) Function -func main() { - gr := uniseg.NewGraphemes("πŸ‘πŸΌ!") - for gr.Next() { - fmt.Printf("%x ", gr.Runes()) +This is orders of magnitude faster than the `Graphemes` class, but it requires the handling of states and boundaries: + +```go +str := "πŸ‡©πŸ‡ͺπŸ³οΈβ€πŸŒˆ" +state := -1 +var c string +for len(str) > 0 { + c, str, _, state = uniseg.StepString(str, state) + fmt.Printf("%x ", []rune(c)) +} +// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308] +``` + +### Advanced Examples + +Breaking into grapheme clusters and evaluating line breaks: + +```go +str := "First line.\nSecond line." +state := -1 +var ( + c string + boundaries int +) +for len(str) > 0 { + c, str, boundaries, state = uniseg.StepString(str, state) + fmt.Print(c) + if boundaries&uniseg.MaskLine == uniseg.LineCanBreak { + fmt.Print("|") + } else if boundaries&uniseg.MaskLine == uniseg.LineMustBreak { + fmt.Print("β€–") } - // Output: [1f44d 1f3fc] [21] } +// First |line. +// β€–Second |line.β€– +``` + +If you're only interested in word segmentation, use [`FirstWord`](https://pkg.go.dev/github.com/rivo/uniseg#FirstWord) or [`FirstWordInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstWordInString): + +```go +str := "Hello, world!" +state := -1 +var c string +for len(str) > 0 { + c, str, state = uniseg.FirstWordInString(str, state) + fmt.Printf("(%s)\n", c) +} +// (Hello) +// (,) +// ( ) +// (world) +// (!) ``` +Similarly, use + +- [`FirstGraphemeCluster`](https://pkg.go.dev/github.com/rivo/uniseg#FirstGraphemeCluster) or [`FirstGraphemeClusterInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstGraphemeClusterInString) for grapheme cluster determination only, +- [`FirstSentence`](https://pkg.go.dev/github.com/rivo/uniseg#FirstSentence) or [`FirstSentenceInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstSentenceInString) for sentence segmentation only, and +- [`FirstLineSegment`](https://pkg.go.dev/github.com/rivo/uniseg#FirstLineSegment) or [`FirstLineSegmentInString`](https://pkg.go.dev/github.com/rivo/uniseg#FirstLineSegmentInString) for line breaking / word wrapping (although using [`Step`](https://pkg.go.dev/github.com/rivo/uniseg#Step) or [`StepString`](https://pkg.go.dev/github.com/rivo/uniseg#StepString) is preferred as it will observe grapheme cluster boundaries). + ## Documentation Refer to https://pkg.go.dev/github.com/rivo/uniseg for the package's documentation. @@ -55,7 +131,7 @@ This package does not depend on any packages outside the standard library. ## Sponsor this Project -[Become a Sponsor on GitHub](https://github.com/sponsors/rivo?metadata_source=uniseg_readme) to further this project! Plans for future releases include the implementation of the rest of UAX29 (word boundaries, sentence boundaries) as well as UAX14 (line breaking). +[Become a Sponsor on GitHub](https://github.com/sponsors/rivo?metadata_source=uniseg_readme) to support this project! ## Your Feedback