Skip to content

Commit

Permalink
Further README improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
janlelis committed Oct 20, 2024
1 parent 1c94ad4 commit 37d08f2
Showing 1 changed file with 31 additions and 35 deletions.
66 changes: 31 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@ CLDR version (used for sub-region flags): **45** (April 2024)
gem "unicode-emoji"
```

## Usage

### Regex
## Usage โ€“ Regex Matching

The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.

Expand All @@ -41,11 +39,9 @@ string = "String which contains all kinds of emoji:
string.scan(Unicode::Emoji::REGEX) # => ["๐Ÿ˜ด", "โ–ถ๏ธ", "๐Ÿ›Œ๐Ÿฝ", "๐Ÿ‡ต๐Ÿ‡น", "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ", "2๏ธโƒฃ", "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ"]
```

#### Regex: Which Type of Emoji?

There are multiple levels of Emoji detection:

#### Main Regexes
### Main Regexes

Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
Expand All @@ -54,7 +50,7 @@ Regex | Description | Example Matches | Example Non-Matc
`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `1`, `1โƒฃ`
`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `1` | `1โƒฃ`

##### Include Text Emoji
#### Include Text Emoji

By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:

Expand All @@ -64,7 +60,7 @@ Regex | Description | Example Matches | Example Non-Matc
`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `1โƒฃ` | `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `1`
`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `1โƒฃ` | `๐Ÿป`, `1`

##### Singleton Regexes
#### Singleton Regexes

Matches only simple one-codepoint (+ optional variation selector) Emoji:

Expand All @@ -73,7 +69,7 @@ Regex | Description | Example Matches | Example Non-Matc
`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `๐Ÿ˜ด`, `โ–ถ๏ธ` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `๐Ÿ‡ต๐Ÿ‡ต`,`2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `1`
`Unicode::Emoji::REGEX_TEXT` | Matches only textual singleton Emoji (except for singleton components, like digits) | `๐Ÿ˜ด๏ธŽ`, `โ–ถ` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿป`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `๐Ÿ‡ต๐Ÿ‡ต`,`2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `1`

##### Comparison
### Comparison

1) Fully-qualified RGI Emoji ZWJ sequence
2) Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
Expand All @@ -91,52 +87,52 @@ Regex | Description | Example Matches | Example Non-Matc
Regex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Region | 6 Any Region | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keycap
-|-|-|-|-|-|-|-|-|-|-|-|-
REGEX | โœ… | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โŒ | โŒ
REGEX_INCLUDE_TEXT | โœ… | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โœ… | โœ…
REGEX_VALID | โœ… | โœ… | (โœ…)ยน | โœ… | โœ… | โŒ | โœ… | โœ… | โŒ | โœ… | โŒ | โŒ
REGEX_VALID_INCLUDE_TEXT | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ | โœ… | โœ… | โŒ | โœ… | โœ… | โœ…
REGEX_WELL_FORMED | โœ… | โœ… | (โœ…)ยน | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ | โŒ
REGEX_WELL_FORMED_INCLUDE_TEXT | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ…
REGEX_POSSIBLE | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ
REGEX_BASIC | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โœ… | โŒ | โŒ
REGEX_TEXT | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โœ… | โœ…
REGEX INCLUDE TEXT | โœ… | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โœ… | โœ…
REGEX VALID | โœ… | โœ… | (โœ…)ยน | โœ… | โœ… | โŒ | โœ… | โœ… | โŒ | โœ… | โŒ | โŒ
REGEX VALID INCLUDE TEXT | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ | โœ… | โœ… | โŒ | โœ… | โœ… | โœ…
REGEX WELL FORMED | โœ… | โœ… | (โœ…)ยน | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ | โŒ
REGEX WELL FORMED INCLUDE TEXT | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ…
REGEX POSSIBLE | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ
REGEX BASIC | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โœ… | โŒ | โŒ
REGEX TEXT | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โœ… | โœ…

ยน Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)

See spec files for detailed examples about which regex matches which kind of Emoji.
See [spec files](/spec) for detailed examples about which regex matches which kind of Emoji.

#### Picking the Right Emoji Regex
### Picking the Right Emoji Regex

- Usually you just want `REGEX` (RGI set)
- If you want broader matching (any ZWJ sequences, more sub-region flags), choose `REGEX_VALID`
- Even broader is `REGEX_WELL_FORMED`, which will also match any region flag and any tag sequence
- Use `_INCLUDE_TEXT` suffix with any of the above, if you want to also match basic textual Emoji
- And finally there is also the option to use `REGEX_POSSIBLE` , which is a simplified test for possible Emoji that might contain false positives. However, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check.
- If you need to match any region flag and any tag sequence, choose `REGEX_WELL_FORMED`
- Use the `_INCLUDE_TEXT` suffix with any of the above, if you want to also match basic textual Emoji
- And finally, there is also the option to use `REGEX_POSSIBLE` , which is a simplified test for possible Emoji. It might contain false positives, however, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check.

#### Examples
### Examples

Desc | Emoji | Escaped | `REGEX` (RGI) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed) | `REGEX_POSSIBLE`
-----|-------|---------|---------------|-----------------------|-----------------------------------|-----------------
RGI ZWJ Sequence | "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ" | `\u{1F93E 1F3FD 200D 2640 FE0F}` | Yes | Yes | Yes | Yes
Valid ZWJ Sequence | "๐Ÿค โ€๐Ÿคข" | `\u{1F920 200D 1F922}` | No | Yes | Yes | Yes
Known Region | "๐Ÿ‡ต๐Ÿ‡น" | `\u{1F1F5 1F1F9}` | Yes | Yes | Yes | Yes
Unknown Region | "๐Ÿ‡ต๐Ÿ‡ต" | `\u{1F1F5 1F1F5}` | No | No | Yes | Yes
RGI Tag Sequence | "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ" | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | Yes | Yes | Yes | Yes
Valid Tag Sequence | "๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ" | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | No | Yes | Yes | Yes
Well-formed Tag Sequence | "๐Ÿ˜ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ" | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | No | No | Yes | Yes
RGI ZWJ Sequence | ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ | `\u{1F93E 1F3FD 200D 2640 FE0F}` | โœ… | โœ… | โœ… | โœ…
Valid ZWJ Sequence | ๐Ÿค โ€๐Ÿคข | `\u{1F920 200D 1F922}` | โŒ | โœ… | โœ… | โœ…
Known Region | ๐Ÿ‡ต๐Ÿ‡น | `\u{1F1F5 1F1F9}` | โœ… | โœ… | โœ… | โœ…
Unknown Region | ๐Ÿ‡ต๐Ÿ‡ต | `\u{1F1F5 1F1F5}` | โŒ | โŒ | โœ… | โœ…
RGI Tag Sequence | ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | โœ… | โœ… | โœ… | โœ…
Valid Tag Sequence | ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | โŒ | โœ… | โœ… | โœ…
Well-formed Tag Sequence | ๐Ÿ˜ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | โŒ | โŒ | โœ… | โœ…

Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.

More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).

#### Extended Pictographic Regex
### Extended Pictographic Regex

`Unicode::Emoji::REGEX_PICTO` matches single codepoints with the **Extended_Pictographic** property. For example, it will match `โœ€` BLACK SAFETY SCISSORS.

`Unicode::Emoji::REGEX_PICTO_NO_EMOJI` matches single codepoints with the **Extended_Pictographic** property, but excludes Emoji characters.

See [character.construction/picto](https://character.construction/picto) for a list of all non-Emoji pictographic characters.

#### Partial Regexes
### Partial Regexes

**Please note:** Might get removed or renamed in the future. This the same as `\p{Emoji}`

Expand All @@ -146,7 +142,7 @@ Regex | Description | Example Matches | Example Non-Matc
------------------------------|-------------|-----------------|--------------------
`Unicode::Emoji::REGEX_ANY` | Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! | `๐Ÿ˜ด`, `โ–ถ`, `๐Ÿป`, `๐Ÿ›Œ`, `๐Ÿฝ`, `๐Ÿ‡ต`, `๐Ÿ‡น`, `2`, `๐Ÿด`, `๐Ÿคพ`, `โ™€`, `๐Ÿค `, `๐Ÿคข` | -

### List
## Usage โ€“ List

Use `Unicode::Emoji::LIST` or the **list** method to get a ordered and categorized list of Emoji:

Expand All @@ -161,11 +157,11 @@ Unicode::Emoji.list("Food & Drink", "food-asian")
=> ["๐Ÿฑ", "๐Ÿ˜", "๐Ÿ™", "๐Ÿš", "๐Ÿ›", "๐Ÿœ", "๐Ÿ", "๐Ÿ ", "๐Ÿข", "๐Ÿฃ", "๐Ÿค", "๐Ÿฅ", "๐Ÿฅฎ", "๐Ÿก", "๐ŸฅŸ", "๐Ÿฅ ", "๐Ÿฅก"]
```

Please note that categories might change with future versions of the Emoji standard, also this has not happened often.
Please note that categories might change with future versions of the Emoji standard, although this has not happened often.

A list of all Emoji (generated from this gem) can be found at [character.construction/emoji](https://character.construction/emoji).

### Properties Data
## Usage โ€“ Properties Data

Allows you to access the codepoint data form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:

Expand Down

0 comments on commit 37d08f2

Please sign in to comment.