Skip to content

A lightweight, intuitive wrapper around Intl.Segmenter for seamless segment-aware string operations in TypeScript and JavaScript

License

Notifications You must be signed in to change notification settings

Not-Jayden/segment-string

Repository files navigation

🧩 segment-string

A lightweight, intuitive wrapper around Intl.Segmenter for seamless segment-aware string operations in TypeScript and JavaScript.

πŸ‘ͺ All Contributors: 2 🀝 Code of Conduct: Kept πŸ§ͺ Coverage πŸ“ πŸ“ License: MIT πŸ“¦ npm version πŸ’ͺ TypeScript: Strict


Key Features

  • Intuitive Intl.Segmenter Wrapper: Simplifies text segmentation with a clean API.
  • Standards-Based: Built on native Intl.Segmenter for robust compatibility.
  • Lightweight & Tree-Shakeable: Minimal footprint with optimal bundling.
  • Highly Performant: Uses iterators for efficient, on-demand processing.
  • Full TypeScript Support: Strict types for safe, predictable usage.

Installation

npm install segment-string

Getting Started

segment-string is a lightweight wrapper for Intl.Segmenter, designed to simplify locale-sensitive text segmentation in JavaScript and TypeScript. It lets you easily segment and manipulate text by graphemes, words, or sentences, ideal for handling complex cases like multi-character emojis or language-specific boundaries.

import { SegmentString } from "segment-string";

const str = new SegmentString("Hello, world! πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸŒπŸŒˆ");

// Segment by grapheme
console.log([...str.graphemes()]); // ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦', '🌍', '🌈']

SegmentString Class

The SegmentString class encapsulates a string and provides methods for segmentation, counting, and retrieving segments at specified indices with locale and granularity options.

Constructor

new SegmentString(str: string, locales?: Intl.LocalesArgument);
  • str: The string to segment.
  • locales: Optional locales argument for segmentation.

Methods

segments(granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): Iterable<string>

Segments the string by the specified granularity and returns the segments as strings.

rawSegments(granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): Intl.Segments | Iterable<Intl.SegmentData>

Returns raw Intl.SegmentData objects based on granularity and options.

segmentCount(granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): number

Counts segments in the string based on the specified granularity.

segmentAt(index: number, granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): string | undefined

Retrieves the segment at a specific index, supporting negative indices.

rawSegmentAt(index: number, granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): Intl.SegmentData | undefined

Returns the raw segment data at a specific index, supporting negative indices.

graphemes(options?: SegmentationOptions): Iterable<string>

Returns an iterable of grapheme segments as strings.

rawGraphemes(options?: SegmentationOptions): Iterable<Intl.SegmentData>

Returns an iterable of raw grapheme segments.

graphemeCount(options?: SegmentationOptions): number

Counts grapheme segments in the string.

graphemeAt(index: number, options?: SegmentationOptions): string | undefined

Returns the grapheme at a specific index, supporting negative indices.

rawGraphemeAt(index: number, options?: SegmentationOptions): Intl.SegmentData | undefined

Returns the raw grapheme data at a specific index, supporting negative indices.

words(options?: WordSegmentationOptions): Iterable<string>

Returns an iterable of word segments, with optional filtering for word-like segments.

rawWords(options?: WordSegmentationOptions): Iterable<Intl.SegmentData>

Returns an iterable of raw word segments, with optional filtering for word-like segments.

wordCount(options?: WordSegmentationOptions): number

Counts word segments in the string.

wordAt(index: number, options?: WordSegmentationOptions): string | undefined

Returns the word at a specific index, supporting negative indices.

rawWordAt(index: number, options?: WordSegmentationOptions): Intl.SegmentData | undefined

Returns the raw word data at a specific index, supporting negative indices.

sentences(options?: SegmentationOptions): Iterable<string>

Returns an iterable of sentence segments.

rawSentences(options?: SegmentationOptions): Iterable<Intl.SegmentData>

Returns an iterable of raw sentence segments.

sentenceCount(options?: SegmentationOptions): number

Counts sentence segments in the string.

sentenceAt(index: number, options?: SegmentationOptions): string | undefined

Returns the sentence at a specific index, supporting negative indices.

rawSentenceAt(index: number, options?: SegmentationOptions): Intl.SegmentData | undefined

Returns the raw sentence data at a specific index, supporting negative indices.

[Symbol.iterator](): Iterator<string>

Returns an iterator over the graphemes of the string.


Example Usage

import { SegmentString } from "segment-string";

const text = new SegmentString("Hello, world! πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸŒπŸŒˆ");

// Segmenting by words
for (const word of text.words()) {
	console.log(word); // 'Hello', ',', ' ', 'world', '!', ' πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸŒπŸŒˆ'
}

// Segmenting graphemes and counting
console.log([...text.graphemes()]); // ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦', '🌍', '🌈']
console.log("Grapheme count:", text.graphemeCount()); // 17
console.log("String length:", text.toString().length); // 29

// Accessing a specific word
const secondWord = text.wordAt(1, { isWordLike: true }); // 'world'
console.log(secondWord);

SegmentSplitter Class

Alternatively, the SegmentSplitter class allows you to create an instance that can be directly used with JavaScript's String.prototype.split method for basic segmentation.

Constructor

new SegmentSplitter<T extends Granularity>(granularity: T, options?: SegmentationOptions<T>);
  • granularity: Specifies the segmentation granularity level ('grapheme', 'word', 'sentence', etc.).
  • options: Optional settings to customize the segmentation for the given granularity.

Example Usage

const str = "Hello, world!";
const wordSplitter = new SegmentSplitter("word", { isWordLike: true });
const words = str.split(wordSplitter);
console.log(words); // ["Hello", "world"]

Individual Functions

getRawSegments

function getRawSegments(
	str: string,
	granularity: Granularity,
	options?: SegmentationOptions | WordSegmentationOptions,
): Intl.Segments | Iterable<Intl.SegmentData>;
  • Description: Returns raw Intl.SegmentData objects based on granularity and options.
  • Parameters:
    • str: The string to segment.
    • granularity: Specifies the segmentation level ('grapheme', 'word', or 'sentence').
    • options: Includes locales for specifying locale and isWordLike for filtering word-like segments.
  • Returns: An iterable of raw Intl.SegmentData.

getSegments

function getSegments(
	str: string,
	granularity: Granularity,
	options?: SegmentationOptions | WordSegmentationOptions,
): Iterable<string>;
  • Description: Returns segments of the string as plain strings.
  • Parameters: Similar to getRawSegments.
  • Returns: An iterable of segments as strings.

segmentCount

function segmentCount(
	str: string,
	granularity: Granularity,
	options?: SegmentationOptions | WordSegmentationOptions,
): number;
  • Description: Returns the count of segments based on granularity and options.
  • Parameters: Similar to getRawSegments.
  • Returns: Number of segments.

rawSegmentAt

function rawSegmentAt(
	str: string,
	index: number,
	granularity: Granularity,
	options?: SegmentationOptions | WordSegmentationOptions,
): Intl.SegmentData | undefined;
  • Description: Returns the raw segment data at a specified index, supporting negative indices.
  • Parameters: Similar to getRawSegments, plus an index parameter.
  • Returns: The Intl.SegmentData at the specified index, or undefined if out of bounds.

segmentAt

function segmentAt(
	str: string,
	index: number,
	granularity: Granularity,
	options?: SegmentationOptions | WordSegmentationOptions,
): string | undefined;
  • Description: Returns the segment at a specified index, supporting negative indices.
  • Parameters: Similar to getRawSegments, plus an index parameter.
  • Returns: The segment at the specified index or undefined if out of bounds.

filterRawWordLikeSegments

function filterRawWordLikeSegments(
	segments: Intl.Segments,
): Iterable<Intl.SegmentData>;
  • Description: Filters and returns an iterable of raw word-like segment data where isWordLike is true.
  • Parameters:
    • segments: The segments to filter.
  • Returns: An iterable of Intl.SegmentData for each word-like segment.

filterWordLikeSegments

function filterWordLikeSegments(segments: Intl.Segments): Iterable<string>;
  • Description: Filters and returns an iterable of word-like segments as strings where isWordLike is true.
  • Parameters:
    • segments: The segments to filter.
  • Returns: An iterable of strings for each word-like segment.

πŸ’™ This package was templated with create-typescript-app.

About

A lightweight, intuitive wrapper around Intl.Segmenter for seamless segment-aware string operations in TypeScript and JavaScript

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published