Skip to content

Expose incremental utf8 decoding APIs #2408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 27 additions & 5 deletions core/src/Streamly/Internal/Unicode/Stream.hs
Original file line number Diff line number Diff line change
@@ -190,8 +190,23 @@ encodeLatin1Lax = encodeLatin1
-- UTF-8 decoding
-------------------------------------------------------------------------------

-- Int helps in cheaper conversion from Int to Char
-- | CodePoint represents a specific character in the Unicode standard.
--
-- It is meant to be used with the resumable decoding APIs such as
-- 'resumeDecodeUtf8Either'.
--
-- On decoding failure we return the current 'CodePoint' and the 'DecodeState'
-- in 'DecodeError'.
type CodePoint = Int

-- | DecodeState refers to the number of bytes remaining to complete the current
-- UTF-8 character decoding.
--
-- It is meant to be used with the resumable decoding APIs such as
-- 'resumeDecodeUtf8Either'.
--
-- On decoding failure we return the current 'CodePoint' and the 'DecodeState'
-- in 'DecodeError'.
type DecodeState = Word8

-- We can divide the errors in three general categories:
@@ -410,17 +425,24 @@ decodeUtf8EitherD :: Monad m
=> D.Stream m Word8 -> D.Stream m (Either DecodeError Char)
decodeUtf8EitherD = resumeDecodeUtf8EitherD 0 0

-- |
-- | Decode a bytestream as UTF-8 encoded characters, returning an 'Either'
-- stream.
--
-- This function is similar to 'decodeUtf8', but instead of replacing the
-- invalid codepoint encountered, it returns a 'Left' 'DecodeError'.
--
-- When decoding is successful and a valid character is encountered, the
-- function returns 'Right Char'.
--
-- /Pre-release/
{-# INLINE decodeUtf8Either #-}
decodeUtf8Either :: Monad m
=> Stream m Word8 -> Stream m (Either DecodeError Char)
decodeUtf8Either = decodeUtf8EitherD

-- |
-- | Resuming the decoding of a bytestream given a 'DecodeState' and a
-- 'CodePoint'.
--
-- /Pre-release/
-- >>> decodeUtf8Either = resumeDecodeUtf8Either 0 0
{-# INLINE resumeDecodeUtf8Either #-}
resumeDecodeUtf8Either
:: Monad m
7 changes: 7 additions & 0 deletions core/src/Streamly/Unicode/Stream.hs
Original file line number Diff line number Diff line change
@@ -81,6 +81,13 @@ module Streamly.Unicode.Stream
, decodeUtf8'
, decodeUtf8Chunks

-- ** Resumable UTF-8 Decoding
, DecodeError(..)
, DecodeState
, CodePoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can return a more intelligent decode error:

data DecodeUTF8Error = DecodeUTF8Incomplete Word32 |  DecodeUTF8NonStarter Word8 | DecodeUTF8Invalid

This should be enough to build a resumable decoder. When resuming we should supply the Word32 from DecodeUTF8Incomplete.

, decodeUtf8Either
, resumeDecodeUtf8Either

-- * Elimination (Encoding)
, encodeLatin1
, encodeLatin1'