error: input contains invalid UTF-8 around byte 30 (of 68) #53

talalriaz · 2022-01-26T12:35:46Z

I encountered this error while running the following code:

import pycld2 as cld2
text ="""
Happy Tailors Day! Hackett We�re celebrating with a special offer
"""
isReliable, textBytesFound, details =  cld2.detect(text)

Here is the error:

error: input contains invalid UTF-8 around byte 30 (of 68)

The text was updated successfully, but these errors were encountered:

ned2 · 2022-03-28T12:14:25Z

There's been some great exploration of this issue in this polyglot issue and [also in the older cld2 project(https://github.com/mikemccand/chromium-compact-language-detector/issues/22) that pycld2 is forked from (some of which are from folks using Polyglot, which actually depends on pycld2 rather than that older cld2 project).

Have not tried it yet, but this solution, which uses a regex to strip the two offending UTF8 control characters from the input, looks like the most elegant solution to me.

ned2 mentioned this issue Mar 28, 2022

Error on language detection for some unicode characters (control characters) aboSamoor/polyglot#71

Open

omazapa mentioned this issue Nov 30, 2023

input contains invalid UTF-8 around byte... colav/Kahi_plugins#129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error: input contains invalid UTF-8 around byte 30 (of 68) #53

error: input contains invalid UTF-8 around byte 30 (of 68) #53

talalriaz commented Jan 26, 2022 •

edited

Loading

ned2 commented Mar 28, 2022 •

edited

Loading

error: input contains invalid UTF-8 around byte 30 (of 68) #53

error: input contains invalid UTF-8 around byte 30 (of 68) #53

Comments

talalriaz commented Jan 26, 2022 • edited Loading

ned2 commented Mar 28, 2022 • edited Loading

talalriaz commented Jan 26, 2022 •

edited

Loading

ned2 commented Mar 28, 2022 •

edited

Loading