[Bug] ps recognizes UTF8 as UTF16 #4697

Semnodime · 2024-11-03T00:53:41Z

Work environment

rizin 0.8.0 @ linux-x86-64
commit: 73d85d2

Expected behavior

Detect and display string (hex f0 9f 9f aa f0 9f 9f aa 00, decoded 🟪🟪) as UTF8

Actual behavior

UTF16BE (which is incorrectly parsed as well, if it actually was UTF16 but that's a separate bug)

Steps to reproduce the behavior

ELF AMD64

[0x0007ed51]> pxc
- offset -   0 1  2 3  4 5  6 7  8 9  A B  C D  E F  0123456789ABCDEF  comment
0x0007ed51  f09f 9faa f09f 9faa 0025 6868 75ef b88f  .........%hhu...  ; data.0007ed51  ; str.hhu
[0x0007ed51]> psj
{"string":"\u00f0\u009f\u009f\u00aa\u00f0\u009f\u009f\u00aa%\u0068\u0068\u0075\u00ef\u00b8\u008f\u00e2\u0083\u00a3\u0000\u0059\u006f\u0075\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0021\u0000\u002f\u0062\u0069\u006e\u002f\u0073\u0068\u0000Y\u006f\u0075\u0020\u006d\u0061\u0079\u0020\u0068\u0061\u0076\u0065\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0070\u0075\u007a\u007a\u006c\u0065\u0020\u0062\u0075\u0074\u0020\u0079\u006f\u0075\u0020\u0064\u0069\u0064\u0020\u006e\u006f\u0074\u0020\u0073\u006f\u006c\u0076\u0065\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0020\u003b\u0029","offset":519505,"section":".rodata","length":122,"type":"utf16be"}
[0x0007ed51]> ps+j
{"string":"\u009f\u009f\u00aa\u00f0\u009f\u009f\u00aa\u0000\u0025\u0068\u0068\u0075\u00ef\u00b8\u008f\u00e2\u0083\u00a3Y\u006f\u0075\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0021/\u0062\u0069\u006e\u002f\u0073\u0068","offset":519505,"section":".rodata","length":50,"type":"utf16be"}
[0x0007ed51]> ps
龪龪%桨痯뢏ꌀ奯甠獯汶敤⁴桥\xe2\x81\xa3桡汬敮来℀⽢楮⽳栀Y潵\xe2\x81\xad慹\xe2\x81\xa8慶攠獯汶敤⁴桥⁰畺穬攠扵琠祯甠摩搠湯琠獯汶攠瑨攠捨慬汥湧攠㬩
[0x0007ed51]> ps+
龟꫰龟ꨀ╨桵迢莣Y潵\xe2\x81\xb3潬癥搠瑨攠捨慬汥湧攡/扩港獨

The text was updated successfully, but these errors were encountered:

wargio · 2024-11-03T11:53:57Z

I believe is due the guess encoding. you can enforce utf-8 by setting str.search.encoding=utf8

[0x00000000]> e str.search.encoding
guess
[0x00000000]> e str.search.encoding=?
ascii
8bit
utf8
utf16le
utf32le
utf16be
utf32be
guess

wargio · 2024-11-03T11:55:06Z

Also since those chars are emoji, i am strongly sure we do not handle it correctly when guessing.

XVilka added this to the 0.8.0 milestone Nov 3, 2024

XVilka added test-required RzUtil labels Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] ps recognizes UTF8 as UTF16 #4697

[Bug] ps recognizes UTF8 as UTF16 #4697

Semnodime commented Nov 3, 2024 •

edited

Loading

wargio commented Nov 3, 2024

wargio commented Nov 3, 2024 •

edited

Loading

[Bug] ps recognizes UTF8 as UTF16 #4697

[Bug] ps recognizes UTF8 as UTF16 #4697

Comments

Semnodime commented Nov 3, 2024 • edited Loading

Work environment

Expected behavior

Actual behavior

Steps to reproduce the behavior

wargio commented Nov 3, 2024

wargio commented Nov 3, 2024 • edited Loading

Semnodime commented Nov 3, 2024 •

edited

Loading

wargio commented Nov 3, 2024 •

edited

Loading