Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] ps recognizes UTF8 as UTF16 #4697

Open
Semnodime opened this issue Nov 3, 2024 · 2 comments
Open

[Bug] ps recognizes UTF8 as UTF16 #4697

Semnodime opened this issue Nov 3, 2024 · 2 comments

Comments

@Semnodime
Copy link

Semnodime commented Nov 3, 2024

Work environment

rizin 0.8.0 @ linux-x86-64
commit: 73d85d2

Expected behavior

Detect and display string (hex f0 9f 9f aa f0 9f 9f aa 00, decoded 🟪🟪) as UTF8

Actual behavior

UTF16BE (which is incorrectly parsed as well, if it actually was UTF16 but that's a separate bug)

Steps to reproduce the behavior

ELF AMD64

[0x0007ed51]> pxc
- offset -   0 1  2 3  4 5  6 7  8 9  A B  C D  E F  0123456789ABCDEF  comment
0x0007ed51  f09f 9faa f09f 9faa 0025 6868 75ef b88f  .........%hhu...  ; data.0007ed51  ; str.hhu
[0x0007ed51]> psj
{"string":"\u00f0\u009f\u009f\u00aa\u00f0\u009f\u009f\u00aa%\u0068\u0068\u0075\u00ef\u00b8\u008f\u00e2\u0083\u00a3\u0000\u0059\u006f\u0075\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0021\u0000\u002f\u0062\u0069\u006e\u002f\u0073\u0068\u0000Y\u006f\u0075\u0020\u006d\u0061\u0079\u0020\u0068\u0061\u0076\u0065\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0070\u0075\u007a\u007a\u006c\u0065\u0020\u0062\u0075\u0074\u0020\u0079\u006f\u0075\u0020\u0064\u0069\u0064\u0020\u006e\u006f\u0074\u0020\u0073\u006f\u006c\u0076\u0065\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0020\u003b\u0029","offset":519505,"section":".rodata","length":122,"type":"utf16be"}
[0x0007ed51]> ps+j
{"string":"\u009f\u009f\u00aa\u00f0\u009f\u009f\u00aa\u0000\u0025\u0068\u0068\u0075\u00ef\u00b8\u008f\u00e2\u0083\u00a3Y\u006f\u0075\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0021/\u0062\u0069\u006e\u002f\u0073\u0068","offset":519505,"section":".rodata","length":50,"type":"utf16be"}
[0x0007ed51]> ps
龪龪%桨痯뢏ꌀ奯甠獯汶敤⁴桥\xe2\x81\xa3桡汬敮来℀⽢楮⽳栀Y潵\xe2\x81\xad慹\xe2\x81\xa8慶攠獯汶敤⁴桥⁰畺穬攠扵琠祯甠摩搠湯琠獯汶攠瑨攠捨慬汥湧攠㬩
[0x0007ed51]> ps+
龟꫰龟ꨀ╨桵迢莣Y潵\xe2\x81\xb3潬癥搠瑨攠捨慬汥湧攡/扩港獨
@XVilka XVilka added this to the 0.8.0 milestone Nov 3, 2024
@wargio
Copy link
Member

wargio commented Nov 3, 2024

I believe is due the guess encoding. you can enforce utf-8 by setting str.search.encoding=utf8

[0x00000000]> e str.search.encoding
guess
[0x00000000]> e str.search.encoding=?
ascii
8bit
utf8
utf16le
utf32le
utf16be
utf32be
guess

@wargio
Copy link
Member

wargio commented Nov 3, 2024

Also since those chars are emoji, i am strongly sure we do not handle it correctly when guessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants