Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Linux, command line arguments cannot contain multi-byte codepoints because of limitation on the String type #167

Open
lucabol opened this issue Jan 29, 2024 · 5 comments

Comments

@lucabol
Copy link
Contributor

lucabol commented Jan 29, 2024

This is the offending method, copilot suggests simple utf8 -> utf16 code, so maybe worth adding to libzero, but perhaps copilot is oversimplifying:

        private static unsafe string Ctor(sbyte* ptr)
        {
            sbyte* cur = ptr;
            while (*cur++ != 0) ;

            string result = FastNewString((int)(cur - ptr - 1));
            for (int i = 0; i < cur - ptr - 1; i++)
            {
                if (ptr[i] > 0x7F)
                    Environment.FailFast(null);
                Unsafe.Add(ref result._firstChar, i) = (char)ptr[i];
            }
            return result;
        }
@MichalStrehovsky
Copy link
Member

You're running into the FailFast, right? Yeah, zerolib does cut corners like that. This constructor has different behaviors depending on whether this is Windows or Linux (it's "current codepage to UTF-16" on Windows, and "UTF-8 to UTF-16" on Linux).

@lucabol
Copy link
Contributor Author

lucabol commented Jan 29, 2024

Well, it's not 'UTF-8 to UTF-16' on Linux, it is 'ASCII to UTF-16' (by design). It seems too limiting, even for zerolib. But perhaps you feel differently?

@MichalStrehovsky
Copy link
Member

I'm not opposed to adding a helper for this. It might not be the only place where utf-8 to 16 would be useful.

@lucabol
Copy link
Contributor Author

lucabol commented Jan 29, 2024

I am keeping track of everything that 'feels like' zerolib in a single file. Perhaps worth doing a simple PR, or discussion, when I am finished.

In case you wonder ... I am playing around with exposing utf8 cmd line args in the spirit of utf8everywhere, moving on experimenting with a no-allocation programming model (as MISRA C), to finish with a linear allocator. At least, this is the rough idea, we'll see.

@ghost
Copy link

ghost commented Jan 29, 2024

On Windows, I think using the Windows API. On Linux, I think using libiconv. Don't write your own encoding converter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants