Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-language support design #16

Closed
kichikuou opened this issue Aug 31, 2021 · 37 comments
Closed

Multi-language support design #16

kichikuou opened this issue Aug 31, 2021 · 37 comments

Comments

@kichikuou
Copy link
Owner

Background

Originally System1-3 only supported Japanese characters, encoded in Shift-JIS. ASCII characters are included in Shift-JIS, but they were interpreted as commands rather than text messages.

SysEng (by @RottenBlock) enabled ASCII characters in messages, by enclosing them with quotation marks (' or "). It was merged to system3-sdl2.

@silas1037 is working on GBK character encoding support, for Simplified Chinese translation. The code change is fairly simple, thanks to the similarity between GBK and Shift-JIS.

Now let's step back a bit, and explore possible alternative designs for multi-language support.

Unicode vs non-unicode

Unicode

xsystem35-sdl2 and xsys35c have complete UTF-8 support, as described in "Unicode mode" document. I think a similar approach is possible in system3-sdl2.

Pros:

  • We can cover many languages with single implementation. Note that just using Unicode is not perfect, for example we can't get right-to-left languages support for free.
  • UTF-8 is the "native" encoding of SDL2. We could just pass UTF-8 strings to SDL2 functions.
  • Gaiji characters can be mapped to Unicode's Private Use Areas.
  • Characters outside the target character set can be used. For example, symbols like ◎ can be used even when translating to a western language that uses Latin-1 character set.

Cons:

  • Implementation is not trivial, compared to the GBK support.
  • Shift-JIS support is still needed, for original AliceSoft games.

Non-unicode

Instead of using Unicode, we may support different encodings for each character set (Shift-JIS for Japanese, GBK for Simplified Chinese, Big5 for Traditional Chinese, etc.).

Pros:

  • We already have working code for GBK, by @silas1037. It breaks Shift-JIS support, so some refinement is needed though.
  • GBK support only needed to replace the Unicode conversion table and implementation of a few functions such as is_[12]byte_message(). Many non-unicode encodings have similar structure so could be supported in a similar way.

Cons:

  • We need separate Unicode conversion table for each language.
  • To avoid conflict, Gaiji encoding must be defined individually for each target character encoding. For example, 0xeb9f-0xebfc and 0xec40-0xec9e are used for Gaiji in Shift-JIS, but it overlaps with regular character ranges in GBK.

Language tag

Currently, there is no way to declare the character encoding used in ADISK.DAT. For the interpreter (or decompiler), the character encoding has to be provided separately, e.g. by a command-line flag. Should we have a language / character-encoding tag in the head of the scenario file, like the (now deprecated) REV tag of SysEng?

This may not be so beneficial, since many games require the correct game ID to work properly, and the character encoding can be determined from the game ID.

Compiler / decompiler

I'm not familiar with Sys0Decompiler internal, but it seems simply changing the encoding name was enough for GBK compilation, and it would possibly work for UTF-8 or other encodings too. Decompilation would be straightforward as well, if the character encoding is known.

It would be nice if the compiler had some abstract notation for Gaiji, so that users don't have to care about the actual representation in the target character encoding.

Zenkaku-Hankaku conversion

System3 games store certain Zenkaku (2-byte) characters as Hankaku (1-byte) characters, to save precious floppy disk space. There is no reason to perform such conversion in non-Shift-JIS encodings.

@kichikuou
Copy link
Owner Author

I think if we are serious about i18n, we should implement Unicode support because it clearly has some advantages. Otherwise, we can simply support GBK and forget about this topic until the next need (if any) arises.

@RottenBlock any thoughts? I don't know much about Sys0Decompiler and I have no experience translating games, so I may be overlooking something.

@silas1037 If we decide to implement Unicode support, are you willing to use it in your zh translation?

@silas1037
Copy link

silas1037 commented Aug 31, 2021

Yeah, Of course.
The excellent performance of xsystem35 finally make me to use unicode support for RanceIV. The final decision depends on stability, reliability and ease of use, so I have to test new system so much times for my TL.

Currently, there is no way to declare the character encoding used in ADISK.DAT. For the interpreter (or decompiler), the character encoding has to be provided separately, e.g. by a command-line flag. Should we have a language / character-encoding tag in the head of the scenario file, like the (now deprecated) REV tag of SysEng?

I think it is nothing better than system3.ini~. And now I am still confusing about Nise~tower's unpack and repack because of a little unsupport between compiler and emulator.

@RottenBlock
Copy link

RottenBlock commented Sep 1, 2021

I agree that unicode support is best. silas1037 isn't the first to come asking about localization support in other languages, though they are the first to go the extra mile to implement it! It would be better to provide support for everyone if we're willing to take those steps. I think you're right that command line arguments / system3.ini will work without the need for a language tag. I don't think there's any huge problems on Sys0Decompiler's side, and I have the luxury of simply adding a Unicode radio button to the layout and so don't even have to worry about command line arguments.

I don't think you're overlooking much. There are some languages that don't cooperate nicely with monospace fonts, but the end result is more "ugly" than "broken." Here's a Reddit thread arguing about it, for what that's worth. Devangari is apparently especially bad. But even in those cases, supporting unicode is still a necessary first step.

One problem that might be unavoidable: are there any latter-day System 3.5 games that send their text through a DLL? Those might take additional work, if you haven't already handled them. Sorry, I haven't experimented much with xsystem35.

"It would be nice if the compiler had some abstract notation for Gaiji, so that users don't have to care about the actual representation in the target character encoding."

At present, Sys0Decompiler has gaiji turned into hexidecimal notation during decompilation, e.g. "0xEBBA", and turned back during compilation. It could be a little clearer what's going on, though, especially in a Unicode environment. Unicode uses the notation U+#### to refer to character codes, so I'm half-tempted to change it to G+####.

@kichikuou
Copy link
Owner Author

Okay, sounds like we all agreed that Unicode without a language tag is the best option. I'll give it a shot this weekend.

I don't think you're overlooking much. There are some languages that don't cooperate nicely with monospace fonts, but the end result is more "ugly" than "broken." Here's a Reddit thread arguing about it, for what that's worth. Devangari is apparently especially bad. But even in those cases, supporting unicode is still a necessary first step.

Yeah text rendering is a hard problem. Even English has a problem -- system3-sdl2 can render proportional (non-monospace) fonts, but still draws characters one-by-one, so kerning isn't working.

One problem that might be unavoidable: are there any latter-day System 3.5 games that send their text through a DLL? Those might take additional work, if you haven't already handled them. Sorry, I haven't experimented much with xsystem35.

xsystem35 has its own implementation of the DLLs, and the Unicode mode is supported.

At present, Sys0Decompiler has gaiji turned into hexidecimal notation during decompilation, e.g. "0xEBBA", and turned back during compilation. It could be a little clearer what's going on, though, especially in a Unicode environment. Unicode uses the notation U+#### to refer to character codes, so I'm half-tempted to change it to G+####.

G+#### sounds good.

@kichikuou
Copy link
Owner Author

I've implemented the Unicode support in unicode branch. I'd appreciate if you could test it before I merge it into the master branch.

(Very hacky) compiler change is here: kichikuou/Sys0Decompiler@4f303f6

With these, I was able to recompile and run Rance 4.1 zh version in Unicode.

How to test:

  1. Checkout and build the above system3-sdl2 and Sys0Decompiler branches.
  2. Recompile ADISK.DAT with the new compiler. If your source files are not in UTF-8, you can change the input encoding by modifying "utf-8" of the following line in SystemVersion.cs and DecompilerForm.cs:
       private Encoding sourceEncoding = Encoding.GetEncoding("utf-8");
  3. Add the following line to system3.ini:
       encoding = utf-8
    
  4. Run it with system3 built in step 1.

Caveats:

  • Using savedata created in non-Unicode version may cause crash.
  • Compiler change is very hacky and incomplete; it always generates Unicode output, and decompilation is not supported at all.
  • In the compiler, gaiji references (0xeb9f-0xebfc, 0xec40-0xec9e) are automatically remapped to Unicode Private Area (U+E000-). For your convenience, it also accepts gaiji ranges used in the GBK port (0xff40-0xff9d, 0xff9e-0xfffc), but please don't expect this to be permanently supported.
  • Zenkaku-hankaku conversion still works in Unicode mode. I'm not 100% sure if this is desirable behavior.

@silas1037
Copy link

silas1037 commented Sep 4, 2021

Good! I will test it soon, hopefully.

EDIT: I have tested it on new compiled game successfully. Good job.

@kichikuou
Copy link
Owner Author

@silas1037 Thank you for testing! I merged the changes into the master branch.

@RottenBlock Are you interested in integrating the Unicode output support code to Sys0Decompiler?

@RottenBlock
Copy link

RottenBlock commented Sep 5, 2021

Yes, I'd like to integrate these updates soon, but I'm busy for the next few days. I'm sorry if that slows you down, you're making serious progress!

@RottenBlock
Copy link

RottenBlock commented Sep 9, 2021

Sorry this is going to slowly. After working on the decompiler for a few days, I finally got around to testing the code you put up, which I probably should have done from the start. Unfortunately, I'm getting invalid characters when system3-sdl tries to run compiled code. Just to be sure, I tried using your version of Sys0Decompiler and the results are the same.

Here's what I'm doing: I've compiled Little Vampire to include the word "兰斯" (Rance in Simplified Chinese) in both the opening text and the startup menu (i.e. using both a page file and in the AG00 file). In both cases, it comes out looking like this:

https://imgur.com/a/o1cBKD4

Any ideas what's going wrong? I'll probably have to send you my code, but I'll wait to hear which files you want instead of sending everything in a big wave.

Maybe the problem is that I haven't set the -encoding config to a value, but the program currently doesn't use the -encoding config. Or maybe I'm misunderstanding?

@kichikuou
Copy link
Owner Author

The string looks like the result of decoding "\xE5\x85\xB0\xE6\x96\xAF" ("兰斯" in UTF-8) as Shift JIS. I guess ADISK.DAT is generated in UTF-8 as expected.

Did you specify -encoding utf-8 command line option, or encoding = utf8 in system3.ini? That tells system3 that UTF-8 encoding should be used to interpret multi-byte strings.

@silas1037
Copy link

System1 and 2 are not solved well too. I fail to start Alice Yakata2 and Little Vampire in unicode.

@RottenBlock
Copy link

Ah, my bad, for some reason I was convinced the -encoding parameter wasn't hooked up yet. Not sure how I expected it to work, really. So that's entirely on me.

That said, now I'm getting this:

https://imgur.com/a/ytFjrXr

So the second character is correct but the first has become a dot, for some reason? Again, both in the text and the menu.

@silas1037
Copy link

So the second character is correct but the first has become a dot, for some reason? Again, both in the text and the menu.

font issue. You can use simhei.ttf for chs

@RottenBlock
Copy link

Oh, of course, thank you!

@kichikuou
Copy link
Owner Author

System1 and 2 are not solved well too. I fail to start Alice Yakata2 and Little Vampire in unicode.

Do you see any error messages?
Note that recompiling with utf-8 changes CRC32 of ADISK.DAT, so auto game detection no longer works. You have to specify game option.

@silas1037
Copy link

silas1037 commented Sep 9, 2021

Do you see any error messages?
Note that recompiling with utf-8 changes CRC32 of ADISK.DAT, so auto game detection no longer works. You have to specify game option.

No, it just crashed. Game option is sure. Maybe I could send you the pack.

@kichikuou
Copy link
Owner Author

Ah another thing I forgot to mention.

Sys0Decompiler in #16 (comment) had a bug that AG00.DAT was not generated correctly. It's fixed in kichikuou/Sys0Decompiler@0fb6d95, please try it if you haven't yet.

@RottenBlock
Copy link

RottenBlock commented Sep 9, 2021

Here's the updated version of Sys0Decompiler so far. It includes some features from 0.7.6 (including the removal of the REV tag) and can compile to UTF8, complete with updated GUI. The decompile form is also updated, but decompiling still doesn't work.

https://www.mediafire.com/file/evlkrp0u0k6wead/Sys0Decompiler_0.8_%2528WIP%2529.zip/file

@silas1037
Copy link

silas1037 commented Sep 9, 2021

Thank you, solved.
@RottenBlock Could you please see the comment in fandom?
So v0.8 have already merged the utf-8 fixing by kichikuou?

@kichikuou
Copy link
Owner Author

Here's the updated version of Sys0Decompiler so far. It includes some features from 0.7.6 (including the removal of the REV tag) and can compile to UTF8, complete with updated GUI. The decompile form is also updated, but decompiling still doesn't work.

Great progress!

It seems "Text Output Encoding" decompile option is not working (output is Shift-JIS regardless of the selection).

This is not a bug, but I noticed that Mugen Houyou fails to (re)compile in UTF-8 because a page overflows the 64KB address space. Messages take much more space in UTF-8, especially a hiragana consumes only 1 byte in SJIS when stored as a hankaku, but consumes 3 bytes in UTF-8, regardless of zenkaku or hankaku. This is unfortunate, but I don't think it's fixable.

@silas1037
Copy link

An idea. Is that possible to achieve backlog function for system3 ?

@RottenBlock
Copy link

RottenBlock commented Sep 9, 2021

It seems "Text Output Encoding" decompile option is not working (output is Shift-JIS regardless of the selection).

Ah, yes, I haven't gotten around to that yet, but I can probably do it before anything else!

This is not a bug, but I noticed that Mugen Houyou fails to (re)compile in UTF-8 because a page overflows the 64KB address space. Messages take much more space in UTF-8, especially a hiragana consumes only 1 byte in SJIS when stored as a hankaku, but consumes 3 bytes in UTF-8, regardless of zenkaku or hankaku. This is unfortunate, but I don't think it's fixable.

Yes, I saw that in Rance 4.1 when it tries to compile the Dangerous Tengu Legend novella. Because the novella is so unusual, I was hoping it wouldn't affect other games, but I guess Mugen Hoyou clinches it. I'm afraid the most I can do is add an error message, but I'll add that all the same.

@silas1037
Copy link

silas1037 commented Sep 12, 2021

It seems that 0xEBAB (12) is not showed in SaveLoad Page, but normal in main message. system3-sdl2's problem, as I put 1995 ADISK into it and bug still exists.

@RottenBlock
Copy link

RottenBlock commented Sep 13, 2021

I think the reason for the above is because system3-sdl2 no longer support the original ShiftJIS gaiji range, even when running in ShiftJIS mode. Or am I mistaken?

@silas1037
Copy link

silas1037 commented Sep 13, 2021

I think the reason for the above is because system3-sdl2 no longer support the original ShiftJIS gaiji range, even when running in ShiftJIS mode. Or am I mistaken?

But the new ADISK is compiled by 0.8 compiler, it should be converted to U+XX.

@RottenBlock
Copy link

RottenBlock commented Sep 13, 2021

EDIT: Nevermind this.

@kichikuou
Copy link
Owner Author

I think the issue is that a save data created at 12:45pm is displayed as PM :45.

This is not an issue of system3-sdl2 but a bug of Rance4.1 script. In page0001.adv, Y 239 command stores the hour-part (0-23) of the save time to D04. Think about what would happen if it were 12.

@RottenBlock
Copy link

RottenBlock commented Sep 13, 2021

I think the issue is that a save data created at 12:45pm is displayed as PM :45.

This is not an issue of system3-sdl2 but a bug of Rance4.1 script. In page0001.adv, Y 239 command stores the hour-part (0-23) of the save time to D04. Think about what would happen if it were 12.

Oh right, that! That's fixed in my version of rance41 and 42, so if silas1037 wants, they can check that for my fix. silas, check that and see if it fixes the problem you posted on the wiki, as well, please!

@silas1037
Copy link

silas1037 commented Sep 13, 2021

Great, it comes to normal ♪
Alright, I will give a futher check to Script comparison

@kichikuou
Copy link
Owner Author

Now game titles and language-dependent string constants can be overridden by system3.ini (see this commit message for details).

With this and the Unicode support, I believe it's no longer necessary to modify the system3-sdl2 code for translation.

@RottenBlock I'll keep (little_vampire|rance41|rance42)_eng game ids in nact_crc32.cpp for backward compatibility, but please use the id without _eng suffix when you release the next version.

@RottenBlock
Copy link

Sure, that sounds good to me.

The decompiler should be fully compatible now! I'd appreciate any tests anyone wants to run. I just need to update the manuals at this point to account for UTF-8 and the like.

https://www.mediafire.com/file/lsmmyce9d362ql7/Sys0Decompiler+0.8+(WIP2).zip/file

Concerning the max page size we discussed up the thread: I assumed that it caps out at a full 65535 bytes, does that sound correct? It's possible I've forgotten some small detail that might change things by a byte or two.

@kichikuou
Copy link
Owner Author

Yay decompilation of Unicode game worked!

AFAICT system3-sdl2 should be able to handle 65535-byte pages.

The error message for the max page size worked for Rance 4.1, but for Mugen Houyou it raised an unhandled System.OverflowException at SystemVersion.cs:434 trying to generate a 16-bit address:

	labelMap[strLabel].DestinationAddress = Convert.ToUInt16(outputStream.Position);// curAddress;

Maybe the check for outputStream.Length > PAGE_MAX should be inside the foreach(string line in lines) loop above.

Other than that, it's working perfectly so far. :)

@RottenBlock
Copy link

RottenBlock commented Sep 18, 2021

Thanks, I think I've got that now, I've added checks to every compile-side ToInt16 in the program:

https://www.mediafire.com/file/lsmmyce9d362ql7/Sys0Decompiler+0.8+(WIP2).zip/file

Edit: Somehow I forgot to check for escape characters in messages. This has also been fixed.

@kichikuou
Copy link
Owner Author

Confirmed that the unhandled exception in Mugen Houyou has been fixed, thanks!

@RottenBlock
Copy link

https://www.mediafire.com/file/6llezbie6koe84s/Sys0Decompiler_0.8_Source.zip/file

Additional fixes in this one. Officially launched it over at the wiki.

@kichikuou
Copy link
Owner Author

Congrats on the release!

It seems manuals are not included in the binary distribution (Sys0Decompiler 0.8 Release.zip). Is that intentional?

@kichikuou
Copy link
Owner Author

I've updated README.md with a link to Sys0Decompiler.

I think we can close this. @RottenBlock @silas1037 Thank you for your cooperation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants