Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Postgresql] Hangul(Korean alphabet) encoding problem #66

Open
bjtj opened this issue Feb 9, 2025 · 16 comments
Open

[Postgresql] Hangul(Korean alphabet) encoding problem #66

bjtj opened this issue Feb 9, 2025 · 16 comments

Comments

@bjtj
Copy link

bjtj commented Feb 9, 2025

! This is my first time submitting a GitHub issue. Please let me know if anything is inappropriate, and I appreciate your understanding.

Problem

Description:

  • When I use Babashka SQL pods with PostgreSQL, Hangul (Korean alphabet) encoding gets corrupted.
  • Inserting Hangul data results in corruption.

Environment:

  • Windows 10
  • babashka v1.3.191
  • PostgreSQL 17.2 on x86_64-windows, compiled by msvc-19.42.34435, 64-bit

deps.edn:

{:pods :pods {org.babashka/postgresql {:version "0.1.0"}}}

main.clj:

(require '[pod.babashka.postgresql :as pg])

(def db {:dbtype   "postgresql"
         :host     "localhost"
         :dbname   "mytest"
         :user     "test"
         :password "test"
         :port     5432})

(pg/execute! db ["SELECT '안녕?'"])

result:

[{:?column? "안녕?"}]

expected:

[{:?column? "안녕?"}]

plus:

After inserting Hangul (Korean) data and querying it in psql, the encoding appears to be corrupted.

Server & Client encoding configuration

(pg/execute! db ["show server_encoding;"])

  • [{:server_encoding "UTF8"}]

(pg/execute! db ["show client_encoding;"])

  • [{:client_encoding "UTF8"}]

Comparative experiment

  1. This python code works as expected
import psycopg

def main():
    with psycopg.connect('dbname=mytest user=test password=test') as conn:
        with conn.cursor() as cur:
            print(cur.execute("SELECT '안녕?'"))
  1. Clojure + next.jdbc works as expected
  • I tried using Babashka with next.jdbc, but next.jdbc doesn't seem to work in Babashka. Is that correct?

Thanks.

@bjtj bjtj changed the title Hangul(Korean alphabet) encoding problem [Postgresql] Hangul(Korean alphabet) encoding problem Feb 9, 2025
@borkdude
Copy link
Collaborator

borkdude commented Feb 9, 2025

I've tried to reproduce this on macOS and I get the following:

(require '[babashka.pods :as p])

(p/load-pod 'org.babashka/postgresql {:version "0.1.3"})

(require '[pod.babashka.postgresql :as pg])

(def db {:dbtype   "postgresql"
         :host     "localhost"
         :dbname   "postgres"
         :user     "test"
         :password "test"
         :port     5432})

(prn (-> (pg/execute! db ["SELECT '안녕?' as foo"])
         first
         :foo))

The output is "안녕?" which looks correct?
Note that I've used the pod version 0.1.3. Can you try this on WSL2 perhaps on linux to see if that works for you? Then we could narrow it down to either a bb, pod or OS issue.

@bjtj
Copy link
Author

bjtj commented Feb 9, 2025

I tried it on WSL2 as you suggested. When I tested connecting to the PostgreSQL server on the host machine from WSL2, it worked fine. 👀

It can be considered an issue that occurs only on Windows.

Thanks,

@borkdude
Copy link
Collaborator

borkdude commented Feb 10, 2025

Can you try this:

(require '[babashka.pods :as p])

(p/load-pod 'org.babashka/postgresql {:version "0.1.3"})

(require '[pod.babashka.postgresql :as pg])

(def db {:dbtype   "postgresql"
         :host     "localhost"
         :dbname   "postgres"
         :user     "test"
         :password "test"
         :port     5432})

(spit "foo.txt" 
  (-> (pg/execute! db ["SELECT '안녕?' as foo"])
         first
         :foo))

and then look with a text editor (e.g. VSCode) in the file foo.txt to see if the characters are fine in there?

@bjtj
Copy link
Author

bjtj commented Feb 10, 2025

I tested it right away with the code you sent, but the result was the same.

foo.txt:

안녕?

screenshot:

Image

Image

Thanks,

@borkdude
Copy link
Collaborator

Can you test just writing the string directly to the file without the database to see if the same problem occurs? Perhaps it's not even a sql pod issue. Thanks

@bjtj
Copy link
Author

bjtj commented Feb 10, 2025

Of course. I tested it right away, and there was no problem when saving the file directly.

Image

For reference, there was no issue with SQLite org.babashka/go-sqlite3 {:version "0.1.0"}.

Image

It seems to be a problem that is not easy to solve.

Thanks,

@borkdude
Copy link
Collaborator

Have you tried pod version 0.1.3 on Windows?

@bjtj
Copy link
Author

bjtj commented Feb 10, 2025

Yes, all attempts were made on Windows. Pod version 0.1.3 was also tested on Windows. So far, the issue has occurred on Windows, but there were no problems in WSL2.

@borkdude
Copy link
Collaborator

Could you maybe also test hsqldb on Windows with pod version 0.1.3? If that works then we know that it is a specific problem with the postgres pod on Windows

@bjtj
Copy link
Author

bjtj commented Feb 10, 2025

Yes, no problem. I will test it and let you know.

@bjtj
Copy link
Author

bjtj commented Feb 10, 2025

Oh, it seems the same issue occurs in hsqldb as well.

bb.edn:

{:pods {org.babashka/hsqldb {:version "0.1.3"}}}

Image

@borkdude
Copy link
Collaborator

Interesting

@borkdude
Copy link
Collaborator

I think it makes sense to upgrade all the builds to use Oracle GraalVM latest (23), add a test for this and then see what happens. It could be a matter of setting -J-Dfile.encoding=UTF-8 during the build like described here:

oracle/graal#2492

@borkdude
Copy link
Collaborator

One more idea, it could be that the problem is with string encoding via the single arg constructor: (String. v)

Can you try in your version of bb the following:

(String. (.getBytes "안녕?"))
;; vs
(String. (.getBytes "안녕?") java.nio.charset.StandardCharsets/UTF_8)

to see if you see a different result?

@bjtj
Copy link
Author

bjtj commented Feb 12, 2025

First, I tested it with the code you provided.

Image

Image

I also saved it as a file, and both the file size and the data were identical.

Image

I think the values are changing in the process of transferring the data to the database.

(pg/execute! db ["CREATE TABLE IF NOT EXISTS foo (text VARCHAR(256))"])
(pg/execute! db ["INSERT INTO foo (text) VALUES (?)" (String. (.getBytes "안녕?"))])
(pg/execute! db ["INSERT INTO foo (text) VALUES (?)" (String. (.getBytes "안녕?")
                                                              java.nio.charset.StandardCharsets/UTF_8)])

Image

I'm not sure if I fully understood your code, but it seems that you're simply passing the values to next.jdbc, so I'll check the next.jdbc side.

(defn execute!
([db-spec sql-params]
(execute! db-spec sql-params nil))
([db-spec sql-params opts]
;; (.println System/err (str sql-params))
(let [conn (->connectable db-spec)
res (jdbc/execute! conn sql-params opts)]
res)))

Thanks,

@borkdude
Copy link
Collaborator

I didn't mean that you would insert the result into the database. The original output looks similar to this:

(String. (.getBytes "안녕?") (java.nio.charset.Charset/forName "CP1252"))
"안녕?"

so I think it's a encoding mismatch somewhere. I read that from JDK18 onwards the default encoding is UTF-8 unless otherwise specified so this might fix it. Currently the SQL pods are still built using JDK11, so upgrading should help.

I'll try to reproduce this problem on my own Windows machine. Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants