Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EUC-KR charset is not parsable #87

Open
vanniktech opened this issue Sep 25, 2024 · 7 comments
Open

EUC-KR charset is not parsable #87

vanniktech opened this issue Sep 25, 2024 · 7 comments
Assignees

Comments

@vanniktech
Copy link

I'm using version 0.1.9 with the ktor module to parse the response from this website: http://www.bodnara.co.kr/rss/rss_bodnara.xml

I get my source reader via response.bodyAsChannel().toByteArray().openSourceReader() and then I use Ksoup.parse with an XML Parser and the charset is EUC-KR. However this does not work on Android:

io.ktor.utils.io.charsets.MalformedInputException: Input length = 1
  at io.ktor.utils.io.charsets.CharsetJVMKt.throwExceptionWrapped(CharsetJVM.kt:370)
  at io.ktor.utils.io.charsets.CharsetJVMKt.decode(CharsetJVM.kt:241)
  at io.ktor.utils.io.charsets.EncodingKt.decode(Encoding.kt:103)
  at io.ktor.utils.io.charsets.EncodingKt.decode$default(Encoding.kt:101)
  at com.fleeksoft.ksoup.io.CharsetImpl.decode(CharsetImpl.kt:47)
  at com.fleeksoft.ksoup.io.KByteBuffer.readText(KByteBuffer.kt:63)
  at com.fleeksoft.ksoup.ported.io.StreamDecoder.implRead(StreamDecoder.kt:147)
  at com.fleeksoft.ksoup.ported.io.StreamDecoder.lockedRead(StreamDecoder.kt:87)
  at com.fleeksoft.ksoup.ported.io.StreamDecoder.read(StreamDecoder.kt:50)
  at com.fleeksoft.ksoup.ported.io.InputSourceReader.read(InputSourceReader.kt:46)
  at com.fleeksoft.ksoup.parser.CharacterReader.doBufferUp(CharacterReader.kt:76)
  at com.fleeksoft.ksoup.parser.CharacterReader.bufferUp(CharacterReader.kt:58)
  at com.fleeksoft.ksoup.parser.CharacterReader.current(CharacterReader.kt:222)
  at com.fleeksoft.ksoup.parser.TokeniserState$Data.read(TokeniserState.kt:12)
  at com.fleeksoft.ksoup.parser.Tokeniser.read(Tokeniser.kt:38)
  at com.fleeksoft.ksoup.parser.TreeBuilder.stepParser(TreeBuilder.kt:129)
  at com.fleeksoft.ksoup.parser.TreeBuilder.runParser(TreeBuilder.kt:112)
  at com.fleeksoft.ksoup.parser.TreeBuilder.parse(TreeBuilder.kt:77)
  at com.fleeksoft.ksoup.parser.Parser.parseInput(Parser.kt:61)
  at com.fleeksoft.ksoup.helper.DataUtil.parseInputSource(DataUtil.kt:179)
  at com.fleeksoft.ksoup.helper.DataUtil.parseInputSource(DataUtil.kt:77)
  at com.fleeksoft.ksoup.helper.DataUtil.load(DataUtil.kt:44)
  at com.fleeksoft.ksoup.Ksoup.parse(Ksoup.kt:70)

I saw this 0b76b21 but I'm not on windows, so it should work?

@itboy87
Copy link
Collaborator

itboy87 commented Sep 25, 2024

@vanniktech Can you please mention which variant you are using?

@vanniktech
Copy link
Author

com.fleeksoft.ksoup:ksoup-ktor2:0.1.9

@itboy87
Copy link
Collaborator

itboy87 commented Sep 25, 2024

@vanniktech i tested with the following code it worked fine:

val doc = Ksoup.parseGetRequest("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
assertEquals("보드나라::전체기사", doc.selectFirst("title")?.text())

and this also worked fine:

val httpResponse = NetworkHelperKtor.instance.get("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
val doc = Ksoup.parse(sourceReader = httpResponse.asSourceReader(), baseUri = "", parser = Parser.xmlParser())
assertEquals("보드나라::전체기사", doc.selectFirst("title")?.text())

Can you please share your code how you are reading bytes from web

@vanniktech
Copy link
Author

This also works for me on the JVM (Desktop / Mac):

suspend fun main() {
  val url = "http://www.bodnara.co.kr/rss/rss_bodnara.xml"
  val request = HttpRequestBuilder().apply {
    url(url)
  }

  val response = HttpClient().get(request)
  val document = Ksoup.parse(
    sourceReader = response.bodyAsChannel().toByteArray().openSourceReader(),
    baseUri = url,
    charsetName = response.charset(),
    parser = Parser.xmlParser(),
  )

  println(document)
}

fun HttpResponse.charset() = headers[HttpHeaders.ContentType]?.asContentTypeOrNull()?.parameter("charset")
  ?: "UTF-8"

// https://youtrack.jetbrains.com/issue/KTOR-6241/Lenient-Content-Type-Parsing
internal fun String.asContentTypeOrNull() =
  runCatching { ContentType.parse(replace(", charset=", "; charset=")) }.getOrNull()

The same code crashes on Android though with the exception from my original issue. Did you try it on an Android emulator?

@itboy87
Copy link
Collaborator

itboy87 commented Sep 26, 2024

@vanniktech Yes, there is an issue with the EUC-KR charset in Android with Ktor 2, but it’s working fine with Ktor 3. I’m looking into whether I can fix it on my end.

@itboy87
Copy link
Collaborator

itboy87 commented Sep 26, 2024

@vanniktech I’m trying to fix the issue; in the meantime, you can try this, it is working fine:

val httpResponse = NetworkHelperKtor.instance.get("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
                val doc = Ksoup.parse(
                    html = httpResponse.bodyAsText(),
                    baseUri = "",
                    parser = Parser.xmlParser()
                )

Reading text from ChannelBody and parsing it works fine.

@vanniktech
Copy link
Author

That's neat. I've changed it. Maybe ktor3 has a better/improved charset implementation due do the switch to kotlinxio?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants