Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON extraction embedded in HTML script element #4106

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hkcomori
Copy link

I want to extract JSON embedded in HTML script elements for processing by JSON dotpath.
So I have added a format that outputs only bare content.
Barejson is a term I coined because pure format names could not explain the behavior..
So if you have a better idea, I would like to adopt it.

This format can output only one item, so if more and less than one is found, an error will occur.

This is triggered by the following discussion:
FreshRSS/FreshRSS#6406

@dvikan
Copy link
Contributor

dvikan commented May 14, 2024

sorry i dont understand the use case here

maybe show example usage

@hkcomori
Copy link
Author

I want to use JSON dotted path to get information from JSON embedded as a script element, such as the following on this page.
It contains information on articles that should be RSS.
It can be read from HTML, but some information are only in JSON.

<script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"workId":"018d6a5c-b9f2-77db-9191-e7cc6fbfdce2", ... }</script>

JSON must be separate from HTML because JSON dotted paths are not HTML readable.
For this purpose, this PR feature extracts JSON, and the JSON dotted path processes the results.

@hkcomori
Copy link
Author

XPathBridge example:

Enter web page URL: https://comic-walker.com/detail/KC_003160_S?episodeType=latest
Item selector: //script[@id="__NEXT_DATA__"]
Item title selector: "JSON"
Item description selector: ./text()
Use raw item description: true

@hkcomori
Copy link
Author

Is it better to create bridges to extract RSS from embedded json instead of such format for intermediate files?

@dvikan
Copy link
Contributor

dvikan commented May 15, 2024

  1. are you aware that there already exists a JsonFormat?

  2. Have you tested this PR and it does what you need?

@hkcomori
Copy link
Author

hkcomori commented May 15, 2024

  1. are you aware that there already exists a JsonFormat?

Of course, I first tried JsonFormat.
I expected the following results:

{
    ...
    "content": {
        "key": "value"
    }
}

But in fact, the content was converted to a string and raw Json could not be extracted:

{
    ...
    "content": "{\"key\": \"value\"}"
}
  1. Have you tested this PR and it does what you need?

Yes. I confirmed that this result is raw json content and JSON dotted path can processes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants