Skip to content
This repository has been archived by the owner on Jun 5, 2020. It is now read-only.

Process embedded JSON in HTML with metafacture-fix #3

Closed
fsteeg opened this issue Feb 5, 2020 · 7 comments
Closed

Process embedded JSON in HTML with metafacture-fix #3

fsteeg opened this issue Feb 5, 2020 · 7 comments

Comments

@fsteeg
Copy link
Member

fsteeg commented Feb 5, 2020

For a scenario as in https://github.com/programmieraffe/oerhoernchen20#technical-background, looking at a resource with embedded JSON like https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f, we want to process that resource with metafacture-fix to create JSON output that can be indexed with Elasticsearch. Fixes should be configurable in a UI like http://test.lobid.org/fix.

@fsteeg fsteeg self-assigned this Feb 5, 2020
@fsteeg
Copy link
Member Author

fsteeg commented Feb 5, 2020

With HTML input support in metafacture/metafacture-core#312 and URL input support in metafacture/metafacture-fix#6, we can access the script content with metafacture-fix:

http://test.lobid.org/fix/xtext-service/run?flux="https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f"|open-http|decode-html|fix|encode-json(prettyPrinting="true")&fix=map(html.head.script.value,json)&data=

My initial idea was to set up something like this:

"https://www.oerbw.de/..." | open-http | decode-html | fix("1.fix") | decode-json | fix("2.fix") | encode-json

That is, parse the HTML, pick out the JSON data in the first Fix (with something like the map(html.head.script.value,json) Fix in the link above), decode that as JSON, pass it to a second Fix to pick out the fields we need, and encode the final JSON for the index. However, since the data flowing out of the first Fix would have to be an entire record, not a field, this would not exactly fit into the workflow architecture. Instead, it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:

"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json

That is, we decode an HTML document as JSON, by looking for embedded JSON in the HTML.

@acka47
Copy link
Contributor

acka47 commented Feb 5, 2020

it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:

"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json

This looks fine. However, it should then first try to get JSON-only via accept header ($ curl -H "accept: application/json" https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902) as this would be the ideal way if servers provide it. I don't know whether it makes sense to build those two approaches into one decode-json command.

@fsteeg
Copy link
Member Author

fsteeg commented Feb 5, 2020

However, it should then first try to get JSON-only via accept header ($ curl -H "accept: application/json" https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902) as this would be the ideal way if servers provide it.

The accept header is actually a config option of the open-http step (see metafacture/metafacture-core@9be4ec0), so if the service supported it, we could set up the Flux like this:

"https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f" | open-http(accept="application/json") | decode-json | fix | encode-json

You can test that in http://test.lobid.org/fix with a Flux like:

"http://lobid.org/gnd/5093230-5" | open-http(accept="application/json") | as-lines

"http://lobid.org/gnd/5093230-5" | open-http(accept="text/html") | as-lines

@fsteeg
Copy link
Member Author

fsteeg commented Feb 6, 2020

Instead, it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:
"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json
That is, we decode an HTML document as JSON, by looking for embedded JSON in the HTML.

When we discussed this today, @dr0i objected that this is rather confusing, as we would open an HTML document with a JSON decoder. Additionally, it would pull the jsoup dependency into the metafacture-json project. Instead, we came up with a small module that only extracts the JSON from the HTML. This could be part of metafacture-html and would have no dependency on metafacture-json. It would be used like this:

"https://www.oerbw.de/..." | open-http | extract-json | decode-json | fix | encode-json

@fsteeg
Copy link
Member Author

fsteeg commented Feb 6, 2020

Deployed to http://test.lobid.org/fix:

Flux:

"https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f" | open-http | extract-script | decode-json | fix | encode-json(prettyPrinting="true")

map(name, title)
map(description, description)

Output:

{ "description" : "Das Lehrvideo ist Teil der Lehrveranstaltung \"Mathematik für Designer\". In diesem 3. Lehrvideo wird die projektive Geometrie der Ebene als Grundlage von homogenen Koordinaten erklärt. Im nächsten Video wird dann erklärt, wie affine Abbildungen mit homogenen Koordinaten dargestellt werden können.", "title" : "Projektive Geometrie und Homogene Koordinaten" }

I've used extract-script (instead of extract-json) to have it both more generic (can get any script) and more HTML-specific (since it's part of metafacture-html). For now it always takes the first script. If we need other scripts in other examples we could easily extend the component to support an index, e.g. to get the second script: extract-script("2").

@acka47
Copy link
Contributor

acka47 commented Feb 7, 2020

+1

@acka47
Copy link
Contributor

acka47 commented Apr 6, 2020

Moved to https://gitlab.com/oersi/oersi-etl/-/issues/3. Closing.

@acka47 acka47 closed this as completed May 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants