Skip to content
This repository was archived by the owner on Jun 5, 2020. It is now read-only.

Process HTML DOM elements with metafacture-fix #2

Closed
fsteeg opened this issue Feb 5, 2020 · 4 comments
Closed

Process HTML DOM elements with metafacture-fix #2

fsteeg opened this issue Feb 5, 2020 · 4 comments

Comments

@fsteeg
Copy link
Member

fsteeg commented Feb 5, 2020

For a scenario as in https://github.com/programmieraffe/oerhoernchen20#technical-background, looking at a sitemap like https://www.hoou.de/sitemap.xml, finding OER materials like https://www.hoou.de/materials/tutorial-lernen-lernen, we want to process that resource with metafacture-fix to create JSON output that can be indexed with Elasticsearch. Fixes should be configurable in a UI like http://test.lobid.org/fix.

@fsteeg fsteeg self-assigned this Feb 5, 2020
@fsteeg
Copy link
Member Author

fsteeg commented Feb 5, 2020

With HTML input support in metafacture/metafacture-core#312 and URL input support in metafacture/metafacture-fix#6, we can use metafacture-fix to convert the full DOM structure of something like https://www.hoou.de/materials/tutorial-lernen-lernen to JSON:

http://test.lobid.org/fix/xtext-service/run?flux="https://www.hoou.de/materials/tutorial-lernen-lernen"|open-http|decode-html|fix|encode-json(prettyPrinting="true")&fix=map(_else)&data=

To pick out just the title and the description, in http://test.lobid.org/fix, we can use a Fix like:

map(html.head.title.value, title)
map(html.body.div.div.div.div.div.div.div.p.value, description)

With the Flux from the link above:

"https://www.hoou.de/materials/tutorial-lernen-lernen"|open-http|decode-html|fix|encode-json(prettyPrinting="true")

We get some concise JSON back:

{ "title" : "Tutorial: Lernen lernen - HOOU", "description" : "Das Bewusstsein und die Kenntnis über Ihren Lernstil kann Ihnen helfen, Ihren Lernansatz und damit auch den Lernerfolg zu optimieren. In diesem Modul reflektieren Sie Ihren Lernstil und dessen Implikationen und entwickeln individuelle Lernstrategien. Zudem hilft Ihnen das Wissen über unterschiedliche Lernstile beim Lernen in der Gruppe oder bei der Teamarbeit." }

So this basically works. However, the html.body.div.div.div.div.div.div.div.p.value is problematic: the internal structure might change, requiring changes to the Fix. It would be better to have support for conditionals in the Fix, and use the description property of the html.head.meta.content, see metafacture/metafacture-fix#10.

@fsteeg
Copy link
Member Author

fsteeg commented Feb 6, 2020

Both this and #3 basically work (we get a title and a description). Maybe it makes sense continue with the bigger picture (collecting sources from the sitemap.xml, indexing the results) instead of improving the way we extract the description at this point?

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Feb 6, 2020
@acka47
Copy link
Contributor

acka47 commented Feb 7, 2020

Maybe it makes sense continue with the bigger picture (collecting sources from the sitemap.xml, indexing the results) instead of improving the way we extract the description at this point?

+1

@acka47 acka47 removed their assignment Feb 7, 2020
@acka47
Copy link
Contributor

acka47 commented Apr 6, 2020

Moved to https://gitlab.com/oersi/oersi-etl/-/issues/2. Closing.

@acka47 acka47 closed this as completed Apr 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants