Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML input support #312

Closed
fsteeg opened this issue Jan 22, 2020 · 4 comments
Closed

Add HTML input support #312

fsteeg opened this issue Jan 22, 2020 · 4 comments

Comments

@fsteeg
Copy link
Member

fsteeg commented Jan 22, 2020

For HTML processing as in https://github.com/programmieraffe/oerhoernchen20 (with sources configured as in https://github.com/programmieraffe/oerhoernchen20/blob/master/scrapy/projects.json), we need an HtmlReader, to be used between HttpOpener and XmlDecoder (parsing HTML to XML, e.g. with https://jsoup.org/) in a workflow to create JSON index data for https://github.com/orgs/hbz/projects/4.

@fsteeg fsteeg self-assigned this Jan 22, 2020
@fsteeg fsteeg changed the title Add HtmlHandler Add HtmlReader Jan 22, 2020
fsteeg added a commit that referenced this issue Jan 23, 2020
Parse HTML with jsoup, write XML. See example in test.

See #312
fsteeg added a commit that referenced this issue Jan 24, 2020
To use with decode-xml, but how to test?

See #312
@fsteeg
Copy link
Member Author

fsteeg commented Jan 24, 2020

Current state, on http://test.lobid.org/fix (which includes metafacture-html from branch 312-html), use:

Data:

<div id=1><h1>Faust</h1><b>Goethe</b></div><div id=2><h1>Prozess</h1><b>Kafka

Flux:

html-to-xml|decode-xml|handle-generic-xml("div")|fix|encode-json

Fix:

map(_id, id)
map(h1.value,title)
map(b.value,author)
/*map(_else)*/

@fsteeg
Copy link
Member Author

fsteeg commented Jan 29, 2020

To test a real world example, I did:

The full input data fails due to an org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 29901; The value of attribute "value" associated with an element type "input" must not contain the '<' character. at org.metafacture.xml.XmlDecoder.process(XmlDecoder.java:69). To verify the basic idea, I used just the beginning of the HTML source, until the head.title element, as input data:

<!DOCTYPE html> <html lang="de"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="msapplication-TileColor" content="#ffffff"> <meta name="msapplication-TileImage" content="/assets/images/favicon/ms-icon-144x144-082c0e8e414258ffcec59a1e87679f9c.png"> <meta name="viewport" content="width=device-width, initial-scale=1"> <!-- EMBER_CLI_FASTBOOT_TITLE --> <meta name="ember-cli-head-start" content> <meta property="title" content="Tutorial: Lernen lernen - HOOU"> <title>Tutorial: Lernen lernen - HOOU</title>

With this, we get a JSON result:

{"title":"Tutorial: Lernen lernen - HOOU"}

The problem above is caused by value="<embed src=&quot;https://https://www.hoou.de/m/undefined?embedded=true&quot; style=&quot;overflow:hidden;height:100vh;width:100%&quot;></embed>" attributes. If we remove these (there are 3 in this example), we can process the full input. If we use map(_else) as Fix, we can see the field structure, and pick out more information with a Fix like this:

map(head.title.value,title)
map(body.div.div.div.div.div.div.div.p.value,content)

We get a JSON result:

{"title":"Tutorial: Lernen lernen - HOOU","content":"Das Bewusstsein und die Kenntnis über Ihren Lernstil kann Ihnen helfen, Ihren Lernansatz und damit auch den Lernerfolg zu optimieren. In diesem Modul reflektieren Sie Ihren Lernstil und dessen Implikationen und entwickeln individuelle Lernstrategien. Zudem hilft Ihnen das Wissen über unterschiedliche Lernstile beim Lernen in der Gruppe oder bei der Teamarbeit."}

I tried a few other input examples, but all HOOU content has these value="<embed attributes, and other input from other sites (from https://github.com/programmieraffe/oerhoernchen20/blob/master/scrapy/projects.json) has other XML parsing issues, like undeclared or unclosed entities. I think this proves what @dr0i already suspected when we discussed this: converting these real world HTML inputs to XML does not really work.

Since the problem happens when the XML is parsed (at org.metafacture.xml.XmlDecoder.process(XmlDecoder.java:69)), not when the HTML is parsed, I think this could be a viable approach: parse the HTML with jsoup and generate metadata events directly, without the intermediate XML representation. In a Flux, this would basically work like this: decode-html|fix|encode-json. Internally, it would require a setup similar to XmlDecoder, but extending DefaultObjectPipe<Reader, StreamReceiver>, passing on the HTML DOM elements, instead of doing the SAX parsing.

@fsteeg fsteeg changed the title Add HtmlReader Add HTML input support Jan 29, 2020
fsteeg added a commit that referenced this issue Feb 4, 2020
With `decode-html` flux command

See #312
fsteeg added a commit that referenced this issue Feb 4, 2020
Set generated record ID, only process content of leaf nodes

See #312
@fsteeg
Copy link
Member Author

fsteeg commented Feb 4, 2020

Implemented new HtmlDecoder described above, the basic approach seems to work, deployed to http://test.lobid.org/fix:

Data: (Full, unmodified) source of https://www.hoou.de/materials/tutorial-lernen-lernen

Flux: decode-html|fix|encode-json(prettyPrinting="true")

Fix:

map(html.head.title.value,title)
map(html.body.div.div.div.div.div.div.div.p.value, description)

Output:

{
  "title" : "Tutorial: Lernen lernen - HOOU",
  "description" : "Das Bewusstsein und die Kenntnis ..."
}

I suggest we leave this open until we are done with some full real-world scenarios. For these, I think we should continue with support for full Flux workflows in the UI (metafacture/metafacture-fix#6) and conditionals in Fix (metafacture/metafacture-fix#10). @acka47, you might still want to play around with this a little already (use map(_else) as Fix to see the full field structure).

@fsteeg
Copy link
Member Author

fsteeg commented Mar 16, 2020

metafacture/metafacture-fix#6 is closed and #313 is merged, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants