-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTML input support #312
Comments
Parse HTML with jsoup, write XML. See example in test. See #312
To use with decode-xml, but how to test? See #312
Current state, on http://test.lobid.org/fix (which includes metafacture-html from branch 312-html), use: Data:
Flux:
Fix:
|
To test a real world example, I did:
The full input data fails due to an
With this, we get a JSON result:
The problem above is caused by
We get a JSON result:
I tried a few other input examples, but all HOOU content has these Since the problem happens when the XML is parsed ( |
With `decode-html` flux command See #312
Set generated record ID, only process content of leaf nodes See #312
Implemented new HtmlDecoder described above, the basic approach seems to work, deployed to http://test.lobid.org/fix: Data: (Full, unmodified) source of https://www.hoou.de/materials/tutorial-lernen-lernen Flux: Fix:
Output:
I suggest we leave this open until we are done with some full real-world scenarios. For these, I think we should continue with support for full Flux workflows in the UI (metafacture/metafacture-fix#6) and conditionals in Fix (metafacture/metafacture-fix#10). @acka47, you might still want to play around with this a little already (use |
metafacture/metafacture-fix#6 is closed and #313 is merged, closing. |
For HTML processing as in https://github.com/programmieraffe/oerhoernchen20 (with sources configured as in https://github.com/programmieraffe/oerhoernchen20/blob/master/scrapy/projects.json), we need an
HtmlReader
, to be used betweenHttpOpener
andXmlDecoder
(parsing HTML to XML, e.g. with https://jsoup.org/) in a workflow to create JSON index data for https://github.com/orgs/hbz/projects/4.The text was updated successfully, but these errors were encountered: