-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ocamlnet HTML parser with Lambda Soup #15
Conversation
@@ -1,4 +1,4 @@ | |||
(library | |||
(name river) | |||
(public_name river) | |||
(libraries cohttp cohttp-lwt cohttp-lwt-unix syndic netstring lambdasoup)) | |||
(libraries cohttp cohttp-lwt cohttp-lwt-unix str syndic lambdasoup)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not listing str
gave a warning on OCaml 5.2.0.
let len, prefix_content = len_prefix_of_html content len in | ||
(len, Element (tag, args, prefix_content)) | ||
|
||
let prefix_of_html html len = snd (len_prefix_of_html html len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These prefix_of
... functions appeared to be dead code. Since they also referenced module Nethtml
, I removed them.
match l with | ||
| [] -> [] | ||
| a :: tl -> ( | ||
match f a with None -> filter_map tl f | Some a -> a :: filter_map tl f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This became dead code.
Netencoding.Html.encode ~prefer_name:false ~in_enc:`Enc_utf8 () | ||
|
||
let decode_document html = Nethtml.decode ~enc:`Enc_utf8 html | ||
let encode_document html = Nethtml.encode ~enc:`Enc_utf8 html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These entity-translating functions are not necessary with Lambda Soup, as Markup.ml applies all the necessary encoding and decoding internally, as required in the HTML5 specification. HTML5 defaults to UTF-8.
Nethtml.Element ("img", attrs, sub) | ||
| Nethtml.Element (e, attrs, sub) -> | ||
Nethtml.Element (e, attrs, resolve ?xmlbase sub) | ||
| Data _ as d -> d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recursive traversal became soup $$ "a[href]"
and soup $$ "img[src]"
.
| Data _ as d -> Some d | ||
|
||
let relaxed_html40_dtd = | ||
(* Allow <font> inside <pre> because blogspot uses it! :-( *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this refers to. However, in HTML5, the <font>
tag is allowed in <pre>
, and Lambda Soup loads that correctly.
The results of trying this PR on the large set of blogs scraped by OCaml.org are here: ocaml/ocaml.org#2609 (comment)
This looks fine to me. |
Thanks @aantron! |
CHANGES: - Replace ocamlnet HTML parser with Lambda Soup (tarides/river#15, @aantron)
As suggested in ocaml/ocaml.org#2609 (comment).
It looks like Lambda Soup was already being used in some of the newer code in
meta.ml
by @tmattio. This PR also replaces usage of theNethtml
parser from ocamlnet by Lambda Soup.I tested this by running
example/aggregate_feeds.ml
and it seems to still give plausible output. Are there other tests I should run?