Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ocamlnet HTML parser with Lambda Soup #15

Merged
merged 2 commits into from
Nov 8, 2024

Conversation

aantron
Copy link
Contributor

@aantron aantron commented Aug 7, 2024

As suggested in ocaml/ocaml.org#2609 (comment).

It looks like Lambda Soup was already being used in some of the newer code in meta.ml by @tmattio. This PR also replaces usage of the Nethtml parser from ocamlnet by Lambda Soup.

I tested this by running example/aggregate_feeds.ml and it seems to still give plausible output. Are there other tests I should run?

@@ -1,4 +1,4 @@
(library
(name river)
(public_name river)
(libraries cohttp cohttp-lwt cohttp-lwt-unix syndic netstring lambdasoup))
(libraries cohttp cohttp-lwt cohttp-lwt-unix str syndic lambdasoup))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not listing str gave a warning on OCaml 5.2.0.

let len, prefix_content = len_prefix_of_html content len in
(len, Element (tag, args, prefix_content))

let prefix_of_html html len = snd (len_prefix_of_html html len)
Copy link
Contributor Author

@aantron aantron Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These prefix_of... functions appeared to be dead code. Since they also referenced module Nethtml, I removed them.

match l with
| [] -> []
| a :: tl -> (
match f a with None -> filter_map tl f | Some a -> a :: filter_map tl f)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became dead code.

Netencoding.Html.encode ~prefer_name:false ~in_enc:`Enc_utf8 ()

let decode_document html = Nethtml.decode ~enc:`Enc_utf8 html
let encode_document html = Nethtml.encode ~enc:`Enc_utf8 html
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These entity-translating functions are not necessary with Lambda Soup, as Markup.ml applies all the necessary encoding and decoding internally, as required in the HTML5 specification. HTML5 defaults to UTF-8.

Nethtml.Element ("img", attrs, sub)
| Nethtml.Element (e, attrs, sub) ->
Nethtml.Element (e, attrs, resolve ?xmlbase sub)
| Data _ as d -> d
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursive traversal became soup $$ "a[href]" and soup $$ "img[src]".

| Data _ as d -> Some d

let relaxed_html40_dtd =
(* Allow <font> inside <pre> because blogspot uses it! :-( *)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this refers to. However, in HTML5, the <font> tag is allowed in <pre>, and Lambda Soup loads that correctly.

@aantron
Copy link
Contributor Author

aantron commented Aug 9, 2024

The results of trying this PR on the large set of blogs scraped by OCaml.org are here: ocaml/ocaml.org#2609 (comment)

I've switched this PR to pin the River PR rather than ocamlnet.

Based on a recommendation from @sabine, to test, I ran

rm -rf data/planet/*
make scrape

to test for differences in the scraped blogs. Besides that for many sources the result was 404 or timeout, the differences actually due to the River PR were insignificant:

  • The literal texts of self-closing tags like <hr/> were replaced by <hr>, per the serialization algorithm in the HTML5 spec.
  • There is no aggressive encoding of characters like single quotes with e.g. &rsquo;. The serialization algorithm in the HTML5 spec specifies UTF-8 and does not call for encoding these as entities. Characters that must be encoded still are -- for example, &gt;. The previous serializer used in River encoded aggressively as if outputting Latin-1.
  • <pre> is followed by a newline. The HTML5 parser requires swallowing this newline when parsing, and the output is semantically equivalent.
  • Changes like c29f667#diff-460316799c4fad86d97128339c0d566dd168995dd19a3b3b832840c856ebc577R27 are the result of HTML5 error correction and represent how browsers load HTML in practice.

I've uploaded the data diff as a throwaway commit: c29f667

I wonder if the diff at c29f667#diff-e7720bbc02cf67fbb25f22d45e6c0e65048b0225ebbe9819beb97c6b159f54beL110 is due to the blog post being updated after it was scraped for OCaml.org.

This looks fine to me.

@tmattio tmattio merged commit 56a7010 into tarides:master Nov 8, 2024
@tmattio
Copy link
Contributor

tmattio commented Nov 8, 2024

Thanks @aantron!

tmattio added a commit to tmattio/opam-repository that referenced this pull request Nov 8, 2024
CHANGES:

- Replace ocamlnet HTML parser with Lambda Soup (tarides/river#15, @aantron)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants