Replace ocamlnet HTML parser with Lambda Soup #15

aantron · 2024-08-07T18:28:25Z

As suggested in ocaml/ocaml.org#2609 (comment).

It looks like Lambda Soup was already being used in some of the newer code in meta.ml by @tmattio. This PR also replaces usage of the Nethtml parser from ocamlnet by Lambda Soup.

I tested this by running example/aggregate_feeds.ml and it seems to still give plausible output. Are there other tests I should run?

aantron · 2024-08-07T18:28:58Z

lib/dune

@@ -1,4 +1,4 @@
 (library
 (name river)
 (public_name river)
- (libraries cohttp cohttp-lwt cohttp-lwt-unix syndic netstring lambdasoup))
+ (libraries cohttp cohttp-lwt cohttp-lwt-unix str syndic lambdasoup))


Not listing str gave a warning on OCaml 5.2.0.

aantron · 2024-08-07T18:30:12Z

lib/post.ml

-      let len, prefix_content = len_prefix_of_html content len in
-      (len, Element (tag, args, prefix_content))
-
-let prefix_of_html html len = snd (len_prefix_of_html html len)


These prefix_of... functions appeared to be dead code. Since they also referenced module Nethtml, I removed them.

aantron · 2024-08-07T18:30:36Z

lib/post.ml

-  match l with
-  | [] -> []
-  | a :: tl -> (
-      match f a with None -> filter_map tl f | Some a -> a :: filter_map tl f)


This became dead code.

aantron · 2024-08-07T18:32:29Z

lib/post.ml

-  Netencoding.Html.encode ~prefer_name:false ~in_enc:`Enc_utf8 ()
-
-let decode_document html = Nethtml.decode ~enc:`Enc_utf8 html
-let encode_document html = Nethtml.encode ~enc:`Enc_utf8 html


These entity-translating functions are not necessary with Lambda Soup, as Markup.ml applies all the necessary encoding and decoding internally, as required in the HTML5 specification. HTML5 defaults to UTF-8.

aantron · 2024-08-07T18:33:47Z

lib/post.ml

-      Nethtml.Element ("img", attrs, sub)
-  | Nethtml.Element (e, attrs, sub) ->
-      Nethtml.Element (e, attrs, resolve ?xmlbase sub)
-  | Data _ as d -> d


This recursive traversal became soup $$ "a[href]" and soup $$ "img[src]".

aantron · 2024-08-07T18:44:03Z

lib/post.ml

-  | Data _ as d -> Some d
-
-let relaxed_html40_dtd =
-  (* Allow <font> inside <pre> because blogspot uses it! :-( *)


I'm not sure what this refers to. However, in HTML5, the <font> tag is allowed in <pre>, and Lambda Soup loads that correctly.

aantron · 2024-08-09T13:50:16Z

The results of trying this PR on the large set of blogs scraped by OCaml.org are here: ocaml/ocaml.org#2609 (comment)

I've switched this PR to pin the River PR rather than ocamlnet.

Based on a recommendation from @sabine, to test, I ran
rm -rf data/planet/*
make scrape
to test for differences in the scraped blogs. Besides that for many sources the result was 404 or timeout, the differences actually due to the River PR were insignificant:

The literal texts of self-closing tags like <hr/> were replaced by <hr>, per the serialization algorithm in the HTML5 spec.

There is no aggressive encoding of characters like single quotes with e.g. ’. The serialization algorithm in the HTML5 spec specifies UTF-8 and does not call for encoding these as entities. Characters that must be encoded still are -- for example, >. The previous serializer used in River encoded aggressively as if outputting Latin-1.

<pre> is followed by a newline. The HTML5 parser requires swallowing this newline when parsing, and the output is semantically equivalent.

Changes like c29f667#diff-460316799c4fad86d97128339c0d566dd168995dd19a3b3b832840c856ebc577R27 are the result of HTML5 error correction and represent how browsers load HTML in practice.

I've uploaded the data diff as a throwaway commit: c29f667

I wonder if the diff at c29f667#diff-e7720bbc02cf67fbb25f22d45e6c0e65048b0225ebbe9819beb97c6b159f54beL110 is due to the blog post being updated after it was scraped for OCaml.org.

This looks fine to me.

tmattio · 2024-11-08T10:03:30Z

Thanks @aantron!

@aantron

CHANGES: - Replace ocamlnet HTML parser with Lambda Soup (tarides/river#15, @aantron)

Replace ocamlnet HTML parser with Lambda Soup

476dc94

aantron commented Aug 7, 2024

View reviewed changes

aantron mentioned this pull request Aug 8, 2024

Build on OCaml 5 (river pin) ocaml/ocaml.org#2609

Merged

Update ocamlnet reference in README

c771e5c

tmattio merged commit 56a7010 into tarides:master Nov 8, 2024

tmattio added a commit to tmattio/opam-repository that referenced this pull request Nov 8, 2024

[new release] river (0.4)

7093fa3

CHANGES: - Replace ocamlnet HTML parser with Lambda Soup (tarides/river#15, @aantron)

tmattio mentioned this pull request Nov 8, 2024

[new release] river (0.4) ocaml/opam-repository#26848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ocamlnet HTML parser with Lambda Soup #15

Replace ocamlnet HTML parser with Lambda Soup #15

aantron commented Aug 7, 2024

aantron Aug 7, 2024

aantron Aug 7, 2024 •

edited

Loading

aantron Aug 7, 2024

aantron Aug 7, 2024

aantron Aug 7, 2024

aantron Aug 7, 2024

aantron commented Aug 9, 2024

tmattio commented Nov 8, 2024

Replace ocamlnet HTML parser with Lambda Soup #15

Replace ocamlnet HTML parser with Lambda Soup #15

Conversation

aantron commented Aug 7, 2024

aantron Aug 7, 2024

Choose a reason for hiding this comment

aantron Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

aantron Aug 7, 2024

Choose a reason for hiding this comment

aantron Aug 7, 2024

Choose a reason for hiding this comment

aantron Aug 7, 2024

Choose a reason for hiding this comment

aantron Aug 7, 2024

Choose a reason for hiding this comment

aantron commented Aug 9, 2024

tmattio commented Nov 8, 2024

aantron Aug 7, 2024 •

edited

Loading