Use libxml and dom libraries? #95

alphapapa · 2017-07-19T11:21:19Z

Hi there,

I've been using org-cliplink for years, and it's great, been very useful.

With Emacs 25 and eww-readable, I discovered that libxml and the dom library make it fairly easy to get the HTML title of a page. Here's some example code.

Given this, I wonder if org-cliplink should make use of this code instead of calling out to external processes. Now, I wouldn't be surprised if there are some edge cases that I haven't found that org-cliplink already handles. :) But if so, I would like to find them and fix them, because I'm using this code in some other things, and it seems to work well.

Thanks!

(defun ap/org-cliplink (&optional url)
  "Insert Org link to URL using title of page at URL.
If URL is not given, look for first URL in kill-ring."
  (interactive)
  (let* ((url (or url (ap/get-first-url-in-kill-ring)))
         (html (ap/url-html url))
         (title (ap/html-title html))
         (link (org-make-link-string url title)))
    (insert link)))

(defun ap/get-first-url-in-kill-ring ()
  "Return first URL in kill-ring, or nil if none."
  (cl-loop for item in kill-ring
           if (s-starts-with? "http" item)
           return item))

(defun ap/url-html (url)
  "Return HTML from URL as string."
  (let* ((response-buffer (url-retrieve-synchronously url nil t))
         (encoded-html (with-current-buffer response-buffer
                         (pop-to-buffer response-buffer)
                         ;; Skip HTTP headers
                         (re-search-forward "\r$" nil t)
                         (delete-region (point-min) (point))
                         (buffer-string))))
    (kill-buffer response-buffer)     ; Not sure if necessary to avoid leaking buffer
    (with-temp-buffer
      ;; For some reason, running `decode-coding-region' in the
      ;; response buffer has no effect, so we have to do it in a
      ;; temp buffer.
      (insert encoded-html)
      (condition-case nil
          ;; Fix undecoded text
          (decode-coding-region (point-min) (point-max) 'utf-8)
        (coding-system-error nil))
      (buffer-string))))

(defun ap/html-title (html)
  "Return title of HTML page.
Uses the `dom' library."
  ;; Based on `eww-readable'
  (let* ((dom (with-temp-buffer
                (insert html)
                (libxml-parse-html-region (point-min) (point-max))))
         (title (caddr (car (dom-by-tag dom 'title)))))
    title))

The text was updated successfully, but these errors were encountered:

rexim · 2017-07-19T12:03:59Z

Hi @alphapapa!

Thanks for filing this ticket!

With Emacs 25 and eww-readable, I discovered that libxml and the dom library make it fairly easy to get the HTML title of a page.

I didn't know about that, thanks!

Given this, I wonder if org-cliplink should make use of this code instead of calling out to external processes.

org-cliplink calls out to an external process only when org-cliplink-transport-implementation is set to curl and it's completely optional. By default org-cliplink uses url.el (which you use in your example too, btw). And it's not even about parsing. The optional external process is started only for making an HTTP request. The parsing of the response is always done on the Emacs side here:

org-cliplink/org-cliplink.el

Line 428 in 6c134fd

(defun org-cliplink-extract-title-from-html (html)

And it's extremely dumb and simple. :)

So, here are my questions:

Is libxml-parse-html-region faster than org-cliplink-extract-title-from-html? Honestly, I'd like to see some benchmarks on that. :) Intuitively, it feels like it has to be slower because it parses the entire page, but that org-cliplink-extract-title-from-html just searches for a couple of substrings in a raw text.
How does libxml-parse-html-region handle broken HTML? Even though org-cliplink-extract-title-from-html is simple it's extremely fault-tolerant, because it doesn't care about correctness of the HTML document.
What are the benefits of using libxml-parse-html-region? Probably some tricky cases that we don't cover. Like capitalized title tags <TITLE>title</TITLE>. I saw such tags in the wild nature and org-cliplink couldn't parse them. :)

Unfortunately, I'm not actively developing org-cliplink anymore because I don't have resources for that, but I'm open for any discussions and pull requests. :)

alphapapa · 2017-07-19T14:37:24Z

My main concern is the use of temp files, gzipping and gunzipping, that I see in the minibuffer every time I use org-cliplink. This other code doesn't have to do any of that. I haven't done benchmarks, but it seems faster. I don't know about the broken HTML issue, but I'm guessing that the simplicity of the HTML->HEAD->TITLE structure will work pretty well. I seem to recall looking at the Emacs source code for that function, and I remember seeing something about handling malformed documents, but don't quote me on that. ;)

By the way, I looked at that link you gave, and I noticed something: will it work with capitalized HTML tags? e.g. if it's <TITLE>blah</TITLE>, will that still work?

rexim · 2017-07-19T15:15:39Z

My main concern is the use of temp files, gzipping and gunzipping, that I see in the minibuffer every time I use org-cliplink. This other code doesn't have to do any of that.

Now THAT concern should've been in the title of ticket! Please feel free to speak up any concerns you have.

I completely agree with you. I was even trying to get rid of all of that once #85. As far as I know modern version of url.el supports zlib #89 and doesn't require this hack. This hack is only needed for cURL. And I introduced cURL, because I needed something better than url.el #60, but apparently I didn't document why. :)

But I slightly remember that cURL was needed for #55, because url.el was not completely asynchronous, and it was blocking the UI for a decent amount of time and when I was experimenting the anchor didn't show up fast enough. Something like that. Hope it makes sense. :)

So #89 should resolve your concerns. I'm not sure if I can find any time to work on it. But I'm open to any contributions. :)

By the way, I looked at that link you gave, and I noticed something: will it work with capitalized HTML tags? e.g. if it's <TITLE>blah</TITLE>, will that still work?

No. org-cliplink currently doesn't parse such titles. I thought that was exactly what I said in my previous message. :)

Like capitalized title tags <TITLE>title</TITLE>. I saw such tags in the wild nature and org-cliplink couldn't parse them.

Please correct me if my wording was wrong.

alphapapa · 2017-07-19T18:57:30Z

Now THAT concern should've been in the title of ticket! Please feel free to speak up any concerns you have.

Well, I was just wondering if some of the code in this project could essentially be replaced with code that's now in Emacs itself.

But I slightly remember that cURL was needed for #55, because url.el was not completely asynchronous, and it was blocking the UI for a decent amount of time and when I was experimenting the anchor didn't show up fast enough. Something like that. Hope it makes sense. :)

I guess that's a cool feature, but I'm not sure it's necessary. It doesn't usually take but a moment, and I'm not going to be doing anything else with Emacs while waiting for it to complete. :)

So #89 should resolve your concerns. I'm not sure if I can find any time to work on it. But I'm open to any contributions. :)

Is it common for Emacs to be built without zlib support? I'm guessing that the vast majority of users have it built-in.

No. org-cliplink currently doesn't parse such titles. I thought that was exactly what I said in my previous message. :)

Guess I failed to parse your message. ;) Anyway, that could be fixed very easily by setting case-fold-search around the string-match. Something like:

(defun org-cliplink-extract-title-from-html (html)
  (let ((case-fold-search t))
    (when (string-match (rx "<title>" (minimal-match (0+ anything)) "</title>") html)
      (match-string 1))))

rexim · 2017-07-19T19:38:36Z

I guess that's a cool feature, but I'm not sure it's necessary. It doesn't usually take but a moment, and I'm not going to be doing anything else with Emacs while waiting for it to complete. :)

Yeah, I guess I wanted this feature for myself. I have a couple of pretty intense use cases. :D

Guess I failed to parse your message. ;) Anyway, that could be fixed very easily by setting case-fold-search around the string-match.

You know what. I think I lied to you, sorry. I just double checked and org-cliplink actually parses capitalized title tags. I guess I had that problem before but forgot that I fixed it. Sorry again. :)

Ok, I think you motivated me to work on all of that. I propose the following:

Let's define the scope of this specific ticket to be reimplementing org-cliplink-extract-title-from-html using dom and libxml. Feel free to work on it if you want. ;) We have a bunch of integration tests to rely on.
I'm taking Utilize zlib support in gzip handling #89 and will try to do something about it in the nearest future. I will try to keep my status updated in Utilize zlib support in gzip handling #89 but cannot promise anything. Feel free to ping me if I disappear.

What risks I see at the moment: I haven't run CI for a long time and it may be broken at the moment. It may take a considerable amount of effort to recover it. Probably will be merging PRs bypassing the CI step for awhile.

alphapapa · 2017-07-19T21:41:28Z

No matter really, everything still works fine. No need to fix what ain't broken, just thought I'd mention it. :)

I mean, the regex match works fine (though you might consider using the function in my last comment, which does it in a simpler way), so there's no need to parse the whole DOM anyway. While it may be faster overall (maybe) to avoid writing temp files, just getting the title is surely faster with the regexp.

d12frosted · 2018-05-26T06:08:59Z

Hey,

I know this issue is old, but using libxml solves several problems I've encountered while using this wonderful package. First of all, some pages have attributes on title tag. Current implementation fails to get that title. Secondly, as @alphapapa already mentioned, capitalized title tag also is not parsed at all.

rexim · 2018-05-26T06:13:47Z

@d12frosted The attribute issue is a known one #72. For upper case titles I created a separete one #101. Both of the issues can be solved without libxml. Have you encountered anything else?

d12frosted · 2018-05-26T06:27:16Z

Yeah, both of them can be fixed without libxml, but libxml makes it less bloody 😸

P. S. thanks for lightning fast response!

alphapapa · 2018-05-26T09:47:31Z

@d12frosted Not to poach, but you might find this package useful too: https://github.com/alphapapa/org-web-tools

rexim · 2018-05-26T14:07:49Z

@d12frosted here is why I like the current approach:

I think simple regex search is faster than the proper XML parsing of the whole document,
The current approach should work even with broken HTML documents

What do you think?

d12frosted · 2018-05-27T11:46:12Z

I think it should be faster, especially for really big documents. But you know how they say - a performance test is better than a human prediction. 😸
It should, but I am not sure that I care about it. 😈

rexim added the research label Jul 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use libxml and dom libraries? #95

Use libxml and dom libraries? #95

alphapapa commented Jul 19, 2017

rexim commented Jul 19, 2017

alphapapa commented Jul 19, 2017 •

edited

Loading

rexim commented Jul 19, 2017 •

edited

Loading

alphapapa commented Jul 19, 2017 •

edited

Loading

rexim commented Jul 19, 2017

alphapapa commented Jul 19, 2017

d12frosted commented May 26, 2018

rexim commented May 26, 2018

d12frosted commented May 26, 2018

alphapapa commented May 26, 2018

rexim commented May 26, 2018

d12frosted commented May 27, 2018

Use libxml and dom libraries? #95

Use libxml and dom libraries? #95

Comments

alphapapa commented Jul 19, 2017

rexim commented Jul 19, 2017

alphapapa commented Jul 19, 2017 • edited Loading

rexim commented Jul 19, 2017 • edited Loading

alphapapa commented Jul 19, 2017 • edited Loading

rexim commented Jul 19, 2017

alphapapa commented Jul 19, 2017

d12frosted commented May 26, 2018

rexim commented May 26, 2018

d12frosted commented May 26, 2018

alphapapa commented May 26, 2018

rexim commented May 26, 2018

d12frosted commented May 27, 2018

alphapapa commented Jul 19, 2017 •

edited

Loading

rexim commented Jul 19, 2017 •

edited

Loading

alphapapa commented Jul 19, 2017 •

edited

Loading