-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite HTML plugin design #65
Comments
Here is an example of how a intermediate language for HTML bassed on troff could be used by standard text processing utils like AWK: % cat test.mm
.tb html
.tb head
.tb link
.ta rel "alternate"
.ta type "application/rss+xml"
.ta href "/feed.xml"
.te link
.tb link
.ta rel "alternate"
.ta type "application/atom+xml"
.ta href "/atom.xml"
.te link
.te head
.tb body
.tb p
Hello from the body
.te p
.te body
.te html Where commands begin with a dot in the first character. Commands are And here is the AWK program that injects the links after the body: % cat parse.awk
BEGIN { n=0 }
/^\.tb head/ { inhead=1 }
/^\.tb link/ && inhead { inlink=1; href=""; type="" }
/^\.ta type/ && inlink { type=$3 } # FIXME: Handle spaces
/^\.ta href/ && inlink { href=$3 }
/^\.te link/ && inlink \
&& href != "" \
&& type != "" { hrefs[n]=href; types[n]=type; n++; inlink=0 }
{ print } # Print the page as is by default
/^.tb body/ && n > 0 {
print ".tb div"
print ".ta class dillo-plugin-rss"
for (i = 0; i < n; i++) {
print ".tb p"
printf "Feed with type %s at %s\n", types[i], hrefs[i]
print ".te p"
}
print ".te div"
} After running it: % awk -f parse.awk < test.mm > test.pp
% diff -up test.mm test.pp
--- test.mm 2024-01-21 13:48:35.493905662 +0100
+++ test.pp 2024-01-21 13:56:26.554871231 +0100
@@ -12,6 +12,15 @@
.te link
.te head
.tb body
+.tb div
+.ta class dillo-plugin-rss
+.tb p
+Feed with type "application/rss+xml" at "/feed.xml"
+.te p
+.tb p
+Feed with type "application/atom+xml" at "/atom.xml"
+.te p
+.te div
.tb p
Hello from the body
.te p |
Here is a rewrite of the previous plugin in C, showing how we can partially parse a pseudo-HTML document: #include <stdio.h>
#include <string.h>
#define MAXLINKS 32
#define MAXLINE 4096
struct link {
char *href;
char *type;
};
struct state {
int nlinks;
struct link links[MAXLINKS];
int in_head;
int in_link;
int in_body;
int emitted;
};
void parsebegin(struct state *st, char *token)
{
if (strncmp(token, "head", 4) == 0) {
st->in_head = 1;
} else if (st->in_head && strncmp(token, "link", 4) == 0) {
st->in_link = 1;
} else if (strncmp(token, "body", 4) == 0) {
st->in_body = 1;
}
}
char *cleanstr(char *str)
{
int n = strlen(str);
if (str[n-1] == '\n')
str[n-1] = '\0';
return str;
}
void parseattr(struct state *st, char *token)
{
if (!st->in_link)
return;
struct link *link = &st->links[st->nlinks];
if (strncmp(token, "type", 4) == 0) {
link->type = cleanstr(strdup(token + 5));
} else if (strncmp(token, "href", 4) == 0) {
link->href = cleanstr(strdup(token + 5));
}
}
void parseend(struct state *st, char *token)
{
struct link *link = &st->links[st->nlinks];
if (st->in_head && strncmp(token, "head", 4) == 0) {
st->in_head = 0;
} else if (st->in_body && strncmp(token, "body", 4) == 0) {
st->in_body = 0;
} else if (st->in_link && strncmp(token, "link", 4) == 0) {
st->in_link = 0;
/* Accept */
if (link->href && link->type)
st->nlinks++;
}
}
void parseline(struct state *st, char *line)
{
if (st->nlinks >= MAXLINKS)
return;
int n = strlen(line);
if (n < 4)
return;
int a = line[0], b = line[1], c = line[2], d = line[3];
if (a != '.' || d != ' ')
return;
if (b != 't')
return;
char *next = line + 4;
if (c == 'b')
parsebegin(st, next);
else if (c == 'a')
parseattr(st, next);
else if (c == 'e')
parseend(st, next);
}
void post(struct state *st)
{
if (!st->in_body || st->emitted)
return;
printf(".tb div\n");
printf(".ta class dillo-plugin-rss\n");
for (int i = 0; i < st->nlinks; i++) {
struct link *link = &st->links[i];
printf(".tb p\n");
printf("Feed with type %s at %s\n", link->type, link->href);
printf(".te p\n");
}
printf(".te div\n");
st->emitted = 1;
}
int main()
{
char line[MAXLINE];
struct state st = { 0 };
while (fgets(line, MAXLINE, stdin)) {
parseline(&st, line);
fprintf(stdout, "%s", line);
post(&st);
}
return 0;
} And here is the comparison with perf:
I uses 65 times less instructions (but is not 65 times faster). |
CloudFlare has done some work in this area to rewrite parts of the websites they intercept using a stream parser. They have a blog post where they explain some details. As far as I understand, this would be an example where we transform the tree in memory, updating it chunk by chunk. They claim they process a large document (8 MiB) at up to 160 MiB/s, but they don't mention the HW used. Maybe it could serve as a comparison if we manage to do something similar with an intermediate language. |
If we continue with the intermediate language idea, we should also make it a larger subset that just HTML. For example, it would be nice if we have access to the response HTTP headers too. This makes me think that plugins may also need to rewrite the requests and not only the response. For example, we may want to redirect petitions before they are made, or change HTTP headers. |
Another problem that we face how to solve transformations that requires double or more passes. An example of this is the table of contents, where we index the secions (h1, h2, ...) and then display a menu in the top of the page with the table of contents. A way to solve this problem is to allow plugins to work with auxiliary streams and allow adding a reference to inject content from other streams. Example: <html>
<body>
<h1>Main title</h1>
<p>blah blah</p>
<h2>Section 1</h2>
<p>blah blah</p>
<h2>Section 2</h2>
<p>blah blah</p>
</body>
</html> To produce something like this: <html>
<body>
<div class="toc">
<ul>
<li>Main title
<ul>
<li>Section 1</li>
<li>Section 2</li>
</ul>
</li>
<li>Another title</li>
</div>
<h1>Main title</h1>
<p>blah blah</p>
<h2>Section 1</h2>
<p>blah blah</p>
<h2>Section 2</h2>
<p>blah blah</p>
<h1>Another title</h1>
</body>
</html> We could inject an element that includes content from another file descriptor. Like this: <html>
<body>
<specialref fd="42"/>
<h1>Main title</h1>
<p>blah blah</p>
<h2>Section 1</h2>
<p>blah blah</p>
<h2>Section 2</h2>
<p>blah blah</p>
<h1>Another title</h1>
</body>
</html> And then the plugin would write in another fd the table of content as it is being processed from the main stream. The content is not blocked and can continue to be processed in stream mode. The TOC plugin should also make the headers have a unique id, so we can properly link them. We could also inject the content after the main stream is processed, but that would require the plugin to store all the intermediate information in memory. The clean solution is allowing multiple streams. |
Users should have the ability to modify pages on their own, easily and by using their preferred language. They should be able to place rules so that pages matching the rules perform some changes and others don't.
Here are some examples that could use such feature:
<link>
tags for alternate pages and display them (RSS).Ideally, we would like to have a design such that it has the following features:
For this to happen, we would need to make a decision about how the data is sent from the website to the plugins. We have some options, which are not mutually exclusive.
Raw HTML
Just send the page as-is to the plugin stdin and then read the stdout to get the transformed HTML. This is the simplest design, but has the drawback that we would (likely) need to implement a HTML parser in each plugin and parse the page again in each transformation.
Intermediate language
Instead of using HTML, we could transform it to something intermediate that is easier to parse in such a way that the plugins can simply disregard all the content they are not interested in and then just apply match rules that only require a minimal amount of processing. A simple language should allow users to write simple sed or awk plugins to perform simple tasks, without requiring to parse of the whole document tree. This would reduce the amount of processing for plugins, but it would require learning a new language which may be costly.
Document tree in memory
Another option is to allow plugins to read and modify the document tree in memory. As we will be processing the HTML in stream mode, we cannot wait until the whole tree is created and the post-process it. It must be updated in iterations where new content is added to the tree and this new content can be send to plugins for processing. The plugins could hook into some elements or rules so only that content is sent to them.
This is probably the most efficient way to do it, but it would restrict writting the plugins in a way that is compatible with the document tree API, and that would also restrict the languages. Furthermore, as we change the API the plugins will become outdated, so this is not such a great idea.
Use JavaScript
Finally, the option that I would hate the most, is to implement something similar (or just the same) as JavaScript, where the plugins are written in a language that can be executed by the browser to manipulate the document tree. This would hide internal changes in the API and allow writting simpler programs. However, this would only allow plugins to be written in JS, and the emulation of the language would introduce more performance cost.
This option may also not be suitable for the stream mode, where the document tree is still loading, and may cause cascade effects when two plugins are hooked in the same change events. In any case, this would require us to implement support for JavaScript, which would not be an easy task.
To determine which option or options to implement, a simple plan is to just try to code some plugins as a proof of concept and see how they behave. Then, we would have real data on how the performance is affected, instead of just performing some premature optimization.
See also #56
The text was updated successfully, but these errors were encountered: