Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html api/stop at funky comments #7

Draft
wants to merge 11 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Owner

@dmsnell dmsnell commented Apr 5, 2023

Trac ticket:


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@dmsnell dmsnell force-pushed the html-api/stop-at-funky-comments branch 3 times, most recently from a5d3f64 to 61b15f1 Compare April 6, 2023 01:01
@dmsnell dmsnell force-pushed the html-api/stop-at-funky-comments branch from 61b15f1 to d36fb25 Compare April 12, 2023 15:19
dmsnell added a commit that referenced this pull request Sep 14, 2023
The HTML API should be able to provide the ability to generate excerpts from
HTMl documents given a specific maximum length.

In this patch we're exploring the addition of text and HTML chunks that can
be extracted while processing in order to do just this. The text chunks are
similar to `.textContent` on the DOM while the HTML chunks contain raw and
unprocessed HTML.

These functions should likely remain low-level in the Tag Processor and be
exposed from the HTML Processor to ensure that proper semantics are heeded
when extracting this information, such as how `PRE` tags ignore a leading
newline inside their content or how `SCRIPT` and `STYLE` content isn't
part of what we want with something like `strip_tags()`.

In the process of this work it's evident again that the Tag Processor ought
to expose the ability to visit every token and non-tag tokens should be
classified. This has already been explored in #7.
dmsnell added a commit that referenced this pull request Sep 18, 2023
The HTML API should be able to provide the ability to generate excerpts from
HTMl documents given a specific maximum length.

In this patch we're exploring the addition of text and HTML chunks that can
be extracted while processing in order to do just this. The text chunks are
similar to `.textContent` on the DOM while the HTML chunks contain raw and
unprocessed HTML.

These functions should likely remain low-level in the Tag Processor and be
exposed from the HTML Processor to ensure that proper semantics are heeded
when extracting this information, such as how `PRE` tags ignore a leading
newline inside their content or how `SCRIPT` and `STYLE` content isn't
part of what we want with something like `strip_tags()`.

In the process of this work it's evident again that the Tag Processor ought
to expose the ability to visit every token and non-tag tokens should be
classified. This has already been explored in #7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant