You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HTML5 introduces microdata by adding the attributes itemscope, itemid, itemtype, itemprop and itemref. These tags provide valuable information about the semantic role of the parts of a document. This information can also be very useful in parsing the contents of a website as the author intented, rather than by estimating their intent by using statistical or other heuristics.
An effort to standardize the value of these attributes is available on http://schema.org/ which defines various types of documents, such as Article: http://schema.org/Article
One example of a website that uses this effectively that I encountered is http://tweakers.net/. The ArticleExtractor itself does a poor job on this website as it does not only include the article text itself but also includes several (but not all) user comments.
In my setup, I have currently implemented this by first checking for the existence of any HTML elements with a itemprop=articleBody or itemprop=description attribute and using that text when available rather than invoking BoilerPipe, but it would be great if this knowledge could somehow be incorporated into a library such as BoilerPipe that focuses at extracting the article from such a HTML document.
The text was updated successfully, but these errors were encountered:
(updated)
HTML5 introduces microdata by adding the attributes itemscope, itemid, itemtype, itemprop and itemref. These tags provide valuable information about the semantic role of the parts of a document. This information can also be very useful in parsing the contents of a website as the author intented, rather than by estimating their intent by using statistical or other heuristics.
An effort to standardize the value of these attributes is available on http://schema.org/ which defines various types of documents, such as Article: http://schema.org/Article
One example of a website that uses this effectively that I encountered is http://tweakers.net/. The ArticleExtractor itself does a poor job on this website as it does not only include the article text itself but also includes several (but not all) user comments.
In my setup, I have currently implemented this by first checking for the existence of any HTML elements with a itemprop=articleBody or itemprop=description attribute and using that text when available rather than invoking BoilerPipe, but it would be great if this knowledge could somehow be incorporated into a library such as BoilerPipe that focuses at extracting the article from such a HTML document.
The text was updated successfully, but these errors were encountered: