v4.9.0
Remove sanitize-html
The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274
In fact, I detected it's no longer necessary since htmlparser2
is present as part of cheerio
load method.
Result: Smaller bundler, less parsing time.
Setup CSS Insensitive Rules
One of the things related to sanitize-html
was normalized some common things around the HTML markup.
Because this dependency is no more dependency and after discovering that CSS rules can be insensitive, I enabled it properly in where is possible.
Result: Better data detection, less initial parsing time.
Improve Date Rules
Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to metascraper-date
.
I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately.
Also, I tried to prioritize update over create, so the output is more associated with the last modification date over the creation date.
Result: Better date accurate, more value detected.
Improve URL detection
The URL detection has been improved for being possible detected more kind of URLs.
An URL is a subtype of URI. The thing that I want to be sure is detecting as much data as possible.
Now the metascraper-helpers
related with urls
being possible detected URIs, such data image URI encoded on base64 or magnet URIs.
The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that.
Result: Better URLs detection, supporting URIs.