Skip to content

v4.9.0

Compare
Choose a tag to compare
@Kikobeats Kikobeats released this 10 Jan 18:32
· 1368 commits to master since this release
ca32573

Remove sanitize-html

The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274

In fact, I detected it's no longer necessary since htmlparser2 is present as part of cheerio load method.

Result: Smaller bundler, less parsing time.

Setup CSS Insensitive Rules

One of the things related to sanitize-html was normalized some common things around the HTML markup.

Because this dependency is no more dependency and after discovering that CSS rules can be insensitive, I enabled it properly in where is possible.

Result: Better data detection, less initial parsing time.

Improve Date Rules

Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to metascraper-date.

I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately.

Also, I tried to prioritize update over create, so the output is more associated with the last modification date over the creation date.

Result: Better date accurate, more value detected.

Improve URL detection

The URL detection has been improved for being possible detected more kind of URLs.

An URL is a subtype of URI. The thing that I want to be sure is detecting as much data as possible.

Now the metascraper-helpers related with urls being possible detected URIs, such data image URI encoded on base64 or magnet URIs.

The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that.

Result: Better URLs detection, supporting URIs.