Scrape Webpages using JS and Autocode
A quick example of the crawler.query
API is using it to scrape the top 30 stories on Hacker News right now. To do that, use this link;
autocode.standardlibrary.com/new/?workflow=crawler%2Fquery%2Fselectors
This will open up an interface that looks something like this;
This is a part of Autocode called Maker Mode and can be used to pre-generate API logic. You can edit the code at any time by clicking on the code example to the right. You can see I've already pre-filled the following settings;
- url is
https://news.ycombinator.com/
- userAgent is
standardlibrary/crawler/query
(this is the default) - includeMetadata is
False
(ifTrue
, will return additional metadata in ameta
field in the response) - selectorQueries is an array with one object, the values being
{"selector":"a.storylink","resolver":"text"}
These settings generate the code;
// Store API Responses const result = {crawler: {}};
console.log(
Running [Crawler → Query (scrape) a provided URL based on CSS selectors]...
); result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({ url:https://news.ycombinator.com/
, userAgent:standardlibrary/crawler/query
, includeMetadata: false, selectorQueries: [ { 'selector':a.storylink
, 'resolver':text
} ] });
This can now be run by hitting Run Code in the bottom right. As of writing this, the Hacker News Front Page looks like this:
So my returned response was (I've truncated to only five results):
{
"url": "https://news.ycombinator.com/",
"userAgent": "standardlibrary/crawler/query",
"queryResults": [
[
{
"text": "Zig cc: A drop-in replacement for GCC/Clang"
},
{
"text": "I got my file from Clearview AI"
},
{
"text": "A Novel Mechanical Ventilator Designed for Mass Scale Production"
},
{
"text": "Little Snitch and the deprecation of kernel extensions"
},
{
"text": "Doctors turn to social media to develop Covid-19 treatments in real time"
}
]
]
}
And that's it! That's how easy the crawler.query
API is to use.
You might be wondering how to customize this further. First, the resolver
object attribute can take one of four values: text
, html
, attr
and map
.
text
returns the element texthtml
returns the element HTMLattr
returns an HTML attribute of the element, you must add an additionalattr
key with a value like"attr": "href"
map
returns a nested CSS selector query, this requires an additionalmapQueries
attribute expecting another array ofselectorQueries
For this query:
result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
url: `https://news.ycombinator.com/`,
userAgent: `standardlibrary/crawler/query`,
includeMetadata: false,
selectorQueries: [
{
'selector': `a.storylink`,
'resolver': `attr`,
'attr': `href`
}
]
});
We would expect a response that looks like this:
{
"url": "https://news.ycombinator.com/",
"userAgent": "standardlibrary/crawler/query",
"queryResults": [
[
{
"attr": "https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html"
},
{
"attr": "https://onezero.medium.com/i-got-my-file-from-clearview-ai-and-it-freaked-me-out-33ca28b5d6d4"
},
{
"attr": "https://arxiv.org/abs/2003.10405"
},
{
"attr": "https://blog.obdev.at/little-snitch-and-the-deprecation-of-kernel-extensions/"
},
{
"attr": "https://www.bloomberg.com/news/articles/2020-03-24/covid-19-mysteries-yield-to-doctors-new-weapon-crowd-sourcing"
}
]
]
}
We can use map
to make subqueries (called mapQueries
) against a selector to parse data in parallel. For example, if we want to combine the above two queries (get both title and URL simultaneously)...
result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
url: `https://news.ycombinator.com/`,
userAgent: `standardlibrary/crawler/query`,
includeMetadata: false,
selectorQueries: [
{
'selector': `tr[id]:not([id="pagespace"])`,
'resolver': `map`,
'mapQueries': [
{
'selector': 'a.storylink',
'resolver': 'text'
},
{
'selector': 'a.storylink',
'resolver': 'attr',
'attr': 'href'
}
]
}
]
});
And we'll get a result like this...
{
"url": "https://news.ycombinator.com/",
"userAgent": "standardlibrary/crawler/query",
"queryResults": [
[
{
"mapResults": [
[
{
"text": "Zig cc: A drop-in replacement for GCC/Clang"
}
],
[
{
"attr": "https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html"
}
]
]
},
{
"mapResults": [
[
{
"text": "I got my file from Clearview AI"
}
],
[
{
"attr": "https://onezero.medium.com/i-got-my-file-from-clearview-ai-and-it-freaked-me-out-33ca28b5d6d4"
}
]
]
},
{
"mapResults": [
[
{
"text": "A Novel Mechanical Ventilator Designed for Mass Scale Production"
}
],
[
{
"attr": "https://arxiv.org/abs/2003.10405"
}
]
]
},
{
"mapResults": [
[
{
"text": "Little Snitch and the deprecation of kernel extensions"
}
],
[
{
"attr": "https://blog.obdev.at/little-snitch-and-the-deprecation-of-kernel-extensions/"
}
]
]
},
{
"mapResults": [
[
{
"text": "Doctors turn to social media to develop Covid-19 treatments in real time"
}
],
[
{
"attr": "https://www.bloomberg.com/news/articles/2020-03-24/covid-19-mysteries-yield-to-doctors-new-weapon-crowd-sourcing"
}
]
]
}
]
]
}