Realtime-Web-Scraping

Scrape Webpages using JS and Autocode

Crawler Query (Scraping) Example

A quick example of the crawler.query API is using it to scrape the top 30 stories on Hacker News right now. To do that, use this link;

autocode.standardlibrary.com/new/?workflow=crawler%2Fquery%2Fselectors

This will open up an interface that looks something like this;

This is a part of Autocode called Maker Mode and can be used to pre-generate API logic. You can edit the code at any time by clicking on the code example to the right. You can see I've already pre-filled the following settings;

url is https://news.ycombinator.com/
userAgent is standardlibrary/crawler/query (this is the default)
includeMetadata is False (if True, will return additional metadata in a meta field in the response)
selectorQueries is an array with one object, the values being {"selector":"a.storylink","resolver":"text"}

These settings generate the code;

// Store API Responses
const result = {crawler: {}};
console.log(Running [Crawler → Query (scrape) a provided URL based on CSS selectors]...);
result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
url: https://news.ycombinator.com/,
userAgent: standardlibrary/crawler/query,
includeMetadata: false,
selectorQueries: [
{
'selector': a.storylink,
'resolver': text
}
]
});

This can now be run by hitting Run Code in the bottom right. As of writing this, the Hacker News Front Page looks like this:

So my returned response was (I've truncated to only five results):

{
  "url": "https://news.ycombinator.com/",
  "userAgent": "standardlibrary/crawler/query",
  "queryResults": [
    [
      {
        "text": "Zig cc: A drop-in replacement for GCC/Clang"
      },
      {
        "text": "I got my file from Clearview AI"
      },
      {
        "text": "A Novel Mechanical Ventilator Designed for Mass Scale Production"
      },
      {
        "text": "Little Snitch and the deprecation of kernel extensions"
      },
      {
        "text": "Doctors turn to social media to develop Covid-19 treatments in real time"
      }
    ]
  ]
}

And that's it! That's how easy the crawler.query API is to use.

Web Scraping, Next Steps

You might be wondering how to customize this further. First, the resolver object attribute can take one of four values: text, html, attr and map.

text returns the element text
html returns the element HTML
attr returns an HTML attribute of the element, you must add an additional attr key with a value like "attr": "href"
map returns a nested CSS selector query, this requires an additional mapQueries attribute expecting another array of selectorQueries

Using `"resolver": "attr"`

For this query:

result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
  url: `https://news.ycombinator.com/`,
  userAgent: `standardlibrary/crawler/query`,
  includeMetadata: false,
  selectorQueries: [
    {
      'selector': `a.storylink`,
      'resolver': `attr`,
      'attr': `href`
    }
  ]
});

We would expect a response that looks like this:

{
  "url": "https://news.ycombinator.com/",
  "userAgent": "standardlibrary/crawler/query",
  "queryResults": [
    [
      {
        "attr": "https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html"
      },
      {
        "attr": "https://onezero.medium.com/i-got-my-file-from-clearview-ai-and-it-freaked-me-out-33ca28b5d6d4"
      },
      {
        "attr": "https://arxiv.org/abs/2003.10405"
      },
      {
        "attr": "https://blog.obdev.at/little-snitch-and-the-deprecation-of-kernel-extensions/"
      },
      {
        "attr": "https://www.bloomberg.com/news/articles/2020-03-24/covid-19-mysteries-yield-to-doctors-new-weapon-crowd-sourcing"
      }
    ]
  ]
}

Using `"resolver": "map"`

We can use map to make subqueries (called mapQueries) against a selector to parse data in parallel. For example, if we want to combine the above two queries (get both title and URL simultaneously)...

result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
  url: `https://news.ycombinator.com/`,
  userAgent: `standardlibrary/crawler/query`,
  includeMetadata: false,
  selectorQueries: [
    {
      'selector': `tr[id]:not([id="pagespace"])`,
      'resolver': `map`,
      'mapQueries': [
        {
          'selector': 'a.storylink',
          'resolver': 'text'
        },
        {
          'selector': 'a.storylink',
          'resolver': 'attr',
          'attr': 'href'
        }
      ]
    }
  ]
});

And we'll get a result like this...

{
  "url": "https://news.ycombinator.com/",
  "userAgent": "standardlibrary/crawler/query",
  "queryResults": [
    [
      {
        "mapResults": [
          [
            {
              "text": "Zig cc: A drop-in replacement for GCC/Clang"
            }
          ],
          [
            {
              "attr": "https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "I got my file from Clearview AI"
            }
          ],
          [
            {
              "attr": "https://onezero.medium.com/i-got-my-file-from-clearview-ai-and-it-freaked-me-out-33ca28b5d6d4"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "A Novel Mechanical Ventilator Designed for Mass Scale Production"
            }
          ],
          [
            {
              "attr": "https://arxiv.org/abs/2003.10405"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "Little Snitch and the deprecation of kernel extensions"
            }
          ],
          [
            {
              "attr": "https://blog.obdev.at/little-snitch-and-the-deprecation-of-kernel-extensions/"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "Doctors turn to social media to develop Covid-19 treatments in real time"
            }
          ],
          [
            {
              "attr": "https://www.bloomberg.com/news/articles/2020-03-24/covid-19-mysteries-yield-to-doctors-new-weapon-crowd-sourcing"
            }
          ]
        ]
      }
    ]
  ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
img		img
README.md		README.md
scrap_model_1.js		scrap_model_1.js
scrap_model_2.js		scrap_model_2.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Realtime-Web-Scraping

Crawler Query (Scraping) Example

Web Scraping, Next Steps

Using `"resolver": "attr"`

Using `"resolver": "map"`

About

Releases

Packages

Languages

ANASDAVOODTK/Realtime-Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Realtime-Web-Scraping

Crawler Query (Scraping) Example

Web Scraping, Next Steps

Using "resolver": "attr"

Using "resolver": "map"

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Using `"resolver": "attr"`

Using `"resolver": "map"`

Packages