Skip to content

ANASDAVOODTK/Realtime-Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Realtime-Web-Scraping

Scrape Webpages using JS and Autocode

Crawler Query (Scraping) Example

A quick example of the crawler.query API is using it to scrape the top 30 stories on Hacker News right now. To do that, use this link;

autocode.standardlibrary.com/new/?workflow=crawler%2Fquery%2Fselectors

This will open up an interface that looks something like this;

Maker Mode Example

This is a part of Autocode called Maker Mode and can be used to pre-generate API logic. You can edit the code at any time by clicking on the code example to the right. You can see I've already pre-filled the following settings;

  • url is https://news.ycombinator.com/
  • userAgent is standardlibrary/crawler/query (this is the default)
  • includeMetadata is False (if True, will return additional metadata in a meta field in the response)
  • selectorQueries is an array with one object, the values being {"selector":"a.storylink","resolver":"text"}

These settings generate the code;

// Store API Responses
const result = {crawler: {}};

console.log(Running [Crawler → Query (scrape) a provided URL based on CSS selectors]...); result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({ url: https://news.ycombinator.com/, userAgent: standardlibrary/crawler/query, includeMetadata: false, selectorQueries: [ { 'selector': a.storylink, 'resolver': text } ] });

This can now be run by hitting Run Code in the bottom right. As of writing this, the Hacker News Front Page looks like this:

HN Top 5

So my returned response was (I've truncated to only five results):

{
  "url": "https://news.ycombinator.com/",
  "userAgent": "standardlibrary/crawler/query",
  "queryResults": [
    [
      {
        "text": "Zig cc: A drop-in replacement for GCC/Clang"
      },
      {
        "text": "I got my file from Clearview AI"
      },
      {
        "text": "A Novel Mechanical Ventilator Designed for Mass Scale Production"
      },
      {
        "text": "Little Snitch and the deprecation of kernel extensions"
      },
      {
        "text": "Doctors turn to social media to develop Covid-19 treatments in real time"
      }
    ]
  ]
}

And that's it! That's how easy the crawler.query API is to use.

Web Scraping, Next Steps

You might be wondering how to customize this further. First, the resolver object attribute can take one of four values: text, html, attr and map.

  • text returns the element text
  • html returns the element HTML
  • attr returns an HTML attribute of the element, you must add an additional attr key with a value like "attr": "href"
  • map returns a nested CSS selector query, this requires an additional mapQueries attribute expecting another array of selectorQueries

Using "resolver": "attr"

For this query:

result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
  url: `https://news.ycombinator.com/`,
  userAgent: `standardlibrary/crawler/query`,
  includeMetadata: false,
  selectorQueries: [
    {
      'selector': `a.storylink`,
      'resolver': `attr`,
      'attr': `href`
    }
  ]
});

We would expect a response that looks like this:

{
  "url": "https://news.ycombinator.com/",
  "userAgent": "standardlibrary/crawler/query",
  "queryResults": [
    [
      {
        "attr": "https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html"
      },
      {
        "attr": "https://onezero.medium.com/i-got-my-file-from-clearview-ai-and-it-freaked-me-out-33ca28b5d6d4"
      },
      {
        "attr": "https://arxiv.org/abs/2003.10405"
      },
      {
        "attr": "https://blog.obdev.at/little-snitch-and-the-deprecation-of-kernel-extensions/"
      },
      {
        "attr": "https://www.bloomberg.com/news/articles/2020-03-24/covid-19-mysteries-yield-to-doctors-new-weapon-crowd-sourcing"
      }
    ]
  ]
}

Using "resolver": "map"

We can use map to make subqueries (called mapQueries) against a selector to parse data in parallel. For example, if we want to combine the above two queries (get both title and URL simultaneously)...

result.crawler.pageData = await lib.crawler.query['@0.0.1'].selectors({
  url: `https://news.ycombinator.com/`,
  userAgent: `standardlibrary/crawler/query`,
  includeMetadata: false,
  selectorQueries: [
    {
      'selector': `tr[id]:not([id="pagespace"])`,
      'resolver': `map`,
      'mapQueries': [
        {
          'selector': 'a.storylink',
          'resolver': 'text'
        },
        {
          'selector': 'a.storylink',
          'resolver': 'attr',
          'attr': 'href'
        }
      ]
    }
  ]
});

And we'll get a result like this...

{
  "url": "https://news.ycombinator.com/",
  "userAgent": "standardlibrary/crawler/query",
  "queryResults": [
    [
      {
        "mapResults": [
          [
            {
              "text": "Zig cc: A drop-in replacement for GCC/Clang"
            }
          ],
          [
            {
              "attr": "https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "I got my file from Clearview AI"
            }
          ],
          [
            {
              "attr": "https://onezero.medium.com/i-got-my-file-from-clearview-ai-and-it-freaked-me-out-33ca28b5d6d4"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "A Novel Mechanical Ventilator Designed for Mass Scale Production"
            }
          ],
          [
            {
              "attr": "https://arxiv.org/abs/2003.10405"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "Little Snitch and the deprecation of kernel extensions"
            }
          ],
          [
            {
              "attr": "https://blog.obdev.at/little-snitch-and-the-deprecation-of-kernel-extensions/"
            }
          ]
        ]
      },
      {
        "mapResults": [
          [
            {
              "text": "Doctors turn to social media to develop Covid-19 treatments in real time"
            }
          ],
          [
            {
              "attr": "https://www.bloomberg.com/news/articles/2020-03-24/covid-19-mysteries-yield-to-doctors-new-weapon-crowd-sourcing"
            }
          ]
        ]
      }
    ]
  ]
}

About

Scrape Webpages using JS and Autocode

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published