Clauneck
is a Ruby gem designed to scrape specific information from a series of URLs, either directly provided or fetched from Google search results via SerpApi's Google Search API. It extracts and matches patterns such as email addresses and social media handles from the web pages, and stores the results in a CSV file.
Unlike Google Chrome extensions that need you to visit webpages one by one, Clauneck excels in bringing the list of websites to you by leveraging SerpApi’s Google Search API.
- Cold Email Marketing with Open-Source Email Extractor: A Blog Post about the usecase of the tool
The script will write the results in a CSV file. If it cannot find any one of the information on a website, it will label it as null
. For unknown errors happening in-between (connection errors, encoding errors, etc.) the fields will be filled with as error
.
Website | Information | Type of Information |
---|---|---|
serpapi.com | contact@serpapi.com |
|
serpapi.com | serpapicom |
|
serpapi.com | serpapicom |
|
serpapi.com | serp_api |
|
serpapi.com | null |
Tiktok |
serpapi.com | channel/UCUgIHlYBOD3yA3yDIRhg_mg |
Youtube |
serpapi.com | serpapi |
Github |
serpapi.com | serpapi |
Medium |
Since SerpApi offers free credits that renew every month, and the user can access a list of free public proxies online, this tool’s pricing is technically free. You may extract data from approximately 10,000 pages (100 results in 1 page, and up to 100 pages) with a free account from SerpApi.
- For collecting URLs to scrape, one of the following is required:
- SerpApi API Key: You may Register to Claim Free Credits
- List of URLs in a text document (The URLs should be Google web cache links that start with
https://webcache.googleusercontent.com
)
- For scraping URLs, one of the following is required:
- List of Proxies in a text document (You may use public proxies. Only HTTP proxies are accepted.)
- Rotating Proxy IP
Add this line to your application's Gemfile:
gem 'clauneck'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install clauneck
You can use Clauneck
as a command line tool or within your Ruby scripts.
In the command line, use the clauneck
command with options as follows:
clauneck --api_key YOUR_SERPAPI_KEY --output results.csv --q "site:*.ai AND inurl:/contact OR inurl:/contact-us"
In your Ruby script, call Clauneck.run
method:
require 'clauneck'
api_key = "<SerpApi API Key>" # Visit https://serpapi.com/users/sign_up to get free credits.
params = {
"q": "site:*.ai AND inurl:/contact OR inurl:/contact-us"
}
Clauneck.run(api_key: api_key, params: params)
You can visit the Documentation for SerpApi's Google Search API to get insight on which parameters you can use to construct searches.
Google allows different search operators in queries to be made. This enhances your abilty to customize your search and get more precise results. For example, this search query:
"site:*.ai AND inurl:/contact OR inurl:/contact-us"
will search for websites ending with .ai
and at /contact
or /contact-us
paths.
You may check out Google Search Operators: The Complete List (44 Advanced Operators) for a list of more operators
You can utilize your own proxies for scraping web caches of the links you have acquired. Only HTTP proxies are accepted. The proxies should be in the following format
http://username:password@ip:port
http://username:password@another-ip:another-port
or if they are public proxies:
http://ip:port
http://another-ip:another-port
You can add --proxy option in the command line to utilize the file:
clauneck --api_key YOUR_SERPAPI_KEY --proxy proxies.txt --output results.csv --q "site:*.ai AND inurl:/contact OR inurl:/contact-us"
or use the rotating proxy link directly:
clauneck --api_key YOUR_SERPAPI_KEY --proxy "http://username:password@ip:port" --output results.csv --q "site:*.ai AND inurl:/contact OR inurl:/contact-us"
You may also use it in a script:
api_key = "<SerpApi API Key>" # Visit https://serpapi.com/users/sign_up to get free credits.
params = {
"q": "site:*.ai AND inurl:/contact OR inurl:/contact-us"
}
proxy = "proxies.txt"
Clauneck.run(api_key: api_key, params: params, proxy: proxy)
or directly use the rotating proxy link:
api_key = "<SerpApi API Key>" # Visit https://serpapi.com/users/sign_up to get free credits.
params = {
"q": "site:*.ai AND inurl:/contact OR inurl:/contact-us"
}
proxy = "http://username:password@ip:port"
Clauneck.run(api_key: api_key, params: params, proxy: proxy)
The System IP Address will be used if no proxy is provided. The user can use System IP for small-scale projects. But it is not recommended.
Instead of providing search parameters, the user can directly feed a Google Search URL for the web cache links to be collected by SerpApi's Google Search API.
The user may utilize their own list of URLs to be scraped. The URLs should start with https://webcache.googleusercontent.com
, and be added to each line. For example:
https://webcache.googleusercontent.com/search?q=cache:LItv_3DO2N8J:https://serpapi.com/&cd=10&hl=en&ct=clnk&gl=cy
https://webcache.googleusercontent.com/search?q=cache:_gaXFsYVmCgJ:https://serpapi.com/search-api&cd=9&hl=en&ct=clnk&gl=cy
You can find cached links manually from Google Searches as shown below:
Clauneck
accepts the following options:
--api_key
: Your SerpApi key. It is required if you're not providing the--urls
option.--proxy
: Your proxy file or proxy URL. Defaults to system IP if not provided.--pages
: The number of pages to fetch from Google using SerpApi. Defaults to1
.--output
: The CSV output file where to store the results. Defaults tooutput.csv
.--google_url
: The Google URL that contains the webpages you want to scrape. It should be a Google Search Results URL.--urls
: The URLs you want to scrape. If provided, the gem will not fetch URLs from Google.--help
: Shows the help message and exits.
Bug reports and pull requests are welcome on GitHub at https://github.com/serpapi/clauneck.
The gem is available as open source under the terms of the MIT License.