Onigumo

About

Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.

Architecture

The crawling part of Onigumo is composed of three sequentially interconnected components:

the Operator,
the Downloader,
the Parser.

The flowcharts below illustrate the flow of data between those parts:

flowchart LR
    subgraph Crawling
        direction BT
        spider_parser(🕷️ PARSER)
        spider_operator(🕷️ OPERATOR)
        onigumo_downloader[DOWNLOADER]
    end

    start([START]) --> onigumo_feeder[FEEDER]

    onigumo_feeder -- .raw --> Crawling
    onigumo_feeder -- .urls --> Crawling
    onigumo_feeder -- .json --> Crawling

    Crawling --> spider_materializer(🕷️ MATERIALIZER)

    spider_materializer --> done([END])

    spider_operator -. "<hash>.urls" .-> onigumo_downloader
    onigumo_downloader -. "<hash>.raw" .-> spider_parser
    spider_parser -. "<hash>.json" .-> spider_operator

flowchart LR
    subgraph "🕷️ Spider"
        direction TB
        spider_parser(PARSER)
        spider_operator(OPERATOR)
        spider_materializer(MATERIALIZER)
    end

    subgraph Onigumo
        onigumo_feeder[FEEDER]
        onigumo_downloader[DOWNLOADER]
    end

    onigumo_feeder -- .json --> spider_operator
    onigumo_feeder -- .urls --> onigumo_downloader
    onigumo_feeder -- .raw --> spider_parser

    spider_parser -. "<hash>.json" .-> spider_operator
    onigumo_downloader -. "<hash>.raw" .-> spider_parser
    spider_operator -. "<hash>.urls" .-> onigumo_downloader

    spider_operator ---> spider_materializer

Operator

The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.

The Operator’s job is to:

initialize a Spider,
extract new URLs from structured data,
insert those URLs onto the Downloader queue.

Downloader

The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.

The Downloader’s job is to:

read URLs for download,
check for the already downloaded URLs,
fetch the URLs contents along with its metadata,
save the downloaded data.

Parser

Zpracovává data ze staženého obsahu a metadat do strukturované podoby.

Činnost parseru se skládá z:

kontroly stažených URL adres ke zpracování,
zpracovávání obsahu a metadat stažených URL do strukturované podoby,
ukládání strukturovaných dat.

Aplikace (pavouci)

Ze strukturované podoby dat vytáhne potřebné informace.

Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.

Materializer

Usage

Credits

Licenced under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 622 Commits
.github/workflows		.github/workflows
config		config
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Onigumo

About

Architecture

Operator

Downloader

Parser

Aplikace (pavouci)

Materializer

Usage

Credits

About

Contributors 3

Languages

License

Glutexo/onigumo

Folders and files

Latest commit

History

Repository files navigation

Onigumo

About

Architecture

Operator

Downloader

Parser

Aplikace (pavouci)

Materializer

Usage

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages