-
Notifications
You must be signed in to change notification settings - Fork 260
Home
Julien Nioche edited this page Mar 28, 2024
·
26 revisions
- Introduction
- Configuration: how to configure the storm-crawler
- User-Agent-Configuration: how the user agent works in StormCrawler and how to configure it
-
Registering Metadata for Serialization: If your topology doesn't extend
ConfigurableTopology
, you will need to manually register storm-crawler'sMetadata
class for serialization in Storm. - Status Streams: Understanding how streams are used in Storm Crawler
- Debug with Eclipse
- Bolts
- FetcherBolt(s)
- IndexingBolts
- JSoupParserBolt: parse HTML documents
- SiteMapParserBolt: how to handle sitemaps
- Filters
- ParseFilters: extract metadata from documents
- URLFilters): how to filter or normalise outlinks
- Protocol
- Protocols: Network protocols that are usable in storm-crawler
- Start
- Components
- Filters
- Bolts
- Protocol
- Metadata
- Resources