Configurable website scraper in typescript.
- Resource types
- Configurable process pipeline
- Options
- Logger
- Concurrent downloader
- Multi-thread processing (with native worker_thread)
- Process CSS
- Process HTML
- Process SiteMap (but not replace path in it)
- Configurable logging
Note: use multi-thread processing only if your process is cpu sensitive.
- Main thread
- resource downloading in queue
- process after download
- save binary resources to disk
- send other resources to worker thread
- enqueue non-duplicated resource from worker thread
- Worker thread
- receive downloaded resource from main thread
- process after download
- parse html, css, etc.
- collect referenced resources
- process and filter referenced resources before download
- send referenced resources to main thread
- save resources to disk
- skip or redirect link
- detect resource type
- create
- process before download
- download
- process after download
- save to disk