All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added optional ENV variable
REDIS_DB
.
- Fixed Content-Length calculation method in POST requests to
callbackUri
.
- Added the scenario option
config.options.proxyUrls
. - Added action
evaluate
. - Added strategy
evaluate
in the actioncollectData
.
- Database connections now always have UTC timezone.
- Added errors logging in the Crawler's
errorHandler
handler.
- Fixed the value of the field "finishedAt" in the result returned from Crawler.
- Fixed cookies transferring between requests with different sessions.
- Added field
active
for scenario schedulers. The field is required for POST/PUT endpoints, and it is also returned by GET endpoints. - Added endpoints
PUT /api/scenario-schedulers/:scenarioSchedulerId/activate
andPUT /api/scenario-schedulers/:scenarioSchedulerId/activate
.
- Changed pm2 configuration file for docker images
68publishers/crawler:{version}-app
. The application is launched in thefork
mode instead of thecluster
mode.
- Fixed zombie chrome processes - worker container is run with the option
--init
- the process should be used as the PID 1 in the container.
- Added separated docker images for development (
dev-app
,dev-scheduler
,dev-worker
). - Added separated docker images for production (
prod-app
,prod-scheduler
,prod-worker
). Existing all-in-one image has now targetprod-all
.
- Log files in
var/log
are now kept for 7 days only. - The crawler waits for the request queue to empty before exiting (until now, it only waited during an aborted scenario).
- Changed Release GitHub Action. Images for specific entrypoints are now also released with pattern
68publishers/crawler:{$version}-{app|scheduler|worker}
.
- Developer containers have been resized (chrome is installed for the
worker
image only).
- Updated dependency
puppeteer
to the version20.7.3
. - Updated chrome installation instructions in the
Dockerfile
.
- Added scenario option
waitUntil
. - Added option
waitUntil
for actionclickWithRedirect
.
- Changed postgres pool options to prevent the "Connection terminated unexpectedly" error.
- Fixed filter
filter[status]=aborted
for endpointGET /api/scenarios
.
- Added new scenario status
aborted
. - Added endpoint
PUT /api/scenarios/:scenarioId/abort
. Scenario can be aborted until it is finished (completed
orfailed
). - Added field
finishedAt
for scenario. - Added database migration.
- Added filters
finishedBefore
andfinishedAfter
for endpointGET /api/scenarios
.
- Added action
runScene
. - Added action
runSceneConditionally
with first implemented condition -isElementVisible
. - Added scenario option
options.maxConcurrency
. - Added scenario option
options.session.maxPoolSize
. - Added scenario option
options.session.maxSessionUsageCount
. - Added scenario option
options.session.transferredCookies
- cookies with names defined by this option will be transferred between all sessions in the pool.
- Cookies obtained via action
collectCookies
are stored and compared using domain without leading dot.
- Added npm scripts
dev:app
,dev:worker
,dev:scheduler
. - Added separated services
worker
andscheduler
indocker-compose.yml
. - Added service
migrations
indocker-compose.yml
that is run before servicesapp
,worker
andscheduler
started. - Added queue
scheduler_queue
. The scheduler is refreshed when the worker processes new job namedrefresh
.
- The application has been divided into 3 processes -
app
,worker
andscheduler
. - Moved controllers, routes and the Application class into the new directory
src/application
. - Updated pm2-runtime configuration file.
- Removed npm script
dev
.
- Added action
enqueueLinksByClicking
. - Added previously deleted option
baseUrl
for the actionenqueueLinks
- Updated dependency
crawlee
to the version^3.4.0
.
- Fixed cleanup after failed crawling process.
- Action
clickWithRedirect
changesrequest.loadedUrl
to the current URL after a redirect.
- Added Swagger UI on the endpoint
/api-docs
.
- Changed
command
for serviceredis
indocker-compose.yml
- added option--appendonly yes
- Updated README.
- Added new initial scenario status
waiting
. A Scenario is marked asrunning
when it actually runs. - Added support for logging into Sentry. Two ENV variables are processed -
SENTRY_DSN
(optional, enables logging) andSENTRY_SERVER_NAME
(by defaultcrawler
). - Added filters
createdBefore
andcreatedAfter
for endpointGET /api/scenarios
. - Added filters
createdBefore
,createdAfter
,updatedBefore
andupdatedAfter
for endpointGET /api/scenario-schedulers
.
- Logger allows to call methods
warning()
anderror()
with Error objects. - A scenario is marked as
failed
if no URL has benn successfully crawled.
- Removed the option
baseUrl
for the actionenqueueLinks
- browser automatically detects it.
- Fixed saving of fields
visitedUrls.[*].foundOnUrl
for failed requests.
- Added database migration that fixes all previously created results in the group
data
.
- Changed structure of collected data - collected values are now under the key
values
. - Listing Apis now returns fields
next
andprevious
as objects with s shape{"url": "next_page_url", "limit": int, "page": int}
. - All errors responses have been unified.
- Fixed unhandled promises in controllers
- Fixed image pushing into the DockerHub.
- Changed the name of the package to
@68publishers/crawler
in package*.json files.
- The application :)