20 Feb 16:44

nickscamara

e417f83

Self-Host Overhaul - v1.5.0 Latest

Latest

Self-Host Fixes

Reworked Guide: The SELF_HOST.md and docker-compose.yaml have been updated for clarity and compatibility
Kubernetes Improvements: Updated self-hosted Kubernetes deployment examples for compatibility and consistency (#1177)
Self-Host Fixes: Numerous fixes aimed at improving self-host performance and stability (#1207)
Proxy Support: Added proxy support tailored for self-hosted environments (#1212)
Playwright Integration: Added fixes and continuous integration for the Playwright microservice (#1210)
Search Endpoint Upgrade: Added SearXNG support for the /search endpoint (#1193)

Core Fixes & Enhancements

Crawl Status Fixes: Fixed various race conditions in the crawl status endpoint (#1184)
Timeout Enforcement: Added timeout for scrapeURL engines to prevent hanging requests (#1183)
Query Parameter Retention: Map function now preserves query parameters in results (#1191)
Screenshot Action Order: Ensured screenshots execute after specified actions (#1192)
PDF Scraping: Improved handling for PDFs behind anti-bot measures (#1198)
Map/scrapeURL Abort Control: Integrated AbortController to stop scraping when the request times out (#1205)
SDK Timeout Enforcement: Enforced request timeouts in the SDK (#1204)

New Features & Additions

Proxy & Stealth Options: Introduced a proxy option and stealthProxy flag (#1196)
Deep Research (Alpha): Launched an alpha implementation of deep research (#1202)
LLM Text Generator: Added a new endpoint for llms.txt generation (#1201)

Docker & Containerization

Production Ready Docker Image: A streamlined, production ready Docker image is now available to simplify self-hosted deployments.

For the complete details, check out the full changelog.

What's Changed

fix(crawl-status): consider concurrency limited jobs as prioritized (FIR-851) by @mogery in #1184
fix(scrapeURL/sb): enforce timeout (FIR-980) by @mogery in #1183
fix(map): do not remove query parameters from results (FIR-1015) by @mogery in #1191
fix(scrapeURL/fire-engine): perform format screenshot after specified actions (FIR-985) by @mogery in #1192
Update self-hosted Kubernetes deployments examples for compatibility and consistency by @tetuyoko in #1177
fix(v1/types): fix extract -> json rename (FIR-1072) by @mogery in #1195
feat(v1): proxy option / stealthProxy flag (FIR-1050) by @mogery in #1196
fix(v1/types): fix extract -> json rename, ROUND II (FIR-1072) by @mogery in #1199
(feat/deep-research) Alpha implementation of deep research by @nickscamara in #1202
Add llmstxt generator endpoint by @ericciarla in #1201
fix(concurrency-limit): move to renewing a lock on each active job instead of estimating time to complete (FIR-1075) by @mogery in #1197
SELFHOST FIXES (FIR-1105) by @mogery in #1207
feat(v1/map): stop mapping if timed out via AbortController (FIR-747) by @mogery in #1205
Playwright page error schema by @makeiteasierapps in #1172
feat(ci/self-host): add playwright microservice tests by @mogery in #1210
feat(scrapeURL): handle PDFs behind anti-bot (FIR-722) by @mogery in #1198
Use correct list typing for py 3.8 support by @niazarak in #931
feat(map): mock support (FIR-1109) by @mogery in #1213
Add searxng for search endpoint by @loorisr in #1193
feat(sdk): enforce timeout on client-side if set (FIR-864) by @mogery in #1204
feat(self-host): proxy support (FIR-1111) by @mogery in #1212
temp by @mogery in #1218
gemini extractor Implementation by @aparupganguly in #1206

New Contributors

@tetuyoko made their first contribution in #1177
@makeiteasierapps made their first contribution in #1172
@niazarak made their first contribution in #931
@loorisr made their first contribution in #1193

Full Changelog: v1.4.4...v1.5.0

Contributors

tetuyoko, niazarak, and 6 other contributors

Assets 2

0 Join discussion

14 Feb 16:03

nickscamara

v1.4.4

7ac2b99

v1.4.4

🚀 Features & Enhancements

Scrape API: Added action & wait time validation (#1146)
Extraction Improvements:
- Added detection of PDF/image sub-links & extracted text via Gemini (#1173)
- Multi-entity prompt enhancements for extraction (#1181)
- Show sources out of __experimental in extraction (#1180)
Environment Setup: Added Serper & Search API env vars to docker-compose (#1147)
Credit System Update: Now displays "tokens" instead of "credits" when out of tokens (#1178)

✏️ Examples

Gemini 2.0 Crawler: Implemented new crawling example (#1161)
Gemini TrendFinder: https://github.com/mendableai/gemini-trendfinder
Normal Search to Open Deep Research: https://github.com/nickscamara/open-deep-research

🐛 Fixes

HTML Transformer: Updated free_string function parameter type (#1163)
Gemini Crawler: Updated library & improved PDF link extraction (#1175)
Crawl Queue Worker: Only reports successful page count in num_docs (#1179)
Scraping & URLs:
- Fixed relative URL conversion (#584)
- Enforced scrape rate limit in batch scraping (#1182)

What's Changed

[FIR-796] feat(api/types): Add action and wait time validation for scrape requests by @ftonato in #1146
Implemented Gemini 2.0 crawler by @aparupganguly in #1161
Add Serper and Search API env vars to docker-compose by @RealLukeMartin in #1147
fix(html-transformer): Update free_string function parameter type by @carterlasalle in #1163
Add detection of PDF/image sub-links and extract text via Gemini by @mayooear in #1173
fix: update gemini library. extract pdf links from scraped content by @mayooear in #1175
feat(v1/checkCredits): say "tokens" instead of "credits" if out of tokens by @mogery in #1178
feat(v1/extract) Show sources out of __experimental by @nickscamara in #1180
(feat/extract) Multi-entity prompt improvements by @nickscamara in #1181
fix(queue-worker/crawl): only report successful page count in num_docs (FIR-960) by @mogery in #1179
fix: relative url 2 full url use error base url by @dolonfly in #584
fix(v1/batch/scrape): use scrape rate limit by @mogery in #1182

New Contributors

@RealLukeMartin made their first contribution in #1147
@carterlasalle made their first contribution in #1163
@mayooear made their first contribution in #1173
@dolonfly made their first contribution in #584

Full Changelog: v1.4.3...v1.4.4

Contributors

ftonato, RealLukeMartin, and 6 other contributors

Assets 2

07 Feb 16:41

nickscamara

v1.4.3

2b7b740

Examples Week - v1.4.3

Summary of changes

Open Deep Research: An open source version of OpenAI Deep Research. See here
R1 Web Extractor Feature: New extraction capability added.
O3-Mini Web Crawler: Introduces a lightweight crawler for specific use cases.
Updated Model Parameters: Enhancements to o3-mini_company_researcher.
URL Deduplication: Fixes handling of URLs ending with /, index.html, index.php, etc.
Improved URL Blocking: Uses tldts parsing for better blocklist management.
Valid JSON via rawHtml in Scrape: Ensures valid JSON extraction.
Product Reviews Summarizer: Implements summarization using o3-mini.
Scrape Options for Extract: Adds more configuration options for extracting data.
O3-Mini Job Resource Extractor: Extracts job-related resources using o3-mini.
Cached Scrapes for Extract evals: Improves performance by using cached data for extractions evals.

What's Changed

You forgot an 'e' by @sami0596 in #1118
added cached scrapes to extract by @rafaelsideguide in #1107
Added R1 web extractor feature by @aparupganguly in #1115
Feature o3-mini web crawler by @aparupganguly in #1120
Updated Model Parameters (o3-mini_company_researcher) by @aparupganguly in #1130
Fix corepack and self hosting setup by @rothnic in #1131
fix(crawl-redis/generateURLPermutations): dedupe index.html/index.php/slash/bare URL ends (FIR-827) by @mogery in #1134
feat(blocklist): Improve URL blocking with tldts parsing by @ftonato in #1117
fix(scrape): allow getting valid JSON via rawHtml (FIR-852) by @mogery in #1138
Implemented prodcut reviews summarizer using o3 mini by @aparupganguly in #1139
[Feat] Added scrapeOptions to extract by @rafaelsideguide in #1133
Feature/o3 mini job resource extractor by @aparupganguly in #1144

New Contributors

@sami0596 made their first contribution in #1118
@aparupganguly made their first contribution in #1115
@rothnic made their first contribution in #1131

Full Changelog: v1.4.2...v1.4.3

Contributors

rothnic, ftonato, and 4 other contributors

Assets 2

31 Jan 16:23

nickscamara

v1.4.2

492b81d

Extract and API Improvements - v1.4.2

We're excited to announce several new features and improvements:

New Features

Added web search capabilities to the extract endpoint via the enableWebSearch parameter
Introduced source tracking with __experimental_showSources parameter
Added configurable webhook events for crawl and batch operations
New timeout parameter for map endpoint
Optional ad blocking with blockAds parameter (enabled by default)

Infrastructure & UI

Enhanced proxy selection and infrastructure reliability
Added domain checker tool to cloud platform
Redesigned LLMs.txt generator interface for better usability

What's Changed

(feat/extract) Refactor and Reranker improvements by @nickscamara in #1100
Fix bad WebSocket URL in CrawlWatcher by @ProfHercules in #1053
(feat/extract) Add sources to the extraction by @nickscamara in #1101
feat(v1/map): Timeout parameter (FIR-393) by @mogery in #1105
fix(scrapeURL/fire-engine): default to separate US-generic proxy list if no location is specified (FIR-728) by @mogery in #1104
feat(scrapeUrl/fire-engine): add blockAds flag (FIR-692) by @mogery in #1106
(feat/extract) Logs analyzeSchemaAndPrompt output did not match the schema by @nickscamara in #1108
(feat/extract) Improved completions to use model's limits by @nickscamara in #1109
feat(v0): store v0 users (team ID) in Redis for collection (FIR-698) by @mogery in #1111
feat(github/ci): connect to tailscale (FIR-748) by @mogery in #1112
(feat/conc) Move fully to a concurrency limit system by @nickscamara in #1045
Added instructions for empty string to extract prompts by @rafaelsideguide in #1114

New Contributors

@ProfHercules made their first contribution in #1053

Full Changelog: 1.4.1...v1.4.2
Firecrawl website changelog: https://firecrawl.dev/changelog

Contributors

nickscamara, ProfHercules, and 2 other contributors

Assets 2

24 Jan 22:50

nickscamara

1.4.1

fa5544a

Extract Improvements - v1.4.1

We've significantly enhanced our data extraction capabilities with several key updates:

Extract now returns a lot more data due to a new re-ranker system
Improved infrastructure reliability
Migrated from Cheerio to a high-performance Rust-based parser for faster and more memory-efficient parsing
Enhanced crawl cancellation functionality for better control over running jobs

What's Changed

Added "today" to extract prompts by @rafaelsideguide in #1084
docs: update cancel crawl response by @ftonato in #1087
port most of cheerio stuff to rust by @mogery in #1089
Re-ranker changes by @nickscamara in #1090
Rerank with lower threshold + back to map if length = 0 by @rafaelsideguide in #1086

Full Changelog: v1.4.0...1.4.1

Contributors

ftonato, nickscamara, and 2 other contributors

Assets 2

20 Jan 14:17

nickscamara

v1.4.0

2d4f4de

Introducing /extract - v.1.4.0

Get structured web data with /extract

We’re excited to announce the release of /extract - get data from any website with just a prompt. With /extract, you can retrieve any information from anywhere on a website without being limited by scraping roadblocks or the typical context constraints of LLMs.

No more manual copy-pasting, broken scraping scripts, or debugging LLM calls. - it’s never been easier to enrich your data, create datasets, or power AI applications with clean, structured data from any website.

Companies are already using extract to:

Enrich CRM data
Streamline KYB processes
Monitor competitors
Supercharge onboarding experiences
Build targeted prospecting lists

Instead of spending hours manually researching, fixing broken scrapers, or piecing together data from multiple sources, simply specify what information you need and the target website, and let the Firecrawl handle the entire retrieval process.

Specifically, you can:

Extract structured data from entire websites using URL wildcards (https://example.com/*)
Define custom schemas to capture exactly what you need—from simple product details to complex organizational structures
Guide the extraction with custom prompts to ensure the LLM focuses on your target information
Deploy anywhere with comprehensive support for Python, Node, cURL, and other popular tools. For no-code workflows, just connect via Zapier or use our API to set up integrations with other tools.

This versatility translates into a wide range of real-world applications—enabling you to enrich web data for just about any use case.

Limitations - (and the road ahead)

Let's be honest - while /extract is pretty awesome at grabbing web data, it's not perfect yet. Here's what we're still working on:
Big sites are tricky - It can't (yet!) grab every single product on Amazon in one go
Complex searches need work - Things like "find all posts from 2025" aren't quite there
Sometimes, it's a bit quirky - Results can vary between runs, though it usually gets what you need
But here's the exciting part: we're seeing the future of web scraping take shape

Try it out

Curious to try /extract out for yourself?
Visit our playground to try out /extract - you get 500,000 tokens for free
Dive into our Extract Beta documentation for detailed technical guidance and API reference
Want a no-code solution? Connect /extract to thousands of applications through our enhanced Zapier integration

That's all for now! Happy Extracting from the whole Firecrawl team 🔥

Full Changelog: v.1.3.0...v1.4.0

Assets 2

0 Join discussion

14 Jan 22:40

nickscamara

v.1.3.0

957eea4

v1.3 - /extract improvements

What's Changed

feat: new snips test framework (FIR-414) by @mogery in #1033
(feat/extract) New re-ranker + multi entity extraction by @nickscamara in #1061
__experimental_streamSteps by @nickscamara in #1063

Full Changelog: v1.2.1...v.1.3.0

Contributors

nickscamara and mogery

Assets 2

10 Jan 17:54

nickscamara

v1.2.1

d1f3b96

v1.2.1 - /extract Beta Improvements

What's Changed

Indexes, Caching for /extract, Improvements by @nickscamara in #1037
[SDK] fixed none and undefined on response by @rafaelsideguide in #1034
feat: use new random user agent instead of the old one by @1101-1 in #1038
(feat/extract) Move extract to a queue system by @nickscamara in #1044

/extract (beta) changes

We have updated the /extract endpoint to now be asynchronous. When you make a request to /extract, it will return an ID that you can use to check the status of your extract job. If you are using our SDKs, there are no changes required to your code, but please make sure to update the SDKs to the latest versions as soon as possible.
For those using the API directly, we have made it backwards compatible. However, you have 10 days to update your implementation to the new asynchronous model.
For more details about the parameters, refer to the docs sent to you.

New Contributors

@1101-1 made their first contribution in #1038

Full Changelog: v1.2.0...v1.2.1

Changelog: https://www.firecrawl.dev/changelog#/extract-changes

Contributors

nickscamara, 1101-1, and rafaelsideguide

Assets 2

02 Jan 23:24

nickscamara

v1.2.0

a4b6dfe

v1.2.0 - v1/search is now available!

/v1/search

The search endpoint combines web search with Firecrawl’s scraping capabilities to return full page content for any query.

Include scrapeOptions with formats: ["markdown"] to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).

More info here /v1/search docs

What's Changed

/extract URL trace by @nickscamara in #1014
(feat/v1) Search by @nickscamara in #1032

Full Changelog: v1.1.1...v1.2.0

Contributors

nickscamara

Assets 2

30 Dec 15:30

nickscamara

v1.1.1

71a8f74

v1.1.1

What's Changed

feat(python-sdk): Make API key optional for self-hosted instances by @RutamBhagat in #990
Sitemap fixes by @mogery in #1010
fixed optional+default bug on llm schema by @rafaelsideguide in #955
[FIR-37] feat: extract and return favicon URL during scraping by @ftonato in #1018
fix: merge mock success data by @yujunhui in #1013
feat(rust-sdk): Make API key optional for self-hosted instances by @RutamBhagat in #991
feat(scrapeURL/pdf): switch to MU (FIR-356) by @mogery in #1016

New Contributors

@ftonato made their first contribution in #1018
@yujunhui made their first contribution in #1013

Full Changelog: v1.1.0...v1.1.1

Contributors

ftonato, yujunhui, and 3 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-Host Fixes

Core Fixes & Enhancements

New Features & Additions

Docker & Containerization

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Summary of changes

What's Changed

New Contributors

Contributors

New Features

Infrastructure & UI

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Get structured web data with /extract

Limitations - (and the road ahead)

Try it out

What's Changed

Contributors

What's Changed

/extract (beta) changes

New Contributors

Contributors

/v1/search

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Releases: mendableai/firecrawl

Self-Host Overhaul - v1.5.0

Self-Host Fixes

Core Fixes & Enhancements

New Features & Additions

Docker & Containerization

What's Changed

New Contributors

Contributors

v1.4.4

What's Changed

New Contributors

Contributors

Examples Week - v1.4.3

Summary of changes

What's Changed

New Contributors

Contributors

Extract and API Improvements - v1.4.2

New Features

Infrastructure & UI

What's Changed

New Contributors

Contributors

Extract Improvements - v1.4.1

What's Changed

Contributors

Introducing /extract - v.1.4.0

Get structured web data with /extract

Limitations - (and the road ahead)

Try it out

v1.3 - /extract improvements

What's Changed

Contributors

v1.2.1 - /extract Beta Improvements

What's Changed

/extract (beta) changes

New Contributors

Contributors

v1.2.0 - v1/search is now available!

/v1/search

What's Changed

Contributors

v1.1.1

What's Changed

New Contributors

Contributors