-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: missing URLs for extract() with array schema
#1107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: 6d6c7aa The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR fixes a critical bug in Stagehand's URL extraction system for array schemas. The issue occurred when using `z.array(z.string().url())` in extraction schemas - instead of returning actual URLs, users received numeric identifiers like '0-302', '0-309', etc.The bug was in the injectUrls function in lib/utils.ts, which handles converting numeric IDs back to URLs after LLM extraction. Stagehand's extraction pipeline works by first transforming URL fields into numeric IDs (making them easier for LLMs to process), then converting those IDs back to actual URLs using a mapping. While the schema transformation correctly identified URL fields in arrays and created appropriate path segments like ['*'], the injectUrls function wasn't handling the terminal case where the path ends at an array wildcard.
The fix adds logic to handle this scenario: when the path terminates at an array (rest.length === 0), the function now loops through each array element and converts any valid IDs back to their corresponding URLs. Additionally, a toId helper function was extracted to standardize the ID detection logic that was previously duplicated.
This change integrates seamlessly with Stagehand's existing URL extraction architecture and ensures that array-based URL schemas work as expected, maintaining consistency with non-array URL field behavior.
Important Files Changed
Changed Files
| Filename | Score | Overview |
|---|---|---|
| lib/utils.ts | 4/5 | Fixed URL injection for array schemas by adding array element processing logic and extracting a toId helper function |
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it fixes a clear bug without affecting existing functionality
- Score reflects focused bug fix with good code organization and clear understanding of the existing system
- No files require special attention as the change is contained and well-implemented
Sequence Diagram
sequenceDiagram
participant User
participant ExtractHandler as "StagehandExtractHandler"
participant Utils as "utils.ts"
participant Schema as "Zod Schema"
participant LLM as "LLM Client"
User->>ExtractHandler: "extract() with array schema containing z.string().url()"
ExtractHandler->>ExtractHandler: "domExtract()"
ExtractHandler->>Utils: "transformSchema(schema, [])"
Note over Utils: Transform z.string().url() to z.number() in arrays
Utils->>Utils: "isKind(schema, Kind.ZodArray)"
Utils->>Utils: "transformSchema(itemType, [...currentPath, '*'])"
Utils->>Utils: "Replace z.string().url() with z.number()"
Utils-->>ExtractHandler: "[transformedSchema, urlPaths]"
ExtractHandler->>LLM: "extract data using transformed schema"
LLM-->>ExtractHandler: "extracted data with numeric IDs"
Note over ExtractHandler: Data contains numeric IDs like ['0-302', '0-309']
ExtractHandler->>Utils: "injectUrls(extractedData, urlPaths, idToUrlMapping)"
Utils->>Utils: "Check if path[0] === '*' and rest.length === 0"
Note over Utils: NEW: Handle arrays directly when no deeper path
Utils->>Utils: "Loop through array items"
Utils->>Utils: "Replace numeric IDs with actual URLs"
Utils-->>ExtractHandler: "Data with URLs injected"
ExtractHandler-->>User: "Array with actual URLs instead of numeric IDs"
1 file reviewed, no comments
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/stagehand@2.5.1 ### Patch Changes - [#1082](#1082) [`8c0fd01`](8c0fd01) Thanks [@tkattkat](https://github.com/tkattkat)! - Pass stagehand object to agent instead of stagehand page - [#1104](#1104) [`a1ad06c`](a1ad06c) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix logging for stagehand agent - [#1066](#1066) [`9daa584`](9daa584) Thanks [@tkattkat](https://github.com/tkattkat)! - Add playwright arguments to agent execute response - [#1077](#1077) [`7f38b3a`](7f38b3a) Thanks [@tkattkat](https://github.com/tkattkat)! - adds support for stagehand agent in the api - [#1032](#1032) [`bf2d0e7`](bf2d0e7) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix for zod peer dependency support - [#1014](#1014) [`6966201`](6966201) Thanks [@tkattkat](https://github.com/tkattkat)! - Replace operator handler with base of new agent - [#1089](#1089) [`536f366`](536f366) Thanks [@miguelg719](https://github.com/miguelg719)! - Fixed info logs on api session create - [#1103](#1103) [`889cb6c`](889cb6c) Thanks [@tkattkat](https://github.com/tkattkat)! - patch custom tool support in anthropic cua client - [#1056](#1056) [`6a002b2`](6a002b2) Thanks [@chrisreadsf](https://github.com/chrisreadsf)! - remove need for duplicate project id if already passed to Stagehand - [#1090](#1090) [`8ff5c5a`](8ff5c5a) Thanks [@miguelg719](https://github.com/miguelg719)! - Improve failed act error logs - [#1014](#1014) [`6966201`](6966201) Thanks [@tkattkat](https://github.com/tkattkat)! - replace operator agent with scaffold for new stagehand agent - [#1107](#1107) [`3ccf335`](3ccf335) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: url extraction not working inside an array - [#1102](#1102) [`a99aa48`](a99aa48) Thanks [@miguelg719](https://github.com/miguelg719)! - Add current page and date context to agent - [#1110](#1110) [`dda52f1`](dda52f1) Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for new Gemini Computer Use models ## @browserbasehq/stagehand-evals@1.1.0 ### Minor Changes - [#1057](#1057) [`b7be89e`](b7be89e) Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - added web voyager ground truth (optional), added web bench, and subset of OSWorld evals which run on a browser ### Patch Changes - [#1072](#1072) [`dc2d420`](dc2d420) Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - improve evals screenshot service - add img hashing diff to add screenshots and change to screenshot intercepts from the agent - Updated dependencies \[[`8c0fd01`](8c0fd01), [`a1ad06c`](a1ad06c), [`9daa584`](9daa584), [`7f38b3a`](7f38b3a), [`bf2d0e7`](bf2d0e7), [`6966201`](6966201), [`536f366`](536f366), [`889cb6c`](889cb6c), [`6a002b2`](6a002b2), [`8ff5c5a`](8ff5c5a), [`6966201`](6966201), [`3ccf335`](3ccf335), [`a99aa48`](a99aa48), [`dda52f1`](dda52f1)]: - @browserbasehq/stagehand@2.5.1 ## @browserbasehq/stagehand-examples@1.0.10 ### Patch Changes - Updated dependencies \[[`8c0fd01`](8c0fd01), [`a1ad06c`](a1ad06c), [`9daa584`](9daa584), [`7f38b3a`](7f38b3a), [`bf2d0e7`](bf2d0e7), [`6966201`](6966201), [`536f366`](536f366), [`889cb6c`](889cb6c), [`6a002b2`](6a002b2), [`8ff5c5a`](8ff5c5a), [`6966201`](6966201), [`3ccf335`](3ccf335), [`a99aa48`](a99aa48), [`dda52f1`](dda52f1)]: - @browserbasehq/stagehand@2.5.1 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
why
z.string().url()to an ID, if it was inside az.array(), it was not getting converted back into a URLyou would receive an array like this:
what changed
injectUrlsfunction so that when it hits an array and there is not deeper path, it loops through the array and injects the URLstest plan