25 Nov 01:10

jamesturk

c75fe2b

0.6.0 Latest

Latest

Changelog

0.6.0

move to supporting Python 3.11 and 3.12
move to openai 1.0
move to pydantic 2.0
add support for November 2023 model upgrades

0.5.1 - 2023-06-13

Improve type annotations and remove some ignored errors.
Support for new OpenAI models announced June 13th 2023.
Improved support for model fallbacks. Now if a request has 6k tokens and the model list looks like ['gpt-3.5-turbo', 'gpt-3.5-turbo-16k'], the 16k model will be used automatically since the default 4k model will not be able to handle the request.

0.5.0 - 2023-06-06

Restore PaginatedSchemaScraper and add documentation for pagination.
Documentation improvements.
Small quality-of-life improvements such as better pydantic schema support and
more useful error messages.

0.4.4 - 2023-03-31

Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

Fix type bug with JSON nudging.
Improve HallucinationCheck to handle more cases.
More tests!

0.4.1 - 2023-03-24

Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

13 Jun 22:50

jamesturk

0.5.1

5216267

0.5.1

Changelog

0.5.1 - 2023-06-13

Improve type annotations and remove some ignored errors.
Support for new OpenAI models announced June 13th 2023.
Improved support for model fallbacks. Now if a request has 6k tokens and the model list looks like ['gpt-3.5-turbo', 'gpt-3.5-turbo-16k'], the 16k model will be used automatically since the default 4k model will not be able to handle the request.

0.5.0 - 2023-06-06

Restore PaginatedSchemaScraper and add documentation for pagination.
Documentation improvements.
Small quality-of-life improvements such as better pydantic schema support and
more useful error messages.

0.4.4 - 2023-03-31

Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

Fix type bug with JSON nudging.
Improve HallucinationCheck to handle more cases.
More tests!

0.4.1 - 2023-03-24

Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

06 Jun 22:39

jamesturk

0.5.0

946d252

0.5.0

Changelog

0.5.0 - WIP

Restore PaginatedSchemaScraper and add documentation for pagination.
Documentation improvements.

0.4.4 - 2023-03-31

Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

Fix type bug with JSON nudging.
Improve HallucinationCheck to handle more cases.
More tests!

0.4.1 - 2023-03-24

Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

31 Mar 16:55

jamesturk

0.4.4

2235dbb

0.4.4

Changelog

0.4.3 - 2023-03-32

Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

Fix type bug with JSON nudging.
Improve HallucinationCheck to handle more cases.
More tests!

0.4.1 - 2023-03-24

Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

26 Mar 18:42

jamesturk

0.4.2

3d17cc2

0.4.2

Changelog

Fix bug with JSON nudging.
Improve HallucinationCheck to handle more cases.

0.4.1 - 2023-03-24

Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

25 Mar 01:05

jamesturk

0.4.1

92d9c55

0.4.1

Changelog

0.4.1 - 2023-03-24

Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

25 Mar 00:48

jamesturk

0.4.0

605dc82

0.4.0

Changelog

0.4.0 - 2023-03-24

New configurable pre- and post-processing pipelines for customizing behavior.
Addition of ScrapeResult object to hold results of scraping along with metadata.
Support for pydantic models as schemas and for validation.
"Hallucination" check to ensure that the data in the response truly exists on the page.
Use post-processing pipeline to "nudge" JSON errors to a better result.
Now fully type-annotated.
Another big refactor, separation of API calls and scraping logic.
Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

Add tests, docs, and complete examples!
Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
Use tiktoken for accurate token counts.
New cost_estimate utility function.
Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Assets 2

20 Mar 10:09

jamesturk

0.3.0

c2298d1

0.3.0

cli fix

Assets 2

19 Mar 02:23

jamesturk

0.2.0

991986a

0.2.0

CLI options

Assets 2

18 Mar 03:01

jamesturk

0.1.0

e518012

0.1.0

Update README.md

Assets 2

Releases: jamesturk/scrapeghost

0.6.0

Changelog

0.6.0

0.5.1 - 2023-06-13

0.5.0 - 2023-06-06

0.4.4 - 2023-03-31

0.4.2 - 2023-03-26

0.4.1 - 2023-03-24

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.5.1

Changelog

0.5.1 - 2023-06-13

0.5.0 - 2023-06-06

0.4.4 - 2023-03-31

0.4.2 - 2023-03-26

0.4.1 - 2023-03-24

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.5.0

Changelog

0.5.0 - WIP

0.4.4 - 2023-03-31

0.4.2 - 2023-03-26

0.4.1 - 2023-03-24

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.4.4

Changelog

0.4.3 - 2023-03-32

0.4.2 - 2023-03-26

0.4.1 - 2023-03-24

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.4.2

Changelog

Next

0.4.1 - 2023-03-24

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.4.1

Changelog

0.4.1 - 2023-03-24

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.4.0

Changelog

0.4.0 - 2023-03-24

0.3.0 - 2023-03-20

0.2.0 - 2023-03-18

0.1.0 - 2023-03-17

0.3.0

0.2.0

0.1.0