Skip to content

Releases: jamesturk/scrapeghost

0.6.0

25 Nov 01:10
Compare
Choose a tag to compare

Changelog

0.6.0

  • move to supporting Python 3.11 and 3.12
  • move to openai 1.0
  • move to pydantic 2.0
  • add support for November 2023 model upgrades

0.5.1 - 2023-06-13

  • Improve type annotations and remove some ignored errors.
  • Support for new OpenAI models announced June 13th 2023.
  • Improved support for model fallbacks. Now if a request has 6k tokens and the model list looks like ['gpt-3.5-turbo', 'gpt-3.5-turbo-16k'], the 16k model will be used automatically since the default 4k model will not be able to handle the request.

0.5.0 - 2023-06-06

  • Restore PaginatedSchemaScraper and add documentation for pagination.
  • Documentation improvements.
  • Small quality-of-life improvements such as better pydantic schema support and
    more useful error messages.

0.4.4 - 2023-03-31

  • Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
  • Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

  • Fix type bug with JSON nudging.
  • Improve HallucinationCheck to handle more cases.
  • More tests!

0.4.1 - 2023-03-24

  • Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.5.1

13 Jun 22:50
Compare
Choose a tag to compare

Changelog

0.5.1 - 2023-06-13

  • Improve type annotations and remove some ignored errors.
  • Support for new OpenAI models announced June 13th 2023.
  • Improved support for model fallbacks. Now if a request has 6k tokens and the model list looks like ['gpt-3.5-turbo', 'gpt-3.5-turbo-16k'], the 16k model will be used automatically since the default 4k model will not be able to handle the request.

0.5.0 - 2023-06-06

  • Restore PaginatedSchemaScraper and add documentation for pagination.
  • Documentation improvements.
  • Small quality-of-life improvements such as better pydantic schema support and
    more useful error messages.

0.4.4 - 2023-03-31

  • Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
  • Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

  • Fix type bug with JSON nudging.
  • Improve HallucinationCheck to handle more cases.
  • More tests!

0.4.1 - 2023-03-24

  • Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.5.0

06 Jun 22:39
Compare
Choose a tag to compare

Changelog

0.5.0 - WIP

  • Restore PaginatedSchemaScraper and add documentation for pagination.
  • Documentation improvements.

0.4.4 - 2023-03-31

  • Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
  • Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

  • Fix type bug with JSON nudging.
  • Improve HallucinationCheck to handle more cases.
  • More tests!

0.4.1 - 2023-03-24

  • Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.4.4

31 Mar 16:55
Compare
Choose a tag to compare

Changelog

0.4.3 - 2023-03-32

  • Deactivate HallucinationCheck by default, it is overly aggressive and needs more work to be useful without raising false positives.
  • Bugfix for postprocessors parameter behavior not overriding defaults.

0.4.2 - 2023-03-26

  • Fix type bug with JSON nudging.
  • Improve HallucinationCheck to handle more cases.
  • More tests!

0.4.1 - 2023-03-24

  • Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.4.2

26 Mar 18:42
Compare
Choose a tag to compare

Changelog

Next

  • Fix bug with JSON nudging.
  • Improve HallucinationCheck to handle more cases.

0.4.1 - 2023-03-24

  • Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.4.1

25 Mar 01:05
Compare
Choose a tag to compare

Changelog

0.4.1 - 2023-03-24

  • Fix bug with HallucinationCheck.

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.4.0

25 Mar 00:48
Compare
Choose a tag to compare

Changelog

0.4.0 - 2023-03-24

  • New configurable pre- and post-processing pipelines for customizing behavior.
  • Addition of ScrapeResult object to hold results of scraping along with metadata.
  • Support for pydantic models as schemas and for validation.
  • "Hallucination" check to ensure that the data in the response truly exists on the page.
  • Use post-processing pipeline to "nudge" JSON errors to a better result.
  • Now fully type-annotated.
  • Another big refactor, separation of API calls and scraping logic.
  • Finally, a ghost logo reminiscent of library's namesake.

0.3.0 - 2023-03-20

  • Add tests, docs, and complete examples!
  • Add preprocessors to SchemaScraper to allow for uniform interface for cleaning & selecting HTML.
  • Use tiktoken for accurate token counts.
  • New cost_estimate utility function.
  • Cost is now tracked on a per-scraper basis (see the total_cost attribute on SchemaScraper objects).
  • SchemaScraper now takes a max_cost parameter to limit the total cost of a scraper.
  • Prompt improvements, list mode simplification.

0.2.0 - 2023-03-18

  • Add list mode, auto-splitting, and pagination support.
  • Improve xpath and css handling.
  • Improve prompt for GPT 3.5.
  • Make it possible to alter parameters when calling scrape.
  • Logging & error handling.
  • Command line interface.
  • See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-part-2/

0.1.0 - 2023-03-17

0.3.0

20 Mar 10:09
Compare
Choose a tag to compare
cli fix

0.2.0

19 Mar 02:23
Compare
Choose a tag to compare
CLI options

0.1.0

18 Mar 03:01
e518012
Compare
Choose a tag to compare
Update README.md