Skip to content

About results of the WebScraper #34

@NTU-P04922004

Description

@NTU-P04922004

Hi there,

While I was testing the OpenDeepSearch code, I found that the WebScraper would not get useful results (simply return None for the content variable in ExtractionResult). Is this the expected behavior?

Actualy, I turned on the debug flag in the WebScraper class and found the following error

Using Jina Reranker

Debug: Attempting extraction with strategy: no_extraction
Debug: URL: https://www.nps.gov/articles/000/july-2nd-1881-a-second-assassination.htm
Debug: Strategy config: <crawl4ai.extraction_strategy.NoExtractionStrategy object at 0x7cd6a9885390>
[INIT].... → Crawl4AI 0.5.0.post4
[FETCH]... ↓ https://www.nps.gov/articles/000/july-2nd-1881-a-s... | Status: True | Time: 2.03s
[SCRAPE].. ◆ https://www.nps.gov/articles/000/july-2nd-1881-a-s... | Time: 0.115s
[COMPLETE] ● https://www.nps.gov/articles/000/july-2nd-1881-a-s... | Status: True | Total: 2.16s

extraction_config.name no_extraction
Debug: Processed content: None
Debug: Exception occurred during extraction:
Traceback (most recent call last):
  File "/content/OpenDeepSearch/src/opendeepsearch/context_scraping/crawl4ai_scraper.py", line 183, in extract
    extraction_result.raw_markdown_length = len(result.markdown_v2.raw_markdown)
                                                ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/crawl4ai/async_webcrawler.py", line 72, in __getattr__
    return getattr(self._results[0], attr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pydantic/main.py", line 986, in __getattr__
    return super().__getattribute__(item)  # Raises AttributeError if appropriate
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/crawl4ai/models.py", line 216, in markdown_v2
    raise AttributeError(
AttributeError: The 'markdown_v2' attribute is deprecated and has been removed. Please use 'markdown' instead, which now returns a MarkdownGenerationResult, with
            following properties:
            - raw_markdown: The raw markdown string
            - markdown_with_citations: The markdown string with citations
            - references_markdown: The markdown string with references
            - fit_markdown: The markdown string with fit text

It seems that the markdown_v2 attribute is deprecated and has caused the issue. Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions