Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML_TAG_MAPPING error during scrape #701

Closed
beefyandbeef opened this issue Sep 19, 2024 · 2 comments · Fixed by #721
Closed

HTML_TAG_MAPPING error during scrape #701

beefyandbeef opened this issue Sep 19, 2024 · 2 comments · Fixed by #721
Labels
bug Something isn't working

Comments

@beefyandbeef
Copy link

Using scrapy package, playwright, and trafilatura. Getting this error on certain pages.

The KeyError: None indicates that the code is trying to access a key in the HTML_TAG_MAPPING dictionary using a value that is [None]. This error occurs in the trafilatura library, specifically in the htmlprocessing.py file.

To fix this issue, you need to ensure that the elem.get('rend') call does not return [None]. If it does, you should handle this case appropriately.

Here is a step-by-step plan to address this issue:

  1. Locate the Code: Identify where the elem.get('rend') call is made.
  2. Handle [None] Values: Add a check to handle cases where elem.get('rend') returns [None]
    Example Fix
    In the trafilatura/htmlprocessing.py file, locate the following line:

"hi": lambda elem: HTML_TAG_MAPPING[elem.get('rend')]

Update it to handle None values:

"hi": lambda elem: HTML_TAG_MAPPING.get(elem.get('rend'), default_value)

Here, default_value should be a valid key in the HTML_TAG_MAPPING dictionary that you want to use as a fallback.

Summary
Locate the Issue: Find where elem.get('rend') is called.
Handle [None] Values: Use HTML_TAG_MAPPING.get(elem.get('rend'), default_value) to handle cases where elem.get('rend') returns None.
By making this change, you can avoid the KeyError: None and ensure that the code handles cases where elem.get('rend') is None.

@adbar
Copy link
Owner

adbar commented Oct 1, 2024

Hi @beefyandbeef, thanks for the detailed summary, this is a bug indeed. Are you interested in drafting a pull request?

@adbar adbar added the bug Something isn't working label Oct 1, 2024
@adbar
Copy link
Owner

adbar commented Oct 15, 2024

Note: related to #720.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants