Skip to content

Commit

Permalink
Add fallback for extract
Browse files Browse the repository at this point in the history
  • Loading branch information
Niels Ringler committed Oct 8, 2024
1 parent bd7fe51 commit ef835b1
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion src/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,9 @@ def extract_urlnews(url) -> List[str]:
# Filter out SVG images and data URI images
article_images = [img for img in article_images if
not (img.lower().endswith('.svg') or img.lower().startswith('data:image/svg+xml'))]

# Fallback if Article(url) doesn't get enough text
if len(article.text) < 1000:
paragraphs = soup.find_all('p')
text = ' '.join(p.get_text(strip=True) for p in paragraphs)
article.text = text
return article.title, article.text, article_images

0 comments on commit ef835b1

Please sign in to comment.