Conversation
Contributor
Author
|
This is the final script I used to extract the article data to port to the new site Details
"""Script to scrape articles from old OceanParcels website."""
import requests
from bs4 import BeautifulSoup
import json
import re
import sys
def scrape_articles(url):
try:
# Fetch the webpage
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Find all card elements
cards = soup.find_all("div", class_="card")
# List to store extracted article information
articles = []
# Process each card
for card in cards:
try:
# Extract title from h5 element
title_elem = card.find("h5")
title = title_elem.get_text(strip=True) if title_elem else ""
# Extract authors (text immediately after h5)
authors = ""
if title_elem and title_elem.next_sibling:
authors = (
title_elem.next_sibling.strip()
if isinstance(title_elem.next_sibling, str)
else ""
)
# Extract published info (journal, volume, pages)
published_info = ""
if title_elem:
# Find all text between authors and <br/>
next_elem = title_elem.find_next_sibling()
while next_elem and next_elem.name != "br":
if isinstance(next_elem, str):
published_info += next_elem.strip() + " "
else:
published_info += next_elem.get_text(strip=True) + " "
next_elem = next_elem.next_sibling
published_info = published_info.strip()
# Extract DOI from card-link
# Extract DOI from card-link
doi_link = card.find(
"a", class_="card-link", href=lambda href: href and "doi" in href
)
if doi_link:
doi = doi_link.get("href", "")
# Extract abstract from card-body
card_body = card.find("div", class_="card-body")
abstract = card_body.get_text(strip=True) if card_body else ""
# Clean up abstract by replacing newlines and multiple spaces with single space
authors = authors.rstrip(",")
published_info = re.sub(r"\s*,", ",", published_info)
# Create article dictionary
article = {
"title": title,
"published_info": published_info,
"authors": authors,
"doi": doi,
"abstract": abstract,
}
article = {k: re.sub(r"\n\s*", " ", v) for k, v in article.items()}
articles.append(article)
except Exception as card_error:
print(f"Error processing card: {card_error}")
print("Problematic card HTML:")
print(card.prettify())
sys.exit(1)
# Make articles chronological
articles.reverse()
# Save to JSON file
with open("articles.json", "w", encoding="utf-8") as f:
json.dump(articles, f, indent=2, ensure_ascii=False)
print(f"Successfully scraped {len(articles)} articles.")
return articles
except requests.RequestException as e:
print(f"Error fetching URL: {e}")
sys.exit(1)
# Main execution
if __name__ == "__main__":
url = "https://oceanparcels.org/articles.html"
scrape_articles(url) |
Contributor
Author
|
Updated view @erikvansebille |
erikvansebille
approved these changes
Mar 28, 2025
Contributor
Author
|
Let's merge on Monday :) |
27b1712 to
b738a5a
Compare
Contributor
Author
|
Should we remove the placeholder |
Member
|
Good point, I just removed it. Will write it in the coming week(?) But perhaps you(?) can write a short blog post celebrating the new website launch, highlighting that we thank xarray for the design? |
Contributor
Author
done :) |
Rename sponsors to funders
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Created the new OceanParcels website. Used https://xarray.dev as a starting point.
Made sure in the migration to bring
example_dataacross as that is how parcels downloads example datasets.Items still TODO:
Fixes #112