-
Notifications
You must be signed in to change notification settings - Fork 1
/
Web scrapping with beautiful soup.txt
71 lines (45 loc) · 3.07 KB
/
Web scrapping with beautiful soup.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
Installation
First, you need to install the Beautiful Soup library. You can do this by running the following command in your terminal:
pip install beautifulsoup4
Importing Libraries
To use Beautiful Soup, you need to import the BeautifulSoup class from the bs4 module and the requests library to send HTTP requests:
from bs4 import BeautifulSoup
import requests
Sending an HTTP Request
To scrape a website, you first need to send an HTTP request to the website's server to retrieve the HTML content of the page. You can use the get function from the requests library to send an HTTP request:
url = 'https://www.example.com'
page = requests.get(url)
Parsing the HTML Content
Once you have the HTML content of the page, you can parse it using the BeautifulSoup class. You need to pass the HTML content and the parser library (e.g., 'html.parser') as arguments to the BeautifulSoup constructor:
soup = BeautifulSoup(page.content, 'html.parser')
Finding Elements
You can use the find or find_all methods to search for elements in the HTML tree. These methods take the HTML tag of the element as an argument and return a list of matching elements. You can also use optional keyword arguments to specify additional search criteria:
# Find all elements with the 'a' tag
links = soup.find_all('a')
# Find all elements with the 'div' tag and the 'class' attribute set to 'header'
headers = soup.find_all('div', class_='header')
# Find all elements with the 'a' tag and the 'href' attribute starting with '/users'
users = soup.find_all('a', href=lambda x: x and x.startswith('/users'))
Extracting Data
Once you have found the elements you are interested in, you can extract the data you need using the element's attributes and methods. For example, you can use the text attribute to get the text content of the element, the get method to get the value of an attribute, and the find method to search for child elements:
# Get the text content of the element
text = element.text
# Get the value of the 'href' attribute
link = element.get('href')
# Find the first child element with the 'p' tag
paragraph = element.find('p')
Putting It All Together
Here is an example of how you can use the Beautiful Soup library to scrape a website:
# Send an HTTP request to the website
url = 'https://www.example.com'
page = requests.get(url)
# Parse the HTML content
soup = BeautifulSoup(page.content, 'html.parser')
# Find all elements with the 'a' tag and the 'href' attribute starting with '/users'
users = soup.find_all('a', href=lambda x: x and x.startswith('/users'))
# Extract the data you need
for user in users:
username = user.text
user_link = user.get('href')
print(username, user_link)
This code sends an HTTP request to the website at the specified URL, parses the HTML content of the page, finds all elements with the a tag and the href attribute starting with '/users', and extracts the text content of the element (the username) and the value of the href attribute (the link to the user's profile). It then prints the username and user link for each user.