Chương 4_Trang 4_Bài tập kinh điển về việc HTML parse thu thập, truy xuất, phân tích xử lý dữ liệu đầu vào ETL data transform

Bài tập về việc HTML parse thu thập, truy xuất, phân tích xử lý dữ liệu đầu vào ETL data transform:

Tôi đang có 1 file myspecs.html có nội dung ví dụ sau:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content="lshw-B.02.18" /> <style type="text/css"> .first {font-weight: bold; margin-left: none; padding-right: 1em;vertical-align: top; } .second {padding-left: 1em; width: 100%; vertical-align: center; } .id {font-family: monospace;} .indented {margin-left: 2em; border-left: dotted thin #dde; padding-bottom: 1em; } .node {border: solid thin #ffcc66; padding: 1em; background: #ffffcc; } .node-unclaimed {border: dotted thin #c3c3c3; padding: 1em; background: #fafafa; color: red; } .node-disabled {border: solid thin #f55; padding: 1em; background: #fee; color: gray; } </style> <title>geoai-server</title> </head> <body> <div class="indented"> <table width="100%" class="node" summary="attributes of geoai-server"> <thead><tr><td class="first">id:</td><td class="second"><div class="id">geoai-server</div></td></tr></thead> <tbody> <tr><td class="first">description: </td><td class="second">Mini Tower Computer</td></tr> <tr><td class="first">product: </td><td class="second">HP Z620 Workstation (G6C36US#ABA)</td></tr> <tr><td class="first">vendor: </td><td class="second">Hewlett-Packard</td></tr> <tr><td class="first">serial: </td><td class="second">2UA4251XTQ</td></tr> <tr><td class="first">width: </td><td class="second">64 bits</td></tr> <tr><td class="first">capabilities: </td><td class="second"><dfn title="SMBIOS version 2.7">smbios-2.7</dfn> <dfn title="DMI version 2.7">dmi-2.7</dfn> <dfn title="Symmetric Multi-Processing">smp</dfn> <dfn title="32-bit processes">vsyscall32</dfn> </td></tr> <tr><td class="first">configuration:</td><td class="second"><table summary="configuration of geoai-server"><tr><td class="sub-first"> boot</td><td>=</td><td>normal</td></tr><tr><td class="sub-first"> chassis</td><td>=</td><td>mini-tower</td></tr><tr><td class="sub-first"> family</td><td>=</td><td>103C_53335X G=D</td></tr><tr><td class="sub-first"> sku</td><td>=</td><td>G6C36US#ABA</td></tr><tr><td class="sub-first"> uuid</td><td>=</td><td>00B085CD-99F0-E311-833C-A0D3C11AFCF0</td></tr></table></td></tr> </tbody></table></div> <div class="indented"> <div class="indented"> <table width="100%" class="node" summary="attributes of core"> <thead><tr><td class="first">id:</td><td class="second"><div class="id">core</div></td></tr></thead> <tbody> <tr><td class="first">description: </td><td class="second">Motherboard</td></tr> <tr><td class="first">product: </td><td class="second">158A</td></tr> <tr><td class="first">vendor: </td><td class="second">Hewlett-Packard</td></tr> <tr><td class="first">physical id: </td><td class="second"><div class="id">0</div></td></tr> <tr><td class="first">version: </td><td class="second">0.00</td></tr> <tr><td class="first">serial: </td><td class="second">2UA4251XTQ</td></tr> </tbody> </table></div>
tôi muốn dùng python và các thư viện như beautiful soup... để html parse file htl trên để lấy ra dữ liệu CSV thành các cột giá trị ?

;) ;) ;) --> Khó quá bỏ qua ? AI/ML trả lời:

Chào bạn! Để phân tích cú pháp HTML và trích xuất dữ liệu từ file HTML của bạn, chúng ta có thể sử dụng thư viện BeautifulSoup trong Python. Dưới đây là một ví dụ về cách bạn có thể sử dụng BeautifulSoup để lấy ra dữ liệu CSV từ file HTML của bạn:
Khi bạn chạy mã này, nó sẽ lấy dữ liệu từ file HTML và in ra dạng CSV với các cột giá trị. Bạn có thể thay đổi hoặc tùy chỉnh mã theo nhu cầu của mình để lấy các giá trị cụ thể hoặc xử lý dữ liệu theo mong muốn.

Hãy thử và cho tôi biết nếu bạn gặp bất kỳ vấn đề nào hoặc có yêu cầu khác. Tôi sẽ cố gắng giúp bạn!

from bs4 import BeautifulSoup

# Đọc nội dung của file HTML
with open('myspecs.html', 'r') as file:
    html_content = file.read()

# Phân tích cú pháp HTML bằng BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Tìm tất cả các phần tử <tr> trong bảng
rows = soup.find_all('tr')

# Tạo danh sách các dòng CSV
csv_rows = []

# Lặp qua từng dòng và lấy giá trị từ các cột
for row in rows:
    csv_row = []
    cols = row.find_all('td')
    for col in cols:
        csv_row.append(col.text.strip())
    csv_rows.append(csv_row)

# In ra dữ liệu CSV
for row in csv_rows:
    print(','.join(row))

Mình giải bài tập này như sau:

Bước 1. Tạo thư mục và copy file myspecs.html vào thư mục data:

Bước 2. Tạo một project vào notebook

Bước 3. Copy source code do AI trả lời vào và sửa debug rồi chạy test kiểm tra code, kết quả sẽ căn chỉnh tiếp cho tới khi dùng được hoàn chỉnh.

Việc tiếp theo là phân tích, tích hợp với các tools khác như excel, SQL ... không bàn tới trong bài tập này.

Chữa Bài tập về việc HTML parse xử lý dữ liệu đầu vào ETL data transform: XML, Excel dễ dùng hơn:

Cách dùng thư viện beautifulsoup4 để html parse

!pip install beautifulsoup4

Requirement already satisfied: beautifulsoup4 in c:\python311\lib\site-packages (4.12.3)
Requirement already satisfied: soupsieve>1.2 in c:\python311\lib\site-packages (from beautifulsoup4) (2.5)

from bs4 import BeautifulSoup

# Đọc nội dung của file HTML
with open('C:\Python311\workspaces\data\mySpecs.html', 'r') as file:
    html_content = file.read()

# Phân tích cú pháp HTML bằng BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Tìm tất cả các phần tử <tr> trong bảng
rows = soup.find_all('tr')

# Tạo danh sách các dòng CSV
csv_rows = []

# Lặp qua từng dòng và lấy giá trị từ các cột
for row in rows:
    csv_row = []
    cols = row.find_all('td')
    for col in cols:
        csv_row.append(col.text.strip())
    csv_rows.append(csv_row)

# In ra dữ liệu CSV
for row in csv_rows:
    print(','.join(row))

!pip install pandas
!pip install beautifulsoup4
!pip install openpyxl

Requirement already satisfied: pandas in c:\python311\lib\site-packages (2.1.3)
Requirement already satisfied: numpy<2,>=1.23.2 in c:\python311\lib\site-packages (from pandas) (1.26.1)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\python311\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\python311\lib\site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\python311\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in c:\python311\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: beautifulsoup4 in c:\python311\lib\site-packages (4.12.3)
Requirement already satisfied: soupsieve>1.2 in c:\python311\lib\site-packages (from beautifulsoup4) (2.5)
Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
   ---------------------------------------- 0.0/250.0 kB ? eta -:--:--
   - -------------------------------------- 10.2/250.0 kB ? eta -:--:--
   - -------------------------------------- 10.2/250.0 kB ? eta -:--:--
   ------ -------------------------------- 41.0/250.0 kB 330.3 kB/s eta 0:00:01
   ----------------- -------------------- 112.6/250.0 kB 656.4 kB/s eta 0:00:01
   ---------------------------------------  245.8/250.0 kB 1.2 MB/s eta 0:00:01
   -------------------------------------- 250.0/250.0 kB 959.0 kB/s eta 0:00:00
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2

import pandas as pd
from bs4 import BeautifulSoup
from openpyxl import Workbook

def html_to_excel(html_file):
    # Đọc nội dung của file html
    with open(html_file, 'r') as f:
        html_content = f.read()

    # Tạo đối tượng BeautifulSoup để phân tích cú pháp HTML
    soup = BeautifulSoup(html_content, 'html.parser')

    # Tìm tất cả các thẻ div và table trong HTML
    div_tags = soup.find_all('div')
    table_tags = soup.find_all('table')

    # Tạo danh sách các bộ metadata
    metadata_list = []

    # Trích xuất giá trị từ các thẻ div và table và thêm vào danh sách metadata
    for tag in div_tags + table_tags:
        metadata = {}
        metadata['tag'] = tag.name
        metadata['value'] = tag.text.strip()
        metadata_list.append(metadata)

    # Chuyển đổi danh sách metadata thành DataFrame
    df = pd.DataFrame(metadata_list)

    # Lưu DataFrame vào file Excel
    excel_file = html_file.replace('.html', '.xlsx')
    df.to_excel(excel_file, index=False)

    return excel_file
    
#excel_file = html_file.replace('C:\Python311\workspaces\data\mySpecs.html', 'C:\Python311\workspaces\data\mySpecs.xlsx')  
excel_file = html_to_excel('C:\Python311\workspaces\data\mySpecs.html')
print('File Excel đã được tạo:', excel_file)

File Excel đã được tạo: C:\Python311\workspaces\data\mySpecs.xlsx

import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

def html_to_xml(html_file, xml_file):
    # Đọc nội dung của file html
    with open(html_file, 'r') as f:
        html_content = f.read()

    # Tạo đối tượng BeautifulSoup để phân tích cú pháp HTML
    soup = BeautifulSoup(html_content, 'html.parser')

    # Tạo phần tử gốc của cây XML
    root = ET.Element("metadata")

    # Trích xuất giá trị từ các thẻ div và table và thêm vào cây XML
    for tag in soup.find_all(['div', 'table']):
        metadata = ET.SubElement(root, tag.name)
        metadata.text = tag.text.strip()

    # Tạo đối tượng ElementTree từ cây XML
    tree = ET.ElementTree(root)

    # Ghi cây XML vào file XML
    tree.write(xml_file)

    print('File XML đã được tạo:', xml_file)

    return xml_file

# Gọi hàm html_to_xml để chuyển đổi file html thành file xml
xml_file = html_to_xml('C:\Python311\workspaces\data\mySpecs.html', 'C:\Python311\workspaces\data\mySpecs.xml')

# In thông báo cho người dùng
print('File XML đã được tạo:', xml_file)

File XML đã được tạo: C:\Python311\workspaces\data\mySpecs.xml
File XML đã được tạo: C:\Python311\workspaces\data\mySpecs.xml

Prototype Standard ISO cho điều kiện đấu Cluster GPU

graph TD;
    A[Đồng bộ dữ liệu TPU 128Gbps của Tensorflow, Python AI AI/ML] --> B[GPU DDR4 33Mhz];
    B --> H[Không cấu hình Cl-GPU nếu xung nhịp CPU so với GPU > 10% bottleneck]
    B --> C[CPU 2.5GHz];
    C --> D[BUS RAM x64];
    D --> E[DDR3 RAM 1333Mh 0.8ns];
    D --> I[Không cấu hình Cl-GPU nếu DIM RAM khác nhau DDR3 với GPU DDR4/5/6]
    E --> F[Network CNA 25Gbps];
    E --> J[Không cấu hình Cl-GPU nếu NIC với Internal Swwitch Port 1 - 10 Gbps]
    F --> G[Cluster GPU giữa 2 Host Workstation]

Check kiểm tra xung nhịp CPU và GPU > 31%: https://pc-builds.com/bottleneck-calculator/result/0wT0XF/1/general-tasks/1920x1080/rkstation]rkstation]

Sách điện tử Tương tác dạy học trực tuyến (viết tắt:SĐT-TTTT),

English: Interactive Learning eBook (short-name: iLebook),

Tác giả: Lê Toàn Thắng,

Cung cấp ảnh: AI_Creators,

Hiệu đính: Lê Anh Hà Lan, ML_7b và ChatGPT, B2T,

Business Analyst (BA) là một nghề tồn tại từ lâu trên toàn cầu, nhưng ở Việt Nam thì vẫn còn khá mới (xuất hiện khoảng hơn 27 năm).
Tôi thực sự đánh giá cao nghề này và trong quá trình làm việc, tôi đã gặp phải nhiều thách thức, từ đó tôi đã hiểu rõ những điểm mạnh và điểm yếu của bản thân. Tôi nhận thấy rằng trong lĩnh vực này có nhiều vấn đề phức tạp và khó khăn, nhưng khi chúng được giải quyết, thì cảm giác thật tuyệt vời.
BA xuất hiện với mục tiêu giải quyết vấn đề. Đó có thể là biến một điều không tốt thành điều tốt hơn, hoặc cải thiện những điều đã tốt để trở nên tốt hơn. Việc mang lại ý nghĩa cho người khác thực sự là một điều mà tôi khó lòng tranh cãi hay bỏ qua.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chương 4_Trang 4_Bài tập kinh điển về việc HTML parse thu thập, truy xuất, phân tích xử lý dữ liệu đầu vào ETL data transform

Bài tập về việc HTML parse thu thập, truy xuất, phân tích xử lý dữ liệu đầu vào ETL data transform:

Chữa Bài tập về việc HTML parse xử lý dữ liệu đầu vào ETL data transform: XML, Excel dễ dùng hơn:

Cách dùng thư viện beautifulsoup4 để html parse

Prototype Standard ISO cho điều kiện đấu Cluster GPU

Clone this wiki locally