Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image/Video caption and credits removal #616

Open
hamsarajan opened this issue Jun 6, 2024 · 3 comments
Open

Image/Video caption and credits removal #616

hamsarajan opened this issue Jun 6, 2024 · 3 comments
Labels
documentation Docs in need of update or extension question Further information is requested

Comments

@hamsarajan
Copy link

Thank you for this great library! I've encountered an issue when extracting text in Vietnamese. It seems that image/video captions or credits within articles are not being removed. An example would be this article: https://vietnamnet.vn/voi-va-keo-phanh-tay-o-to-quay-vong-dam-vao-xe-buyt-663482.html. Is there a way to extract only the meaningful parts of the text?

Snippet of Vietnamese text extracted:
Hành khách ngồi bên cạnh ghế tài xế đã kéo phanh tay khiến xe ô tô quay vòng và đâm vào xe buýt đang quay đầu.
Clip: Ô tô quay vòng đâm vào xe buýt
Sự việc này xảy ra trên một con đường tại tỉnh Hồ Bắc, Trung Quốc, vào ngày 22/7 vừa qua.
Hình ảnh từ đoạn video cho thấy, chiếc xe ô tô màu vàng đột nhiên mất lái, quay vòng sang làn đường ngược chiều và đâm vào hông của một chiếc xe buýt đang quay đầu.

Translation:
The passenger sitting next to the driver's seat pulled the handbrake, causing the car to spin and crash into the turning bus.
Clip: Car spins and crashes into bus
This incident happened on a road in Hubei province, China, on July 22.
Images from the video show that the yellow car suddenly lost control, spun into the opposite lane and crashed into the side of a turning bus.

@adbar adbar added the question Further information is requested label Jun 6, 2024
@adbar
Copy link
Owner

adbar commented Jun 6, 2024

I lack the time to check the issue right now. You can provide a list of XPath expressions to the extraction function (prune_xpath parameter).

@lgjluis
Copy link

lgjluis commented Jun 24, 2024

Can you give us an example of PRUNE XPATH? I have an HTML tag "blockquote" that I want to remove from the extraction, can you give the code in python?

@adbar adbar added the documentation Docs in need of update or extension label Jun 24, 2024
@adbar
Copy link
Owner

adbar commented Jun 24, 2024

I can answer questions on issues but not provide code snippets, if it is not in the documentation just look at the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Docs in need of update or extension question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants