Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

next_page_link and dynamically loaded pages not following strip commands #258

Open
kour1er opened this issue Apr 26, 2021 · 0 comments
Open

Comments

@kour1er
Copy link

kour1er commented Apr 26, 2021

I'm not sure if this is an issue, or me doing something dumb :)

When using the 'next_page_lnk' command and the strip command, the strip works on the first page, but the dynamically loaded subsequent pages don't seem to obey the strip command. For example: if we look at this arstechnica page Huawei’s HarmonyOS: “Fake it till you make it” meets OS development - it has four pages. If I use the following config as an example:

body: //div[contains(@class,'article-content')]
title: //div[@id='story']//h2[@class='title']
date: //div[@class='byline']/span[@class='posted']//abbr/@original-title
date: //div[@class='byline']/span[@class='posted']//abbr
date: //*[@class='byline']//time[@class='date']
author: //p[@class='byline']/span[@class='author']
author: //p[@class='byline']/a
next_page_link: //nav//a[contains(text(), 'Next')]/@href
next_page_link: //span[@class='numbers']//a/span[@class='next']/..
next_page_link: //nav//a/span[contains(text(), 'Next')]/../@href
strip: //p

The p tag only gets stripped on the first page, not on the additional three pages (this is obviously a silly example stripping all p tags but it's just to illustrate). Is there anyway to force the rules (in this case the silly strip: //p) on the dynamically loaded subsequent pages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant