Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parameter to be parsed as item's datetime stamp #4

Open
martinrotter opened this issue Jan 17, 2024 · 38 comments
Open

Add a parameter to be parsed as item's datetime stamp #4

martinrotter opened this issue Jan 17, 2024 · 38 comments
Labels
done enhancement New feature or request

Comments

@martinrotter
Copy link
Contributor

No description provided.

@Owyn
Copy link
Owner

Owyn commented Jan 17, 2024

Would that be useful? 🤔
Since RSSGuard already sorts new entries right even without dates martinrotter/rssguard#621 (comment)

@Owyn Owyn added the enhancement New feature or request label Jan 17, 2024
@RetroAbstract
Copy link

Would that be useful? 🤔

Yes. It would be useful in being able to limit the initial amount of articles CSS2RSS-based feeds fetch upon the first fetch after feed creation (or the first fetch after the database is cleaned & optimized) with article filters and properly limit the amount of articles in general with this upcoming feature.

Right now the article filter I use successfully limits the initial amount of non-CSS2RSS feed articles fetched but cannot limit the initial amount of CSS2RSS feed articles fetched as they all have the same timestamp, and the feature has unexpected behaviour as when for example limiting a CSS2RSS feed to:

  • keeping 1 article, deleting important and non-read articles: RSS Guard will oscillate at random between the first and second item found at every fetch.
  • keeping 2 articles, deleting important and non-read articles: RSS Guard will oscillate at random between the first and second item found at every fetch for the first article, and oscillate at random between the first, second and third item found at every fetch for the second article.

@Owyn
Copy link
Owner

Owyn commented Jan 19, 2024

Well, that can be done of course as a yet another parameter, but my worry is - will people actually bother to use it and match one more element with a new css selector every feed for this? (even if they'd want to limit the number of fetched articles with this mentioned feature)

@martinrotter
Copy link
Contributor Author

martinrotter commented Jan 19, 2024

well I myself would use that feature

it would be nice if the provided element was somehow magically parsed into date/time object, there is excellent python package "dateparser" which provides such automagic string -> date/time conversion function

@RetroAbstract
Copy link

Well, that can be done of course as a yet another parameter, but my worry is - will people actually bother to use it and match one more element with a new css selector every feed for this? (even if they'd want to limit the number of fetched articles with this mentioned feature)

I would, yes.

@Owyn
Copy link
Owner

Owyn commented Jan 21, 2024

@martinrotter articles keep getting remarked as non-read when their date changes even a little :-(

so with this feature they will be forever unread (cuz many dates are relative)

2024-01-21.16-47-14.mp4

Tried in 4.6.3 - same result

Owyn added a commit that referenced this issue Jan 21, 2024
@Owyn
Copy link
Owner

Owyn commented Jan 21, 2024

Try the new update with the Date parsing selector (6th argument) - aim it at text nodes with the date or or elements with title or alt attributes containing Date

You'd need "No duplicates by URL" filter to not let items get remarked as unread every time currently (dunno why it's not a standard checkbox option)

function filterMessage() {
  if (msg.isAlreadyInDatabase(MessageObject.SameUrl)) {
    return MessageObject.Ignore;
  }
  else {
    return MessageObject.Accept;
  }
}

@RetroAbstract
Copy link

New date selector works relatively well, thanks alot :)

Couldn't get it to work for these pages even if they do have dates to their items:

https://jwfan.com/?cat=21
https://support.microsoft.com/en-us/topic/windows-10-update-history-8127c2c6-6edf-4fdf-8b9f-0f7be1ef3562

@martinrotter
Copy link
Contributor Author

martinrotter commented Jan 23, 2024

Calling script like this fails:

python3 scripts/css2rss.py ".clanek" "" "" "" "" ".date"

I only need to specify "item" selector and date selector (in this case with "date" class)

114, in <module>
    if sys.argv[2][0] == '!':
       ~~~~~~~~~~~^^^
IndexError: string index out of range

@RetroAbstract
Copy link

Calling script like this fails:

python3 scripts/css2rss.py ".clanek" "" "" "" "" ".date"

I only need to specify "item" selector and date selector (in this case with "date" class)

114, in <module>
    if sys.argv[2][0] == '!':
       ~~~~~~~~~~~^^^
IndexError: string index out of range

I think you need to use ~ instead of "" to skip selectors you don't need.

@martinrotter
Copy link
Contributor Author

Calling script like this fails:

python3 scripts/css2rss.py ".clanek" "" "" "" "" ".date"

I only need to specify "item" selector and date selector (in this case with "date" class)

114, in <module>
    if sys.argv[2][0] == '!':
       ~~~~~~~~~~~^^^
IndexError: string index out of range

I think you need to use ~ instead of "" to skip selectors you don't need.

OK, I am really beginner when it comes to CSS selectors.

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

Calling script like this fails:

python3 scripts/css2rss.py ".clanek" "" "" "" "" ".date"

both "" and ~ should work to skip elements, I'll fix it 👍
tho ~ should be easier to write and read

Owyn added a commit that referenced this issue Jan 23, 2024
@martinrotter
Copy link
Contributor Author

Tilde works.

Perhaps anyone can help me. I have website, which has one huge

element which contains elements (which represent titles) and elements (which contain dates/times) like this:

<p>

<a href="....." class="x">Title 1</a>
<span class="y">2024-01-07</span>

<a href="....." class="x">Title 2</a>
<span class="y">2023-07-29</span>

<a href="....." class="x">Title 3</a>
<span class="y">2022-01-07</span>

</p>

@Owyn @RetroAbstract Can anyone from you experts tell me how to instruct css2rss to spill out feed from this correctly now with date?

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

<span class="y">2022-01-07</span>

date selector should point at this span element
e.g. with .y as the 6th argument

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

New date selector works relatively well, thanks alot :)

Couldn't get it to work for these pages even if they do have dates to their items:

https://jwfan.com/?cat=21 https://support.microsoft.com/en-us/topic/windows-10-update-history-8127c2c6-6edf-4fdf-8b9f-0f7be1ef3562

image

as for the 2nd link I'm not sure what you're trying to match there... update history? there isn't a separate date element there
image

@martinrotter
Copy link
Contributor Author

<span class="y">2022-01-07</span>

date selector should point at this span element e.g. with .y as the 6th argument

Would this also work if input element for date is messed with auxiliary text like this?

<p>

<a href="....." class="x">Title 1</a>
<span class="y">2024-01-07 SOMEMORETEXT</span>

<a href="....." class="x">Title 2</a>
<span class="y">2023-07-29 SOMEMORETEXT</span>

<a href="....." class="x">Title 3</a>
<span class="y">2022-01-07 SOMEMORETEXT</span>

</p>

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

date is messed

nope:
image

@martinrotter
Copy link
Contributor Author

date is messed

nope: image

are these exceptions muted in your script? because running my script like this

css2rss.py ".a" "" "" "" "" ".y"

does not produce any exceptions which is weird

anyway, thanks for your help with this :)

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

are these exceptions muted in your script? because running my script like this

They are redirected into the article's body, check it :-)

Because erroring a whole feed just because an auxiliary feature like date-parsing broke would've been too much

@RetroAbstract
Copy link

New date selector works relatively well, thanks alot :)
Couldn't get it to work for these pages even if they do have dates to their items:
https://jwfan.com/?cat=21 https://support.microsoft.com/en-us/topic/windows-10-update-history-8127c2c6-6edf-4fdf-8b9f-0f7be1ef3562

image

as for the 2nd link I'm not sure what you're trying to match there... update history? there isn't a separate date element there image

Yes, I can create feeds for both sites with CSS2RSS, but can't get the date selector to work with them.

The "time" element does not work for https://jwfan.com/?cat=21, as well as "published", "entry-meta", "entry-meta > published" etc.. articles have as time-stamp the moment RSS Guard fetches the feed.

For https://support.microsoft.com/en-us/topic/windows-10-update-history-8127c2c6-6edf-4fdf-8b9f-0f7be1ef3562, you're right there isn't a separate date element, my mistake. Would the script be able to only take into consideration the date of each update and discard the rest? For example with the "January 9, 2024—KB5034122 (OS Builds 19044.3930 and 19045.3930)" update - Keep the date, use it for the date selector and discard everything after it?

@martinrotter
Copy link
Contributor Author

Well, what I try to parse is this URL https://antivirus.22web.org/clanky.htm

Here is RAW input: https://www.pastebin.cz/en/p/VVFQEN9

Here is how I call it

cat raw-input.txt | python3 css2rss ".clanek" "" "" "" "" ".date"

Sadly with the the output "date_published" attribute is empty and error message is not seen in contents of article either.

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

The "time" element does not work for https://jwfan.com/?cat=21,

It works for me 🤔 have you tried the script line from my screenshot?
image

error message is not seen in contents of article either.

it should be at the very bottom of it in item_description

@martinrotter
Copy link
Contributor Author

it should be at the very bottom of it in item_description
Well, maybe I am blind.
out.json

@RetroAbstract
Copy link

The "time" element does not work for https://jwfan.com/?cat=21,

It works for me 🤔 have you tried the script line from my screenshot? image

Works now, thanks 😄

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

it should be at the very bottom of it in item_description
Well, maybe I am blind.
out.json

https://support.microsoft.com/en-us/topic/windows-10-update-history-8127c2c6-6edf-4fdf-8b9f-0f7be1ef3562

python css2rss.py "#supLeftNav > div > ul:nth-child(2) > li:nth-child(2)" ~ ~ ~ ~ ".supLeftNavLink"

image
image

or
image

check after 4fe2c89

now it should write when the date element wasn't found at all (it was probably the case - cuz then it won't write any parsing errors - since there's nothing to parse)

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

cat raw-input.txt | python3 css2rss ".clanek" "" "" "" "" ".date"

that's because the date element isn't inside .clanek element but next to it...

@martinrotter
Copy link
Contributor Author

Yes i know, how to tell it that it is next then? Pls

@Owyn
Copy link
Owner

Owyn commented Jan 23, 2024

Yes i know, how to tell it that it is next then? Pls

Don't tell it that it's next, just choose a root element (1st argument) a level higher

@RetroAbstract
Copy link

Would need some help trying to get the dates for items on eBay.

For example, the search term "nintendo switch" sorted by newly listed, the time of listing for each article is found in:

<span class="s-item__detail s-item__detail--secondary">
  <span class="s-item__dynamic s-item__listingDate">
   <!--F#f_0-->
   <span class="BOLD">Aug-2 15:37</span>
   <!--F/-->
  </span>

When I use any of the above (s-item__detail, s-item__detail--secondary, s-item__dynamic, s-item__listingDate, BOLD) as the 6th selector in the feed's post processing script box with CSS2RSS, it fails to find date elements.

Any help with this would be much appreciated. 🙂

@Owyn
Copy link
Owner

Owyn commented Aug 3, 2024

it fails to find date elements.

because there are no dates for ebay ads they masked as listed items, that's why it says so

python css2rss.py "li.s-item" "span.s-item__listingDate" ~ "a[href*=UserReviews]" ~ "span.s-item__listingDate"

image

also, date parsing now uses your local timezone, not UTC when unspecified (I've just noticed it was wrong here)

and there's now a more detailed message saying how many items exactly didn't get their date parsed or found so you'll see when it's just few and not all of them failing

Owyn added a commit that referenced this issue Aug 3, 2024
@RetroAbstract
Copy link

RetroAbstract commented Aug 4, 2024

it fails to find date elements.

because there are no dates for ebay ads they masked as listed items, that's why it says so

python css2rss.py "li.s-item" "span.s-item__listingDate" ~ "a[href*=UserReviews]" ~ "span.s-item__listingDate"

image

also, date parsing now uses your local timezone, not UTC when unspecified (I've just noticed it was wrong here)

and there's now a more detailed message saying how many items exactly didn't get their date parsed or found so you'll see when it's just few and not all of them failing

Thanks. I tried python css2rss.py "li.s-item" "span.s-item__listingDate" ~ "a[href*=UserReviews]" ~ "span.s-item__listingDate", but don't get the same results as the ones shown in the screenshot you shared.

0

Maya and Beautiful Soup are installed.

What URL are you using?

Update: The user agent I had set to launch with RSS Guard was the cause of me not getting the same results. Changed it and now it works fine. 🙂

@Owyn
Copy link
Owner

Owyn commented Aug 4, 2024

but don't get the same results as the ones shown in the screenshot you shared.

Are you sure you're actually using the line I wrote above? because it clearly chooses published date there as the title, there's no way it'd randomly choose the actual title instead - you'd need to carefully select elements for that

anyway, now latest version should clearly say how many date elements got matched and parsed ( in the feed comment field where you press "fetch now" )

image
the url is the one you gave me,
site content might also differ based on the user's country

@RetroAbstract
Copy link

Are you sure you're actually using the line I wrote above? because it clearly chooses published date there as the title, there's no way it'd randomly choose the actual title instead - you'd need to carefully select elements for that

100% sure it's the same line you wrote, copied and pasted:

00

I also updated css2rss to your latest commit.

site content might also differ based on the user's country

I thought this too, tried the UK and CA eBays but same results.

@Owyn
Copy link
Owner

Owyn commented Aug 4, 2024

Perhaps RSSGuard version also matters, I'm still using 4.3.3 nowebengine version since there's still light web-viewer there which looks better

newer versions should have an option to switch between full and lite browser and also some alternative ways to fetch websites for parsing as I've heard

@RetroAbstract
Copy link

RetroAbstract commented Aug 4, 2024

Perhaps RSSGuard version also matters, I'm still using 4.3.3 nowebengine version since there's still light web-viewer there which looks better

Tried on 4.3.3 No Web Engine and indeed, I now get the results you do.

000

newer versions should have an option to switch between full and lite browser

4.7.2 No Web Engine is my daily driver, tried alterning Use legacy article formatting in Feeds & Articles > Articles toggled on and off but no difference.

and also some alternative ways to fetch websites for parsing as I've heard

I think those are related to the Discover feeds feature which to my knowledge cannot be used with CSS2RSS.

0000

@martinrotter
Copy link
Contributor Author

Perhaps RSSGuard version also matters, I'm still using 4.3.3 nowebengine version since there's still light web-viewer there which looks better

newer versions should have an option to switch between full and lite browser and also some alternative ways to fetch websites for parsing as I've heard

BTW, lightweight variant of browser is still available in latest RSS Guard, you just have to enable it in settings.

@Owyn
Copy link
Owner

Owyn commented Aug 5, 2024

BTW, lightweight variant of browser is still available in latest RSS Guard, you just have to enable it in settings.

I know, the old one just looks better for me tho so I still use it:
4.3.3:
4 3 3

4.7.3 (after full database + setting restore - article columns got messed up anyway)
4 7 3

edit: - apparently now there's "legacy article view" which almost brings back the old look (except link colors are now different (shows em as all visited now for some reason?))
4.7.3 legacy article view:
4 7 3legacy

Might actually update now, thanks for the legacy view 👍🏻

@martinrotter
Copy link
Contributor Author

Feel free to report the "visited link color" if it bugs you, could be a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants