Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange issues with yt part #57

Closed
umputun opened this issue Mar 25, 2022 · 6 comments
Closed

Strange issues with yt part #57

umputun opened this issue Mar 25, 2022 · 6 comments

Comments

@umputun
Copy link
Owner

umputun commented Mar 25, 2022

The youtube part is mostly fine, doing things as designed however there are 2 issues we need to address. I tend to believe both are somehow related to the way the youtube XML feed formed and I suspect some out-of-order entries from time to time, but not sure.

  1. Sometimes we have an unhealthy loop with new entries detected, published, and removed right away. I don't really understand how this happens, but it seems to be related to updated entries. Currently, there is a hack in youtube.Service to prevent processing on the same entry for the second time. Obviously, this is just a hack, as this unbounded set will grow with time.
  2. Somehow we get updates in the middle of the feed (the final RSS). This drives podcast clients insane and probably it has something to do with pub date out of order, not sure.

Both issues are reproducible with the current configuration, i.e. with this:

 echo-msk:
   title: Эхо Москвы
   description: Правильный, комбинированный фид избранных передач
   link: http://echo.msk.ru
   language: "ru-ru"
   image: images/echomsk.png
   telegram_channel: echomsk_u
   ext_date: yyyyddmm
   sources:
     - {name: "Живой Гвоздь", url: "https://worker.feed-master.com/yt/media/UCWAIvx2yYLK_xTYD4F2mUNw.xml"}
     - {name: "Александр Плющев", url: "https://worker.feed-master.com/yt/media/UCTVk323gzizpujtn2T_BL7w.xml"}
     - {name: "Дилетант", url: "https://worker.feed-master.com/yt/media/UCuIE7-5QzeAR6EdZXwDRwuQ.xml"}
     - {name: "Майкл Наки", url: "https://worker.feed-master.com/yt/media/UCgpSieplNxXxLXYAzJLLpng.xml"}
     - {name: "Татьяна Фельгенгауэр", url: "https://worker.feed-master.com/yt/media/UC0OiowQ-fTqw3Kq3bfopO_Q.xml"}
     - {name: "Юлия Латынина", url: "https://worker.feed-master.com/yt/media/UCzaqqlriSjVyc795m86GVyg.xml"}
     - {name: "Сергей Пархоменко", url: "https://worker.feed-master.com/yt/media/UC5wRw_R3kzwV5zGe7kzec8A.xml"}
     - {name: "Сергей Алексашенко", url: "https://worker.feed-master.com/yt/media/UCfePVJbfih7dkxrs8AInINg.xml"}
     - {name: "Евгения Альбац", url: "https://worker.feed-master.com/yt/media/UCZdyijiiXyMyrwTYMGOVxKA.xml"}
     - {name: "Виктор Шендерович", url: "https://worker.feed-master.com/yt/media/UCyYl1mXWoe4z_FcxzMaDHCw.xml"}
     - {name: "Александр Невзоров", url: "https://worker.feed-master.com/yt/media/UC8kI2B-UUv7A5u3AOUnHNMQ.xml"}
@paskal paskal self-assigned this Mar 27, 2022
@orthur
Copy link
Contributor

orthur commented Mar 28, 2022

I have noticed that scheduled live streams has one <published> time, after stream has finished <published> time updated. And entries stored in boltDB with key using <published> time fmt.Sprintf("%d-%x", entry.Published.Unix(), h.Sum(nil)) that makes s.Store.Exist(entry) equals to false, hence duplicate appears in a feed.

Channel's rss feed downloaded at Mar 28 19:57:

 <entry>
  <id>yt:video:4wK4JFikFM0</id>
  <yt:videoId>4wK4JFikFM0</yt:videoId>
  <yt:channelId>UCWAIvx2yYLK_xTYD4F2mUNw</yt:channelId>
  <title>Личный прием / @Евгений Ройзман // 28.03.22</title>
  <link rel="alternate" href="https://www.youtube.com/watch?v=4wK4JFikFM0"/>
  <author>
   <name>Живой Гвоздь</name>
   <uri>https://www.youtube.com/channel/UCWAIvx2yYLK_xTYD4F2mUNw</uri>
  </author>
  <published>2022-03-27T19:31:51+00:00</published>
  <updated>2022-03-28T07:29:57+00:00</updated>
 </entry>

Downloaded at Mar 28 22:11:

<entry>
  <id>yt:video:4wK4JFikFM0</id>
  <yt:videoId>4wK4JFikFM0</yt:videoId>
  <yt:channelId>UCWAIvx2yYLK_xTYD4F2mUNw</yt:channelId>
  <title>Личный прием / @Евгений Ройзман // 28.03.22</title>
  <link rel="alternate" href="https://www.youtube.com/watch?v=4wK4JFikFM0"/>
  <author>
   <name>Живой Гвоздь</name>
   <uri>https://www.youtube.com/channel/UCWAIvx2yYLK_xTYD4F2mUNw</uri>
  </author>
  <published>2022-03-28T19:06:22+00:00</published>
  <updated>2022-03-28T19:06:38+00:00</updated>
 </entry>

@umputun
Copy link
Owner Author

umputun commented Mar 28, 2022

I'm not sure this is the case. I tend to believe, for live events as well as "scheduled" events (i.e. ERROR: [youtube] hF6IrOcs78o: This live event will begin in 4 minutes.) it will fail the download process. The failed download won't update the store, so it won't even publish an entry to boltdb

I think, there are some other cases the published property got updated. I'm not sure how to handle it better - we have to have some timestamp in the key to allow time-sorting and this timestamp should be real, I mean we can't set time.Now() here. If you have any bright idea how to address this mutable published pls share, or even better, make a pr.

@umputun
Copy link
Owner Author

umputun commented Mar 29, 2022

this is an example of "unhealthy loop with new entries detected, published, and removed right away"

here we download videoid 45GbA78kSpo, store audio to ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3, and after this remove ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3 right away.

2022/03/29 00:44:52.306 [INFO]  {youtube/service.go:193} new entry [10] 45GbA78kSpo, Юлия Латынина / Войномика.2.0 С Евгением Чичваркиным / 21.03.2022/ LatyninaTV /, Юлия Латынина
2022/03/29 00:44:52.306 [DEBUG] {feed/downloader.go:62} executing command: yt-dlp --extract-audio --audio-format=mp3 --audio-quality=0 -f m4a/bestaudio "https://www.youtube.com/watch?v=45GbA78kSpo" --no-progress -o ce8355c3eee3207b93320b79d978e7c19cb2dc5e.tmp
2022/03/29 00:44:54.094 [INFO]  {lgr/adaptor.go:16} [youtube] 45GbA78kSpo: Downloading webpage
2022/03/29 00:44:54.470 [INFO]  {lgr/adaptor.go:16} [youtube] 45GbA78kSpo: Downloading android player API JSON
2022/03/29 00:44:54.831 [INFO]  {lgr/adaptor.go:16} [info] 45GbA78kSpo: Downloading 1 format(s): 140
2022/03/29 00:44:54.943 [INFO]  {lgr/adaptor.go:16} [download] Destination: ce8355c3eee3207b93320b79d978e7c19cb2dc5e.tmp
2022/03/29 00:44:56.136 [INFO]  {lgr/adaptor.go:16} [download] Download completed
2022/03/29 00:44:56.232 [INFO]  {lgr/adaptor.go:16} [FixupM4a] Correcting container of "ce8355c3eee3207b93320b79d978e7c19cb2dc5e.tmp"
2022/03/29 00:44:57.325 [INFO]  {lgr/adaptor.go:16} [ExtractAudio] Destination: ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3
2022/03/29 00:46:35.577 [INFO]  {lgr/adaptor.go:16} Deleting original file ce8355c3eee3207b93320b79d978e7c19cb2dc5e.tmp (pass -k to keep)
2022/03/29 00:46:35.738 [INFO]  {youtube/service.go:200} downloaded 45GbA78kSpo (Юлия Латынина / Войномика.2.0 С Евгением Чичваркиным / 21.03.2022/ LatyninaTV /) to var/yt/ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3, channel: {Name:Юлия Латынина ID:UCzaqqlriSjVyc795m86GVyg Type: Keep:0 Language:ru-ru}
2022/03/29 00:46:35.738 [INFO]  {store/store.go:46} save 1647886619-a9f72e139d51976e3487ced8ee6f3f621a25c3ca - {ChannelID:UCzaqqlriSjVyc795m86GVyg, VideoID:45GbA78kSpo, Title:Юлия Латынина / Войномика.2.0 С Евгением Чичваркиным / 21.03.2022/ LatyninaTV /, File:var/yt/ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3, Author:Yulia Latynina, Published:2022-03-21 18:16:59 +0000 +0000}
2022/03/29 00:46:35.743 [INFO]  {store/store.go:172} set processed ce8355c3eee3207b93320b79d978e7c19cb2dc5e - {ChannelID:UCzaqqlriSjVyc795m86GVyg, VideoID:45GbA78kSpo, Title:Юлия Латынина / Войномика.2.0 С Евгением Чичваркиным / 21.03.2022/ LatyninaTV /, File:var/yt/ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3, Author:Yulia Latynina, Published:2022-03-21 18:16:59 +0000 +0000}
2022/03/29 00:46:35.744 [INFO]  {youtube/service.go:216} saved 45GbA78kSpo (Юлия Латынина / Войномика.2.0 С Евгением Чичваркиным / 21.03.2022/ LatyninaTV /) to var/yt/ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3, channel: {Name:Юлия Латынина ID:UCzaqqlriSjVyc795m86GVyg Type: Keep:0 Language:ru-ru}
2022/03/29 00:46:35.763 [INFO]  {youtube/service.go:251} removed var/yt/ce8355c3eee3207b93320b79d978e7c19cb2dc5e.mp3 for UCzaqqlriSjVyc795m86GVyg (Юлия Латынина)
2022/03/29 00:46:35.766 [INFO]  {youtube/rss_store.go:35} rss feed file saved to var/yt/UCzaqqlriSjVyc795m86GVyg.xml

this is youtube feed for the channel: https://www.youtube.com/feeds/videos.xml?channel_id=UCzaqqlriSjVyc795m86GVyg

I don't see anything unusual with this entry but somehow it causes the issue above

@umputun
Copy link
Owner Author

umputun commented Mar 29, 2022

this is one more example of get-delete loop

2022/03/29 00:44:44.448 [INFO]  {youtube/service.go:193} new entry [12] 0k8J-vx-R8E, Московские старости от 24.03.1922, Дилетант
2022/03/29 00:44:44.448 [DEBUG] {feed/downloader.go:62} executing command: yt-dlp --extract-audio --audio-format=mp3 --audio-quality=0 -f m4a/bestaudio "https://www.youtube.com/watch?v=0k8J-vx-R8E" --no-progress -o ceafa1e50a60691038e0fac816cc75823105d262.tmp
2022/03/29 00:44:46.180 [INFO]  {lgr/adaptor.go:16} [youtube] 0k8J-vx-R8E: Downloading webpage
2022/03/29 00:44:46.551 [INFO]  {lgr/adaptor.go:16} [youtube] 0k8J-vx-R8E: Downloading android player API JSON
2022/03/29 00:44:46.973 [INFO]  {lgr/adaptor.go:16} [info] 0k8J-vx-R8E: Downloading 1 format(s): 140
2022/03/29 00:44:47.035 [INFO]  {lgr/adaptor.go:16} [download] Destination: ceafa1e50a60691038e0fac816cc75823105d262.tmp
2022/03/29 00:44:47.076 [INFO]  {lgr/adaptor.go:16} [download] Download completed
2022/03/29 00:44:47.158 [INFO]  {lgr/adaptor.go:16} [FixupM4a] Correcting container of "ceafa1e50a60691038e0fac816cc75823105d262.tmp"
2022/03/29 00:44:47.262 [INFO]  {lgr/adaptor.go:16} [ExtractAudio] Destination: ceafa1e50a60691038e0fac816cc75823105d262.mp3
2022/03/29 00:44:51.988 [INFO]  {lgr/adaptor.go:16} Deleting original file ceafa1e50a60691038e0fac816cc75823105d262.tmp (pass -k to keep)
2022/03/29 00:44:52.122 [INFO]  {youtube/service.go:200} downloaded 0k8J-vx-R8E (Московские старости от 24.03.1922) to var/yt/ceafa1e50a60691038e0fac816cc75823105d262.mp3, channel: {Name:Дилетант ID:UCuIE7-5QzeAR6EdZXwDRwuQ Type: Keep:0 Language:ru-ru}
2022/03/29 00:44:52.123 [INFO]  {store/store.go:46} save 1648100460-145cfeb8145f07fd1e9c0a9d49c4b4840db12f02 - {ChannelID:UCuIE7-5QzeAR6EdZXwDRwuQ, VideoID:0k8J-vx-R8E, Title:Дилетант: Московские старости от 24.03.1922, File:var/yt/ceafa1e50a60691038e0fac816cc75823105d262.mp3, Author:Дилетант, Published:2022-03-24 05:41:00 +0000 +0000}
2022/03/29 00:44:52.125 [INFO]  {store/store.go:172} set processed ceafa1e50a60691038e0fac816cc75823105d262 - {ChannelID:UCuIE7-5QzeAR6EdZXwDRwuQ, VideoID:0k8J-vx-R8E, Title:Дилетант: Московские старости от 24.03.1922, File:var/yt/ceafa1e50a60691038e0fac816cc75823105d262.mp3, Author:Дилетант, Published:2022-03-24 05:41:00 +0000 +0000}
2022/03/29 00:44:52.127 [INFO]  {youtube/service.go:216} saved 0k8J-vx-R8E (Дилетант: Московские старости от 24.03.1922) to var/yt/ceafa1e50a60691038e0fac816cc75823105d262.mp3, channel: {Name:Дилетант ID:UCuIE7-5QzeAR6EdZXwDRwuQ Type: Keep:0 Language:ru-ru}
2022/03/29 00:44:52.129 [INFO]  {youtube/service.go:251} removed var/yt/ceafa1e50a60691038e0fac816cc75823105d262.mp3 for UCuIE7-5QzeAR6EdZXwDRwuQ (Дилетант)
2022/03/29 00:44:52.131 [INFO]  {youtube/rss_store.go:35} rss feed file saved to var/yt/UCuIE7-5QzeAR6EdZXwDRwuQ.xml

So far those two are the only cases I can reproduce.

Pls note - after #62 it will happen only once and in order to see it again you will need to remove boltdb file

umputun added a commit that referenced this issue Mar 29, 2022
@umputun
Copy link
Owner Author

umputun commented Mar 29, 2022

another mystery

2022/03/29 12:32:27.003 [INFO]  {logger/logger.go:134} GET - /yt/rss/UC5wRw_R3kzwV5zGe7kzec8A - yt.feed-master.com - unknown ip - 200 (42312) - 5.403089ms
2022/03/29 12:32:27.005 [WARN]  {youtube/service.go:112} failed to get file size for var/yt/3ba58d9587cc398bf42456b1165ace916051be53.mp3: stat var/yt/3ba58d9587cc398bf42456b1165ace916051be53.mp3: no such file or directory

So, the file was removed but the entry in RSS is still in place. RSS is built from bolt directly and removal seems to be fine to me. At least I can't see how it can keep items in bolt but remove the file

this is the log for the actual removal, no errors

2022/03/29 11:25:44.082 [INFO] {youtube/service.go:251} removed var/yt/3ba58d9587cc398bf42456b1165ace916051be53.mp3 for UCuIE7-5QzeAR6EdZXwDRwuQ (Дилетант)

I have added some extra checks to prevent potential failed removal of metadata from reporting the file for removal, however I don't think this was the case

@paskal paskal removed their assignment Mar 30, 2022
@umputun
Copy link
Owner Author

umputun commented Apr 11, 2022

after all the changes around ts adjustment and a separate tracking of processed entires this part seems to be stable and issue-free

@umputun umputun closed this as completed Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants