Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapes appear frozen time to time in Zimfarm #1756

Closed
kelson42 opened this issue Jan 27, 2023 · 30 comments
Closed

Scrapes appear frozen time to time in Zimfarm #1756

kelson42 opened this issue Jan 27, 2023 · 30 comments
Labels
Milestone

Comments

@kelson42
Copy link
Collaborator

This recipe has its very own freeze pattern at the very start, even before scraping looks like!

See https://farm.openzim.org/pipeline/32b22c3a583bd32c94c53d36/debug

@kelson42 kelson42 added the bug label Jan 27, 2023
@kelson42 kelson42 added this to the 1.12.0 milestone Jan 27, 2023
@uriesk
Copy link
Collaborator

uriesk commented Jan 27, 2023

works locally

npm start -- --adminEmail=contact@kiwix.org --articleList=http://download.openzim.org/wp1/enwiki/tops/1000000.tsv --customZimDescription="A selection of the best 1 Million Wikipedia articles" --customZimFavicon=https://en.wikipedia.org/static/images/project-logos/enwiki.png --customZimTitle="Wikipedia's 1 Million Top Articles" --filenamePrefix=wikipedia_en_top1m --mwUrl=https://en.wikipedia.org/ --webp

and works with docker

docker run -v /root/test:/output:rw --name mwoffliner_wikipedia_en_top1m ghcr.io/openzim/mwoffliner:TEST-redisupdate mwoffliner --adminEmail=contact@kiwix.org --articleList=http://download.openzim.org/wp1/enwiki/tops/1000000.tsv --customZimDescription="A selection of the best 1 Million Wikipedia articles" --customZimFavicon=https://en.wikipedia.org/static/images/project-logos/enwiki.png --customZimTitle="Wikipedia's 1 Million Top Articles" --filenamePrefix=wikipedia_en_top1m --mwUrl=https://en.wikipedia.org/ --webp

both enter the article-downloading stage fine

Is there anything special on the zimfarm?

@kelson42
Copy link
Collaborator Author

@uriesk not that i know... and we use to download list from that server.... but never of that size

@holta
Copy link

holta commented Jan 28, 2023

@uriesk https://kinsta.com/blog/increase-max-upload-size-wordpress/#increase-the-max-upload-file-size-in-nginx suggests NGINX error 413 Request Entity Too Large might be overcome by increasing these default settings in /etc/php/7.4/fpm/php.ini and /etc/php/7.4/cli/php.ini or equivalent?!

upload_max_filesize = 64M
post_max_size = 128M

Others recommend increasing these too:

max_execution_time = 30
max_input_time = 60
memory_limit = 128M
max_input_vars = 1000

Followed by something like:

systemctl restart php7.4-fpm

[ Later Followup: ]

@kelson42
Copy link
Collaborator Author

@rgaudin Any idea? these erros are in the task-worker log https://farm.openzim.org/pipeline/32b22c3a583bd32c94c53d36/debug

@rgaudin
Copy link
Member

rgaudin commented Jan 28, 2023

137 is OOM

@rgaudin
Copy link
Member

rgaudin commented Jan 28, 2023

Ah sorry yes I've seen those. We need a ticket on ZF but it doesn't affect the scraping

@kelson42
Copy link
Collaborator Author

@rgaudin OK, have already increased memory available and restarted a scrape.

@kelson42
Copy link
Collaborator Author

@rgaudin Almost all the time (we had one which somehow achieved to go through this and then died later with 137) the scrape is just stuck after Redis server start... so super early. This is the only one with that behaviour and seems the only one with the errors openzim/zimfarm#738. Locally it seems to work fine like reported by @uriesk. I have a kind of strong feeling that the problem might come from the Zimfarm (worker?) itself. Would you be able please to verify and maybe find other clues?

@rgaudin
Copy link
Member

rgaudin commented Jan 29, 2023

I'll look into it but the task you referenced here failed after 1day and 50mn… not exactly upon startup

@holta
Copy link

holta commented Jan 29, 2023

I'll look into it but the task you referenced here failed after 1day and 50mn… not exactly upon startup

Indeed, this time https://farm.openzim.org/recipes/wikipedia_en_top1m failed after about half a day according to https://farm.openzim.org/pipeline/81507cc3e015578b61a95d36 ("10 hours, 50 minutes" ?)

ASIDE: This ~40 GB ZIM file will be a Lifeline for people who just cannot afford large microSD cards.

In essentially all countries.

So I'd like to help wherever I can!

@uriesk
Copy link
Collaborator

uriesk commented Jan 29, 2023

Better not run it on verbose.

The three oldest available builds failed with errorcode 137 (out-of-memory).
All scraps since then are with verbose.
And when you download the 500 MB logfile of the most recent one, you see:

[error] [2023-01-29T11:50:40.491Z] Error downloading article Nottingham_City_Hospital
Failed to run mwoffliner after [49493s]: {
	"message": "Request failed with status code 500",
	"name": "AxiosError",
	"stack": "AxiosError: Request failed with status code 500\n    at settle (file:///tmp/mwoffliner/node_modules/axios/lib/core/settle.js:19:12)\n    at IncomingMessage.handleStreamEnd (file:///tmp/mwoffliner/node_modules/axios/lib/adapters/http.js:512:11)\n    at IncomingMessage.emit (node:events:539:35)\n    at IncomingMessage.emit (node:domain:475:12)\n    at endReadableNT (node:internal/streams/readable:1344:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)",
	

so just one failed article, that we can exclude with articleListToIgnore when 1.12 releases.
Edit: Or just remove from the article list, aham.

I think the zimfarm just got some issues with massive verbose outputs of large scraps.

@rgaudin
Copy link
Member

rgaudin commented Jan 30, 2023

wikipedia_en_top1m never worked

When (end) Duration Reason
2 months ago 1 day, 8 hours, 50 minutes OOM (10GB)
1 month ago 1 day, 20 hours, 30 minutes OOM (10GB)
29 days ago 1 day, 10 hours, 30 minutes OOM (15GB)
28 days ago 1 day, 3 hours, 30 minutes Canceled
23 days ago 10 hours, 30 minutes Canceled
23 days ago 1 hour, 2 minutes Canceled
10 days ago 20 hours, 3 minutes Canceled
2 days ago 1 day, 0 hours, 50 minutes OOM (15GB)
1 day ago 10 hours, 50 minutes Exit-code 2 (Request failed with status code 500)
21 minutes ago 20 hours, 60 minutes Canceled

This doesn't match the ticket description at all. This looks like a memory hungry task that didn't fit within 15GB.

I see that another one has been launched following openzim/zimfarm#738 fix. I doubt this would have much impact as this was just preventing the task worker from uploading the log. Even if that log was kept in memory, we're talking about 500MB… and the task worker is not resources limited. So for it to have an impact, it would require the worker to be completely maxed out on RAM and hope that the kernel decides to kill the scraper… 

I am ruling this out of zimfarm at the moment ; please let me know if your findings lead back to zimfarm.

I would suggest you test locally by specifying memory resources limit on your docker command.

@uriesk
Copy link
Collaborator

uriesk commented Jan 30, 2023

@rgaudin
then i suggest to look at the actual issue and not only trying to work around it (like i wanted with that "lets not do verbose").

The 413 Request Entity Too Large regularly happened with verbose outputs while none-verbose was ok.
Here you got a different one: https://farm.openzim.org/pipeline/4d2a6c335d2fa49beb7afb36/debug or here https://farm.openzim.org/pipeline/ba86dc6329a4e5323ec6eb36/debug

The zimfarm will not be the reason why the scraps fail. But it can be the reason why they appear as frozen (even if they might not be). All those Canceled ones had a frozen output. They should have either given us a OOM error, or a running output, or whatever error actually appeared (like the one a day ago).
But if the zimfarm is just preventing the task worker from uploading the log we don't know what is going on and when you hit cancel, you don't get the logfile either.
The output of the last wikipedia_ceb_all appeared frozen for two days or more and then magically updated and we got a legit error.
A frozen output is an issue.
mwoffliner:TEST-redisupdate was supposed to solve freezes that happen within mwoffliner with timers that monitor the crucial parts and cancel execution with a legitimate error when somethings freezes. It did do that successfully once, so i guess that is working.
But if the whole node process freezes, a timer within that frozen process can't help.
Neither is it helping when it actually isn't frozen and we just don't see the output.

So i ask you to stick around, watch those builds, and if something appears to be frozen... check what is going on in the container.
Even if you think that it is not a zimfarm issue anymore since client_max_body_size changes, lets watch if that is the case, and if freezes still appear and are not zimfarm related, you can give the hint that leads the way.
Maybe shell into it and check the node process if its still running, is it taking CPU, do redis-cli and check if the redis connections still exist and if there are still commands going in.

I can not locally test a full scrap of a 6 million article wikipedia like ceb or en or even just 1 million one. I can just check if it reaches one of the earlier stages without freezing.
And i never saw a freeze with the ones that are small enough that i can try.
We can only debug this on the zimfarm.

We can rename this issue to some scraps appear frozen sometimes if that is helping.

@kelson42 kelson42 changed the title WPEN top1m recipe stuck at very start scrapes appear frozen time to time in Zimfarm Jan 30, 2023
@rgaudin
Copy link
Member

rgaudin commented Jan 30, 2023

That makes more sense. I understand that indeed if you rely on timestamps in the live-updated stdout to tell a running task from a stuck one, then ZF issue would have made you think those tasks were stuck.

That said:

  • I don't understand the references to the start/beginning
  • Given most runs failed/got stuck after 20h+, I don't understand how it was considered to work locally but not in the farm.

I guess none of this matter now that we have eliminated the main culprit. Hopefully, the current run will enlighten us.

As for monitoring, ping me here or on slack with a task when you want me to connect and find information for you ; I'd be happy to help.

@uriesk
Copy link
Collaborator

uriesk commented Jan 30, 2023

Because they appeared stuck right after starting redis, before any scrapping started
https://farm.openzim.org/pipeline/4f192c6329a4e5327c141b36/debug

Thanks, lets hope it works out 👍

@kelson42 kelson42 changed the title scrapes appear frozen time to time in Zimfarm Scrapes appear frozen time to time in Zimfarm Jan 31, 2023
@kelson42 kelson42 modified the milestones: 1.12.0, 1.13.0 Feb 1, 2023
@holta
Copy link

holta commented Feb 1, 2023

FWIW yesterday's run contained:

  • lots of "413 Request Entity Too Large" warnings
  • one "couldn't patch task status=scraper_running HTTP 502: ResponseError (not JSON)" "502 Bad Gateway" warning

Does Scraper stderr /usr/local/sbin/mwoffliner: line 2: 13 Killed mean the job was manually killed?

Thank you to @kelson42 who launched another ZF attempt 3.5 hours ago:
https://farm.openzim.org/pipeline/28c070f7906bf9674d93ad36

@rgaudin
Copy link
Member

rgaudin commented Feb 1, 2023

FWIW yesterday's run contained:

* lots of "413 Request Entity Too Large" warnings
* one "couldn't patch task status=scraper_running HTTP 502: ResponseError (not JSON)" "502 Bad Gateway" warning

Both were from before the fix. Sorry about the timing conflict

Does Scraper stderr /usr/local/sbin/mwoffliner: line 2: 13 Killed mean the job was manually killed?

No, it was killed by docker due to lack of RAM

@rgaudin
Copy link
Member

rgaudin commented Feb 1, 2023

I'd like to mention that yesterday's run had 20GB of RAM yet OOM'd. Keep in mind that redis is completely in RAM as well.

The new one running is bound to 25GB.

@holta
Copy link

holta commented Feb 1, 2023

I'd like to mention that yesterday's run had 20GB of RAM yet OOM'd. Keep in mind that redis is completely in RAM as well.

The new one running is bound to 25GB.

I should have seen & realized 137 OOM yesterday, Right!

(5GB extra RAM during each attempt can't hurt, if indeed it's that simple!)

🙏

@uriesk
Copy link
Collaborator

uriesk commented Feb 1, 2023

wikipedia_en_all_maxi made it through the downloading-articles stage with 20GB RAM and is fetching 6.6 million articles. Something fishy is going on.

TEST-redisupdate is merged now, we can do the next try with mwoffliner:dev.
And i would appreciate it without verbose, i don't enjoy downloading those GB logfiles.

@holta
Copy link

holta commented Feb 1, 2023

And i would appreciate it without verbose, i don't enjoy downloading those GB logfiles.

Great Question @uriesk, given the importance of rapid/continuous improvement:

ASIDE, "Scrapper Log" should really be "Scraper Log" on every job's "Debug" tab (Debug page) like:
https://farm.openzim.org/pipeline/f7b5acc98307812c7e5c7d36/debug

@rgaudin
Copy link
Member

rgaudin commented Feb 2, 2023

ASIDE, "Scrapper Log" should really be "Scraper Log" on every job's "Debug" tab (Debug page) like: https://farm.openzim.org/pipeline/f7b5acc98307812c7e5c7d36/debug

Fixed the typo (openzim/zimfarm@f83b6ee). Please refresh. Note that I accidentally understood what you meant while I was following this ticket but I don't contribute to mwoffliner so zimfarm bugs have limited chances to get fixed from here 😉

@kelson42 kelson42 closed this as completed Feb 2, 2023
@holta
Copy link

holta commented Feb 2, 2023

So far so good — the latest scrape of wikipedia_en_top1m is still running after almost 34h:

https://farm.openzim.org/pipeline/28c070f7906bf9674d93ad36

@uriesk
Copy link
Collaborator

uriesk commented Feb 3, 2023

it's done, with an epic 200 GB file 🤔

And one 413 Request Entity Too Large at the very end

@kelson42
Copy link
Collaborator Author

kelson42 commented Feb 3, 2023

Yes, I get the novid format should be configured on this recipe ;)

@holta
Copy link

holta commented Feb 3, 2023

Yes, I get the novid format should be configured on this recipe ;)

Very Awesome it appears so close; Thanks to everyone !

Can someone restart the job with --format="novid:maxi" ?

ASIDE: Why does wikipedia_en_all_maxi use --addNamespaces="100" ?

@kelson42
Copy link
Collaborator Author

kelson42 commented Feb 3, 2023

@holta yes, will do

@uriesk
Copy link
Collaborator

uriesk commented Feb 3, 2023

@holta @kelson42 that build gave very good insights in storage requirements of media, i did analyze it #1767

@holta
Copy link

holta commented Feb 3, 2023

@holta @kelson42 that [build] gave very good insights in storage requirements of media, i did analyze it #1767

Heroic. Millions of people should provide their thanks to you and everyone helping here. When we're all done — reliably + compactly delivering the "world's knowledge" (i.e. a very meaningful snapshot subset of Wikipedia, in English for starters) every single month:

@holta
Copy link

holta commented Feb 11, 2023

@uriesk thanks for your help completing this major accomplishment — that has the very real potential to help millions of kids — getting legit-not-stale encyclopedic knowledge every single month: 🎆

(i.e. if the wikipedia_en_top1m scraping recipe hopefully proves much more reliable/ruggedized than https://farm.openzim.org/recipes/wikipedia_en_all_maxi ❓❗)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants