-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with parallelizing thousands of tasks #49
Comments
Hi @drunksaint,
When you run
Good question! This is quite common with Lambdas, and
Although
The idea is that this utility keeps five stacks: (1) arguments, (2) data (thunks and values), (3) executables, (4) outputs and (5) symlinks. Use you commands like But of course, if you're going to go down any of these two routes, that means you need to forgo
I think setting the |
Hey @sadjad, thanks for the super detailed response! For the thunk creation step, I'll try out one of the steps you mentioned - will probably go with the c++ api since you recommended it. Is either step faster than the other? For the execution step, earlier when I mentioned the 3 minute time, I had timed it manually from the time the upload from my local machine completed, to the time the download to my local machine started. So I measured just the total lambda execution time, since I'm running these tests from outside ec2 and upload/download times to my machine are higher than they should be (should probably start doing this from ec2). The timeout parameter fixed the straggler issue though! But the total time still didn't really improve by much. I ran two tests, one with 1000 jobs and one with 2000. This time, with 2000 jobs and a 15 sec timeout, it still took 2.5 min to run just the lambda step. I've appended the outputs below: First test:
Second test (I cancelled it during the upload when I realized that I'd like to update the parameters):
And yes, I could see 2000 lambdas running (the number in red). |
Could you please measure the time necessary for forcing just one thunk on lambda? You can run My hunch is that the overhead of each task (launching a Lambda + downloading the inputs + uploading the output + winding down) compared to the actual job is just too high. And you know, that would be okay if you had 15,000 workers (same as the number of thunks you have); all of them start in parallel, take some time to initiate, and the run their task and finish and deliver the results. But now, you need to pay that price multiple times. Based on what you said, each function needs around 210 MiB of input data—which would take 4-6s to download—and a small output—which will be uploaded quickly. But your link to the cloud can also add to these overheads, and I can't help but wonder if running the job from an EC2 machine would make a different. I think this is something you should definitely try.
C++ API is faster, and both are way faster than Hope this helps! |
Forcing a single thunk without download took ~6s which seems about right.
Running it a second time with a different input takes less time (~3s) because it perhaps uses the same instance with the cached binary and big input files:
I reran the experiment by splitting the input into 8 line input pieces (<2k files) so that each worker can run just once. This had much better results. Using the old
And forcing the thunks took around 30s which i guess is as expected because of the 15s timeout. perhaps this is due to failures the first time.
I ran a second test with a 10s timeout and this also took around 30s. Perhaps this time there were failures 2 consecutive times.
Is it possible to debug these error messages? I see the ChildProcess and 404 messages quite frequently. Unrelated question: Is there a command to clear the thunk cache for just the current gg project from s3 and local? |
I just ran the test on ec2. Good news! you were right! it took less time despite getting a ton of
Looks like the lambda execution took around 18s (25s total - 6s upload - 1s download). This is despite the ~8000 Inference took much longer using the
|
also, is it possible to get |
I looked at gg repl and seemed quite easy to use. So I thought I'll try this out and see if the performance is good enough for this use case.
The old way using
What I tried with
sending |
Hi @drunksaint -- apologies for my delayed response.
The re: 404 messages, are you sure those blobs exist in the S3 bucket? That usually means that the object is missing and can happen when there's a mismatch between the local index and the remote bucket.
Sadly, no, not yet. But it's a good thing to consider.
We also talked to AWS support to get our limit increased. If you wanna get 2,000 concurrent Lambdas, you need them to increase your limit to around 10,000 concurrent requests (expected duration × expected requests per second).
Unfortunately, no. All the thunks has to be completed for a job to finish.
Normally,
Nice, yeah, I think you need to get your limited increased. Please let me know if there's anything else I can help with! --Sadjad |
Hey @sadjad! Thanks for your response!
I saw 404s initially, but the execution went through eventually. And the ouput file was correct and had everything in place. So I'm guessing this might be due to some sort of race condition. I got
Do you know why this might be happening? Is there a way I can debug this? |
Hi @drunksaint, My best guess is that the EC2 machine runs out of memory, and then the process is killed. Could you please check if that is the case? --Sadjad |
Yup, you're right! I'll move to a fatter instance. I was surprised because the machine has more than 750MB free for ~200MB files. |
Got it working end to end on ec2! The whole thing (16k lines in 2k tasks - 8 lines per task) now takes around 23s for the full upload+execution+download cycle, of which upload+download is around 5s. So execution is around 18s (with a timeout of 5s and 2k lambdas). A single task when run independently takes around 2.5s to run from ec2 though. I must've run close to 20 tests now. It always takes around 23s-24s. I always see >50 of the Childprocess and 404 errors - ratio and number vary between runs. But the runs always complete with a valid output. So the 404s are most likely a race condition. The ChildProcess I'm not sure about though. Is it possible that these two errors are leading to the higher than expected execution times? The two errors:
and
|
That's still seems too much... I wonder if we're hitting some limits within S3. Would you be willing to get me some timing logs from an actual run? Here are the steps:
If the job failed for any reason, please get rid of |
Hi @drunksaint, Thank you for getting the data! So, Also, can you run this script over the timing logs? It will convert them to Thank you! --Sadjad |
Hi @sadjad! I thought the do_cleanup numbers were in ms XD. After removing the cleanup stage, run times varied quite a bit - from a second or two less to almost 10s less! i see many times the run fails due to disk space issues, perhaps because of the lack of cleanup. I'm seeing the 'slow down' error code more often now. the charts look different from run to run, so i've attached stats from multiple runs below:
The timelog folder has the same number of files everytime. But I see a different number of error messages each time. Does this log only the successful executions? Let me know if you'd like me to send any more stats or run it more times. Thanks! |
Some of the do_cleanup histograms from the above successful runs: Could this possibly be the reason? |
Hello @drunksaint,
Oops. You are right, you're right. All numbers are in milliseconds, so Let's look at one of those >20s runs: A few points:
Almost 1,700 Lambdas are launch within the first second of the job. Compared to yours, it takes way longer to start (I now remember that the capacity that we asked for, gives us 1,500 concurrent Lambdas). This was our request to Lambda team that gave us enough capacity to launch 1500 concurrent Lambdas:
I think the Expected Average Requests Per Second is the key. Do you happen to know the parameters of your limit increase request? |
Hi @sadjad
I'm glad XD. I had started adding code to measure the backend creation time XD. Wow, our execution profiles look quite different. The charts are super neat. My support request looked like this:
Yeah, you might be right - I asked for a low average requests per sec. So they have provided a low burst capacity but 10k max. I'll reach out to AWS support and try to get it fixed. Thanks! |
@sadjad did you see any 404/503/ChildProcess errors when you ran your simulation? |
Not really. I got plenty of "rate limited" errors, but those are okay. I think I saw a few ChildProcess errors, but no 404s or 503s. |
Hi @sadjad I've been busy the last week talking to people at AWS to figure out what the issue is and that seems to be taking time. Seems like some problem with the burst concurrency at their end. I sent them the two charts you sent me and they asked me if I could replicate your test exactly and see if I'm still observing the same issue in my region. I'm guessing you're using ubuntu 18.04 to run the script from an ec2 instance in us west (oregon) to lambdas in us west (oregon). I'm using a t3.medium ec2 instance from us east (n. virginia) to lambdas in us east (n. virginia). can you confirm if my guess is right? Also which instance type are you using? i doubt that'll matter much but might help reduce the number of variables in the two tests. |
Hi @drunksaint, Actually, I'm running my script from my machine at Stanford (Northern California) to Lambdas in
Also, in my experience, Lambda performance is not the same in all regions. I usually use Here's the code to my function: #include <fstream>
#include <thread>
#include <chrono>
#include <cstdlib>
#define SIZE 1024 * 1024 * 100
char dummy[SIZE] = { 'j' };
using namespace std;
using namespace chrono;
int main()
{
dummy[SIZE - 1] = '\0';
srand( time( nullptr ) );
this_thread::sleep_for( 5s );
ofstream fout { "output" };
fout << rand() << endl;
return 0;
} Compiled with Create a large, random input file: Thunk creation: F_HASH=`gg hash prog`
I_HASH=`gg hash input`
for i in {1..2000}
do
gg create-thunk --value ${I_HASH} --executable ${F_HASH} --output output \
--placeholder ${i}.task ${F_HASH} prog ${i}
done Execution: gg force --jobs 2000 --engine lambda *.task --timeout 10 |
Thanks @sadjad for the details of your execution! This helps moving the conversation forward with the AWS person I'm in touch with. I'm seeing quite different lambda start profiles in us-west-2 and us-east-1 even from a c5n instance.
It's strange that the step is always ~10sec+ in the us-east case. Do you think this could be related to the 10sec timeout value in gg? As in, maybe the first set of tasks actually ran, but failed due to the 503, and 10s later, replacement tasks were started for all these failed tasks. Thanks |
I have a theory - s3 has an upper limit of how many simultaneous downloads can happen from a single key. Since I'm running from an ec2 instance in the same region, the requests are being sent faster than s3 can keep up with, resulting in the high number of 503s. And gg retries after 10s. I'm not sure if gg retries 503 failures within the same lambda or if it creates a different lambda in this case. If it does create a different lambda, then the failed lambdas most likely aren't logged in the timelogs. And if it retries it in the same lambda, is the start time measured as the start time of the original execution or the start time of the new execution? when i send requests to lambdas in us west from an ec2 instance in us east, or you send them from your home, we don't see the 503s because the requests are slow enough for s3. And this results in a smoother execution because gg doesn't face any error and doesn't need to retry. Do you think this makes sense? Or is there something wrong with some of the assumptions? |
Hey @sadjad! I ran the test without gg and am not seeing the bump up to >10s in the start times. So looks like there might be something in gg that's causing that. There is one difference in the test though - I'm not using files for the small inputs and the outputs - I'm sending these directly through the lambda payload. But this shouldn't affect the start time profile. |
Hi @drunksaint, Lunch on me for the delayed response!
This definitely has something to do with Lambda's burst concurrency quotas.
Okay, it seems that I had something wrong. I didn't pay enough attention and I thought the 503 errors are coming from Lambda, telling us to slow down launching workers... But they're coming from S3. My bad! However, the 10s timeout is for the tasks that are started, but are still executing after 10s. Failed tasks are retried immediately.
Yes, you're right. It creates a different Lambda and the failed ones are not logged in timelogs (you see one entry per thunk in timelogs, even if it had failed multiple times or retried due to a timeout -- our logging system is definitely lacking).
FWIW, I did ran the same experiment from an EC2 machine in the same region, and it just ran as smoothly.
Oh, I didn't know that you had small inputs. They can be the reason that S3 is acting up -- we had a similar problem in another project, where we had to deal with small files (less than a kilobyte in size) on S3 and we kept hitting the limits pretty quickly. Since If that's the issue, I can easily think about an enhancement to |
Hi @sadjad
I thought I lost you for some time :)
I've been speaking to someone in AWS. They haven't found the issue yet at their end. I pointed him to gg and he looked intrigued and I believe he is running some tests using gg.
Yeah, I guess I should've been clearer. They are definitely from s3
Ok, then it might be possible that the lambdas get throttled when they see too many 503s from s3. But the AWS guys didn't bring that up though. So, still not sure why the 10-12 sec jump happened.
Strange. I was getting 503s even with your test (the sleep for 5 sec) where there are no extra inputs.
Yeah, unfortunately it isn't possible to bundle them.
Thanks, that would be super useful! Can the outputs also be bundled in the response payload? I'm not sure if that's possible though the way gg is setup. |
Yes! Give me a couple days and I'll have that on master :) |
oh awesome! thanks! :) |
Hi! I am trying to run a script that can be parallelized trivially - basically each line of the input is independent from one another. I have a 15k line input file that takes around 10 minutes to process on a single core. A single line takes ~1.4s to run. I split the input file into 15k files containing single lines, generated the makefile and passed it to lambda through gg with -j=2000. I encountered a few issues:
gg infer make -j$(nproc)
on a 6 core machine. And the wrapper usesgg create-thunk
because I used pyinstaller to build the binary. The wrapper does just two things - it collects the single line input file and creates a thunk for the command. The binary and common input file are collected just once manually outside the wrapper function. I was considering using gg to convert the whole process to something that takes around 5-10sec max to run so that it can be run as an api. Is there a faster way to create these thunks?The command I use is of the format
The
single_line_input_file.txt
andsingle_line_output_file.txt
change for each execution. Thecommand
andbig_constant_input_file.txt
are the same for every execution. thecommand
is around 70MB and thebig_constant_input_file.txt
is around 140MB. So the same file is being downloaded to 2k lambdas parallely. I remember @sadjad, you mentioned in a talk that gg does each file transfer in parallel chunks. Perhaps this combined with the 2k parallel lambdas trying to download the same 2 files is hitting the s3 concurrency limit?The text was updated successfully, but these errors were encountered: