Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addUserRecord call throws DaemonException #39

Closed
heikkiv opened this issue Jan 27, 2016 · 64 comments
Closed

addUserRecord call throws DaemonException #39

heikkiv opened this issue Jan 27, 2016 · 64 comments
Labels

Comments

@heikkiv
Copy link

heikkiv commented Jan 27, 2016

Sometimes calling addUserRecord starts to throw:

com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
    at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:171) ~[amazon-kinesis-producer-0.10.2.jar:na]
    at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:467) ~[amazon-kinesis-producer-0.10.2.jar:na]
    at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:338) ~[amazon-kinesis-producer-0.10.2.jar:na]

The KPL does not seems to recover from this. All further calls to addUserRecord also fail. Restarting the KPL java process fixes the situation.

This seems to happen when the kinesis stream is throttling requests so my guess is that the native process cant write to the stream quickly enough and runs out of memory. If that's the case my expectation would be that the native process should start to discard older data and of course that if the native process dies the KPL recovers to a working state.

@cgpassante
Copy link

I received the same error in production yesterday on a service that has been running smoothly for months...

2016-01-27T00:10:10,299 ERROR [com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask] ShardId shardId-000000000000: Application processRecords() threw an exception when processing shard
com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.

@MusikPolice
Copy link

I'm running a lambda function that uses the KPL, and it appears to crash with this exception when the lambda runs out of memory

@quiqua
Copy link

quiqua commented Nov 11, 2016

We keep running into similar issues when using the KPL library on AWS Lambda.
At the moment we are using the aggressive flushSync method and reduce the RecordMaxBufferedTime to work around the issues.

  • KPL version 0.12.1
  • AWS Lambda env: Java8, 1024 mb

@pfifer
Copy link
Contributor

pfifer commented Feb 15, 2017

Thanks for reporting this. We are investigating this, but could use some additional information.
It appears that people are still seeing this with the 0.12.x versions of the KPL. How commonly does this occur?

We will investigate adding memory usage tracking for the KPL native process to help determine how much memory it's consuming.

Can everyone who is affected by this please respond or add a reaction to help us reaction to help us prioritize this issue.

@pfifer pfifer added the bug label Feb 15, 2017
@xujiaxj
Copy link

xujiaxj commented Feb 17, 2017

We upgraded to 0.12.3 in our production two days ago, but since then the occurrence of this issue is much frequent than previous 0.10.2 version. (Please note as part of the upgrade, we also upgraded AWS SDK from 1.10.49 to 1.11.45.)

We have 10 m4.2xlarge instances in eu-west-1 that produce a high amount of traffic to our Kinesis stream in us-west-2, and they are the ones that frequently run into this issue. Other producer instances in other regions seem to be fine, but they have less traffic.

Our motive of upgrade is to hope it can solve the memory leak issue, but it looks like the stability has degraded instead.

I will open a separate support ticket to give more information.

@yeshodhan
Copy link

Facing similar issue at our side:
com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:171)
at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:467)
at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:338)

we are using amazon-kinesis-producer - v0.10.2 and amazon-kinesis-client v1.7.0, aws java sdk v1.11.77.

it issue is intermittent and I'm unable to figure out the root cause.

@quiqua
Copy link

quiqua commented Mar 2, 2017

Well we moved away from AWS KPL and are using AWS SDK for Java now to stream data to Kinesis. It didn't work out for us in the end as we saw those errors quite frequently.

@xujiaxj
Copy link

xujiaxj commented Mar 15, 2017

On a few occasions we ran into the crashed KPL, we also observed our JVM cannot create more threads, even though the thread count in JVM is very stable.

Exception in thread "qtp213045289-24084" java.lang.OutOfMemoryError: unable to create new native thread

I wonder whether that's because the KPL process has somehow prevented its parent JVM process to get more native threads.

@ramachpa
Copy link

ramachpa commented Apr 5, 2017

We see this on calling flushSync method as well. We initially thought the host do not have enough memory for KPL to do its job. But even after increasing the memory, we continue to see this.

com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages. at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:175) at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:716) at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:735) at com.amazonaws.services.kinesis.producer.KinesisProducer.flushSync(KinesisProducer.java:760)

@sshrivastava-incontact
Copy link

want to point out one observation, not sure how helpful this will be...
I tried (on windows) with
DATA_SIZE = 10;
SECONDS_TO_RUN = 1;
RECORDS_PER_SECOND = 1;

and following is log (which point to file/pipe \.\pipe\amz-aws-kpl-in-pipe- not found )

[main] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to C:\Users\xxxx~1\AppData\Local\Temp\amazon-kinesis-producer-native-binaries
[main] INFO com.amazonaws.samples.SampleProducer - Starting puts... will run for 1 seconds at 1 records per second
[pool-1-thread-1] INFO com.amazonaws.samples.SampleProducer - Put 1 of 1 so far (100.00 %), 0 have completed (0.00 %)
[main] INFO com.amazonaws.samples.SampleProducer - Waiting for remaining puts to finish...
Exception in thread "main" com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
	at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:171)
	at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:708)
	at com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:727)
	at com.amazonaws.services.kinesis.producer.KinesisProducer.flushSync(KinesisProducer.java:752)
	at com.amazonaws.samples.SampleProducer.main(SampleProducer.java:262)
[pool-3-thread-2] ERROR com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
	at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502)
	at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:61)
	at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:447)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.nio.file.NoSuchFileException: \\.\pipe\amz-aws-kpl-in-pipe-878f15f7
	at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
	at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
	at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
	at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
	at java.nio.channels.FileChannel.open(Unknown Source)
	at java.nio.channels.FileChannel.open(Unknown Source)
	at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:329)
	at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:61)
	at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:444)
	... 5 more
[kpl-callback-pool-0-thread-0] ERROR com.amazonaws.samples.SampleProducer - Exception during put
com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
	at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502)
	at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:61)
	at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:447)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.nio.file.NoSuchFileException: \\.\pipe\amz-aws-kpl-in-pipe-878f15f7
	at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
	at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
	at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
	at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
	at java.nio.channels.FileChannel.open(Unknown Source)
	at java.nio.channels.FileChannel.open(Unknown Source)
	at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:329)
	at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:61)
	at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:444)
	... 5 more

@buremba
Copy link

buremba commented Jul 8, 2017

We plan to use this library for high workloads but it looks like it doesn't prevent the native process to crash in high workloads. When the process is dead, our application stops working, it would be great if you could focus on this issue.

@pfifer
Copy link
Contributor

pfifer commented Jul 10, 2017

@sshrivastava-incontact what you're seeing is related to running the KPL 0.12.x on Windows which isn't currently supported.

For those running on Linux, and Mac OS X: The newest version of the KPL includes some additional logging about how busy the sending process is. See the Release Notes for 0.12.4 on the meaning of the log messages. Under certain circumstances the native component can actually run itself out of threads, which will trigger a failure of the native process.

@JasonQi-swe
Copy link

JasonQi-swe commented Jul 17, 2017

When can this KPL works with Windows? I am OK with version 0.12.* does not work with Windows. However, the version 0.10.* also does not work, since I always got this Exception:
ERROR KinesisProducer:148 - Error in child process
com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502)
at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:61)
at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:447)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: \.\pipe\amz-aws-kpl-in-pipe-6cd6f933
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:329)
at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:61)
at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:444)
... 5 more

And it is not only me, this guy also got same problem:
https://stackoverflow.com/questions/43113791/getting-error-while-running-amazon-kinesis-producer-sample

Looking forward to your update and thank you so much!

Sincerely

@oletap
Copy link

oletap commented Jul 26, 2017

I am using Kinesis Producer 0.10.2. It was running fine on Windows 7 and when I try to set it up on CentOS 6.5, I am getting the error below:

Error in child process
com.amazonaws.services.kinesis.producer.IrrecoverableError: Error starting child process
at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:502)
at com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:455)
at com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:61)
at com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:128)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Cannot run program "/tmp/amazon-kinesis-producer-native-binaries/kinesis_producer_544f70b9f23702394d1c2f983b07d4c0242aff01": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:453)
... 7 more
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 8 more

Can anybody help me on this? Thanks.

@bijith
Copy link

bijith commented Sep 2, 2017

Getting the same exception while trying https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer-sample/src/com/amazonaws/services/kinesis/producer/sample/SampleProducer.java

17/09/01 16:53:58 INFO SampleProducer: Put 1998 of 10000 so far (19.98 %), 0 have completed (0.00 %)
17/09/01 16:53:58 ERROR SampleProducer: Error running task
com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
	at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
	at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477)

@udayravuri
Copy link

Seems like switching to root on Linux & running the example: SampleProducer.java showed some puts going through before it fails.

Environment: KPL v 0.12.5 on Linux (Linux els-d93322 2.6.34.7-66.fc13.i686.PAE #1 SMP Wed Dec 15 07:21:49 UTC 2010 i686 i686 i386 GNU/Linux)

Looking at this error line suggests that it is trying to extract binaries to /tmp:
[com.amazonaws.services.kinesis.producer.sample.SampleProducer.main()] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to /tmp/amazon-kinesis-producer-native-binaries

So I swithced to root & now it's at least doing 2 puts & then fails again with the same exception. But at least it did 2 puts:
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ amazon-kinesis-producer-sample ---
[com.amazonaws.services.kinesis.producer.sample.SampleProducer.main()] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to /tmp/amazon-kinesis-producer-native-binaries
[com.amazonaws.services.kinesis.producer.sample.SampleProducer.main()] INFO com.amazonaws.services.kinesis.producer.sample.SampleProducer - Starting puts... will run for 5 seconds at 2000 records per second
[kpl-daemon-0000] INFO com.amazonaws.services.kinesis.producer.Daemon - Asking for trace
[pool-1-thread-1] INFO com.amazonaws.services.kinesis.producer.sample.SampleProducer - Put 1999 of 10000 so far (19.99 %), 0 have completed (0.00 %)
[pool-1-thread-1] INFO com.amazonaws.services.kinesis.producer.sample.SampleProducer - Put 3999 of 10000 so far (39.99 %), 0 have completed (0.00 %)
[pool-1-thread-1] ERROR com.amazonaws.services.kinesis.producer.sample.SampleProducer - Error running task
com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477)
at com.amazonaws.services.kinesis.producer.sample.SampleProducer$2.run(SampleProducer.java:216)
at com.amazonaws.services.kinesis.producer.sample.SampleProducer$4.run(SampleProducer.java:298)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@udayravuri
Copy link

Regarding the exception above, it was my mistake. I was attempting to write to a Kinesis Stream while the actual endpoint was a Kinesis Firehose. After making that correction, I was able to successfully write to Kinesis Firehose using this example: https://github.com/awslabs/aws-big-data-blog/blob/master/aws-blog-firehose-lambda/kinesisFirehose/src/main/java/com/amazonaws/proserv/lambda/KinesisToFirehose.java

@ppearcy
Copy link

ppearcy commented Oct 6, 2017

Any chance anyone has any workarounds?
Has anyone had any version of the KPL working on any version of windows with any version of java?

The issue looks similar:
https://bugs.openjdk.java.net/browse/JDK-8152922

Is the only known workaround use the native Java producer?

It might be a good idea to update the README as it is a pretty nasty surprise when the claim is made that 0.10.2 works on windows.

@bk-mz
Copy link

bk-mz commented Nov 12, 2017

From what we have observed, it's not a bug, but by-design behaviour.

The thing is, message to kinesis is being stored (put operation) on blocking queue, which has InterruptedException as checked exception in it's signature.

This operation is being made on the calling thread in the KPL code, so, if there is a possibility for the calling thread to be interrupted, catch logic would be invoked:

/**
 * Enqueue a message to be sent to the child process.
 * 
 * @param m
 */
public void add(Message m) {
    if (shutdown.get()) {
        throw new DaemonException(
                "The child process has been shutdown and can no longer accept messages.");
    }
    
    try {
        outgoingMessages.put(m); //<-- HERE 
    } catch (InterruptedException e) {
        fatalError("Unexpected error", e);
    }
}

Actually, fatalError call will terminate the daemon:

private synchronized void fatalError(String message, Throwable t, boolean retryable) {
    if (!shutdown.getAndSet(true)) {
        if (process != null) {
            process.destroy(); //<-- HERE PROCESS IS DESTROYED
        }
        try {
            executor.awaitTermination(1, TimeUnit.SECONDS);
        } catch (InterruptedException e) { }
        executor.shutdownNow();
        // other code
    }
}

So, to workaround this issue make sure you are not invoking addUserRecord on the thread which might be interrupted, e.g. any of your server request handling thread.

This piece of code did the thing for us, make sure something similar is happening in your code:

private void sendToKinesis(ByteBuffer buffer) {
    CompletableFuture.runAsync(() -> {
        try {
            streamProducer.addUserRecord(...)
        } catch (Throwable e) {
            log.error("Failed to send to Kinesis: {}", e.getMessage());
        }
    }); // run on separate thread pool
}

@spjegan
Copy link

spjegan commented Jan 16, 2018

We are seeing this same issue with version 0.10.2 on Linux. I do not really have a root cause. Is there any work-around for this issue? Has anyone tried the above proposed work-around?

Thanks.

@pfifer
Copy link
Contributor

pfifer commented Jan 18, 2018

@spjegan 0.10.x is no longer supported, and there is a required upgrade to 0.12.x. In 0.12.x automatic restarts of a failed Daemon process were added.

@pfifer
Copy link
Contributor

pfifer commented Jan 18, 2018

@head-thrash In 0.12.x there was higher level automatic restart added, that should restart the native process if it exits. During the process exit you're correct that you will receive the exception, but after the process is restarted publishing should be available.

@pfifer
Copy link
Contributor

pfifer commented Jan 18, 2018

@ppearcy Windows support was added in Release 0.12.6.

@pfifer
Copy link
Contributor

pfifer commented Jan 18, 2018

@udayravuri During startup the library attempts to extract the native component to the temp directory. The user your application is running as, must have write access to the directory. You can use setTempDirectory on the configuration to control where it extracts to. There is a bug #161 right now with the certificates that we're looking into.

@pfifer
Copy link
Contributor

pfifer commented Jan 18, 2018

@xujiaxj The threading issue was fixed/mitigated in #100, with more information to assist in determining configuration added in #102. These were released in 0.12.4

@xujiaxj
Copy link

xujiaxj commented Jan 18, 2018

@pfifer thanks

@bk-mz
Copy link

bk-mz commented Feb 13, 2018

@pfifer Still, the method returns ListenableFuture, which leads to a false assumption that all the exceptions will be thrown upon the future execution.

Developers create onFailure handler methods, just to know that the exception will be thrown in sync code somewhere on production :(

@fmthoma
Copy link

fmthoma commented Apr 9, 2018

What am I doing wrong?

I was trying to run the KPL on an alpine docker base image, which is apparently a bad idea. Switching to Debian slim did the trick. Unfortunately, the exception does not convey the reason, and Flink does not show logs other than the caught exceptions.

@pfifer
Copy link
Contributor

pfifer commented Apr 9, 2018

Unfortunately the KPL is built against glibc while Alpine Linux uses musl libc. This causes the native component to fail runtime linking, and crash. There appears to be some Docker images that include glibc, but I can't vouch for whether they would work or not.

@linksach3
Copy link

Hi,
I am using KPL(0.12.8) in AWS lambda function coded using java8. Here is my code snipped:

`// put records and save the Futures
for (MyRecord myRecord : MyRecordsList) {
putFutures.add(kinesisProducer.addUserRecord(kinesis_stream_name,
randomPartitionKey(), ByteBuffer.wrap(convertToJson(myRecord).getBytes("UTF-8"))));
}

	// Wait for puts to finish and check the results.
	LOGGER.info("Waiting for KinesisProducer puts to finish...");
	for (Future<UserRecordResult> f : putFutures) {
		UserRecordResult result = f.get();
		if (result.isSuccessful()) {
			// Increment the success count for this run
			recordsPut.getAndIncrement();
			long totalTime = result.getAttempts().stream().mapToLong(a -> a.getDelay() + a.getDuration()).sum();
			LOGGER.debug(
					String.format("Successfully put record #%d, sequence_no %s took %d  attempts, totalling %d ms.",
							recordsPut.get(), result.getSequenceNumber(), result.getAttempts().size(), totalTime));

		} else {

			Attempt last = Iterables.getLast(result.getAttempts());
			LOGGER.error(
					String.format("Record failed to put - %s : %s", last.getErrorCode(), last.getErrorMessage()));
			throw new UserRecordFailedException(result);

		}

	}
	kinesisProducer.flushSync();`

This code works fine for most of the time. I just ran a load with millions of records and i see that hundreds of these failed with below error:

com.amazonaws.services.kinesis.producer.UserRecordFailedException java.util.concurrent.ExecutionException: com.amazonaws.services.kinesis.producer.UserRecordFailedException at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[task/:?] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[task/:?] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[task/:?]

at the code line f.get()

Can you please suggest something as a workaround? I am using all default configurations for KPL other than below:
config.setAggregationMaxSize(512000);

@linksach3
Copy link

Also, in addition to above, i am seeing below errors:

java.lang.RuntimeException: EOF reached during read
java.util.concurrent.ExecutionException: java.lang.RuntimeException: EOF reached during read
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[task/:?]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[task/:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[task/:?]

@linksach3
Copy link

Also: com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.

@cgpassante
Copy link

I have been able to cause this problem by calling the KPL from a catch block for an InterruptException in a child thread. I was using the KPL for logging and attempted to log when a child thread was interrupted from the catch block. Unfortunately the KPL has a shutdown hook that runs when the thread it was called from exits, not the thread where the KPL was instantiated. The hook kills the daemon process that writes events to the shards. I would say this is a KPL bug and that they should only kill the daemon when the parent thread of the KPL object is killed.

@rakhu
Copy link

rakhu commented Sep 2, 2018

Getting this frequent failures. 0.12.9 jar Windows 2012 R2 Standard VMWare Intel 2GHZ (4 processors) 6 GB RAM.
During this time CPU is going 100% and bringing down the windows server. So as a workaround onFailure i writen to destroy (flushSync and destroy didnt help ). But in this case i am loosing all the outstanding records. Can you help me to know if anyway to solve this ? Atlease i need a way to get the message and reprocess it.

Note: All Kinesis default config used.

@vijaychd
Copy link

vijaychd commented Sep 4, 2018

Worked fine in local. But getting heap size issues running in docker container linux. How much heap this needs?

@indraneelr
Copy link

Any updates or workarounds for this issue? still facing this with 0.12.9.

org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:784)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.flush(KinesisProducer.java:804)
        at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.flushSync(FlinkKinesisProducer.java:404)
        at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.close(FlinkKinesisProducer.java:305)
        at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:477)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
        at java.lang.Thread.run(Thread.java:748)
1513132 [kpl-daemon-0000] ERROR org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
java.lang.RuntimeException: Child process exited with code 137
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:509)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:487)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:63)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
1513132 [kpl-daemon-0000] INFO org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Restarting native producer process.
1513243 [kpl-daemon-0004] ERROR org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
java.lang.RuntimeException: Error writing message to daemon
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:513)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.sendMessage(Daemon.java:234)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.access$400(Daemon.java:63)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon$2.run(Daemon.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.sendMessage(Daemon.java:230)
        ... 5 more
1513245 [kpl-daemon-0004] INFO org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer - Restarting native producer process.

@Asamkhata071
Copy link

@pfifer
Using Version 12.11 of amazon-kinesis-producer which I downloaded from
: https://jar-download.com/artifacts/com.amazonaws/amazon-kinesis-producer .
Running on windows 10 .
Trying to run the sample code from :
https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-writing.html (Responding to Results Synchronously) .

I get the following error :
[main] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Extracting binaries to C:\Users\AppData\Local\Temp\amazon-kinesis-producer-native-binaries
[main] INFO com.amazonaws.services.kinesis.producer.HashedFileCopier - 'C:\Users\AppData\Local\Temp\amazon-kinesis-producer-native-binaries\kinesis_producer_36877407482F2EE24DFC2F3B20F02F0404BFA4EC.exe' already exists, and matches. Not overwriting.
[kpl-daemon-0001] ERROR com.amazonaws.services.kinesis.producer.KinesisProducer - Error in child process
com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:537)
at com.amazonaws.services.kinesis.producer.Daemon.access$1200(Daemon.java:63)
at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:460)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.nio.file.NoSuchFileException: .\pipe\amz-aws-kpl-in-pipe-f035bcfc
at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
at java.nio.channels.FileChannel.open(Unknown Source)
at java.nio.channels.FileChannel.open(Unknown Source)
at com.amazonaws.services.kinesis.producer.Daemon.connectToChild(Daemon.java:347)
at com.amazonaws.services.kinesis.producer.Daemon.access$1000(Daemon.java:63)
at com.amazonaws.services.kinesis.producer.Daemon$6.run(Daemon.java:457)
... 3 more
[kpl-daemon-0001] INFO com.amazonaws.services.kinesis.producer.KinesisProducer - Restarting native producer process.
java.util.concurrent.ExecutionException: com.amazonaws.services.kinesis.producer.IrrecoverableError: Unexpected error connecting to child process
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at KPLSimple.main(KPLSimple.java:58)

Which version of amazon-kinesis-producer jar (aka KPL jar) for windows 10 should I be using ?

@Asamkhata071
Copy link

@pfifer

Also tried using version 0.10.2 , same program and configuration as above:
I get the following error:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
starts.
[2019-04-23 10:54:38.002344] [0x00005134] [info] [metrics_manager.h:148] Uploading metrics to monitoring.us-east-2.amazonaws.com:443
[2019-04-23 10:54:38.060220] [0x00005134] [info] [kinesis_producer.cc:79] Created pipeline for stream "KinesisStream"
[2019-04-23 10:54:38.060220] [0x00005134] [info] [shard_map.cc:83] Updating shard map for stream "KinesisStream"
[2019-04-23 10:54:38.158691] [0x00005240] [error] [http_client.cc:148] Failed to open connection to monitoring.us-east-2.amazonaws.com:443 : certificate verify failed
[2019-04-23 10:54:38.249706] [0x00005240] [error] [http_client.cc:148] Failed to open connection to kinesis.us-east-2.amazonaws.com:443 : certificate verify failed

I don't encounter this "certificate verify failed" with higher versions of producer (example 12.x ) , why so ?

@antgustech
Copy link

antgustech commented May 14, 2019

Same thing here. 12.11, no errors are thrown but no logs are created either on my S3 bucket:

  KinesisProducerConfiguration config = new KinesisProducerConfiguration()
             .setRecordMaxBufferedTime(3000)
             .setMaxConnections(5)
             .setRequestTimeout(60000)
             .setRegion("eu-west-1");


     final KinesisProducer kinesis = new KinesisProducer(config);

     System.out.println(kinesis.toString());
     ByteBuffer data = ByteBuffer.wrap(("Test main java").getBytes());
     kinesis.addUserRecord("myStream", "appStarts", data);
     System.out.println("Done");

But If I instead wait for callbacks like below, I ge the following errors:

        KinesisProducerConfiguration config = new KinesisProducerConfiguration()
                .setRecordMaxBufferedTime(3000)
                .setMaxConnections(5)
                .setRequestTimeout(60000)
                .setRegion("eu-west-1");


 KinesisProducer kinesis = new KinesisProducer(config);

         Thread.sleep(2000);
         FutureCallback<UserRecordResult> myCallback = new FutureCallback<UserRecordResult>() {
             @Override
             public void onFailure(Throwable t) {
                 System.out.println("Failed: " + t.toString());
                 System.out.println(t.getStackTrace().toString());
                 t.printStackTrace();
             }

             @Override
             public void onSuccess(UserRecordResult result) {
                 System.out.println("Success: " + result.toString());
             }
         };

         for (int i = 0; i < 10; ++i) {
             ByteBuffer data = ByteBuffer.wrap("myData".getBytes("UTF-8"));
             ListenableFuture<UserRecordResult> f = kinesis.addUserRecord("myStream", "myPartitionKey", data);
             // If the Future is complete by the time we call addCallback, the callback will be invoked immediately.
             Futures.addCallback(f, myCallback);
         }

         for (int i = 0; i < 5; i++) {
             try {
                 Thread.sleep(10000); //So I can wait and see the callbacks.
             } catch (InterruptedException e) {
                 e.printStackTrace();
             }

         }
Failed: com.amazonaws.services.kinesis.producer.UserRecordFailedException
[Ljava.lang.StackTraceElement;@71d70b5e
com.amazonaws.services.kinesis.producer.UserRecordFailedException
	at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.onPutRecordResult(KinesisProducer.java:197)
	at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.access$000(KinesisProducer.java:131)
	at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler$1.run(KinesisProducer.java:138)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

I have tried older versions such as 10 but they cannot connect at all, I get connection errors.

@cgpassante
Copy link

I can reliably reproduce the error with a very short code snippet. Two points to note: the first request throws the exception, subsequent requests work. This behavior happens in a docker container running various opensdk images (amazoncorretto-8, alpine-opensdk-8, etc). It does not happen outside of docker on my laptop running Java Hotspot.

@MaximilianoFelice
Copy link

MaximilianoFelice commented Jul 10, 2019

I'm trying a really simple example and finding this issue with Scala 2.13.0, Java 1.8 and KPL 0.12.11:

object KinesisTest extends App {
  val kinesis = new KinesisProducer()

  val userRecordsFutures = (1 to 2) map { idx =>
    kinesis.addUserRecord("payments-stream-test", "partitionName", ByteBuffer.wrap("test".getBytes("UTF-8"))).get()
    println(s"Published idx ${idx}")
  }
}

Exception in thread "main" com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
at com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:536)
at com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:349)
at com.example.KinesisTest$.$anonfun$userRecordsFutures$1(KinesisTest.scala:12)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.scala:18)
at scala.collection.immutable.Range.map(Range.scala:59)
at com.example.KinesisTest$.delayedEndpoint$com$example$KinesisTest$1(KinesisTest.scala:11)
at com.example.KinesisTest$delayedInit$body.apply(KinesisTest.scala:8)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:75)
at scala.App.$anonfun$main$1$adapted(App.scala:75)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterable.foreach(Iterable.scala:904)
at scala.App.main(App.scala:75)
at scala.App.main$(App.scala:73)
at com.example.KinesisTest$.main(KinesisTest.scala:8)
at com.example.KinesisTest.main(KinesisTest.scala)

Could anyone find a fix for this issue? This issue seems to be entirely reproducible and many people seem to have faced it. Is there any official roadmap for this tool to be fixed?

@ChengzhiZhao
Copy link

We are facing a similar issue on KPL 0.12.11, I am following this up if anyone has some ideas on how to get around of it and have tried to set DLQ for it. Thanks!

@cgpassante
Copy link

I faced this issue and amazon tech support helped me debug it. I don't know scala but the java version of the api returns a future from the adduserrecord method. By inspecting that return object, you can learn why the add failed. In my case it was because I forgot to add credentials to my container which prevented the KPL daemon from connecting to the queue. That can cause this exception to be thrown when you attempt to add a record.

@Cory-Bradshaw
Copy link
Contributor

Hello,

Thank you everyone for sharing your experience and learning with the community. For an example of how to implement this, see the KPL sample application in this repository, specifically this line. (this is for a test application, and in this case, it just shuts down the sample application after displaying the underlying failure)

This is a general failure condition that occurs when there is any unresolvable configuration problem with the KPL. Usually when this happens it is for one for the following reasons:

  1. Credentials could not be found
  2. Lack of permissions for resources
  3. Stream does not exist
  4. Targeting wrong region (leading to Stream does not exist)

If you are experiencing this problem and can confirm that it is not due to a configuration/access, please re-open the issue and provide more details on your configuration and if any reproduction steps are consistently successful, including steps about stream creation, iam users/roles/permissions, container/ec2 instance, etc.

For additional assistance, you may also open a customer support ticket through the AWS console to receive more specific support.

@namedgraph
Copy link

namedgraph commented Aug 12, 2019

I followed the "Barebones Producer Code" and got this exception when calling kinesis.addUserRecord().
I tried adding callback after that, but the code does not reach it due to the exception.

The same setup worked using KCL (1.x). I will be trying KCL 2.x now.

@Cory-Bradshaw
Copy link
Contributor

@namedgraph Try removing the loop from the sample code and write a single record with the callback above. If this doesn't work, please respond with more details. (Full code, IAM permissions used, etc).

I'm not sure why you are concerned about KCL version here. There aren't any version dependancies between KPL and KCL.

@wikier
Copy link

wikier commented Sep 23, 2019

Any update on this? I've just ran into similar issues...

Is the official recommendation to move out of KPL and adopt the SDK directly?

@Cory-Bradshaw
Copy link
Contributor

Repeating from above:

Hello,

Thank you everyone for sharing your experience and learning with the community. For an example of how to implement this, see the KPL sample application in this repository, specifically this line. (this is for a test application, and in this case, it just shuts down the sample application after displaying the underlying failure)

This is a general failure condition that occurs when there is any unresolvable configuration problem with the KPL. Usually when this happens it is for one for the following reasons:

Credentials could not be found
Lack of permissions for resources
Stream does not exist
Targeting wrong region (leading to Stream does not exist)
If you are experiencing this problem and can confirm that it is not due to a configuration/access, please re-open the issue and provide more details on your configuration and if any reproduction steps are consistently successful, including steps about stream creation, iam users/roles/permissions, container/ec2 instance, etc.

For additional assistance, you may also open a customer support ticket through the AWS console to receive more specific support.

@wikier ,

Additionally, this can happen if the KPL process gets overwhelmed by a lack of backpressure successfully implemented by the customer. I highly recommend reading this blog post to understand some of the considerations for how to configure and use the KPL:

https://aws.amazon.com/blogs/big-data/implementing-efficient-and-reliable-producers-with-the-amazon-kinesis-producer-library/

@srinihacks
Copy link

srinihacks commented Oct 10, 2019

We are facing this issue when we write a high volume of data to the stream. I don't think this issue is related to the configuration and access issues.

Credentials could not be found
Lack of permissions for resources
Stream does not exist
Targeting the wrong region (leading to Stream does not exist)

We have also implemented backpressure and specifically flushing the records when outstanding records reach the maximum threshold configured by the application. Whenever this error occurs CPU usage goes high.

Already there is AWS support ticket to address this issue.

@fmthoma
Copy link

fmthoma commented Oct 10, 2019

@srinihacks The high CPU problem is also known, see #187. It's the reason why we moved away from KPL in the end and towards Kinesis Aggregation Library + Kinesis Client. This did simplify a lot of things for us.

@mkemaldurmus
Copy link

mkemaldurmus commented Apr 26, 2022

ERROR c.a.s.k.producer.KinesisProducer - Error in child process
java.lang.RuntimeException: Child process exited with code 130
        at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
        at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:509)
        at com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:487)
        at com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:63)
        at com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
15:39:49.105 [kpl-daemon-0000] ERROR c.a.s.k.producer.KinesisProducer - Error in child process
java.lang.RuntimeException: Child process exited with code 130
        at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
        at com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:509)
        at com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:487)
        at com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:63)
        at com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

I am using m1 macbook and java 8 for m1. Can you any advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests