Christopher Sardegna's Blog

Thoughts on technology, design, data analysis, and data visualization.

Why My Lambdas Unexpectedly Timed Out

Why My Lambdas Unexpectedly Timed Out

Issue Discovery

On occasion, triggering some AWS Lambda Functions inside of our VPC would result in unexpected timed out connections, specifically ConnectionClosedError. The proximate cause appeared to be the default timeout configuration in the boto3 library, but the real problem was lower level than that.

Debugging Attempts

The Lambdas log the specific parameters that cause failures. The Lambda functions make database queries, so on some occasions, they run for a long time. The Lambda timeouts are set to 300 seconds, but we receive timeout errors after only 120 seconds.

Adjusting Lambda Timeouts

The default configuration in botocore for HTTP connections is 60 seconds. Both read_timeout and connect_timeout were overwritten to 300 already:

LAMBDA_CLIENT = boto3.client(
    'lambda', config=botocore.config.Config(retries={'max_attempts': 0},

This should only timeout after 5 minutes and never retry a request. As an abundance of caution, we reduced these timeouts to 5 seconds:

LAMBDA_CLIENT = boto3.client(
    'lambda', config=botocore.config.Config(retries={'max_attempts': 0},

This crashes when the Lambda takes longer than five seconds to return a value, but it was a new crash: ReadTimeoutError versus the original ConnectionClosedError1.

Since the new error makes sense, the Lambda timeout is not the reason for our connection problem.

Traffic Analysis

According to the AWS Lambda Invoke docs:

For functions with a long timeout, your client might be disconnected during synchronous invocation while it waits for a response. Configure your HTTP client, SDK, firewall, proxy, or operating system to allow for long connections with timeout or keep-alive settings.

Our EC2 fleet that invokes the Lambda Functions are inside of a VPC, and so are the Lambdas themselves. Since there is no VPC Endpoint support for Lambda 2, these requests go through the public internet. For us, that means passing through our datacenter, which has a limited keep-alive configuration.

  1. The last two exceptions in this exception handler ↩︎

  2. It is omitted from the list on the docs page. ↩︎