On occasion, triggering some AWS Lambda Functions inside of our VPC would result in unexpected timed out connections, specifically ConnectionClosedError
. The proximate cause appeared to be the default timeout configuration in the boto3
library, but the real problem was lower level than that.
The Lambdas log the specific parameters that cause failures. The Lambda functions make database queries, so on some occasions, they run for a long time. The Lambda timeouts are set to 300
seconds, but we receive timeout errors after only 120
seconds.
The default configuration in botocore for HTTP connections is 60
seconds. Both read_timeout
and connect_timeout
were overwritten to 300
already:
LAMBDA_CLIENT = boto3.client(
'lambda', config=botocore.config.Config(retries={'max_attempts': 0},
connect_timeout=300,
read_timeout=300))
This should only timeout after 5
minutes and never retry a request. As an abundance of caution, we reduced these timeouts to 5
seconds:
LAMBDA_CLIENT = boto3.client(
'lambda', config=botocore.config.Config(retries={'max_attempts': 0},
connect_timeout=5,
read_timeout=5))
This crashes when the Lambda takes longer than five seconds to return a value, but it was a new crash: ReadTimeoutError
versus the original ConnectionClosedError
1.
Since the new error makes sense, the Lambda timeout is not the reason for our connection problem.
According to the AWS Lambda Invoke docs:
For functions with a long timeout, your client might be disconnected during synchronous invocation while it waits for a response. Configure your HTTP client, SDK, firewall, proxy, or operating system to allow for long connections with timeout or
keep-alive
settings.
Our EC2 fleet that invokes the Lambda Functions are inside of a VPC, and so are the Lambdas themselves. Since there is no VPC Endpoint support for Lambda 2, these requests go through the public internet. For us, that means passing through our datacenter, which has a limited keep-alive
configuration.