On occasion, triggering some AWS Lambda Functions inside of our VPC would result in unexpected timed out connections, specifically ConnectionClosedError. The proximate cause appeared to be the default timeout configuration in the boto3 library, but the real problem was lower level than that.
The Lambdas log the specific parameters that cause failures. The Lambda functions make database queries, so on some occasions, they run for a long time. The Lambda timeouts are set to 300 seconds, but we receive timeout errors after only 120 seconds.
The default configuration in botocore for HTTP connections is 60 seconds. Both read_timeout and connect_timeout were overwritten to 300 already:
LAMBDA_CLIENT = boto3.client(
'lambda', config=botocore.config.Config(retries={'max_attempts': 0},
connect_timeout=300,
read_timeout=300))
This should only timeout after 5 minutes and never retry a request. As an abundance of caution, we reduced these timeouts to 5 seconds:
LAMBDA_CLIENT = boto3.client(
'lambda', config=botocore.config.Config(retries={'max_attempts': 0},
connect_timeout=5,
read_timeout=5))
This crashes when the Lambda takes longer than five seconds to return a value, but it was a new crash: ReadTimeoutError versus the original ConnectionClosedError1.
Since the new error makes sense, the Lambda timeout is not the reason for our connection problem.
According to the AWS Lambda Invoke docs:
For functions with a long timeout, your client might be disconnected during synchronous invocation while it waits for a response. Configure your HTTP client, SDK, firewall, proxy, or operating system to allow for long connections with timeout or
keep-alivesettings.
Our EC2 fleet that invokes the Lambda Functions are inside of a VPC, and so are the Lambdas themselves. Since there is no VPC Endpoint support for Lambda 2, these requests go through the public internet. For us, that means passing through our datacenter, which has a limited keep-alive configuration.