How to dig latencies and 4xx/5xx errors on AWS API Gateway (2024)

When you create Serverless API/Applications, probably API Gateway and Lambda would be a common choice. Although they give a lot of benefits, it also introduces different challenges either. I’m not going to cover all those challenges in this post, but would like to talk about 2 hottest topics: Latency and 4xx/5xx errors.
Before jumping on to the each topics, let’s take a look a basic overview, which give you better understanding for the topics.

This is 10000 ft view to show how the request from the client goes through your final integration via API Gateway.

Client — API Endpoint — API Gateway — Integration Endpoint — Integration

It is important to understand this, as latency and 4xx/5xx errors can be impacted by any of them. Now keep it in your mind, then let’s go further. In this post, “Integration” will be mostly focused on Lambda but it can also apply to any integration.

To understand the latency, let’s break down the latency into the detail, based on the above 10_000 ft view.

The first part is between Client and API Endpoint. This includes DNS resolution, Connection, SSL handshake, and a network latency. API Gateway does not play here, it’s totally up to the source of traffic (where the client is), API endpoint type(Edge-Optimized/Regional/Private), and connection reuse.
For example, let’s assume the source traffic is from EC2 instance and endpoint type is Edge-Optimized. The request must leave the EC2 region, goes to the nearest CloudFront POP, then route to the EC2 region where your API is located. If the source traffic and your API is within the same AWS region, it would be better to use Regional API endpoint. In any case, please make sure the client reuses the connection, as it is a biggest contributor on the latencies for this part.

The second part is between API Endpoint and API Gateway. This is totally up to the endpoint type what you choose. Endpoint itself has its own overhead to process the request and network hop/latency until it reaches to API Gateway.

The third part is within API Gateway. When the request reaches to the API Gateway, it internally process it before sending a request to your integration and after receiving a response from the integration. You can consider this as API Gateway’s overhead latency. There are several drivers which impact on this overhead. It depends on the features(e.g. authorization, request validation, template, etc) what you use, request/response payload size, whether the request payload is compressed or the response payload needs to be compressed, and etc.
Currently, API Gateway publishes 2 latency metrics: Latency and IntegrationLatency. Latency is the overall time taken by API Gateway to execute your request including IntegrationLatency, which is the time between when API Gateway initiated the integration request and got a response.
You can calculate the API Gateway’s overhead by subtracting IntegrationLatency from Latency. When you use the custom authorizer, please note Latency metric will include the authorizer integration latency.

The forth part is between API Gateway and Integration Endpoint. This is much similar as the first part: between client and API Endpoint. All same factors are applied but SSL handshake is optional as HTTPS is not mandatory for your integration. API Gateway reuses connection but it also depends the connection reuse policy on the integration side. If the integration endpoint allows a very short keep-alive duration or even not support it, then it will require more frequent re-connection which impacts latency a lot.

The fifth part is between Integration Endpoint and Integration, much similar as the second part. Integration Endpoint will have its own overhead to process the request and network hop/latency until reach to your final integration.
When you use Lambda, there are more players between Lambda endpoint and your Lambda function. This is why API Gateway’s IntegrationLatency is larger than your Lambda function’s execution duration time. Especially when the cold start happens, IntegrationLatency will be much higher than the execution time. As API Gateway limits the integration timeout up to 29 seconds, the integration request would be timed out but later your Lambda function will be executed after the cold start is done. In the case, you’ll see 29000 ms IntegrationLatency in API Gateway while your Lambda execution time shows 100ms, for example.

Finally, the last part is within your integration. Many people tend to think it as a pure latency what they should have. If you’re using Lambda, this is the execution duration time what you see. I hope now you can understand that this server-side integration latency is just a part of the overall latency.

If you’re sensitive at latency, I recommend to use X-Ray without any hesitation. Both API Gateway and Lambda support X-Ray, and it will reveal the latency among API Gateway, Lambda Endpoint, and your Lambda function. This will give you better idea where the bottleneck is, how many times/how long the cold start happens, and how warm start also impact.

Otherwise, you can track it in different way. Remember API Gateway publishes Latency and IntegrationLatency metrics? These metrics can also be added to API Gateway Access Log by using $context.integrationLatency and $context.responseLatency in the access log format. The access log entry will have a request time(if you put it into access log format) telling you when the request has been arrived at API Gateway. If your client logs the time when it sent a request, you can guess how long it took for the request sent by client to reach to the API Gateway(the first part and the second part of the latency break down).
Then you can calculate the third(API Gateway’ overhead) from the difference between Latency and IntegrationLatency. If you enabled Execution Log on API Gateway, find logs with the request id on the access log. Based on the timestamp, you could figure out when the request was sent to your integration. Then find a timestamp when your integration received the request. The difference between 2 timestamp will tell you the fourth part and fifth part(API Gateway -> Integration Endpoint -> Integration).

Please remember the timestamp comparison is not always accurate, as the clock on each service/host can be skewed.

One of the frequent question is “API Gateway returned 5xx but there is no errors/logs on my integration(Lambda)”. If you go back the 10000 ft view in the beginning, now you could understand why that can happen. There are many players who can return 4xx/5xx before reaching your integration.

If you can access the response that the client received, check the header. If there is no header named x-amz-apigw-id and x-amzn-requestid, this means that the error was returned by API Endpoint in front of API Gateway. For example, if the request hit the Edge-Optimized Endpoint and you see the X-Amz-Cf-Id but no x-amz-apigw-id, this means CloudFront returned the error. In this case, this will not be shown up in 4xx/5xx error metrics in API Gateway or Lambda. This will not be logged on API Gateway’s access log either.

If the request reached API Gateway, it must have x-amz-apigw-id and x-amzn-requestid header in the response. If you enabled access log and execution log, you’ll expect to find a corresponding log. However, there is a difference between Access Log and Execution Log. Access Log will capture any access when the request hit the your API’s stage where the access log is enabled (of course you should configure a correct CloudWatch Log role and the log group should be available). However, Execution Log does not capture most 4xx errors since your API was accessed but wasn’t executed for this cases. Imagine a potential attacker is trying to send a random request to your API. You don’t want them to pollute your execution log with those attempt.

From the access log, you can identify the request id when the 5xx error is thrown. Then try to find an execution log with the request id. When the request/response logging is enabled, you can also see the detailed method request, integration endpoint request, integration endpoint response, and the method response. Then you can figure out when the failure occurred. If it is failed before sending a integration request, apparently it’s a problem on API Gateway.

If it is failed after sending a request, please check the API Gateway received ANY response from the integration. If there was a network/configuration issue so the request couldn’t reach to the endpoint, there would be no response but the execution log will show the relevant error. If there is any response with 4xx/5xx error, it means that the integration “endpoint” returned the error. When the logging level is INFO and the integration is AWS Services(including Lambda), API Gateway will write AWS Service’s request id into Execution Log. However, If the request was failed on the endpoint or timed out, there would be no request id.

The failure on the integration endpoint does not necessarily mean that your integration-Lambda function caused the error. Again, between API Gateway and your Lambda function, there are Lambda endpoint and Lambda’s internal services. If either of them returns 4xx/5xx before hitting your function, it will not be shown in the Lambda’s metric nor log.

Any 4xx failure during invoking the integration endpoint may become 5xx. For example, if you’re using Lambda Proxy integration, Lambda will return 429 for the throttling then API Gateway will translate it to 500 error.

One of the common error is 502 Malformed Lambda proxy response. This means API Gateway expects a certain response format in JSON, but it received unexpected format. This also can be happen before hitting your function, or your function may return a malformed result. Many people claimed their function was implemented to return the correct response, but it will return the malformed response when an unhandled exception is thrown from the code.

So far, I briefly went through the each part that can cause latency or 4xx/5xx errors. For either of latency and 4xx/5xx errors, I strongly suggest to enable Access Log to get a minimal evidence. If you can enable Execution Log together, that would be much better to identify/narrow the root cause. If you’re sensitive on the latency, don’t forget to enable X-Ray!

How to dig latencies and 4xx/5xx errors on AWS API Gateway (2024)
Top Articles
Latest Posts
Article information

Author: Aron Pacocha

Last Updated:

Views: 5321

Rating: 4.8 / 5 (48 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Aron Pacocha

Birthday: 1999-08-12

Address: 3808 Moen Corner, Gorczanyport, FL 67364-2074

Phone: +393457723392

Job: Retail Consultant

Hobby: Jewelry making, Cooking, Gaming, Reading, Juggling, Cabaret, Origami

Introduction: My name is Aron Pacocha, I am a happy, tasty, innocent, proud, talented, courageous, magnificent person who loves writing and wants to share my knowledge and understanding with you.