- Knowledge Base
- Amazon Web Services
- Amazon Simple Queue Service
- Queue Unprocessed Messages
Ensure that your Amazon Simple Queue Service (SQS) queues are not holding a high number of unsuccessfully-processed messages due to unresponsive or incapacitated consumers. A consumer is an AWS cloud resource such as an EC2 instance or a Lambda function that reads messages from the designated SQS queue and does the actual processing. The default threshold for the number of high SQS unprocessed messages is 100, however, you can easily change the threshold for this rule on your Trend Cloud One™ – Conformity account console.
This rule can help you with the following compliance standards:
- NIST4
For further details on compliance standards supported by Conformity, see here.
This rule can help you work with the AWS Well-Architected Framework.
This rule resolution is part of the Conformity Security & Compliance tool for AWS.
Whether you process raw images, transcode video files, or send out a massive number of emails, you need to maintain the SQS consumers healthy and responsive by ensuring their availability and scalability within your environment or else you will end up with a large number of messages in your SQS queues, waiting to be processed.
Audit
To determine if there are any SQS queues that hold a high number of unprocessed messages within your AWS account, perform the following actions:
Using AWS Console
01 Sign in to the AWS Management Console.
02 Navigate to Amazon SQS console at https://console.aws.amazon.com/sqs/.
03 In the main navigation panel, under Amazon SQS, choose Queues.
04 Click on the name (link) of the SQS queue that you want to examine.
05 In the Details section choose More to expand the panel with the additional information and check the Messages available attribute value. If the Messages available value is greater than or equal to 100 (default) or to the custom threshold configured in your Trend Cloud One™ – Conformity account, the selected Amazon SQS queue holds too many unprocessed messages, therefore the consumers (workers) assigned to the SQS queue could be unhealthy or incapacitated.
06 Repeat steps no. 4 and 5 for each Amazon SQS queue available within the current AWS region.
07 Change the AWS cloud region from the navigation bar and repeat the Audit process for other regions.
Using AWS CLI
01 Run list-queues command (OSX/Linux/UNIX) to list the URL of each Amazon SQS queue available in the selected AWS cloud region:
aws sqs list-queues --region us-east-1 --query 'QueueUrls[*]'
02 The command output should return an array with the requested SQS queue URLs:
[ "https://sqs.us-east-1.amazonaws.com/123456789012/cc-web-app-worker", "https://sqs.us-east-1.amazonaws.com/123456789012/cc-mobile-app-queue" ]
03 Run get-queue-attributes command (OSX/Linux/UNIX) using the URL of the Amazon SQS queue that you want to examine as the identifier parameter and custom query filters to return the number of messages currently available within the selected SQS queue:
aws sqs get-queue-attributes --region us-east-1 --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/cc-web-app-worker --attribute-names "ApproximateNumberOfMessages" --query 'Attributes.ApproximateNumberOfMessages'
04 The command output should return the number of SQS queue messages available at the request time:
"139"
If the value returned by the get-queue-attributes command output is greater than or equal to 100 (default) or to the custom threshold configured in your Trend Cloud One™ – Conformity account, the selected Amazon SQS queue holds too many unsuccessfully-processed messages, therefore the consumers subscribed to the SQS queue could be unhealthy or incapacitated.
05 Repeat steps no. 3 and 4 for each Amazon SQS queue available in the selected AWS region.
06 Change the AWS cloud region by updating the --region command parameter value and repeat the Audit process for other regions.
Remediation / Resolution
To restore the availability and scalability of your SQS consumers (workers) in order to prevent adding more unprocessed messages to the existing Amazon SQS queues, perform the following actions:
If the consumer/worker is an individual Amazon EC2 instance:Using AWS CloudFormation
01 CloudFormation template (JSON):
{ "AWSTemplateFormatVersion":"2010-09-09", "Description":"Upgrade the instance type for the worker EC2 instance", "Resources":{ "NewGenerationInstance":{ "Type":"AWS::EC2::Instance", "Properties":{ "InstanceType":"c4.xlarge", "ImageId":"ami-0abcd1234abcd1234", "KeyName":"ssh-key", "SubnetId":"subnet-1234abcd", "SecurityGroupIds":[ "sg-01234abcd1234abcd"], "BlockDeviceMappings":[ { "DeviceName":"/dev/xvda", "Ebs":{ "VolumeSize":"50", "VolumeType":"gp2" } } ] } } } }
02 CloudFormation template (YAML):
AWSTemplateFormatVersion: '2010-09-09' Description: Upgrade the instance type for the worker EC2 instance Resources: NewGenerationInstance: Type: AWS::EC2::Instance Properties: InstanceType: c4.xlarge ImageId: ami-0abcd1234abcd1234 KeyName: ssh-key SubnetId: subnet-1234abcd SecurityGroupIds: - sg-01234abcd1234abcd BlockDeviceMappings: - DeviceName: /dev/xvda Ebs: VolumeSize: '50' VolumeType: gp2
Using Terraform (AWS Provider)
01 Terraform configuration file (.tf):
terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 3.27" } } required_version = ">= 0.14.9" } provider "aws" { profile = "default" region = "us-east-1" } resource "aws_instance" "sqs-worker-instance" { ami = "ami-0abcd1234abcd1234" # Upgrade the instance type for the worker EC2 instance instance_type = "c4.xlarge" lifecycle { ignore_changes = [ami] } }
If the consumer is a fleet of Amazon EC2 instances managed by an Auto Scaling Group (ASG):
Using AWS CloudFormation
01 CloudFormation template (JSON):
{ "AWSTemplateFormatVersion":"2010-09-09", "Description": "Increase the size of the consumer ASG in order to manage the load", "Parameters":{ "LaunchTemplateVersionNumber":{ "Type":"String" }, "Subnets":{ "Type":"CommaDelimitedList" } }, "Resources":{ "ASGLaunchTemplate":{ "Type":"AWS::EC2::LaunchTemplate", "Properties":{ "LaunchTemplateData":{ "CreditSpecification":{ "CpuCredits":"unlimited" }, "ImageId": "ami-0abcd1234abcd1234", "InstanceType":"t2.micro" } } }, "ASG": { "Type":"AWS::AutoScaling::AutoScalingGroup", "Properties": { "AutoScalingGroupName": "cc-project5-asg", "MinSize":"1", "MaxSize":"3", "DesiredCapacity":"3", "LaunchTemplate": { "LaunchTemplateId": { "Ref":"ASGLaunchTemplate" }, "Version":{ "Ref":"LaunchTemplateVersionNumber" } }, "VPCZoneIdentifier":{ "Ref":"Subnets" } } } } }
02 CloudFormation template (YAML):
AWSTemplateFormatVersion: '2010-09-09' Description: Increase the size of the consumer ASG in order to manage the load Parameters: LaunchTemplateVersionNumber: Type: String Subnets: Type: CommaDelimitedList Resources: ASGLaunchTemplate: Type: AWS::EC2::LaunchTemplate Properties: LaunchTemplateData: CreditSpecification: CpuCredits: unlimited ImageId: ami-0abcd1234abcd1234 InstanceType: t2.micro ASG: Type: AWS::AutoScaling::AutoScalingGroup Properties: AutoScalingGroupName: cc-project5-asg MinSize: '1' MaxSize: '3' DesiredCapacity: '3' LaunchTemplate: LaunchTemplateId: !Ref 'ASGLaunchTemplate' Version: !Ref 'LaunchTemplateVersionNumber' VPCZoneIdentifier: !Ref 'Subnets'
Using Terraform (AWS Provider)
01 Terraform configuration file (.tf):
terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 3.27" } } required_version = ">= 0.14.9" } provider "aws" { profile = "default" region = "us-east-1" } resource "aws_launch_template" "asg-launch-template" { name_prefix = "cc-project5-launch-template" image_id = "ami-0abcd1234abcd1234" instance_type = "t2.micro" } resource "aws_autoscaling_group" "auto-scaling-group" { name = "cc-project5-asg" availability_zones = ["us-east-1a"] # Increase the size of the consumer ASG in order to manage the load desired_capacity = 3 max_size = 3 min_size = 1 launch_template { id = aws_launch_template.asg-launch-template.id version = "$Latest" } }
If the SQS consumer is an Amazon Lambda function:
Using AWS CloudFormation
01 CloudFormation template (JSON):
{ "AWSTemplateFormatVersion":"2010-09-09", "Description": "Increase the memory size and timeout for the consumer Lambda function", "Resources":{ "FunctionExecutionRole": { "Type": "AWS::IAM::Role", "Properties": { "RoleName": "LambdaExecutionRole", "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": [ "lambda.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] }] }, "Path": "/", "Policies": [{ "PolicyName": "AWSLambdaBasicExecutionRole", "PolicyDocument": { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" }] } }] } }, "ConsumerFunction": { "Type": "AWS::Lambda::Function", "Properties": { "FunctionName": "cc-app-worker-function", "Handler": "index.handler", "Role": { "Fn::GetAtt": [ "FunctionExecutionRole", "Arn" ] }, "Code": { "S3Bucket": "cc-lambda-functions", "S3Key": "consumer.zip" }, "Runtime": "nodejs12.x", "MemorySize": 256, "Timeout": 30, "TracingConfig": { "Mode": "Active" } } } } }
02 CloudFormation template (YAML):
AWSTemplateFormatVersion: '2010-09-09' Description: Increase the memory size and timeout for the consumer Lambda function Resources: FunctionExecutionRole: Type: AWS::IAM::Role Properties: RoleName: LambdaExecutionRole AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: - lambda.amazonaws.com Action: - sts:AssumeRole Path: / Policies: - PolicyName: AWSLambdaBasicExecutionRole PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - logs:CreateLogGroup - logs:CreateLogStream - logs:PutLogEvents Resource: '*' ConsumerFunction: Type: AWS::Lambda::Function Properties: FunctionName: cc-app-worker-function Handler: index.handler Role: !GetAtt 'FunctionExecutionRole.Arn' Code: S3Bucket: cc-lambda-functions S3Key: consumer.zip Runtime: nodejs12.x MemorySize: 256 Timeout: 30 TracingConfig: Mode: Active
Using Terraform (AWS Provider)
01 Terraform configuration file (.tf):
terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 3.27" } } required_version = ">= 0.14.9" } provider "aws" { profile = "default" region = "us-east-1" } resource "aws_iam_role" "function-execution-role" { name = "LambdaExecutionRole" path = "/" managed_policy_arns = [ "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" ] assume_role_policy = <<EOF { "Version": "2012-10-17", "Statement": [ { "Action": "sts:AssumeRole", "Principal": { "Service": "lambda.amazonaws.com" }, "Effect": "Allow" } ] } EOF } resource "aws_lambda_function" "lambda-function" { filename = "consumer.zip" source_code_hash = filebase64sha256("consumer.zip") function_name = "cc-app-worker-function" role = aws_iam_role.function-execution-role.arn handler = "index.handler" runtime = "nodejs12.x" # Increase the memory size and timeout for the consumer Lambda function memory_size = 256 timeout = 30 }
Using AWS Console
01 Sign in to the AWS Management Console.
02 Navigate to Amazon SQS console at https://console.aws.amazon.com/sqs/.
03 In the main navigation panel, under Amazon SQS, choose Queues.
04 Choose the SQS queue that keep a high number of unprocessed messages and identify the unresponsive/incapacitated consumers of the selected queue.
05 Based on the AWS resource type used for the unresponsive SQS consumer, perform one of the following sets of actions:
- If the consumer/worker is an individual EC2 instance, perform the following operations:
- Navigate to Amazon EC2 console at https://console.aws.amazon.com/ec2/.
- In the navigation panel, under Instances, choose Instances.
- Select the worker EC2 instance that you want to examine and check the instance's Status check. If the Status check is failed and the EC2 resource is unreachable, execute the following:
- Click on the Instance state dropdown button from the console top menu and select Reboot instance.
- In the Reboot instance? confirmation box, review the instance details, then choose Reboot.
- If the Status check is passed, the instance may not have enough capacity to process the necessary SQS messages. To upgrade the resource type, execute the following:
- Click on the Instance state dropdown button from the console top menu and select Stop instance.
- In the Stop instance? confirmation box, review the instance details, then choose Stop.
- Once the instance is stopped (i.e. Instance State is set to stopped), click on the Actions dropdown button from the console top menu, select Instance settings, and choose Change instance type.
- On the Change instance type configuration page, select the appropriate instance type from the Instance type dropdown list, and choose Apply to resize (upgrade) the selected Amazon EC2 instance.
- Click on the Instance state dropdown button from the console top menu and select Start instance. Once the boot sequence is complete, the EC2 instance status should change from Pending to Running.
- If the selected worker instance cannot resume the processing of the available SQS messages after reboot or instance type upgrade, you may need to troubleshoot your worker application.
- If the consumer is a fleet of Amazon EC2 instances managed by an Auto Scaling Group (ASG), perform the following actions:
- Navigate to Amazon EC2 console at https://console.aws.amazon.com/ec2/.
- In the navigation panel, under Auto Scaling, choose Auto Scaling Groups.
- Select the Auto Scaling Group (ASG) provisioned as SQS worker fleet.
- If the associated instances are healthy, the ASG might not having enough capacity to consume the required SQS messages. To increase the size of the Auto Scaling Group in order to handle the load, execute the following:
- Select the Details tab from the console bottom panel and choose Edit.
- Increase the number of EC2 worker instances available in the Desired box to add more compute power to the selected group. Depending on your auto-scaling configuration you may also need to increase the number of instances available in the Max box.
- Choose Save to apply the changes. The Auto Scaling Group will now start to provision new EC2 instances and upgrade the compute capacity of the worker fleet.
- If the consumer ASG cannot resume the SQS queue processing after the capacity upgrade, you may need to troubleshoot your worker application.
- If the SQS consumer is an Amazon Lambda function, perform the following actions:
- In the left navigation panel, under AWS Lambda, choose Functions.
- Click on the name (link) of the function that you want to examine.
- Select the Monitor tab, choose Logs, and select View logs in CloudWatch to access the function logs in Amazon CloudWatch. Choose the right log stream and analyze it for errors. If the function log stream does not have any errors, the selected function might not have enough resources to process the designated SQS messages. To increase the consumer resources, execute the following:
- Choose the Configuration tab, select General configuration, and choose Edit.
- To increase the worker compute capacity, change the size of the memory allocated for the selected function, available in the Memory box, or change the existing timeout value within the Timeout min/sec configuration boxes.
- Choose Save to apply the changes.
- If the Lambda function worker cannot resume the SQS queue processing after the capacity (memory) upgrade, you may need to troubleshoot your worker function.
06 Repeat step no. 4 and 5 for each SQS queue with unhealthy or incapacitated consumers, available within the current AWS region.
07 Change the AWS cloud region from the navigation bar and repeat the Remediation process for other AWS regions.
Using AWS CLI
01 Choose the SQS queue that keep a high number of unprocessed messages and identify the unresponsive/incapacitated consumers of the selected queue.
02 Based on the AWS resource type used for the unresponsive SQS consumer, perform one of the following sets of commands:
- If the worker is a single Amazon EC2 instance, perform the following:
- Run describe-instance-status command (OSX/Linux/UNIX) to describe the operational state of the specified EC2 instance:
aws ec2 describe-instance-status --region us-east-1 --instance-id i-01234abcd1234abcd --query 'InstanceStatuses[*].{"InstanceStatus":InstanceStatus.Details,"SystemStatus":SystemStatus.Details}'
- The command output should return the requested status information:
[ { "InstanceStatus": [ { "Name": "reachability",
"Status": "failed"
} ], "SystemStatus": [ { "Name": "reachability","Status": "failed"
} ] } ] - If "InstanceStatus" and/or "SystemStatus" have the "Status" attribute value set to "failed", as shown in the example above, the EC2 resource is unreachable and requires a reboot. If both "InstanceStatus" and "SystemStatus" have the "Status" attribute value set to "passed" skip to the section with the Amazon EC2 instance upgrade.
- Run reboot-instances command (OSX/Linux/UNIX) to reboot the selected EC2 worker instance (the command does not produce an output):
aws ec2 reboot-instances --region us-east-1 --instance-ids i-01234abcd1234abcd
- If "InstanceStatus" and "SystemStatus" have the "Status" attribute value set to "passed", the instance may not have enough capacity to process the necessary SQS messages. To upgrade the instance resource type, you need to stop the instance by executing stop-instances command (OSX/Linux/UNIX):
aws ec2 stop-instances --region us-east-1 --instance-ids i-01234abcd1234abcd
- The output should return the stop-instances command request metadata:
{ "StoppingInstances": [ { "InstanceId": "i-01234abcd1234abcd", "CurrentState": { "Code": 64, "Name": "stopping" }, "PreviousState": { "Code": 16, "Name": "running" } } ] }
- Run modify-instance-attribute command (OSX/Linux/UNIX) to change (upgrade) the instance type for the worker EC2 instance (the command does not produce an output):
aws ec2 modify-instance-attribute --region us-east-1 --instance-id 01234abcd1234abcd --instance-type "{\"Value\": \"c4.xlarge\"}"
- Run start-instances command (OSX/Linux/UNIX) to restart the selected Amazon EC2 instance (it may take a few minutes until the instance enters the running state):
aws ec2 start-instances --region us-east-1 --instance-ids i-01234abcd1234abcd
- The output should return the start-instances command request metadata:
{ "StartingInstances": [ { "InstanceId": "i-01234abcd1234abcd", "CurrentState": { "Code": 0, "Name": "pending" }, "PreviousState": { "Code": 80, "Name": "stopped" } } ] }
- If your worker EC2 instance can't resume the processing of the available SQS messages after reboot or capacity upgrade, you may need to troubleshoot your worker application.
- Run describe-instance-status command (OSX/Linux/UNIX) to describe the operational state of the specified EC2 instance:
- If the consumer is a fleet of EC2 instance managed by an Amazon Auto Scaling Group (ASG), perform the following actions:
- Run describe-auto-scaling-groups command (OSX/Linux/UNIX) to describe the configuration of the EC2 instances (workers) associated with the specified ASG:
aws autoscaling describe-auto-scaling-groups --region us-east-1 --auto-scaling-group-name cc-project5-asg --query 'AutoScalingGroups[*].Instances[]'
- The command output should return the requested configuration information:
[ { "ProtectedFromScaleIn": false, "AvailabilityZone": "us-east-1a", "InstanceId": "i-01234abcd1234abcd", "HealthStatus": "Healthy", "LifecycleState": "InService", "LaunchConfigurationName": "cc—project5-asg-config" }, { "ProtectedFromScaleIn": false, "AvailabilityZone": "us-east-1b", "InstanceId": "i-0abcd1234abcd1234", "HealthStatus": "Healthy", "LifecycleState": "Pending", "LaunchConfigurationName": "cc-project5-asg-config" } ]
- If the "HealthStatus" attribute value for the associated instances (workers) is "Healthy", as shown in the output example above, the group instances are healthy and responsive, therefore the selected ASG might now have enough capacity to consume the required SQS messages. To increase the size of the ASG in order to handle the queue load, execute update-auto-scaling-group command (OSX/Linux/UNIX) using the desired and maximum number of EC2 instances that will run within the ASG as command parameters (the command does not produce an output):
aws autoscaling update-auto-scaling-group --region us-east-1 --auto-scaling-group-name cc-project5-asg --desired-capacity 3 --max-size 3
- If the ASG consumer can't resume the SQS queue processing after the capacity upgrade, you may need to troubleshoot your worker application.
- Run describe-auto-scaling-groups command (OSX/Linux/UNIX) to describe the configuration of the EC2 instances (workers) associated with the specified ASG:
- If the SQS consumer is a Lambda function, perform the following commands:
- If the log streams associated with your Amazon Lambda function do not contain errors, the Lambda function may not have enough resources to process the required SQS messages. To increase the Lambda consumer resources, you need to know first the current values allocated for these resources by executing get-function command (OSX/Linux/UNIX):
aws lambda get-function --region us-east-1 --function-name cc-app-worker-function --query 'Configuration.[MemorySize,Timeout]'
- The command output should return the memory allocated to your function as first value and the processing timeout/limit (in seconds) as second value. For example, the following Lambda function can use 128 MB of memory and have maximum 15 seconds of compute time to process the function code:
[ 128, 15 ]
- To increase the SQS consumer compute capacity, increase the size of the memory allocated for the selected function or increase the existing timeout value (seconds) by executing update-function-configuration command (OSX/Linux/UNIX). The following command request example updates the memory size and timeout configuration attributes of an Amazon Lambda function named cc-app-worker-function to 256 (MB) and 30 (seconds) using the --memory-size and --timeout parameters:
aws lambda update-function-configuration --region us-east-1 --function-name cc-app-worker-function --memory-size 256 --timeout 30
- The command output should return the new configuration for your Lambda function:
{ "FunctionName": "cc-app-worker-function", "FunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:cc-app-worker-function", "Runtime": "python3.9", "Role": "arn:aws:iam::123456789012:role/service-role/cc-app-worker-function-role-abcdabcd", "Handler": "lambda_function.lambda_handler", "CodeSize": 550, "Timeout": 30, "MemorySize": 256, "LastModified": "2021-08-30T10:00:00.000+0000", "Version": "$LATEST", "VpcConfig": { "SubnetIds": [ "subnet-abcd1234", "subnet-1234abcd" ], "SecurityGroupIds": [ "sg-01234abcd1234abcd" ], "VpcId": "vpc-abcdabcd" }, "TracingConfig": { "Mode": "PassThrough" }, "RevisionId": "abcdabcd-1234-abcd-1234-abcd1234abcd", "State": "Active", "LastUpdateStatus": "Successful", "PackageType": "Zip" }
- If the worker function can't resume the SQS queue processing after the capacity (memory and/or timeout) upgrade, you may need to analyze and troubleshoot your consumer function.
- If the log streams associated with your Amazon Lambda function do not contain errors, the Lambda function may not have enough resources to process the required SQS messages. To increase the Lambda consumer resources, you need to know first the current values allocated for these resources by executing get-function command (OSX/Linux/UNIX):
03 Repeat steps no. 1 and 2 for each SQS queue that has unresponsive or incapacitated consumers, available in the selected AWS region.
04 Change the AWS cloud region by updating the --region command parameter value and repeat the Remediation process for other regions.
References
- AWS Documentation
- Amazon SQS FAQs
- Working with Amazon SQS messages
- Resources required to process Amazon SQS messages
- AWS Command Line Interface (CLI) Documentation
- sqs
- list-queues
- get-queue-attributes
- autoscaling
- describe-auto-scaling-groups
- update-auto-scaling-groups
- ec2
- describe-instance-status
- reboot-instances
- stop-instances
- modify-instance-attribute
- start-instances
- lambda
- get-function
- update-function-configuration
- CloudFormation Documentation
- Amazon Elastic Compute Cloud resource type reference
- Amazon EC2 Auto Scaling resource type reference
- AWS Lambda resource type reference
- Terraform Documentation
- AWS Provider