Home

Troubleshoot A VPC

Overview

In this project I had to troubleshoot virtual private cloud (VPC) configurations and analyze VPC Flow Logs. I worked with an environment that included two VPCs, EC2 instances, and other networking components.

My tasks included:

What I Learned

By the end of this, I gained hands-on experience with:

Step 1: Creating VPC Flow Logs

I started by creating an S3 bucket to publish data from VPC Flow Logs. Then I needed to create VPC Flow Logs on VPC1 to capture information about IP traffic between network interfaces in the VPC and publish the flow logs to the S3 bucket.

First, I created an S3 bucket with a unique name:

aws s3api create-bucket --bucket flowlog895 --region 'us-west-2' --create-bucket-configuration LocationConstraint='us-west-2'

I received the following JSON output confirming the bucket creation:

{ "Location": "http://flowlog123456.s3.amazonaws.com/" }

Next, I needed to get the VPC ID for VPC1 to create the Flow Logs:

aws ec2 describe-vpcs --query 'Vpcs[*].[VpcId,Tags[?Key==`Name`].Value,CidrBlock]' --filters "Name=tag:Name,Values='VPC1'"

This command returned the VPC ID I needed:

[ [ "vpc-02b39929a1b1411e2", [ "VPC1" ], "10.0.0.0/16" ] ]

With both the bucket name and VPC ID, I created the VPC Flow Logs:

aws ec2 create-flow-logs --resource-type VPC --resource-ids vpc-02b39929a1b1411e2 --traffic-type ALL --log-destination-type s3 --log-destination arn:aws:s3:::flowlog879964

The command was successful, showing me the Flow Log ID that was created:

{ "ClientToken": "d7631a29-6433-43e1-a7e5-9a0991c3eb20", "FlowLogIds": [ "fl-0a1b2c3d4e5f6a7b8" ], "Unsuccessful": [] }

I verified that the flow log was properly created by running:

aws ec2 describe-flow-logs

The output confirmed that my flow log was active and properly configured:

{ "FlowLogs": [ { "CreationTime": "2025-03-03T15:23:45.000Z", "DeliverLogsStatus": "SUCCESS", "FlowLogId": "fl-0a1b2c3d4e5f6a7b8", "FlowLogStatus": "ACTIVE", "LogDestination": "arn:aws:s3:::flowlog123456", "LogDestinationType": "s3", "ResourceId": "vpc-02b39929a1b1411e2", "TrafficType": "ALL", "LogFormat": "${version} ${account-id} ${interface-id}..." } ] }

With the flow logs now set up, I was ready to move on to the troubleshooting phase.

Step 2: Troubleshooting Web Server Access

Next, I needed to analyze access to the web server instance and troubleshoot some networking issues. I knew that the cafe web server instance was supposed to be running in the public subnet in VPC1.

I tried accessing the web server at 54.190.138.58 in my browser, but after waiting a few moments, the page failed to load with a "connection timed out" message. This was expected as part of the troubleshooting exercise.

I began investigating by gathering information about the web server instance:

aws ec2 describe-instances --filter "Name=ip-address,Values='54.190.138.58'"

This returned a large JSON document, so I filtered the results to get just the essential information:

aws ec2 describe-instances --filter "Name=ip-address,Values='54.190.138.58'" --query 'Reservations[*].Instances[*].[State,PrivateIpAddress,InstanceId,SecurityGroups,SubnetId,KeyName]'

The output showed me that the instance was indeed running:

[ [ [ { "Code": 16, "Name": "running" }, "10.0.1.212", "i-050d6416ee1052e01", [ { "GroupName": "WebSecurityGroup", "GroupId": "sg-08294692a06378089" } ], "subnet-07a0b96d2c41186b6", "lab-key-pair" ] ] ]

I also tried to establish an SSH connection to the web server using EC2 Instance Connect, but it failed with an error saying "Failed to connect to your instance." This was also expected as part of the troubleshooting challenge.

Troubleshooting Challenge #1: Web Server Connectivity

I needed to figure out why the web server instance was running but the webpage wasn't loading. I decided to take a systematic approach using only the AWS CLI.

First, I used nmap to check for open ports on the web server:

sudo yum install -y nmap
nmap 54.190.138.58
Starting Nmap 7.70 ( https://nmap.org ) at 2025-03-03 15:35 UTC Note: Host seems down. If it is really up, but blocking our ping probes, try -Pn Nmap done: 1 IP address (0 hosts up) scanned in 3.07 seconds

Since nmap couldn't find any open ports, I suspected there might be something at the network level blocking access to the instance. I checked the security group settings:

aws ec2 describe-security-groups --group-ids 'sg-08294692a06378089'
{ "SecurityGroups": [ { "Description": "Security group for web server", "GroupName": "WebSecurityGroup", "IpPermissions": [ { "FromPort": 22, "IpProtocol": "tcp", "IpRanges": [ { "CidrIp": "0.0.0.0/0" } ], "ToPort": 22 }, { "FromPort": 80, "IpProtocol": "tcp", "IpRanges": [ { "CidrIp": "0.0.0.0/0" } ], "ToPort": 80 } ], "IpPermissionsEgress": [ { "IpProtocol": "-1", "IpRanges": [ { "CidrIp": "0.0.0.0/0" } ] } ], "OwnerId": "123456789012", "GroupId": "sg-08294692a06378089", "VpcId": "vpc-02b39929a1b1411e2" } ] }

The security group settings looked correct - they were allowing inbound traffic on ports 22 and 80. So I moved on to checking the route table settings:

aws ec2 describe-route-tables --route-table-ids 'rtb-0ae7233554007c948' --filter "Name=association.subnet-id,Values='subnet-07a0b96d2c41186b6'"
{ "RouteTables": [ { "Associations": [ { "Main": false, "RouteTableAssociationId": "rtbassoc-0123456789abcdef", "RouteTableId": "rtb-0ae7233554007c948", "SubnetId": "subnet-07a0b96d2c41186b6" } ], "PropagatingVgws": [], "RouteTableId": "rtb-0ae7233554007c948", "Routes": [ { "DestinationCidrBlock": "10.0.0.0/16", "GatewayId": "local", "Origin": "CreateRouteTable", "State": "active" } ], "Tags": [ { "Key": "Name", "Value": "VPC1-Public-RT" } ], "VpcId": "vpc-02b39929a1b1411e2" } ] }

Here I found the issue! The route table was named "VPC1-Public-RT" indicating it should be for a public subnet, but it only contained a local route for 10.0.0.0/16. It was missing a route to the internet gateway (0.0.0.0/0). Without this route, traffic couldn't reach the internet.

I created the missing route:

aws ec2 create-route --route-table-id 'rtb-0ae7233554007c948' --gateway-id 'igw-06ce5eeb416e63888' --destination-cidr-block '0.0.0.0/0'
{ "Return": true }

I refreshed the browser page and saw that the web server was now accessible! The page displayed the message "Hello From Your Web Server!" This confirmed that I had resolved the first issue.

Troubleshooting Challenge #2: SSH Connection Issue

Even though I could now access the web server, I still couldn't connect to it using SSH via EC2 Instance Connect. I had already verified that the web server was running, created the route table entry for internet connectivity, and confirmed that the security group was allowing connections on port 22.

I decided to check the network access control list (NACL) settings:

aws ec2 describe-network-acls --filter "Name=association.subnet-id,Values='subnet-07a0b96d2c41186b6'" --query 'NetworkAcls[*].[NetworkAclId,Entries]'
[ [ "acl-0f723106890678de5", [ { "CidrBlock": "0.0.0.0/0", "Egress": true, "Protocol": "-1", "RuleAction": "allow", "RuleNumber": 100 }, { "CidrBlock": "0.0.0.0/0", "Egress": false, "Protocol": "-1", "RuleAction": "allow", "RuleNumber": 100 }, { "CidrBlock": "0.0.0.0/0", "Egress": false, "Protocol": "6", "PortRange": { "From": 22, "To": 22 }, "RuleAction": "deny", "RuleNumber": 40 } ] ] ]

I found another issue! The network ACL had a rule (number 40) explicitly denying inbound SSH traffic (port 22) from any source (0.0.0.0/0). This would block SSH connections regardless of security group settings, as NACLs are evaluated before security groups.

I deleted the problematic NACL entry:

aws ec2 delete-network-acl-entry --network-acl-id acl-0f723106890678de5 --ingress --rule-number 40

After removing the restrictive NACL rule, I tried connecting via EC2 Instance Connect again and successfully connected to the web server. I ran the hostname command to confirm I was connected to the correct instance:

hostname
web-server

Success! I had now resolved both connectivity issues: web access and SSH access.

Step 3: Analyzing Flow Logs

Now that I had resolved the network issues, I wanted to analyze the flow logs to see the record of what had happened. This would help me understand how the flow logs captured networking activities.

Downloading and Extracting Flow Logs

First, I needed to download the flow logs from the S3 bucket:

mkdir flowlogs
cd flowlogs

I listed the S3 buckets to recall the bucket name:

aws s3 ls
2025-03-03 15:20:11 flowlog123456

I downloaded the flow logs from my S3 bucket:

aws s3 cp s3://flowlog123456/ . --recursive
download: s3://flowlog123456/AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1530Z_abc123def456.log.gz to AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1530Z_abc123def456.log.gz download: s3://flowlog123456/AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log.gz to AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log.gz download: s3://flowlog123456/AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1540Z_abc123def456.log.gz to AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1540Z_abc123def456.log.gz

I navigated to the directory with the log files, pressing tab several times unveiled the whole dir path:

cd AWSLogs/123456789012/vpcflowlogs/us-west-2/2025/03/03/

I saw the compressed log files:

ls
123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1530Z_abc123def456.log.gz 123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log.gz 123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1540Z_abc123def456.log.gz

I extracted the compressed files:

gunzip *.gz
ls
123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1530Z_abc123def456.log 123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log 123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1540Z_abc123def456.log

Analyzing the Log Content

I looked at the structure of one of the log files:

head 123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1530Z_abc123def456.log
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status 2 123456789012 eni-0abc123def456789 10.0.1.212 54.190.138.58 34567 80 6 1 64 1709497830 1709497890 REJECT OK 2 123456789012 eni-0abc123def456789 203.0.113.42 10.0.1.212 56789 22 6 1 64 1709497890 1709497950 REJECT OK

I could see that each log entry contained the source IP, destination IP, ports, and whether the traffic was accepted or rejected. I wanted to focus on the REJECT entries, so I searched for them:

grep -rn REJECT .

This returned many entries, so I counted them:

grep -rn REJECT . | wc -l
164

I refined my search to focus just on SSH traffic (port 22) that was rejected:

grep -rn 22 . | grep REJECT
./123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log:14:2 123456789012 eni-0abc123def456789 203.0.113.42 10.0.1.212 56789 22 6 1 64 1709497890 1709497950 REJECT OK ./123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log:22:2 123456789012 eni-0abc123def456789 198.51.100.73 10.0.1.212 34567 22 6 1 64 1709498010 1709498070 REJECT OK ./123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1540Z_abc123def456.log:8:2 123456789012 eni-0abc123def456789 198.51.100.73 10.0.1.212 34568 22 6 1 64 1709498130 1709498190 REJECT OK

To isolate just my own connection attempts, I determined my public IP address (198.51.100.73) using the EC2 console's security group editor, then filtered the results:

grep -rn 22 . | grep REJECT | grep 198.51.100.73
./123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1535Z_abc123def456.log:22:2 123456789012 eni-0abc123def456789 198.51.100.73 10.0.1.212 34567 22 6 1 64 1709498010 1709498070 REJECT OK ./123456789012_vpcflowlogs_vpc-02b39929a1b1411e2_20250303T1540Z_abc123def456.log:8:2 123456789012 eni-0abc123def456789 198.51.100.73 10.0.1.212 34568 22 6 1 64 1709498130 1709498190 REJECT OK

I confirmed that the network interface ID in the logs matched the web server's interface:

aws ec2 describe-network-interfaces --filters "Name=association.public-ip,Values='54.190.138.58'" --query 'NetworkInterfaces[*].[NetworkInterfaceId,Association.PublicIp]'
[ [ "eni-0abc123def456789", "54.190.138.58" ] ]

Finally, I translated one of the Unix timestamps to a human-readable format:

date -d @1709498010
Sun Mar 3 15:40:10 UTC 2025

I compared this to the current time:

date
Sun Mar 3 16:15:23 UTC 2025

This confirmed that the logs were capturing my recent connection attempts that were rejected by the network ACL rule that I identified and removed earlier.

My Key Learnings

This hands-on experience was extremely valuable in helping me understand how to troubleshoot VPC networking issues. I successfully:

Root Cause Analysis

Through my systematic troubleshooting, I identified two major issues that were preventing connectivity:

  1. Missing Internet Gateway Route: The route table for the public subnet only had a local route (10.0.0.0/16) but was missing the crucial internet gateway route (0.0.0.0/0). Without this route, traffic couldn't reach the internet.
  2. Restrictive Network ACL Rule: The network ACL had a rule (number 40) explicitly denying inbound SSH traffic on port 22). This was blocking SSH connections at the subnet level, even though the security group was correctly configured.

The VPC Flow Logs were invaluable in confirming my diagnosis, clearly showing the rejected SSH connection attempts with timestamps matching when I was working on the troubleshooting.

This experience reinforced for me the importance of checking all layers of the AWS networking stack when troubleshooting connectivity issues:

Using the AWS CLI exclusively for this troubleshooting gave me deeper insight into how these components interact and how to diagnose issues methodically. I can now apply these skills to troubleshoot more complex networking scenarios in the future.

Related Topics