As a SOC analyst, one of the most critical tasks is analyzing network flow data to identify potential security threats. In this post, we’ll explore how to combine cloud-based data storage, SQL querying, and AI-powered analysis to streamline this process.
Collecting Flow Data in Amazon Athena
Amazon Athena provides a serverless query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. Here’s how we set up our flow data collection:
-
Store flow logs in an S3 bucket:
- Create an S3 bucket specifically for storing your flow logs.
- Configure your network devices or cloud services to send flow logs to this bucket.
- Ensure proper access controls and encryption are in place for the S3 bucket.
Documentation: Creating an S3 bucket
-
Create an Athena table that maps to the flow log data structure:
- Use Athena’s Data Definition Language (DDL) to create a table schema.
- The schema should match the structure of your flow logs, including fields like source IP, destination IP, ports, and timestamps.
- Specify the S3 location where your flow logs are stored.
Example DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS flow_logs ( `interface-id` string, srcaddr string, dstaddr string, srcport int, dstport int, packets int, bytes int, `start` timestamp, `end` timestamp, action string ) PARTITIONED BY (partition_1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://your-bucket-name/flow-logs/'
Documentation: Creating tables in Athena
-
Configure partitioning for efficient querying:
- Partitioning divides your data into smaller, more manageable chunks.
- In this case, partitioning by date (partition_1) allows for faster queries on specific time ranges.
- After creating the table, add partitions for each date:
ALTER TABLE flow_logs ADD PARTITION (partition_1='2024-12-05') LOCATION 's3://your-bucket-name/flow-logs/2024-12-05/';
- You can automate this process using AWS Glue or Lambda functions to add partitions as new data arrives.
Documentation: Partitioning data in Athena
-
Set up a query execution location:
- Specify an S3 bucket where Athena will store query results.
- This can be done in the Athena console under “Settings”.
Documentation: Working with query results
-
Optimize for performance:
- Use columnar storage formats like Parquet or ORC for better query performance.
- Compress your data to reduce storage costs and improve query speed.
- Consider using Athena’s federated query feature for accessing data from multiple sources.
Documentation: Athena performance tuning
Querying Flow Data
To analyze our flow data, we used the following SQL query in Athena:
SELECT "interface-id", srcaddr, dstaddr, srcport, dstport, packets, bytes, "start", "end", action
FROM flow_logs
WHERE srcaddr IN (
'x.x.x.x',
'y.y.y.y',
-- [additional IPs omitted for brevity]
'z.z.z.z'
)
AND partition_1 IN ('2024-12-05','2024-12-04','2024-12-03','2024-12-02')
This query filters flow logs for specific source IP addresses over a 4-day period, giving us a focused dataset to analyze.
AI-Powered Analysis
After obtaining the query results from Amazon Athena, we downloaded the results as a CSV file for further analysis. This file contained detailed flow log data, including source and destination IP addresses, ports, packet counts, byte counts, and timestamps.
Next, we uploaded the CSV file to our AI language model (LLM) to leverage its capabilities in analyzing the data for suspicious activity. The prompt we used for analysis was as follows:
let's assume you are working as a SOC analyst and you've received some information regarding suspicious activity from someone in your organization. You perform the below query to get some flow logs to analyze:
SELECT "interface-id", srcaddr, dstaddr, srcport, dstport, packets, bytes, "start", "end", action
FROM oci_flow_logs
WHERE srcaddr IN (
'x.x.x.x',
'y.y.y.y',
-- [additional IPs omitted for brevity]
'z.z.z.z'
)
AND partition_1 IN ('2024-12-05','2024-12-04','2024-12-03','2024-12-02')
The analysis and results presented in this section are derived from the output generated by the LLM based on this prompt. The AI identified several potential security concerns:
Unusual Port Activity
The AI flagged multiple connections using non-standard ports, particularly:
- 2,170 connections to port 9844 from various source IPs
- Multiple SSH connections (port 22) with changing destination ports
High Volume Traffic
Several instances of high-volume data transfers were identified:
- 239 connections with either more than 100 packets or more than 10,000 bytes transferred
- Multiple large data transfers to a.a.a.a from various sources, including:
- A connection with 990 packets and 237,100 bytes
- A connection with 1,602 packets and 492,132 bytes
- A connection with 1,068 packets and 283,008 bytes
Repeated Connections
The AI noted patterns of repeated connections that could indicate scanning or other malicious activity:
- 231 instances of repeated connections from the same source to the same destination (more than 5 times)
- Notable examples include:
- 52 connections from b.b.b.b to c.c.c.c
- 35 connections from d.d.d.d to 1.1.1.1 (Cloudflare’s DNS)
Suspicious IP Addresses
The analysis highlighted potentially suspicious IP ranges and behaviors:
- Frequent connections to and from the 198.41.x.x range (Cloudflare’s network)
- High number of connections to 1.1.1.1, possibly indicating DNS-based exfiltration attempts
This analysis demonstrates how integrating LLM capabilities can enhance our ability to detect and respond to potential security threats effectively.
Recommendations
Based on the AI analysis, several follow-up actions were suggested:
- Investigate systems associated with high-volume transfers, especially those involving a.a.a.a
- Analyze traffic patterns for connections to port 9844 across the network
- Review security posture of internal IPs with high connection counts, such as 10.19.53.199 and 10.19.63.178
- Conduct a thorough review of SSH connections, especially those with non-standard destination ports
- Implement additional logging and monitoring for identified IPs and ports of concern
- Consider using threat intelligence to cross-reference external IPs, particularly those in the Cloudflare range
- Investigate the nature of repeated connections to 1.1.1.1 for potential DNS tunneling or data exfiltration
Conclusion
By combining cloud-based data storage, SQL querying, and AI-powered analysis, we’ve created a powerful workflow for identifying potential security threats in network flow data. This approach allows SOC analysts to quickly focus on the most suspicious activities, improving overall security posture and response times.
While AI analysis provides valuable insights, it’s crucial to remember that these findings should be the starting point for further investigation, not definitive conclusions. Human expertise remains essential in contextualizing and acting upon the AI-generated insights.