Use of Shannon Entropy Estimation for DGA Detection
Syed Qutb 12/12/2023 10:30 114
For threat hunters and security researchers, Advanced Persistent Threats (APTs) are always one step ahead especially in case of cyber-attacks involving Domain Generation Algorithms (DGAs) in which attackers circumvent defenders by drafting thousands of FQDN using quick random seeds that maintains communication between a virus or ransomware with its C&C (command & control) servers. Taking down these malicious FQDN is a challenging task, as Cybersecurity professionals have to identify these domains one by one that are often up for only limited periods of time having rapid rotation of DGA seeds. Thus, signature-based detection is near to impossible.
To cater this problem, we used machine learning approach using Splunk’s URL toolbox Shannon calculator. The approach focuses on identifying anomalies in evolving FQDN patterns by using Shannon’s Entropy. We selected Shannon’s entropy, as it serves as a good metric to quantifying the entropy (uncertainty / information content) of a given domain name. If the entropy score of domains exceeds the defined threshold value (i.e. >4.2 in our case) then it can serve as a useful indicator to identify DGA-based FQDN. Hence, the higher the entropy index, the more likely a given DNS was algorithmically generated. The Splunk SPL query we implemented to analyze the dataset is appended as under:
| inputlookup dns_dga.css
| inputlookup dns_legit_1m.csv append=t
| sample ratio=0.0001
| rex field=domain max_match=1 "(?
| table domain subdomain type subtype
| dedup subdomain
| `ut_shannon(subdomain)`
| `ut_meaning(subdomain)`
| eval ut_digit_ratio = 0.0
| eval ut_vowel_ratio = 0.0
| eval ut_domain_length = max(1,len(domain))
| rex field=subdomain max_match=0 "(?
| rex field=subdomain max_match=0 "(?
| eval ut_digit_ratio=if(isnull(digits),0.0,mvcount(digits) / ut_domain_length)
| eval ut_vowel_ratio=if(isnull(vowels),0.0,mvcount(vowels) / ut_domain_length)
| eval ut_consonant_ratio = max(0.0, 1.000000 - ut_digit_ratio - ut_vowel_ratio)
| eval ut_vc_ratio = ut_vowel_ratio / ut_consonant_ratio
| fields - digits - vowels
| fit TFIDF analyzer=char ngram_range=1-3 subdomain into "DNS_DGA_TIDIF_PREPROCESS_MIX_D865K_L1M"
| fit PCA subdomain_tfidf* k=3
| table * PC_*
| fields - digits - vowels - subdomain_*
| fit RandomForestClassifier n_estimators=10 type from PC_* ut_consonant_ratio ut_digit_ratio ut_domain_length ut_meaning_ratio ut_shannon ut_vc_ratio ut_vowel_ratio
This solution helps us in identification of Domain Generation Algorithms (explained in MITRE ATT&CK T1483) at real time, and it has been implemented in our EUNOMATIX MLDETECT app. For more details and functionality of our ML based detection framework, please contact EUNOMATIX, info@eunomatix.com.
References
https://attack.mitre.org/techniques/T1568/002/
https://redcanary.com/blog/threat-hunting-entropy/