Detection of Phishing Websites Using Structural Similarities Amongst Attacks
Abstract
Phishing attacks have become a prominent threat to online security, with attackers constantly evolving their techniques to steal sensitive information. Traditional phishing detection methods often rely on analyzing individual features or patterns, which may not capture the dynamic and evolving nature of phishing attacks. This paper reports a novel approach for phishing website detection by considering the structural similarities among phishing websites. A clustering-based algorithm that can effectively group together websites exhibiting similar attack patterns is developed. For clustering we used hierarchical agglomerative and K-mean clustering algorithms whereas Longest Common Sequence (LCS) and fingerprint algorithms were employed to calculate similarity of the websites. We collected 3,588 website URLs from publicly available phishing websites repository - phishtank and legitimate website - Alexa. From the collected websites HTML parsing were performed to extract relevant features that helps to compute similarities. Silhouette score for internal validation and Adjusted Rand Index (ARI) for external validation of clusters were used. Hierarchical based clustering model (with Ward Linkage methods) with fingerprint similarity measure outperforms other models with Silhouette score and ARI values of 0.85, and 0.87, respectively. The proposed approach offers several advantages. First, it provides a holistic view of phishing attacks by considering the overall structural similarities instead of isolated features. Second, the clustering-based approach enables the detection of previously unknown phishing websites based on their similarity to known attack patterns. Third, it allows for the identification of emerging attack patterns traditional detection methods might not capture.
Copyright (c) 2024 Poly Journal of Engineering and Technology (PJET)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.