Detection of Phishing Websites Using Structural Similarities Amongst Attacks

Mgdelawit Bewketu; Esubalew Alemneh

doi:10.20372/pjet.v2i1.1716

Mgdelawit Bewketu Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, P. O. Box 26, Bahir Dar, Ethiopia
Esubalew Alemneh ICT4D Research Center, Bahir Dar Institute of Technology, Bahir Dar University, P. O. Box 26, Bahir Dar, Ethiopia

DOI: https://doi.org/10.20372/pjet.v2i1.1716

Abstract

Phishing attacks have become a prominent threat to online security, with attackers constantly evolving their techniques to steal sensitive information. Traditional phishing detection methods often rely on analyzing individual features or patterns, which may not capture the dynamic and evolving nature of phishing attacks. This paper reports a novel approach for phishing website detection by considering the structural similarities among phishing websites. A clustering-based algorithm that can effectively group together websites exhibiting similar attack patterns is developed. For clustering we used hierarchical agglomerative and K-mean clustering algorithms whereas Longest Common Sequence (LCS) and fingerprint algorithms were employed to calculate similarity of the websites. We collected 3,588 website URLs from publicly available phishing websites repository - phishtank and legitimate website - Alexa. From the collected websites HTML parsing were performed to extract relevant features that helps to compute similarities. Silhouette score for internal validation and Adjusted Rand Index (ARI) for external validation of clusters were used. Hierarchical based clustering model (with Ward Linkage methods) with fingerprint similarity measure outperforms other models with Silhouette score and ARI values of 0.85, and 0.87, respectively. The proposed approach offers several advantages. First, it provides a holistic view of phishing attacks by considering the overall structural similarities instead of isolated features. Second, the clustering-based approach enables the detection of previously unknown phishing websites based on their similarity to known attack patterns. Third, it allows for the identification of emerging attack patterns traditional detection methods might not capture.