In order to keep the internet safe, there are many tools that have been developed for phishing detection. Many of them rely on static analysis and a variety of features which are extracted from the web page in question. They are often limited in their perspective and lack the scalability needed to be effective. These solutions also tend to produce high numbers of false positives (safe web pages classified as phishing) which can be a major nuisance for website owners.
Our research aims to create a solution that can detect phishing URLs with the help of machine learning and different views from which a web page is examined. This will enable the solution to work in real time and to detect phishing attacks in a timely manner, which is not possible with current solutions.
A phishing attack is when the attacker attempts to steal user information such as usernames, passwords, and credit card details. These attacks can take a number of forms including emails, websites, social media posts, and even chatbots. They are primarily designed to trick the victim into believing they are visiting a legitimate site.
Many phishing URLs look similar to their legitimate counterparts and the phisher can make changes to the FreeURL to avoid detection. Therefore, it is important to detect these types of phishing URLs in real time to prevent users from being exposed to malicious content. The aim of this research is to create a phishing URL detection API that can evaluate the website under stress, from multiple viewpoints, and in real time.
During the evaluation of a website, the API will collect various data points about the web page and its structure, including the Uniform Resource Locator (URL), host name, path, etc. During the processing of the URL, the API will look for certain features that have been collected from academic studies and used in the detection of phishing domains with machine learning techniques.
The first feature is the URL length. Studies have shown that phishing URLs are typically longer than legitimate ones. The heuristic will check the length of the URL and if it is more than 70 characters, it will be marked as suspicious or -1.
Another feature that is examined is the number of slashes in the path of the URL. It was found that phishing URLs often use more slashes than legitimate sites to make them more similar and difficult to spot. If the number of slashes is more than three, it will be marked as a potential phishing URL.
This feature identifies the server where information filled in forms will be sent. Phishers tend to use a fake server address so the information filled in by the victim cannot be traced. This is why this heuristic evaluates the server banner of the URL to mark it as suspicious or safe.
The last heuristic looks at the presence of a hyphen in the host name of the URL. It was discovered that phishing URLs will often contain a hyphen more than once, which is not the case with a legitimate URL.