Cybercrime investigation using confusion matrix

Jainsiddhant
4 min readJun 4, 2021

Cyber-attacks have become one of the biggest problems of the world. They cause serious financial damages to countries and people every day. The increase in cyber-attacks also brings along cyber-crime. The key factors in the fight against crime and criminals are identifying the perpetrators of cyber-crime and understanding the methods of attack. Machine learning has become a vital technology for cybersecurity. Machine learning preemptively stamps out cyber threats and bolsters security infrastructure through pattern detection, real-time cyber crime mapping and thorough penetration testing.

Particularly in the last decade, Internet usage has been growing rapidly. However, as the Internet becomes a part of the day to day activities, cybercrime is also on the rise. Cybercrime will cost nearly $6 trillion per annum by 2021 as per the cybersecurity ventures report in 2020.

The growing rate of cybercrime threats is increasing day by day. In the present time there is no foolproof framework to stop these cybercrimes. But data analysis and machine learning provide us a way to detect these cybercrimes. For this machine learning uses classification (Naive Bayes), clustering (K-means clustering) and supervised algorithms. As for every machine learning model first requirement is gathering information and forming a dataset. Here is the sample dataset collected from Kaggle.

Sample Dataset about Cybercrime Incidents

CC = cyber-criminal; CT = cyber terrorist; CH = cyber hacker; TI = Through Internet; TT = Through telecommunication; TS = Through social network.

The next step is data preprocessing in which we perform feature extraction. For doing this step, tf-idf vector process is used(Tf-idf s a numerical statistic that is intended to reflect how important a word is to a document in a collection).Finally the features which are used to make classification of cybercrimes are separate out. It can be done by finding the word frequencies using the tf-idf Vector. By doing this all the irrelevant words known as tokenizers are removed. It consist of two parts : Term Frequency(This recapitulates how often a word has occurred in the given report and finds
out the importance of each word in a document) and Inverse Document Frequency(This downscales the given words present in the report that appeared many times in the report).

The main steps involved in tf–idf are:
(1) tokenize the sentence
(2) evaluate term frequency
(3) evaluate inverse document frequency
(4) calculate the tf–idf score by multiplying the tf and idf results
(5) score the record sentences
(6) find the threshold

Next, Naive Bayes is used for classification and K-means is used for clustering. This process uses 70/30 rule means 70 percent of the data is used for training purposes and rest of the data is used for validation and testing purposes.

Once the model is trained, accuracy of the model can be predicted by making predictions. For checking the accuracy of the model, confusion matrix can be used. The following table shows the precision, recall and f1-score of the model.

Precision, recall and f1-score of model

Precision: : It is the measure of truly predicted positive samples to the total number of positively predicted samples.
Precision =TP/(TP + FP)

Recall: It is the measure of truly predicted positive samples of all the samples present in the actual class as yes.
Recall =TP/(TP + FN)

F1-score: : It is calculated as the weighted average of both precision and recall.
F1 Score = 2 × (precision × recall)

Accuracy is the performance measure used to check our model. It is preferred when the number of false positives values and the false negative values are the same.
Accuracy =(TP + TN)/(TP + TN + FP + FN)

Confusion matrix is used for cybercrime investigation

Conclusion

In the present time, the cybercrime cases are increasing at a very high rate. So, we should proposed the models in such a way so that we can draw proper and accurate insights about these activities. The main focus of this work is to find these attacks and analyse it with the help of machine learning. By using this approach the paper work will be reduced and also the cases can be identify easily incident-wise or area-wise. Thus, these types of reports can be used to take precautionary steps against these cybercrimes.

Open for any Queries and Suggestions .

Thank you

--

--