Spam-Detector-ML

A Spam Classifier using Machine Learning and ElasticSearch.

Keywords: java, sparse format, liblinear, weka, Linear Regression, Naive Bayse

Data

A 255 MB Corpus (trec07p.tgz) provides set of emails annotated for spam

Clean Data

All email files are written in Multipurpose Internet Mail Extensions (MIME, wiki page) format.

extract the content using Apache James MIME4J library.
parse the html body using Jsoup library
clean content with Regex, eliminate non-englinsh characters and unneccssary notations

Upload ElasticSearch

reformat cleaned content into Json using Gson Library
randomly assign 80% trainning data and 20% testing data
upload to ElasticSearch using its REST api

Generate Sparse Matrix

generate lists of spam words using two strategies. These will be the features (columns) of the data matrix.
1. manuelly generate a list of spam related words, for example : “free” , “win”, “porn”, “click here”, etc.
2. use ElasticSearch to get all unigrams for entire corpus
Generate term frequencies using ElasticSearch
Save values into sparse format, together with a catalog file recording file docId corresponds to which line of sparse data

Train and Test

Train the 80% dataset using LibLinear's linear regression model，to generate a model file
Generate prediction on the 20% dataset
Calculate Precision of the testing set

Results and Inprovements

Using all unigrams as features results in a average precision of 99% 👍

However the manuel list of smaller size only had accuracy around 75%, to improve result, I used another machine learning algorithm : Naive Bayes. It outperforms other machine learning algorithms in case of spam prediction.

Weka provides good naive bayes libaray, one only need to reformat the sparse matrix into .ARFF format to run the algorithm. The result accuracy has increased to about 80% but still not enough.

At the end, I look into the learned model provided by ALL Unigram Training set, and take the top 30 highest absolute value score from the model, indicating the most effective indicator of spam detection. Used catalog file to find the corresponding words, and added them into the short list.

Rerun liblinear on the new list of about 50 spam key words, the average accuracy reached 96% overall and 98% for top 50 spam files 😄

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
All Unigram		All Unigram
SpamWordList		SpamWordList
java		java
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam-Detector-ML

Data

Clean Data

Upload ElasticSearch

Generate Sparse Matrix

Train and Test

Results and Inprovements

About

Releases

Packages

Languages

SophieZ302/Spam-Detector-ML

Folders and files

Latest commit

History

Repository files navigation

Spam-Detector-ML

Data

Clean Data

Upload ElasticSearch

Generate Sparse Matrix

Train and Test

Results and Inprovements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages