Click here to Skip to main content
14,538,925 members

Spam classification using Python and Keras

Rate this:
4.36 (5 votes)
Please Sign up or sign in to vote.
4.36 (5 votes)
26 Feb 2018CPOL
I will show how to prepare training and test data, define a simple neural network model, train and test it.

Introduction

Spam detection is an everyday problem that can be solved in many different ways, for example using statistical methods. Here we will create a spam detection based on Python and the Keras library. Keras is a high level API for deep learning that can use Tensorflow, Theanos or CNTK under the hood. It was created to provide a consistent and user friendly way to prototype neural networks. We will first show how to transform the given text data into a format that can be processed by a deep learning algorithm. Then we will create a rather naive model, train it with the given training data and test it against a separate set of test data.

Disclaimer

I am neither an experienced Python developer nor an expert in the field of deep learning. In my everyday job I develop Java enterprise applications and have been doing this for almost 20 years. In my free time I like to experiment with new technologies, often a little off the beaten track. So feel free to follow me in exploring this new technology and don't hold back with questions or corrections.

Setup

I don't go into details of installing Python, Keras or Tensorflow here not to mention configurations to run the stuff on a GPU. There are plenty of installation recipes on the web. I have based my own installation on Anaconda and tried it both on Mac OS and on Windows. For our simple model and small amount of training data there is no need to install and configure the infrastructure for GPU computations.

Preprocessing the data

The input data for our contest task is a single text file containing training and test data in an alphanumeric format. It consists of 3 blocks of data, two training blocks containing Spam and Ham (means no Spam) examples and one block of mixed spam/ham to test our solution. The blocks are divided by header lines. Each data line starts with a label (spam or ham) followed by the text to evaluate.

# Spam training data
Spam,<p>But could then once pomp to nor that glee glorious of deigned ...</p>
Spam,<p>His honeyed and land vile are so and native from ah to ah it ...</p>
...

# Ham training data
Ham,<p>Nights chamber with off it nearly i and thing entrance name. Into ...</p>
Ham,<p>Chamber bust me. Above the lenore and stern by on. Have shall ah ...</p>
...

# Test data
Ham,<p>Bust by this expressing at stepped and. My my dreary a and. Shaven we ...</p>
...
Spam,<p>So his chaste my. Mote way fabled as of aye from like old. Goodly rill ...</p>
...

First we will separate the training lines from the test lines, preserving the original line format. We will use the comment # Test data for this separation. We will also shuffle the training and test data.

'''
Read the file with the training and test data and return
it as two separate lists. Both lists will be shuffled before
they are returned.
'''
def read_lines():
    train_lines = []
    test_lines = []
    current_lines = []

    with open('SpamDetectionData.txt') as f:
        for line in f.readlines():
            if line.startswith('# Test data', 0):
                train_lines = current_lines
                current_lines = test_lines
            elif line.startswith('#', 0):
                '''
                Ignore comment lines
                '''
            elif line == '\n':
                '''
                Ignore empty lines
                '''
            else:
                current_lines.append(line)

    test_lines = current_lines
   
    seed(1337)
    shuffle(train_lines)
    shuffle(test_lines)

    print('Read training lines: ', len(train_lines))
    print('Read test lines: ', len(test_lines))

    return train_lines, test_lines


# First split train lines from test lines
train_lines, test_lines = read_lines()

Second we will split the two blocks into labels and data. We will remove some formatting information but keep the alphanumeric format for now. This is still plain Python.

'''
Take a list of lines from the original input file (train or test), remove
paragraphs and line breaks and split into label and data by using the comma
as divider. Return as two separate lists preserving the sort order.
'''
def split_lines(lines):
    data = []
    labels = []
    maxtokens = 0
    for line in lines:
        label_part, data_part = line.replace('<p>','').replace('</p>','').replace('\n', '').split(',')
        data.append(data_part)
        labels.append(label_part)
        if (len(data_part)>maxtokens):
            maxtokens=len(data_part)

    print('maxlen ', maxtokens)

    return data, labels

# Split data from label for each line
train_data_raw, train_labels_raw = split_lines(train_lines)
test_data_raw, test_labels_raw = split_lines(test_lines)

Now Keras joins the game. We use the Tokenizer class from the preprocessing package to vectorize our texts. The tokenizer is initialized using our training data (only the text part). fit_on_text will create a dictionary of all words used in the training data, along with a rank (index number) for each word. You can look into this dictionary by calling t.word_index.

# Use Keras Tokenizer to vectorize text:
# fit_on_texts will setup the internal vocabulary using all words
# from the training data and attaching indices to them
# texts_to_sequences will transform each text into sequence of
# integer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_data_raw)
train_data_seq = tokenizer.texts_to_sequences(train_data_raw)
test_data_seq = tokenizer.texts_to_sequences(test_data_raw)

With a call to t.texts_to_sequences() we will transform our text data to a list of word indices.

If we would do it with only one sample text it would look like this:

sample_text = 'But could then once pomp to nor that glee glorious of deigned'

dictionary = {'but':1,'could':2,'then':3,'once':4,'pomp':5,'to':6,'nor':7,'that':8,'glee':9,'glorious':10,'of':11,'deigned':12}

sample_idx = [1,2,3,4,5,6,7,8,9,10,11,12]

 

Then we will convert this list of lists of indices to a binary numpy matrix. The matrix columns represent the words in the text data, the rows represent the text lines.

'''
While processing the data with Keras each original text will converted
to a list of indices. These indices point to words in a dictionary
of all words contained in the training data. We convert this to a binary
matrix. The value 1 in the matrix says that a word (x in the matrix) is
contained in a given text (y in the matrix)
'''
def vectorize_sequences(sequences, dimension=4000):
    results = zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# Finally the integer sequences are converted to a binary (numpy)
# matrix where rows are for the text lines, columns are for
# the words. 1 = word is inside text, 0 = word is not inside
x_train = vectorize_sequences(train_data_seq, 4000)
print('Lines of training data: ', len(x_train))
x_test = vectorize_sequences(test_data_seq, 4000)
print('Lines of test data: ', len(x_test))

 

Using the index values from the last step we can create a binary matrix as follows:

            chamber  nearly   thing   lenore   shall   ...   glorious
line    1     1.0      1.0      1.0     0.0     0.0    ...      0.0
line    2     1.0      0.0      0.0     1.0     1.0    ...      0.0
...
line 2000     0.0      0.00     0.0     0.0     0.0    ...      1.0

As you can see, the matrix reduces our data to the mere existence of words in a text. We give up the sequence of words or recurring word patterns that could indicate spam. This maybe a disadvantage when trying to identify spam, but it is easy to compute.

We will create a second numpy matrix for the test data. In this case we will only use the vocabulary from the training data, as our model will be trained on this. So the second matrix has the same columns as the one created from the training data and binary flags created from the test set.

Finally we will transform the labels to a numeric format. As we have exactly one label per line
we will create a binary vector where 1.0 stands for Spam and 0.0 stands for Ham. We will also do
this for training and test labels.
 

'''
The label vectorization is quite simple:
  the value 1 is for spam,
  the value 0 is for ham
'''
def vectorize_labels(labels):
    results = zeros(len(labels))
    for i, label in enumerate(labels):
        if (label.lower() == 'spam'):
            results[i] = 1
    return results

# The labels are also converted to a binary vector.
# 1 means spam, 0 means ham
y_train = vectorize_labels(train_labels_raw)
print('Lines of training results: ', len(y_train))
y_test = vectorize_labels(test_labels_raw)
print('Lines of test results: ', len(y_test))

So finally we have four pieces of data:

  1. a binary matrix of words and their occurence in the training data
  2. a binary matrix of the same words and their occurence in the test data
  3. a binary vector of classification labels for the training data
  4. a binary vector of classification labels for the test data

We will use 1. and 3. to train our neural network and 2. and 4. to test and evaluate it.

As you can see the preparation of the input data can cause a certain amount of effort.

Setting up the model

Now we want to create the neural network using Keras. A neural network consists of a set of layers that transform the input data to a prediction. Every layer uses a set of weights as parameters for the transformation. The prediction is compared to the expected value ('training label' in the diagram) using a loss function. In each iteration an optimizer is used to improve the weights (parameters). So learning means minimizing the loss of a model by iteratively changing model parameters.

Neural network

Our simple sequential model will use an input layer with 4000 input neurons (in fact we only have 3691 different words in the training data), two hidden layers for internal transformation and one output layer that gives us a scalar prediction value indicating if we have spam or ham.

# Now we build the Keras model
model = models.Sequential()
model.add(layers.Dense(8, activation='relu', input_shape=(4000,)))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

We use so called dense layers for our network, so a neuron on one layer is connected with each neuron on the next layer. Our network looks like this (okay, there are not exactly 4000 neurons in the first layer, but you get the idea):

Dense network

In Keras you can define so-called activation functions to each layer. If you don't, a neuron will be computed as a linear combination of all weighted inputs. By setting functions you can add non-linear behaviour.

For the hidden layers we use the 'relu' function, which is like f(x) = max(0, x).

For the output layer we use the 'sigmoid' function, which will transform the output into a (0,1) interval and is non linear.

We use 'binary_crossentropy' as loss-function and 'rmsprop' as optimizer.

After calling compile the model is ready to be trained.

Training the model

Now we will use the training data to train our neural network model. The training is done by calling fit with the following parameters:

  • x_train is the training data (the binary matrix of words), one dataset is also called a sample
  • y_train is the vector with training labels (the expected results)
  • epochs defines the number of training phases (one pass over the entire dataset). This leads to periodic logging and model validation.
  • batch_size defines the amount of samples that are processed together and which lead to one update of the model
  • validation_split defines which percentage of the training data will be used to validate our progress. Our value of 0.3 means we will use 1400 samples of our training set for training and 600 for validation.
# Train the model
history = model.fit(x_train,y_train,epochs=8,batch_size=100,validation_split=0.3)

The training will give us output for every epoch, so we can see how the model behaves. Ideally the accuracy should increase while the loss will decrease. Because we defined a validation_split, we will also have a validation phase at the end of each epoch, giving us additional val_loss and val_acc.

Train on 1400 samples, validate on 600 samples
Epoch 1/8
1400/1400 [==============================] - 0s 249us/step - loss: 0.3301 - acc: 0.9536 - val_loss: 0.1725 - val_acc: 1.0000
Epoch 2/8
1400/1400 [==============================] - 0s 79us/step - loss: 0.1124 - acc: 1.0000 - val_loss: 0.0747 - val_acc: 1.0000
Epoch 3/8
1400/1400 [==============================] - 0s 79us/step - loss: 0.0491 - acc: 1.0000 - val_loss: 0.0342 - val_acc: 1.0000
Epoch 4/8
1400/1400 [==============================] - 0s 80us/step - loss: 0.0227 - acc: 1.0000 - val_loss: 0.0167 - val_acc: 1.0000
Epoch 5/8
1400/1400 [==============================] - 0s 81us/step - loss: 0.0110 - acc: 1.0000 - val_loss: 0.0084 - val_acc: 1.0000
Epoch 6/8
1400/1400 [==============================] - 0s 79us/step - loss: 0.0053 - acc: 1.0000 - val_loss: 0.0042 - val_acc: 1.0000
Epoch 7/8
1400/1400 [==============================] - 0s 79us/step - loss: 0.0026 - acc: 1.0000 - val_loss: 0.0022 - val_acc: 1.0000
Epoch 8/8
1400/1400 [==============================] - 0s 79us/step - loss: 0.0013 - acc: 1.0000 - val_loss: 0.0012 - val_acc: 1.0000
100/100 [==============================] - 0s 60us/step

The fit operation will return this data as 'history'. We can use it to plot diagrams using matplotlib.

def plot_accuracy(history):
    pyplot.plot(history.history['acc'])
    pyplot.plot(history.history['val_acc'])
    pyplot.title('model accuracy')
    pyplot.ylabel('accuracy')
    pyplot.xlabel('epoch')
    pyplot.legend(['training', 'validation'], loc='lower right')
    pyplot.show()

def plot_loss(history):
    pyplot.plot(history.history['loss'])
    pyplot.plot(history.history['val_loss'])
    pyplot.title('model loss')
    pyplot.ylabel('loss')
    pyplot.xlabel('epoch')
    pyplot.legend(['training', 'validation'], loc='upper right')
    pyplot.show()

# summarize history for accuracy
plot_accuracy(history)

# summarize history for loss
plot_loss(history)

As you can see the accuracy of our model increases very fast to 1.0.

Model accuracy

The loss of our model will decrease with each epoch, going to almost 0.

Model loss

Evaluate the model

Finally we want to evaluate the model using our test data. We can call evaluate to do this and use our test data and test labels to check the model.

# Evaluate the model
results = model.evaluate(x_test, y_test)
print(model.metrics_names)
print('Test result: ', results)

It will return a result containing loss and accuracy. In our case the loss seems to be very low and the accuracy is 100%.

['loss', 'acc']
Test result:  [0.00056361835027928463, 1.0]

If you want to check single records of the test data or use your model with new data (maybe from incoming mail), you can do it using the predict operation. Here we define a small function test_predict which will convert our test text to its vectorized form and call predict. If our prediction value is > 50% we will call it Spam, else we will call it Ham. We compare our predicted value with the expected label.

def test_predict(model, testtext, expected_label):
    testtext_list = []
    testtext_list.append(testtext)
    testtext_sequence = tokenizer.texts_to_sequences(testtext_list)
    x_testtext = vectorize_sequences(testtext_sequence)
    prediction = model.predict(x_testtext)[0][0]
   
    print("Sentiment: %.3f" % prediction, 'Expected ', expected_label)

    if prediction > 0.5:
        if expected_label == 'Spam':
            return True
    else:
        if expected_label == 'Ham':
            return True
   
    return False

# Manual test over all test records
correct = 0
wrong = 0
for input_text, expected_label in zip(test_data_raw, test_labels_raw):
    if test_predict(model, input_text, expected_label):
        correct = correct + 1
    else:
        wrong = wrong + 1

print('Predictions correct ', correct, ', wrong ', wrong)

 

 

 

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Markus Glagla
Software Developer (Senior) Markus Glagla Consulting
Germany Germany
No Biography provided

Comments and Discussions

 
Questioninput Pin
Liuchiang7-May-19 21:06
MemberLiuchiang7-May-19 21:06 
QuestionUrgent Pin
Member 1393273431-Jul-18 5:18
MemberMember 1393273431-Jul-18 5:18 
PraiseNice! Pin
Scott Clayton28-Feb-18 12:58
MemberScott Clayton28-Feb-18 12:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 26 Feb 2018

Tagged as

Stats

10.2K views
341 downloads
3 bookmarked