Identifying Sarcastic Headlines

Bingi Nagesh
7 min readNov 11, 2021
Source: bestlifeonline

Contents

1. Introduction
2. Data Collection
3. Exploratory Data Analysis
4. Preprocessing
5. Experimentation and Results
6. Data Pipeline
7. Deployment
8. Experiments that did not work
9. Conclusion
Code Reference
Contact Links

1. Introduction

Sarcasm is the caustic use of irony, in which words are used to communicate the opposite of their surface meaning, in a humorous way or to mock someone or something —wikipedia

Sarcasm is the use of words that mean the opposite of what you really want to say especially in order to insult someone, to show irritation, or to be funny — merriam-webster

Sarcasm is used in the news headlines to mock/criticize the Industries/MNCs/Government/Government policies. Sometimes news publishing houses intentionally use sarcastic headlines to sound funny or get the general public’s attention.

In this blog, BoW, TF-IDF text processing techniques are used followed by ML algorithms for classification. In another approach, pre-trained embedded vectors are used to encode words and Neural networks are used for classification. BERT encodings are also used but the results are very poor. The metric used for evaluation is accuracy. Finally, the best model is deployed on the Heroku cloud.

2. Data Collection

The data is downloaded from Kaggle. The data consists of two JSON files.

Data is collected from two news websites. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). Real (and non-sarcastic) news headlines are collected from HuffPost.

Each record in the JSON file consists of three attributes:

  • is_sarcastic: 1 if the record is sarcastic otherwise 0
  • headline: the headline of the news article
  • article_link: link to the original news article. Useful in collecting supplementary data

One JSON file consists of 28619 records (train dataset) and the other file consists of 26709 records (test dataset).

3. Exploratory Data Analysis

train = pd.read_json("Sarcasm_Headlines_Dataset_v2.json", lines=True)
plt.figure()
sns.countplot(data = train, x = "is_sarcastic")
plt.title("Class distribution")
plt.show()

From above, we can see that both classes are almost equally present.

Let us plot the distribution of word length of headlines in the train dataset.

From above, we can see that most of the headlines have word length less than 20.

4. Preprocessing

Let us preprocess the headlines.

train["headline"] = train["headline"].apply(decontracted)
test["headline"] = test["headline"].apply(decontracted)

5. Experimentation and Results

5.1 Bag-of-words

Let us extract features using bag-of-words featurizer.

vectorizer = CountVectorizer(min_df = 10, max_df = 5000, ngram_range = (1, 3))vectorizer.fit(train["headline"])
x_train = vectorizer.transform(train["headline"])
x_test = vectorizer.transform(test["headline"])

5.1.1 Logistic Regression

Let us train the logistic regression model and find out accuracy.

model = LogisticRegression(C = 1, max_iter = 200)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 91.3287%

5.1.2 Naive Bayes

Let us train the Naive Bayes model and find out accuracy.

model = MultinomialNB(alpha = .033)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 86.9557%

5.1.3 Random Forest

Let us train the Random Forest model and find out accuracy.

model = RandomForestClassifier(n_estimators = 50)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 99.8951%

5.1.4 XGBoost

Let us train the XGBoost model and find out accuracy.

model = xgb.XGBClassifier(n_estimators = 150, max_depth = 32, verbosity = 1, use_label_encoder = False)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 87.9815%

5.2 TF-IDF

Let us extract features using TF-IDF featurizer.

vectorizer = TfidfVectorizer(min_df = 10, max_df = 5000, ngram_range = (1, 3))vectorizer.fit(train["headline"])
x_train = vectorizer.transform(train["headline"])
x_test = vectorizer.transform(test["headline"])

5.2.1 Logistic Regression

Let us train the logistic regression model and find out accuracy.

model = LogisticRegression(C = 3.3, max_iter = 200)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 90.7671%

5.2.2 Naive Bayes

Let us train the Naive Bayes model and find out accuracy.

model = MultinomialNB(alpha = .01)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 87.0979%

5.2.3 Random Forest

Let us train the Random Forest model and find out accuracy.

model = RandomForestClassifier(n_estimators = 50)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 99.9063%

5.2.4 XGBoost

Let us train the XGBoost model and find out accuracy.

model = xgb.XGBClassifier(n_estimators = 150, max_depth = 32, verbosity = 1, use_label_encoder = False)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 91.1752%

5.3 Neural network with Glove Embeddings

Let us train the Neural network with Glove embeddings and find out accuracy.

t = Tokenizer()
t.fit_on_texts(train["headline"])
encoded_train = t.texts_to_sequences(train["headline"])
encoded_test = t.texts_to_sequences(test["headline"])
max_length = 25padded_train = pad_sequences(encoded_train,
maxlen = max_length,
padding = "post",
truncating = "post")
padded_test = pad_sequences(encoded_test,
maxlen = max_length,
padding = "post",
truncating = "post")
print(padded_train.shape, padded_test.shape, type(padded_train))vocab_size = len(t.word_index) + 1
print(vocab_size)
with open("/mygdrive/glove_vectors", "rb") as fi:
glove_model = pickle.load(fi)
glove_words = set(glove_model.keys())
embedding_matrix = np.zeros((vocab_size, 300)) # vector len of each word is 300for word, i in t.word_index.items():
if word in glove_words:
vec = glove_model[word]
embedding_matrix[i] = vec
print(embedding_matrix.shape)

Output:

(28619, 25) (26709, 25) <class 'numpy.ndarray'>
29588
(29588, 300)

Let us define callbacks for our neural network

def checkpoint_path():
return "./model/weights.{epoch:02d}-{val_accuracy:.4f}.hdf5"
def log_dir():
return "./logs/fit/" + datetime.now().strftime("%Y-%m-%d-%H:%M:%S")
earlystop = EarlyStopping(monitor = "val_accuracy",
patience = 7,
verbose = 1,
restore_best_weights = True,
mode = 'max')
reduce_lr = ReduceLROnPlateau(monitor = "val_accuracy",
factor = .4642,
patience = 3,
verbose = 1,
min_delta = 0.001,
mode = 'max')

creating and training the model

tf.keras.backend.clear_session()
input = Input(shape = (max_length, ), name = "input")
embedding = Embedding(input_dim = vocab_size,
output_dim = 300, # glove vector size
weights = [embedding_matrix],
trainable = False)(input)
lstm = LSTM(32)(embedding)
flatten = Flatten()(lstm)
dense = Dense(16, activation = None,
kernel_initializer = "he_uniform")(flatten)
dropout = Dropout(.25)(dense)
activation = Activation("relu")(dropout)
output = Dense(2, activation = "softmax", name = "output")(activation)
model = Model(inputs = input, outputs = output)
model.compile(optimizer = "adam", loss = "sparse_categorical_crossentropy", metrics = ["accuracy"])checkpoint = ModelCheckpoint(filepath = checkpoint_path(),
monitor='val_accuracy',
verbose = 1,
save_best_only = True,
mode = "max")
callbacks_list = [checkpoint, earlystop, reduce_lr]history = model.fit(padded_train, y_train,
validation_data = (padded_test, y_test),
epochs = 30,
batch_size = 32,
callbacks = callbacks_list)

training history

model testing

y_pred_softmax = model.predict(padded_test)
y_pred = []
for i in range(len(y_pred_softmax)):
if y_pred_softmax[i][0] >= 0.5:
y_pred.append(0)
else:
y_pred.append(1)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True, fmt = "d")
plt.xlabel("predicted label")
plt.ylabel("actual label")
plt.title("test confusion matrix")
plt.show()
print("Accuracy:", 100*accuracy_score(y_test, y_pred))
Accuracy: 99.9925%

5.4 Results

+--------------------------------------+---------------+
| Model | Test Accuracy |
+--------------------------------------+---------------+
| BoW with Logistic Regression | 91.3287% |
| BoW with Naive Bayes | 86.9557% |
| BoW with Random Forest | 99.8951% |
| BoW with XGBoost | 87.9815% |
| TF-IDF with Logistic Regression | 90.7671% |
| TF-IDF with Naive Bayes | 87.0979% |
| TF-IDF with Random Forest | 99.9063% |
| TF-IDF with XGBoost | 91.1752% |
| Neural network with Glove Embeddings | 99.9925% |
+--------------------------------------+---------------+

From the results, it is clear that Neural network with Glove Embeddings gave the highest accuracy of 99.9925%.

6. Data Pipeline

7. Deployment

The best model, i.e., linear regression with L1 regularization is deployed on the Heroku. Link to web-app. The below video shows some sample predictions.

8. Experiments that did not work: BERT embeddings

Pre-trained BERT embeddings were used instead of glove embeddings. A neural network was built for classification but the model predicted all news headlines as NON-SARCASTIC.

9. Conclusion

In this blog, different experiments are done to classify sarcastic news headlines from non-sarcastic news headlines. The Random Forest model worked well for both BoW and TF-IDF featurization. The best accuracy was achieved with the neural network with Glove embeddings with an accuracy of 99.9925%.

Code Reference

Contact Links

Email: binginagesh@gmail.com
LinkedIn: https://www.linkedin.com/in/bingi-nagesh-5b0412b7/

--

--