Clickbait Detection: Combining Datasets and Training a Bag-of-Words Classifier

This code implements a clickbait detection model using a Bag-of-Words (BOW) Naive Bayes classifier. It combines two datasets, one containing positive (clickbait) examples and the other negative (non-clickbait) examples. The code then splits the combined dataset into train, validation, and test sets, trains a BOW classifier using scikit-learn's Pipeline module, and evaluates its performance using metrics like precision, recall, and F1-score.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, f1_score

# Read datasets
clickbait_pos = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait', header=None)
clickbait_neg = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait', header=None)

# Combine and shuffle datasets
dataset = pd.concat([clickbait_pos, clickbait_neg])
dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)

# Split into train, validation, and test sets
train_data, test_data, train_labels, test_labels = train_test_split(dataset[0], dataset[1], test_size=0.2, random_state=42)
train_data, val_data, train_labels, val_labels = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)

# Create Pipeline and train BOW model
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])
pipeline.fit(train_data, train_labels)

# Calculate precision, recall, and F1-score on training set
train_predictions = pipeline.predict(train_data)
train_precision = precision_score(train_labels, train_predictions)
train_recall = recall_score(train_labels, train_predictions)
train_f1 = f1_score(train_labels, train_predictions)

# Calculate precision, recall, and F1-score on validation set
val_predictions = pipeline.predict(val_data)
val_precision = precision_score(val_labels, val_predictions)
val_recall = recall_score(val_labels, val_predictions)
val_f1 = f1_score(val_labels, val_predictions)

print('Training Precision:', train_precision)
print('Training Recall:', train_recall)
print('Training F1-score:', train_f1)
print('Validation Precision:', val_precision)
print('Validation Recall:', val_recall)
print('Validation F1-score:', val_f1)

# Calculate clickbait rate in test set
test_predictions = pipeline.predict(test_data)
clickbait_rate = np.mean(test_predictions)
print('Clickbait Rate in Test Dataset:', clickbait_rate)

Note: The links in the code may change. You might need to locate the datasets from alternative sources if they are not accessible from the provided links.

标签: 常规


原文地址: https://cveoy.top/t/topic/bgf8 著作权归作者所有。请勿转载和采集!