System Architecture of AI Content Detector: Client to AI Model
Artificial Intelligence (AI) has revolutionized how we create, consume, and validate content. With the rise of AI-generated articles, essays, and even images, it’s becoming important to know whether content is AI-generated or written by a human. This step-by-step guide will teach you how to build an AI content detector using Python, even if you’re a beginner.
Here is a sample demo video of our application. Now, let’s build our AI Content Detector.
An AI content detector is a tool that analyzes text to predict whether it was generated by a human or an AI. It uses Natural Language Processing (NLP) techniques and machine learning models trained to differentiate between patterns of human writing and AI-generated text.
AI-generated content is growing rapidly, but it’s not always easy to spot. Detecting AI content can help:
Before we begin, ensure you have:
First, install the necessary libraries. We’ll use:
Run the following commands in your terminal:
pip install scikit-learn nltk pandas numpy
To train your AI content detector, you’ll need two types of data:
Save these datasets as separate .csv files, like human_content.csv and ai_content.csv, with a column for the text and another column for the label (human or ai).
Before training the model, you need to clean the data. Here’s a simple Python script to preprocess it:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
pandas: Used to read and manipulate structured data in CSV format.nltk.corpus.stopwords: Provides a collection of common stopwords (words like “is”, “the”, “and”) to remove from text.nltk.tokenize.word_tokenize: Splits text into individual words (tokens).nltk: Natural Language Toolkit (NLTK) for text processing.nltk.download('punkt')
nltk.download('stopwords')
human_data = pd.read_csv('human_content.csv')
ai_data = pd.read_csv('ai_content.csv')
Reads two CSV files containing text data for human-generated and AI-generated content using pd.read_csv().
data = pd.concat([human_data, ai_data])
pd.concat().def preprocess_text(text):
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
return ' '.join(filtered_tokens)
text is a string to be cleaned.text.lower() to ensure consistent processing.word_tokenize().word.isalnum().word not in stop_words.' '.join(filtered_tokens).data['text'] = data['text'].apply(preprocess_text)
preprocess_text() function to each row in the text column of the combined DataFrame.We’ll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features and train a classifier using the Logistic Regression model.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
TfidfVectorizer: Converts text data into numerical features using the TF-IDF technique.train_test_split: Splits the dataset into training and testing sets.LogisticRegression: A classification algorithm used to predict the label of input data.accuracy_score: Calculates the accuracy of the model.classification_report: Provides detailed evaluation metrics (precision, recall, and F1-score)X = data['text']
y = data['label']
X: Contains the text data as input features.y: Contains the corresponding labels, which might be binary (like 0 for human-generated and 1 for AI-generated) or multi-class.vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)
TfidfVectorizer(): Transforms the text data into a numerical matrix using TF-IDF (Term Frequency-Inverse Document Frequency).How TF-IDF works:
fit_transform(X): Fits the vectorizer on the text and transforms it into a sparse matrix representation.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
random_state=42: Ensures reproducibility of results.model = LogisticRegression()
model.fit(X_train, y_train)
fit() on the training data (X_train, y_train).y_pred = model.predict(X_test)
Generates predictions on the test set (X_test).
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
accuracy_score: Computes the ratio of correctly predicted labels to the total number of predictions.classification_report: Provides additional metrics:
Accuracy: 0.92
precision recall f1-score support
0 0.91 0.93 0.92 120
1 0.93 0.91 0.92 110
You can now input a piece of text to check if it’s AI-generated or human-written.
def predict_content(text):
processed_text = preprocess_text(text)
vectorized_text = vectorizer.transform([processed_text])
prediction = model.predict(vectorized_text)
return "AI-generated" if prediction[0] == 'ai' else "Human-written"
text string as input.preprocess_text() function to clean it (removes stopwords, converts to lowercase, and keeps only alphanumeric tokens).vectorizer.transform() method. It applies the trained TF-IDF vectorizer to transform the text into a format suitable for the model.model.predict() method, which outputs the predicted label ('ai' or 'human')."AI-generated" if the prediction is 'ai', otherwise return "Human-written".sample_text = "This is an example sentence generated by AI."
print(predict_content(sample_text))
predict_content(sample_text) processes and classifies the text as either AI-generated or human-written.AI-generated
vectorizer.transform([processed_text]) ensures the text is converted to the same TF-IDF representation as during training.This function can be expanded by:
To use your detector later, save the model and vectorizer using joblib:
import joblib
joblib: A library that efficiently serializes (saves) and deserializes (loads) large Python objects like machine learning models.
joblib.dump(model, 'ai_content_detector.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
joblib.dump(object, filename): Saves the specified object to a file.'ai_content_detector.pkl': The saved file for the Logistic Regression model.'tfidf_vectorizer.pkl': The saved file for the TF-IDF vectorizer.Later, you can load the saved components like this:
model = joblib.load('ai_content_detector.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')
joblib.load(filename): Reloads the previously saved object.Ensure everything works after reloading:
sample_text = "This is a test sentence."
processed_text = preprocess_text(sample_text)
vectorized_text = vectorizer.transform([processed_text])
prediction = model.predict(vectorized_text)
print("Prediction after reloading:", "AI-generated" if prediction[0] == 'ai' else "Human-written")
Prediction after reloading: AI-generated
joblib is faster and more efficient for large numpy arrays (like the ones in machine learning models).pickle.Keep the saved model files (.pkl) in a secure location and version control them if you plan to use multiple models for different tasks.
To make the detector more accurate:
To integrate the AI Content Detector with a front-end, you’ll need to build a complete web application. Below is a step-by-step guide to achieve this:
Flask is a lightweight Python web framework that’s perfect for small projects.
pip install Flask
Create a file called app.py:
from flask import Flask, request, jsonify
import joblib
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
# Load the pre-trained model and vectorizer
model = joblib.load('ai_content_detector.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')
app = Flask(__name__)
# Text preprocessing function
def preprocess_text(text):
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
return ' '.join(filtered_tokens)
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
text = data.get('text')
if not text:
return jsonify({"error": "No text provided"}), 400
# Preprocess and predict
processed_text = preprocess_text(text)
vectorized_text = vectorizer.transform([processed_text])
prediction = model.predict(vectorized_text)
result = "AI-generated" if prediction[0] == 'ai' else "Human-written"
return jsonify({"prediction": result})
if __name__ == '__main__':
app.run(debug=True)
Create a file called index.html in the same directory:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Content Detector</title>
<style>
body {
font-family: Arial, sans-serif;
max-width: 600px;
margin: 50px auto;
text-align: center;
}
textarea {
width: 100%;
height: 150px;
}
button {
margin-top: 10px;
padding: 10px 20px;
font-size: 16px;
}
#result {
margin-top: 20px;
font-size: 18px;
font-weight: bold;
}
</style>
</head>
<body>
<h1>AI Content Detector</h1>
<p>Enter text to check if it's AI-generated or human-written:</p>
<textarea id="inputText" placeholder="Type your content here..."></textarea>
<br>
<button onclick="checkContent()">Check Content</button>
<div id="result"></div>
<script>
async function checkContent() {
const text = document.getElementById('inputText').value;
if (!text) {
alert("Please enter some text!");
return;
}
const response = await fetch('http://127.0.0.1:5000/predict', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text })
});
const result = await response.json();
document.getElementById('result').innerText = result.prediction || "Error in prediction!";
}
</script>
</body>
</html>
To make our AI text detector more strong, we can add several advanced features:
When developing any web application, it’s very important to consider security, performance, and scalability:
Your application must be accessible to everyone. Especially the users with disabilities. Additionally, consider the ethical implications of AI text detection, including user privacy and data protection.
Creating an AI content detector in Python is both educational and practical. This tutorial covered the basics, but you can take it further by integrating it into a web app or API. If you found this guide helpful, share it with others and explore more resources on our website.
OpenAI API Documentation: Link
Flask Documentation: Link
HTML MDN Web Docs: Link
CSS MDN Web Docs: Link
JavaScript MDN Web Docs: Link
Spacy Documentation: Link
Langdetect Documentation: Link
Spellchecker Documentation: Link
Web Accessibility MDN Web Docs: Link
After debugging production systems that process millions of records daily and optimizing research pipelines that…
The landscape of Business Intelligence (BI) is undergoing a fundamental transformation, moving beyond its historical…
The convergence of artificial intelligence and robotics marks a turning point in human history. Machines…
The journey from simple perceptrons to systems that generate images and write code took 70…
In 1973, the British government asked physicist James Lighthill to review progress in artificial intelligence…
Expert systems came before neural networks. They worked by storing knowledge from human experts as…
This website uses cookies.
View Comments