Top 10 Mathematical Definitions in Data Science Illustrated.
Mathematics is the backbone of data science, providing the foundational concepts that power models and algorithms. Understanding key mathematical definitions can significantly boost your data science journey. In this post, we’ll explore the first ten must-know mathematical definitions for building a strong data science foundation.
Linear regression is a way to predict a result (dependent variable) based on one or more factors (independent variables). Think of it like trying to predict how much a taxi ride will cost based on the distance traveled.
When you plot your data points on a graph, linear regression tries to find the best straight line that connects those points. This line helps make predictions for values that you don’t have yet. The goal is to minimize the errors (the difference between the predicted and actual values).
Let’s say you want to predict someone’s salary based on their years of experience. The more experience they have, the higher their salary is likely to be. The graph might look like this:
X-axis: Years of experience
Y-axis: Salary
If we plot these points, linear regression will create a straight line that best fits the data points. Once we have this line, we can predict the salary for any given number of years of experience.
The mathematical formula for simple linear regression is:
y = mx + b
Where:
In multiple linear regression (when you have more than one factor), the equation becomes:
y = m1x1 + m2x2 + … + mnxn + b
Let’s walk through a basic example to predict salary based on years of experience.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Years of experience (independent variable)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
# Corresponding salaries (dependent variable)
y = np.array([30, 32, 35, 37, 40, 45, 50, 55, 60, 65])
# Visualize the data points
plt.scatter(X, y, color='blue')
plt.xlabel('Years of Experience')
plt.ylabel('Salary (in thousands)')
plt.title('Salary vs Years of Experience')
plt.show()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Get the slope (m) and intercept (b)
print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)
Slope (m): 3.9568965517241383
Intercept (b): 22.86206896551724
# Predict salaries based on years of experience
predictions = model.predict(X_test)
# Compare actual and predicted values
for i in range(len(X_test)):
print(f"Years of Experience: {X_test[i][0]} | Actual Salary: {y_test[i]} | Predicted Salary: {predictions[i]:.2f}")
Years of Experience: 9 | Actual Salary: 60 | Predicted Salary: 58.47
Years of Experience: 2 | Actual Salary: 32 | Predicted Salary: 30.78
# Plot the original data points
plt.scatter(X, y, color='blue')
# Plot the regression line
plt.plot(X, model.predict(X), color='red')
plt.xlabel('Years of Experience')
plt.ylabel('Salary (in thousands)')
plt.title('Linear Regression Model')
plt.show()
Linear regression is a simple yet powerful tool for predicting relationships between variables. In this example, we used years of experience to predict salary. By following these steps in Python, you can apply linear regression to your own data for forecasting, trend analysis, and decision-making.
Logistic regression is a machine learning technique used to classify data into categories. Unlike linear regression, which predicts continuous values (like sales or temperatures), logistic regression predicts probabilities.
Suppose you want to predict whether a student will pass an exam based on the number of hours they study. The result is either “Pass” or “Fail”—a categorical outcome. Logistic regression helps draw a decision boundary between these categories and predicts the probability that a student passes.
The key idea behind logistic regression is the logistic (or sigmoid) function:
Where:
The sigmoid function transforms any input into a value between 0 and 1, representing a probability.
If the probability is greater than 0.5, the algorithm predicts “Pass.” Otherwise, it predicts “Fail.”
The prediction for logistic regression is:
Where:
Let’s predict whether a student passes or fails based on study hours.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
# Study hours and corresponding pass/fail outcomes
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Study hours
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) # 0 = Fail, 1 = Pass
# Visualize the data
plt.scatter(X, y, color='blue')
plt.xlabel('Study Hours')
plt.ylabel('Pass/Fail (0 = Fail, 1 = Pass)')
plt.title('Study Hours vs Pass/Fail')
plt.show()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict pass/fail for the test set
predictions = model.predict(X_test)
# Compare actual and predicted outcomes
for i in range(len(X_test)):
print(f"Study Hours: {X_test[i][0]} | Actual Outcome: {y_test[i]} | Predicted Outcome: {predictions[i]}")
# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
print("Model Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
Study Hours: 9 | Actual Outcome: 1 | Predicted Outcome: 1
Study Hours: 2 | Actual Outcome: 0 | Predicted Outcome: 0
Model Accuracy: 1.0
Confusion Matrix:
[[1 0]
[0 1]]
# Plot the sigmoid curve
X_range = np.linspace(0, 12, 100).reshape(-1, 1)
predicted_probabilities = model.predict_proba(X_range)[:, 1]
plt.scatter(X, y, color='blue')
plt.plot(X_range, predicted_probabilities, color='red')
plt.xlabel('Study Hours')
plt.ylabel('Probability of Passing')
plt.title('Sigmoid Curve for Logistic Regression')
plt.show()
Logistic regression is great for solving classification problems where the outcome is binary (yes/no, pass/fail, etc.). In this example, we used study hours to predict whether a student passes or fails. The sigmoid function plays a key role in converting predictions to probabilities.
Gradient descent is an optimization technique used to help machine learning models find the best parameters (like weights and biases) by minimizing an error function (also known as a loss function).
In machine learning, models are trained by minimizing errors between predicted and actual outputs. Gradient descent finds the optimal values for weights and biases to reduce these errors.
To visualize gradient descent, think of standing on top of a hill and trying to reach the bottom. You keep moving in the direction where the slope is steepest. This is what gradient descent does—it moves in the direction that decreases the error the fastest.
Given a loss function L(w)L(w)L(w), gradient descent updates the weights using the following rule:
Let’s minimize a simple quadratic function: L(w) = (w – 3)^2
The goal is to find the value of w that minimizes the loss (which is 3 in this case).
import numpy as np
import matplotlib.pyplot as plt
# Loss function: (w - 3)^2
def loss(w):
return (w - 3) ** 2
# Gradient of the loss function
def gradient(w):
return 2 * (w - 3)
# Parameters
learning_rate = 0.1
iterations = 20
w = 0 # Initial guess
loss_history = []
# Gradient descent loop
for i in range(iterations):
grad = gradient(w)
w -= learning_rate * grad # Update weight
loss_history.append(loss(w))
print(f"Iteration {i+1}: w = {w:.4f}, Loss = {loss(w):.4f}")
# Plot the loss over iterations
plt.plot(range(iterations), loss_history, marker='o')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.title('Loss Reduction Using Gradient Descent')
plt.show()
In machine learning, gradient descent is essential for training models like linear regression, neural networks, and support vector machines. Understanding how it works will give you the confidence to build more efficient models!
The Normal Distribution (also called the Gaussian Distribution) is a bell-shaped curve used in statistics and data science to represent real-world data distributions. It’s called “normal” because many natural phenomena, such as heights, blood pressure, and test scores, follow this pattern.
The probability density function (PDF) for a normal distribution is given by:
Where:
import numpy as np
import matplotlib.pyplot as plt
# Generate data
mean = 0
std_dev = 1
data = np.random.normal(mean, std_dev, 1000)
# Plot the histogram
plt.hist(data, bins=30, density=True, color='skyblue', alpha=0.7)
# Plot the PDF curve
x = np.linspace(min(data), max(data), 1000)
pdf = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev) ** 2)
plt.plot(x, pdf, color='red', label='PDF')
plt.title('Normal Distribution (Mean=0, Std Dev=1)')
plt.legend()
plt.show()
# Generate random values from a normal distribution
data = np.random.normal(loc=10, scale=2, size=1000) # Mean=10, Std Dev=2
print(f"Mean: {np.mean(data):.2f}, Std Dev: {np.std(data):.2f}")
from scipy.stats import norm
# Calculate probability of value being less than or equal to 1.5 in a standard normal distribution
prob = norm.cdf(1.5)
print(f"Probability of being less than or equal to 1.5: {prob:.4f}")
Mean: 10.10, Std Dev: 2.08
Probability of being less than or equal to 1.5: 0.9332
By understanding and visualizing the normal distribution, you’ll gain better insights into your datasets and make more informed decisions in data science tasks.
The Z-score (also called a standard score) tells you how far a data point is from the mean of a dataset in terms of standard deviations.
If a data point has:
Z-scores help standardize different datasets, making them easier to compare. They are commonly used in:
Where:
Suppose the average height of a group is 170 cm with a standard deviation of 10 cm. If a person’s height is 185 cm, what is their Z-score?
This means the person’s height is 1.5 standard deviations above the mean.
import numpy as np
from scipy.stats import zscore
# Sample dataset
data = [170, 165, 180, 175, 160, 185]
# Calculate Z-scores using SciPy
z_scores = zscore(data)
print("Z-scores:", z_scores)
threshold = 2 # Common threshold for outliers
outliers = [data[i] for i in range(len(z_scores)) if abs(z_scores[i]) > threshold]
print("Outliers:", outliers)
mean = np.mean(data)
std_dev = np.std(data)
data_point = 185
z_score = (data_point - mean) / std_dev
print(f"Z-score for {data_point}: {z_score:.2f}")
Z-scores: [-0.29277002 -0.87831007 0.87831007 0.29277002 -1.46385011 1.46385011]
Outliers: []
Z-score for 185: 1.46
A p-value is used in hypothesis testing to measure how well the observed data aligns with the null hypothesis.
The p-value helps answer: If the null hypothesis were true, what is the probability of observing data as extreme as this?
In hypothesis testing, the p-value is calculated based on a test statistic and its distribution. Here’s the general process:
The test statistic varies depending on the type of hypothesis test (t-test, z-test, etc.). Below is the formula for the Z-test (used when the population variance is known):
Where:
After calculating the test statistic, the p-value is obtained from the cumulative distribution function (CDF) of the test statistic’s distribution.
For a Z-test (standard normal distribution):
Suppose a company claims that its new weight loss pill causes an average weight loss of 5 kg. You conduct an experiment and find an average weight loss of 4 kg in a sample of 50 users. How do you know if the difference is significant or just random?
After performing the statistical test, you get a p-value of 0.03.
Since 0.03 < 0.05, you have strong evidence to reject the null hypothesis, suggesting the pill may not cause a 5 kg weight loss.
import numpy as np
from scipy.stats import ttest_1samp
# Sample data (weight loss results)
data = [4.1, 4.3, 3.8, 4.0, 4.5, 3.9, 4.2, 4.0]
# Null hypothesis mean (company claim: 5 kg weight loss)
population_mean = 5
# Perform the one-sample t-test
statistic, p_value = ttest_1samp(data, population_mean)
print(f"T-statistic: {statistic:.2f}, P-value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("Reject the null hypothesis: The pill may not cause a 5 kg weight loss.")
else:
print("Fail to reject the null hypothesis: The pill may indeed cause a 5 kg weight loss.")
T-statistic: -11.22, P-value: 0.0000
Reject the null hypothesis: The pill may not cause a 5 kg weight loss.
Suppose you conduct a test and get a p-value of 0.01:
This simple interpretation can guide better decisions when analyzing data and performing hypothesis tests in real-world scenarios.
Correlation tells us how two variables are related to each other and whether they move in the same direction or opposite directions.
The correlation coefficient rrr is calculated using this formula:
Where:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
'Temperature': [30, 32, 33, 35, 36, 37, 38],
'IceCreamSales': [100, 150, 200, 300, 400, 500, 600]
}
df = pd.DataFrame(data)
correlation = df.corr()
print("Correlation Matrix:")
print(correlation)
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Correlation Matrix:
Temperature IceCreamSales
Temperature 1.000000 0.972134
IceCreamSales 0.972134 1.000000
If the correlation between Temperature and Ice Cream Sales is 0.97, it means a strong positive relationship—as temperature increases, ice cream sales also increase.
Covariance is a statistical measure that shows how two random variables move together. It tells us whether increases in one variable are associated with increases or decreases in another.
Where:
import numpy as np
import pandas as pd
data = {
'Advertising': [100, 150, 200, 250, 300],
'Sales': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)
cov_matrix = df.cov()
print("Covariance Matrix:")
print(cov_matrix)
Advertising Sales
Advertising 625.0 125.0
Sales 125.0 25.0
Entropy is a way to measure how random or uncertain something is. In simple terms, it tells us how messy or unpredictable data is.
Imagine you have two jars of candy:
If someone asks you to pick a candy from each jar, which jar is easier to predict?
Entropy helps in decision-making tasks like building decision trees.
Where:
You have a coin:
Entropy=−(0.5×log2(0.5)+0.5×log2(0.5))
The answer is 1 bit, meaning the result is very random (maximum uncertainty).
Let’s calculate entropy in Python.
from scipy.stats import entropy
import numpy as np
probabilities = [0.5, 0.5] # Fair coin probabilities
entropy_value = entropy(probabilities, base=2)
print(f"Entropy: {entropy_value} bits")
Entropy: 1.0 bits
The F1 Score is a way to measure how good a classification model is at making predictions. It’s used when we care about both Precision (how many of the predicted positives are actually correct) and Recall (how many of the actual positives were detected by the model).
Since Precision and Recall can sometimes give different results, the F1 Score provides a single number that balances both.
Sometimes models can predict too many false positives or false negatives. If your task is critical, like medical diagnosis, you can’t afford mistakes in either direction. The F1 Score helps you balance both errors.
Suppose you have a spam filter:
So, the F1 Score is 0.778, indicating a good balance between Precision and Recall.
from sklearn.metrics import f1_score
y_true = [1, 0, 1, 1, 0, 1, 0, 0] # Actual labels
y_pred = [1, 0, 1, 0, 0, 1, 1, 0] # Model predictions
score = f1_score(y_true, y_pred)
print(f"F1 Score: {score}")
F1 Score: 0.75
Mastering these 10 important definitions will give you a solid starting point in data science. Each concept plays a crucial role in various tasks, from building models to interpreting results. In our next blog post, we’ll dive into ten more important mathematical definitions to broaden your understanding of data science. Stay tuned!
After debugging production systems that process millions of records daily and optimizing research pipelines that…
The landscape of Business Intelligence (BI) is undergoing a fundamental transformation, moving beyond its historical…
The convergence of artificial intelligence and robotics marks a turning point in human history. Machines…
The journey from simple perceptrons to systems that generate images and write code took 70…
In 1973, the British government asked physicist James Lighthill to review progress in artificial intelligence…
Expert systems came before neural networks. They worked by storing knowledge from human experts as…
This website uses cookies.