The Engineering History of AI: Why Modern Systems Fail in Predictable Ways
The four authentic conceptual pillars of Artificial Intelligence: from Llull's Symbolism (1230) to the McCulloch-Pitts Neuron (1943).
Why I’m Writing This
As the author of Neural Networks and Deep Learning with Python, and founder of EmiTechLogic, I spend my days teaching engineers how to build AI systems.
But there’s a disconnect I keep seeing:
Students learn to train models. They don’t learn why those models fail.
Everyone wants to build the next ChatGPT. Nobody wants to study why expert systems collapsed in 1989, why perceptrons couldn’t solve XOR in 1969, or why symbolic AI required an “AI Winter” to course-correct.
But here’s what I’ve discovered after years of teaching neural networks and deploying LLM tutorials: modern AI systems fail for the exact same architectural reasons that symbolic AI failed 40 years ago. We’ve just moved the failure point.
This post traces the engineering constraints that have shaped AI for 70 years—from Ramon Llull’s 13th-century logic machines to GPT-4’s hallucinations. Understanding these constraints isn’t just academic history. It’s practical knowledge that will help you debug your next production deployment.
Part 1: The Combinatorial Explosion—From Llull (1273) to LLM Token Generation (2024)
Ramon Llull’s Ars Magna: The First “AI” and Its Fatal Flaw
Before computers existed, a 13th-century philosopher named Ramon Llull built something remarkable: a mechanical system for generating knowledge.
His “Ars Magna” used rotating paper discs with fundamental concepts:
Truth, Goodness, Power, Wisdom, Justice, etc.
By rotating the discs, you could mechanically combine concepts:
Truth + Wisdom = Enlightened Understanding
Power + Justice = Righteous Authority
Goodness + Eternity = Divine Grace
This wasn’t mysticism—it was combinatorial search, the same principle underlying:
Llull’s insight: If thought is combining basic symbols, then a machine could theoretically generate all truths.
Llull’s problem: Combinatorial explosion.
With 9 concepts and 2-way combinations: 81 possibilities With 9 concepts and 3-way combinations: 729 possibilities With 10 concepts and 3-way combinations: 1,000 possibilities
The search space grows factorially. Without intelligent pruning, you generate infinite nonsense mixed with occasional insight.
The Modern Parallel: LLM Token Generation
When I teach neural networks, students often ask: “Why do LLMs hallucinate?”
The answer traces back to Llull’s combinatorial problem.
Modern LLMs like GPT-4 or Claude don’t “understand” text—they predict the next most probable token from a vocabulary of 50,000+ tokens. At each step:
Current context: "The capital of France is"
Possible next tokens: "Paris" (high probability), "London" (low probability), "banana" (extremely low probability)
Selection: Sample from probability distribution
This is combinatorial search with probabilistic pruning. The model learned from training data which combinations are likely, but it has no “truth checker.”
The architectural constraint:
Llull had no pruning mechanism (generated everything)
LLMs have training data bias as crude pruning (generate probable things)
Neither system verifies truth—they just navigate possibility spaces
Why this matters for practitioners:
When you deploy an LLM in production, you’re not deploying a “knowledge base”—you’re deploying a probabilistic search engine traversing a high-dimensional space of token combinations.
If the statistically most probable next token is factually wrong (due to training data bias or noise), the model will generate it with complete confidence.
Practical implication: Never ask an LLM to generate factual information without external grounding (databases, APIs, retrieval systems). The architecture fundamentally prioritizes plausibility over veracity.
Part 2: Turing’s Universal Machine and the Platform-Independence of Intelligence
1936: Alan Turing Formalizes the Algorithm
Before Turing, computing machines were purpose-built hardware. A loom wove cloth. A calculator computed numbers. Each machine had one function.
Turing proved something profound: you could build a Universal Machine that could simulate any other machine if given the right instructions (software).
This introduced the concept of the Stored Program—the theoretical foundation for why:
Your GPU can render video games, then train neural networks, then mine cryptocurrency
The same iPhone runs Spotify, Instagram, and Claude
Intelligence can be “software” running on biological or silicon “hardware”
The insight: Intelligence is platform-independent. If thought is a series of state transitions, you can abstract it from biology.
This spawned the Physical Symbol System Hypothesis: any system that manipulates symbols can theoretically achieve human-level intelligence.
Architecture Demo: Finite State Machine (Turing Tape)
The Finite State Machine
Turing’s “Universal Machine” abstraction: intelligence as symbol manipulation on an infinite tape.
Current Machine State
Q0: INITIALIZING
Instruction
Waiting for input…
Algorithm: Binary Increment
StateReadWriteMoveNext
Representation Gap
Notice how the machine doesn’t “know” it’s adding numbers. It simply follows a Physical Symbol System transition table.
“The model learned to manipulate pixel patterns… without understanding the underlying reality.”
To the machine, ‘1’ is not a value; it is a symbol trigger. The intelligence is in the Software, abstracted from the silicon tape.
Steps Taken0
The Modern Reality: Abstraction Creates Distance from Truth
In my book “Neural Networks and Deep Learning with Python,” I emphasize that neural networks are universal function approximators—they can theoretically learn any mapping from input to output.
But “theoretically” does a lot of work in that sentence.
The problem with symbolic abstraction:
When you represent reality as symbols (whether LISP code in 1970 or embeddings in 2024), you create a representation gap between:
The symbols in your system
The actual reality they’re supposed to represent
Example from teaching neural networks:
I often have students build image classifiers. They train a model on ImageNet that achieves 95% accuracy. They’re thrilled—until they deploy it.
In production, the model:
Classifies a husky as a wolf (because training images of wolves often had snow backgrounds)
Fails on slightly rotated images
Confidently misclassifies adversarial examples
What happened?
The model learned to manipulate symbols (pixel patterns → class labels) without understanding the underlying reality. It learned correlations in training data, not causal relationships in the world.
This is Turing’s abstraction taken to its logical conclusion: when intelligence is pure symbol manipulation, it has no ground truth beyond the symbols themselves.
Practical implication:
Test your models on out-of-distribution data before production. The gap between “works on test set” and “works in reality” is the representation gap—and it’s where most deployed AI fails.
Part 3: Why Symbolic AI Failed (1980s) and Why LLMs Fail Differently (2024)
The Promise of Expert Systems
From the 1950s through 1980s, AI meant symbolic AI—encoding human expertise as IF-THEN rules in languages like LISP and Prolog.
The paradigm: Intelligence is rule-following. Encode expert knowledge, get expert performance.
Success stories:
MYCIN: Medical diagnosis (matched expert physicians)
XCON: Computer configuration for Digital Equipment Corporation (saved millions)
DENDRAL: Chemical structure analysis
The collapse:
By the late 1980s, these systems hit a wall. Three fundamental problems:
1. The Frame Problem
How many rules do you need to “make coffee”?
Use the coffee maker (not the toaster)
Use water (not gasoline)
Don’t start a fire
Gravity exists
Cups hold liquids
Hot surfaces burn
[… infinite implicit assumptions humans know]
You can’t encode common sense. The number of contextual rules is infinite.
2. The Knowledge Acquisition Bottleneck
Experts can’t articulate their tacit knowledge as rules:
“How do you diagnose this disease?”
“Well, it just ‘feels’ like pneumonia based on years of experience…”
You can’t program intuition.
3. Brittleness
If the system encounters anything 1% outside its programmed rules, it doesn’t degrade gracefully—it crashes or outputs nonsense.
There’s no “common sense” to fill gaps between logical nodes.
The Neural Network Solution (Sort Of)
When I teach backpropagation in my courses, I explain that neural networks “solve” these problems by:
Learning from examples instead of explicit rules (solves knowledge acquisition)
Generalizing from patterns instead of matching exact cases (reduces brittleness)
Handling noise and ambiguity through probabilistic outputs
But they introduce opposite problems:
Symbolic AI (1980s)
Neural Networks (2024)
Too rigid—can’t handle anything outside rules
Too fluid—can’t maintain rigid constraints
Interpretable—can trace logical reasoning
Black box—can’t explain decisions
Deterministic—same input → same output
Stochastic—same input → variable outputs
Fails explicitly (crashes)
Fails subtly (hallucinations)
The modern challenge:
In my work building tutorials for EmiTechLogic, I’ve seen this pattern repeatedly:
Students build a model that works beautifully in Jupyter notebooks. Then they try to deploy it and discover:
It can’t maintain business rules (too fluid)
It generates plausible but wrong answers (hallucination)
It’s inconsistent across runs (stochastic)
They can’t debug why it failed (black box)
The architectural insight:
We spent 40 years escaping symbolic AI’s rigidity. Now we’re spending the 2020s trying to add structure back into neural systems through:
Retrieval-Augmented Generation (RAG) – connecting to structured databases
Function calling – forcing models to use deterministic tools
Constitutional AI – embedding rules and values
Chain-of-thought – making reasoning explicit
We’re not moving away from symbolic AI—we’re integrating it with neural systems.
Part 4: The XOR Problem (1969) and Why LLMs Can’t Do Math (2024)
1969: Minsky and Papert’s Proof That Nearly Killed Neural Networks
In 1969, Marvin Minsky and Seymour Papert published “Perceptrons,” proving mathematically that single-layer perceptrons cannot solve XOR (exclusive OR).
The problem: XOR is not linearly separable. You can’t draw a single straight line to separate true from false outputs.
This proof effectively killed neural network funding for 20 years. Everyone assumed it was a fundamental limitation.
What they missed: Multi-layer networks CAN solve XOR. They just didn’t have backpropagation to train hidden layers yet.
2024: The Same Linear Limitation Explains LLM Math Failures
When teaching neural networks, I always cover this history because it explains a modern problem students encounter constantly:
“Why is my LLM so bad at basic arithmetic?”
The answer traces back to the XOR problem—not because of linearity, but because of the deeper principle: some operations require fundamentally different computational structures than pattern matching.
How LLMs process numbers:
# When you give GPT-4: "What is 127 × 384?"
# It doesn't compute—it tokenizes:
tokens = ["What", "is", "127", "×", "384", "?"]
# "127" might tokenize as ["12", "7"] or ["127"] depending on the tokenizer
# The model then predicts: "What token sequence looks like a plausible answer?"
# It learned from training data what "reasonable answers" look like
# It does NOT perform symbolic manipulation: 127 × 384 = 48,768
Math requires:
Precise sequential operations (multiply, then add, then carry)
Exact value preservation across steps
Deterministic symbolic manipulation
Algorithmic procedures
LLMs are trained for:
Approximate pattern recognition
Statistical plausibility
Continuous value spaces
Context-dependent generation
These are architecturally incompatible objectives.
Real example from my teaching:
I have students build a simple calculator using GPT-3.5:
prompt = "Calculate: 1234 + 5678"
response = call_gpt(prompt)
# Expected: "6912"
# Sometimes gets: "6912"
# Sometimes gets: "6911" or "6913" or "7012"
The model treats numbers as tokens to predict, not quantities to compute.
The solution (learned from XOR):
In 1986, backpropagation solved XOR by adding hidden layers—changing the architecture to match the problem.
In 2024, we solve math by adding deterministic computation—changing the architecture to match the problem:
# Don't do this:
result = llm.generate("Calculate 1234 + 5678")
# Do this:
def solve_math(prompt):
# Use LLM to parse intent and extract numbers
parsed = llm.extract_operation(prompt)
# Use deterministic code for computation
result = eval(parsed.expression) # Simplified
# Use LLM to format natural language response
return llm.format_response(result)
This is exactly what GPT-4’s Code Interpreter does—and why it’s reliable for math while raw GPT-4 isn’t.
The lesson: Just as perceptrons needed hidden layers for XOR, LLMs need external tools for operations that don’t match their architectural strengths.
Part 5: Why GPUs Changed Everything (2012) and What It Means for Cost (2024)
The Hardware Lottery: AlexNet and the GPU Revolution
When teaching the history of deep learning, I always emphasize 2012 as the inflection point—not because of algorithmic breakthroughs, but because of hardware.
Architecture Demo: Perceptron & GPU Economics
Perceptron Activation & GPU Parallelism
Visualizing the mathematical bottleneck: from 1986’s algorithms to 2012’s GPU revolution.
Single Inference (The Math)
x₁
x₂
Σ
y
// Perceptron Equation
y = σ(w₁x₁ + w₂x₂ + b)
Parallel Throughput (The GPU)
Estimated Cost
$0.00
In 2012, AlexNet utilized thousands of small cores to process matrix math simultaneously.
“The Bitter Lesson”Efficiency: 98.4%
1986: The Algorithm
Backprop existed, but CPUs processed math serially. “Multi-layer networks could theoretically solve complex problems. But training was prohibitively slow.”
2012: The Revolution
AlexNet repurposed GPU shaders. “Training time dropped from weeks to hours.” We stopped being smart and started being massive.
2024: The Cost
“AI has marginal costs per inference.” Unlike SaaS, every extra query scales compute costs linearly. Scaling = Operational Expense.
The constraint before 2012:
Backpropagation was discovered in 1986. Multi-layer networks could theoretically solve complex problems. But training was prohibitively slow on CPUs.
The breakthrough:
Alex Krizhevsky used NVIDIA GPUs (designed for video game rendering) to train AlexNet for ImageNet 2012. Training time dropped from weeks to hours.
It worked perfectly for files under 500 lines. Then a student submitted a 2,000-line file.
What I expected: Error message (“File too large”) What happened: The model analyzed the first 1,500 lines, silently ignored the rest, and confidently said “No issues found”—even though the bug was on line 1,800.
The architectural constraint:
Transformers batch-process fixed-size windows. When you exceed the window:
Input: [8,500 tokens]
Context limit: 8,192 tokens
Processing: Truncate to 8,192 tokens (no warning)
Model: Generates based on incomplete input
The fix:
Always count tokens before sending (use tiktoken library)
Implement chunking for long inputs
Add validation to check if output references content from entire input
Use models with larger context windows when needed (but pay attention to cost)
Part 7: Why LLMs Hallucinate—The Architectural Explanation
This is the question I get most when teaching: “Why do LLMs confidently generate false information?”
The answer requires understanding the shift from symbolic to probabilistic AI.
Symbolic AI: Closed-World Assumption
Query: "Who won the 1987 World Series?"
System checks database:
- If found: "Minnesota Twins"
- If not found: "I don't know"
The system operates in a closed world—only facts in the database exist. If something isn’t there, the system says “I don’t know.”
Neural Networks: Open-World Assumption
Query: "Who won the 1987 World Series?"
System:
1. Converts query to vector embedding
2. Traverses high-dimensional space
3. Generates most probable token sequence
4. Outputs: "The [Minnesota/Chicago/Boston/...] [Twins/Cubs/Red Sox/...]"
The model lives in a continuous space where every concept is a point in a high-dimensional cloud. It doesn’t “know” facts—it predicts probable continuations.
Why Confidence Doesn’t Mean Accuracy
When I explain this in my courses, I use this metaphor:
LLMs are like a student who:
Has read 10,000 textbooks (training data)
Remembers statistical patterns, not specific facts
When asked a question, generates what “sounds right” based on patterns
Has the same confident tone whether they know the answer or are guessing
The technical explanation:
# When you ask: "What is the capital of Bhutan?"
# The model doesn't query a fact database
# It predicts tokens:
P("Thimphu" | context) = 0.85 # High probability
P("Kathmandu" | context) = 0.08 # Lower probability
P("Bangkok" | context) = 0.03 # Low probability
# It samples from this distribution
# "Confidence" reflects probability distribution, not factual certainty
Why hallucinations happen:
Training data gaps: If the model never saw accurate information about topic X, it extrapolates from similar patterns
Conflicting information: Training data contains contradictory facts → model blends them
Problem: Plausibility ≠ truth Solution: External grounding, retrieval systems, validation
Conclusion: The Pattern Across 750 Years
From Ramon Llull’s rotating discs (1273) to GPT-4 (2024), the same fundamental constraints have shaped machine intelligence:
The core challenge: Building systems that balance exploration (generating novel combinations) with verification (ensuring accuracy).
Symbolic AI leaned too far toward verification (brittle rules). Neural networks leaned too far toward exploration (creative hallucination).
The future—what I teach at EmiTechLogic and write about in my books—is integration: architecting the boundary between rigid constraints and flexible generation.
This isn’t just history. It’s the engineering foundation you need to build reliable AI systems in 2025.
For engineers looking to move from theory to implementation, I recommend these high-impact technical deep dives:
Build a Neural Network from Scratch – Master the underlying backpropagation calculus before you start relying on high-level APIs like PyTorch.
Optimizing Chunking for RAG Systems – Learn how to structure document retrieval so your LLM stays grounded in reality, solving the “context window” engineering hurdle.
Provides context on Thomas Hobbes’ “Reasoning is but Reckoning” and Gottfried Wilhelm Leibniz’s development of binary arithmetic and the universal calculus of reasoning.
Turing’s seminal paper proposing the ‘Imitation Game’ (Turing Test) and discussing the limits and capabilities of any computation expressible as a machine algorithm.
Search Term:Alan Turing "Computing Machinery and Intelligence" MIND 1950
4. Activation (McCulloch-Pitts)
“A Logical Calculus of the Ideas Immanent in Nervous Activity” (1943)
The groundbreaking paper that introduced the Formal Neuron Model, reducing the biological neuron to a mathematical “Threshold Gate,” the blueprint for all artificial neural networks.
Search Term:McCulloch and Pitts "A Logical Calculus of the Ideas Immanent in Nervous Activity" 1943
Frequently Asked Questions on the History of AI
1. What are the four core concepts that form the intellectual foundation of AI?
The four pillars are: Symbolism (Llull): The idea of turning concepts into manipulable tokens. Logic (Hobbes/Leibniz): Reducing thought to pure binary calculation (0s and 1s). Algorithm (Turing): Defining the universal recipe for computation (sequential steps). Activation (McCulloch-Pitts): Modeling the biological decision unit (the neuron).
2. How does Ramón Llull’s Ars Magna relate to modern AI?
Llull’s 13th-century machine was the first systematic attempt to mechanize knowledge by combining concepts (symbols) based on fixed rules. This is the ancestor of Symbolic AI and remains the theoretical basis for modern rule-based expert systems and customer service chatbots.
3. Why is Boolean Logic (0s and 1s) essential if modern AI uses complex math?
Boolean Logic, formalized by Leibniz, is the absolute lowest level of all digital computation. Every complex instruction—from high-level code to deep network calculations—must ultimately be translated into sequences of simple, binary True/False switches. It is the fundamental language of all digital processors.
4. What crucial concept did Alan Turing establish with the Universal Turing Machine (UTM)?
The UTM established the concept of the universal algorithm. It provided the theoretical proof that one single machine could perform any computation that can be expressed as a finite, sequential set of logical instructions, thus defining the limits and possibilities of all future computer programs and AI.
5. How is the McCulloch-Pitts Formal Neuron the building block of Deep Learning?
McCulloch and Pitts successfully modeled the biological neuron as a Threshold Gate. This gate only “fires” (outputs a 1) if the combined input signals exceed a fixed value. This simple, all-or-nothing decision unit is the core foundation used to construct the large, interconnected layers found in all modern Neural Network Architectures.
Emmimal Alexander
Emmimal Alexander is an AI educator and the author of Neural Networks and Deep Learning with Python. She is the founder of EmiTechLogic, where she focuses on explaining how modern AI systems are built, trained, and deployed — and why they fail in real-world settings. Her work centers on neural networks, large language models, AI hallucination, and production engineering constraints, with an emphasis on understanding the architectural trade-offs behind modern AI. She is known for translating complex theoretical foundations into practical engineering insights grounded in real system behavior.