Generative Pretrained Transformer, or GPT.
It all started in 2012 with AlexNet, written by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.
In the 1960s, perceptrons took their first steps with Hinton and his colleagues through backpropagation. However, AlexNet was the first to show the world how powerful these methods could truly be.
ImageNet, a massive dataset compiled over years from millions of different objects, had been used until 2012 by extracting object features manually. Algorithms like SIFT, for instance, were widely utilized when I first encountered neural networks in 2015.
However, AlexNet completely overturned this approach. At the annual ImageNet competition in 2012, AlexNet revolutionized the field, halving the error rates of previous algorithms and winning the competition. It was a pivotal moment, marking the end of the quiet era for neural networks since the 1980s. Just one year later, all groups participating in the competition were using neural networks.
This caused a wake-up call for major companies, sparking a race to recruit machine learning and AI experts. Google acquired DeepMind and gave it autonomy, while Facebook (then known as Facebook Artificial Intelligence Research, or FAIR) established its own research division. Most companies began to focus on open-source research, realizing that open-source code fosters a self-sustaining ecosystem.
The more research these big companies shared, the greater the interactions. This increased the number of beneficiaries, which in turn raised the number of skilled people using these platforms.
Within a few years of 2012, papers with hundreds of thousands of citations were being written. Researchers quickly discovered methods to train neural networks much faster, more effectively, and systematically.
Notable advancements included:
• Parameter initialization algorithms replacing random initialization, enabling faster convergence.
• Residual connections to allow gradients to flow seamlessly from output to input.
• Dropout to train sub-networks efficiently during training.
• Advanced optimization algorithms like Adam replacing simpler gradient descent methods.
• ReLU (Rectified Linear Unit) replacing costly sigmoid functions, with further innovations like GELU (Gaussian Error Linear Unit) to address ReLU’s issues.
• Batch normalization and layer normalization, ensuring more stable training.
In addition, NVIDIA’s chips and CUDA architecture, alongside Meta’s PyTorch, became widely accessible. The stage was set for a revolution!
The Rise of Transformers
Google assigned a team a critical task: neural network-based translation—e.g., translating a sentence from French to English.
The team had everything they needed: NVIDIA chips, open-source code, and Google’s TensorFlow framework. However, they faced two major issues. First, existing architectures like LSTMs (Long Short-Term Memory networks) couldn’t fully leverage parallelism. Second, these models had a tendency to quickly forget previous context.
Fortunately, the team was in the right place at the right time. Some of them knew about attention, an algorithm introduced in 2015, and understood how to parallelize it for language tasks.
Their initial idea was to generate multiple training samples. Take the word “Ahmet,” for example:
• Predicting each next letter creates training examples like:
A -> H, AH -> M, AHM -> E, AHME -> T.
This method became the foundation of one of the architecture’s core ideas: Next Token Prediction.
The next step was parallelizing next-token prediction. The model could process multiple examples simultaneously within a given window size.
Finally, attention allowed the model to consider previous tokens while processing a given input. These innovations combined to create the greatest weapon of the revolution: the Transformer.
The paper introducing it, Attention Is All You Need, was one of history’s most cleverly titled works.
The Transformer was both efficient—easy to train—and simple to implement. With just a few hundred lines of code, you could write a Transformer model.
Scaling Up: Neural Scaling Laws
From this point forward, the bottleneck wasn’t architecture—it was data. The more data, the better the Transformer. While everyone had some ideas about what was needed, the definitive answer came in 2019 from OpenAI, co-founded by Ilya Sutskever, who had played a key role in the 2012 revolution.
The Neural Scaling Laws were clear:
• More data
• Larger neural networks
• More compute
When these three elements were combined, models improved systematically. Remarkably, the performance of a model could be predicted before training, simply by considering the data size, model size, and computational power.
This paper triggered a cold war still ongoing today. Countries are now competing for bigger, more powerful chips. The tensions in Taiwan and the U.S.’s restrictions on ASML’s exports to China stem from this. NVIDIA becoming the world’s largest company is a direct result of this technological race.
This represents perhaps the greatest scientific breakthrough since the atomic bomb. While its true impact is hard to fully grasp today, its seriousness is undeniable.
Personal Notes
As humans, we stubbornly cling to the notion of being the center of the universe—a lesson we failed to learn from Galileo.
Let me give you a simple example. If you have a pet dog at home and it obeys your command to “give paw,” you call it smart. Now, where do you think GPT stands compared to you or your smart dog?
We must stop placing ourselves at the center of everything. Humanity has created the second most intelligent entity known to us (for now), and this is no longer a joke. Unlike before, we’ve reached something unprecedented. We’ve moved from apes learning a few hundred words to GPT, capable of explaining quantum mechanics in mind-boggling detail.
We need to be incredibly careful, as this is a heavy responsibility that most people still struggle to fully comprehend.
Some Predictions
1. Brainwave Decoding
We’re already reconstructing sentences from brainwaves. Even relatively primitive models like GPT-2 can accomplish this effectively. Although differences in individual brainwaves present a challenge, history suggests these differences may not be as significant as we think. With more brainwave data, this problem might be solvable.
(See: Apple AirPods sensor system)
The ability for corporations to read your thoughts isn’t science fiction—it’s a terrifyingly real possibility. Imagine your thoughts being read—the end of personal freedom as we know it.
2. Embodied AI
Today’s AI systems lack physical bodies, but there’s no reason they couldn’t have one. With a few sensors attached to your limbs, robots could learn to walk in the cloud flawlessly—and this is already happening. Companies like Tesla may seem advanced, but they’re a few years behind. (See: Torso by Clone)
NVIDIA is now focused on creating digital twins for robots in the cloud. These twins will pave the way for mass-produced robots costing $10–20K that could outperform humans in speed, durability, and precision.
The implications are chilling. Imagine robots programmed with harmful biases like “Turks are bad”—the consequences could be catastrophic.
These predictions may sound fantastical, but sadly, I see no barriers preventing their realization. On the contrary, trillions of dollars are being invested, and nations are prepared to burn everything down to ensure this technology advances.
Onur
[This piece was translated from Turkish via GPT-4o]
Leave a comment