AI language models have undergone a radical architectural transformation since 2019, moving from simple decoder stacks to complex hybrid systems combining dense and sparse components. Sebastian Raschka, PhD, has compiled a comprehensive visual reference documenting these shifts in his LLM Architecture Gallery, showing how the fundamental building blocks of artificial intelligence have evolved to support reasoning, efficiency, and scale. How Have AI Model Architectures Actually Changed? The evolution from GPT-2 to modern reasoning models reveals a fundamental shift in design philosophy. GPT-2, released in late 2019, used a straightforward recipe: 1.5 billion parameters arranged in a classic decoder stack with multi-head attention (MHA), dropout, GELU activation, and LayerNorm. This represented the baseline dense architecture that dominated early large language models (LLMs), which are AI systems trained on vast amounts of text to predict and generate human language. Today's models tell a different story. The architecture gallery documents how modern systems have adopted several key innovations that fundamentally change how models process information. Rather than relying solely on dense layers where every parameter connects to every other parameter, newer designs incorporate mixture-of-experts (MoE) approaches, where different specialized sub-networks handle different types of tasks. DeepSeek V3, for example, uses 671 billion total parameters but only activates 37 billion during inference, making it practical to run despite its enormous size. What Specific Design Choices Make Reasoning Models Different? The architectural differences between standard models and reasoning-focused variants reveal why some AI systems can now show their work. DeepSeek R1, built on the V3 architecture, maintains the same structural foundation as its predecessor but changes the training recipe to emphasize reasoning-oriented learning. This distinction matters because it shows that reasoning capability doesn't require inventing entirely new architectures; instead, it emerges from how models are trained on their existing structure. Modern models have also adopted several technical refinements that improve both performance and efficiency: - Attention Mechanisms: Newer models use grouped-query attention (GQA) and rotary positional embeddings (RoPE) instead of learned absolute positions, allowing models to handle longer sequences more efficiently and generalize better to unseen context lengths. - Normalization Strategies: Advanced models employ query-key normalization (QK-Norm) and pre-norm layouts instead of post-norm, improving training stability and allowing for better gradient flow through deeper networks. - Sparse Routing: Mixture-of-experts designs with shared expert layers, as seen in DeepSeek V3 and its successors, balance model capacity with inference efficiency by routing different inputs to specialized sub-networks. - Local Attention Patterns: Some models like Gemma 3 use sliding-window attention with a 5:1 ratio of local to global attention, reducing computational cost while maintaining long-range understanding. The architectural gallery documents models ranging from compact 3-billion-parameter systems like SmolLM3 to massive trillion-parameter designs like Moonshot's system, which scales the DeepSeek V3 recipe upward with 32 billion active parameters. This range shows that architectural principles remain consistent even as scale changes dramatically. Llama 3, Meta's 8-billion-parameter baseline, demonstrates how pre-norm architectures with grouped-query attention have become the standard reference point for comparing newer models. Meanwhile, OLMo 2 and OLMo 3 use similar decoder recipes to Qwen3 but experiment with different normalization and attention choices, showing how researchers iterate on proven designs rather than starting from scratch. Why Should You Care About These Architectural Differences? Understanding these architectural shifts matters because they directly impact what AI models can do and how efficiently they do it. A model with 671 billion parameters that only uses 37 billion during inference can deliver reasoning capabilities while remaining practical to deploy. A model with sliding-window attention can process documents faster than one using full attention across all tokens. These aren't abstract engineering details; they determine whether an AI system can run on your laptop or requires a data center. The architectural gallery also reveals a convergence around certain design patterns. Most modern large models now use grouped-query attention rather than full multi-head attention, suggesting the field has reached consensus on efficiency improvements. Similarly, the adoption of mixture-of-experts by Meta's Llama 4, Qwen's sparse variants, and other flagship models indicates that sparse routing has become essential for scaling beyond certain parameter counts. Raschka's documentation serves as a reference point showing how much decoder stacks have changed since GPT-2. The gallery includes high-resolution architecture diagrams available as a physical poster, making these technical concepts accessible to researchers, engineers, and AI enthusiasts who want to understand the engineering decisions shaping modern language models. By visualizing these architectures side-by-side, the gallery reveals patterns that might otherwise remain hidden in academic papers and technical documentation. The evolution from simple dense models to hybrid architectures combining dense prefixes with sparse routing, from learned positional embeddings to rotary encodings, and from post-norm to pre-norm layouts represents a maturation of the field. These changes didn't happen overnight; they emerged from thousands of experiments testing what works at different scales. The architectural gallery captures this evolution in one place, making it possible to see how reasoning models, efficient models, and large-scale models all represent different solutions to the same fundamental challenge: how to build AI systems that are capable, efficient, and practical to deploy.