Understanding AI/ML

Documenting my learnings from deep dive into Artificial Intelligence

Artificial Intelligence has quietly reshaped how modern software is built. For decades, most systems were engineered using explicit, human-written rules: if X happens, do Y. That approach works well when the world is predictable, but it breaks down when patterns become too complex, dynamic, or ambiguous for humans to fully describe.

AI represents a fundamental shift in how software works. Instead of relying on hand-coded logic, modern AI systems learn from data. By observing a large volume of examples, these systems infer patterns probabilistically, allowing them to recognize, predict, and even generate outcomes that were never explicitly programmed. This data-driven approach underpins everything from product recommendations and fraud detection to image generation and conversational systems like ChatGPT, and it forms the foundation of what we now call Machine Learning.

Anatomy of AI

AI is best understood as a layered set of approaches, where newer techniques build on those that came before it. This is often visualized as an “onion” model (AI > ML > NN > DL > GenAI). While newer layers enable new and more powerful capabilities, they do not necessarily replace earlier ones. From traditional recommendation systems to modern generative models, these techniques all coexist within the broader AI ecosystem, though some exceptions do exist, which we’ll cover later in the Generative AI section of this article.

AI Layers of Learning

History of AI

Artificial Intelligence as a formal field emerged in the 1950s, with the 1956 Dartmouth Workshop often cited as its point of origin. This early era laid the theoretical groundwork for many core ideas in machine learning that are still in use today. However, progress was limited for decades, as researchers lacked the massive datasets and computational power required to train these models at scale. As a result, the period from roughly the 1960s through the late 1990s became known as the “AI Winter,” marked by stalled breakthroughs and reduced investment.

By the 2010s, we finally began to see sufficient data being collected with the industry push into “big data”. As organizations began collecting large amounts of data on user behavior and site usage, this empowered many new predictive analytics methods that were previously impossible.

In parallel, a new field called Deep Learning that leverages neural networks and expands on the original neural network concept, really took off. This new approach to AI, pioneered by researchers like Geoffrey Hinton (the “Godfather of Deep Learning”), uses architectures that allow for much deeper, hierarchical analysis of data. This new approach was serendipitously enabled by the rise of GPU processors (like those from Nvidia), providing the parallel processing power needed to train these deep networks efficiently. Soon after, the race was on to unlock key use cases. This included Natural Language processing (NLP) and image recognition, through the work of Ilya Sutskever (OpenAI fame).

By 2020, we were beginning to see broad adoption of AI/ML, though it wasn’t yet apparent to most. By this point, AI/ML was present in everything from personalized product recommendations on Amazon, music recognition via Shazam, traffic routing via Waze (now Google Maps), and language comprehension capabilities of Siri and Google Translate. Up to this point, the primary focus was on discriminative tasks—pattern recognition, detection, and prediction.

What fundamentally changed in the early 2020s was the widespread accessibility of Generative AI. Rather than simply detecting patterns or making predictions, this new class of models could be used to create new, high-quality text, images, and media. This was enabled by a pivotal new architecture from Google called Transformers (“Attention Is All You Need“, 2017). This architectural shift allowed models to process vast sequences of data far more efficiently, directly leading to the modern Generative AI boom. And within a couple years, this is when we saw the launch of Chat GPT 3 (November 2022) and a new class of Diffusion Models for media generation such as Stable Diffusion and Dall-E.

Paradigms and Algorithms

Next, I’ll go into each of the major types of AI and describe how each works and what it is used for. Below is a visualized mental map that I created, to simplify what can otherwise become overwhelming:

AI/ML Paradigms and Algorithms

Machine Learning

Machine Learning (ML) is broad and encompasses any algorithm which learns patterns from data to make predictions or decisions without being explicitly programmed. This is compared to traditional Symbolic systems (aka Expert Systems( like Wolfram Alpha for example (eg Mathmatica) which is primarily based on vast collections of humans-defined rules. ML includes a variety of statistical techniques that primarily fall into 3 categories: Supervised, Unsupervised, and Reinforcement Learning.

Supervised

Supervised learning leverages common statistical patterns to identify patterns through the use of “labeled” data. Labels are human annotated notes that accompany each data element, telling the algorithm what it is an example of. There are two main types of Supervised learning that describe how you might use this labeled data:

i. Regression

Regression modeling is a type of Machine Learning used to predict a continuous numerical value (like temperature or price) based on input data. The goal is to find the best-fitting function (often a line or curve) that mathematically models the relationship between the inputs and the output.

The most common algorithm is Linear Regression, but many others exist for complex relationships. When a model’s prediction differs from the actual data point, the difference is an error. The system learns by using an optimization process called gradient descent, which iteratively adjusts the model’s parameters to reduce the overall error rate until a point of convergence is achieved, making accurate prediction possible. Once achieved, we have a model that reflects that data and prediction becomes possible.

ii. Classification

Classification predicts a category or label for an item, based on input features associated with that item (ex its a cat since it has both whiskers and pointy ears). This is an opposite approach from Regression, which is predicting a series of data along a single dimension/feature. There are two main types: (i) Binary Classification – choosing between two options (eg Spam/Not Spam) and (ii) Multi-Class Classification – choosing from three or more categories (eg Cat vs Dog vs Bird).

The goal of the model is to learn a “decision boundary” that accurately separates the data points belonging to different classes. Rather than simply assigning an absolute label, most sophisticated models (eg Logistic Regression and Naive Bayes) generate a probability score that represents the model’s confidence that an item belongs to a specific class. Or, in the case of K-NN, it looks for which neighborhood an item seems to cluster into and then takes a vote as to the most common class of those items in that cluster. There are many techniques like this but the goal is the same – to convert this probability gradient into a definitive label, a classification threshold is applied. This threshold is critical because it governs the trade-off between the evaluation metrics Precision and Recall (see Wikipedia diagram), allowing model performance to be tuned based on the real-world cost of different types of errors.

Unsupervised

The next major category of Machine Learning is Unsupervised Learning, which looks for hidden patterns and relationships in data that has no human-affixed labels. This approach is especially valuable when identifying naturally occurring groupings that do not lend themselves to pre-definition, such as uncovering new market segments or finding groups of customers with shared needs that were not known ahead of time.

i. Clustering

The most common application of this is Clustering, which divides data into groups based on similarity. K-Means Clustering is the most popular algorithm for this task. K-Means is similar to the classification algorithm K-NN in that it relies on proximity in the feature space (vector embeddings), but its ultimate purpose is opposite. While K-NN assigns a new item to a known, pre-labeled class, K-Means works by iteratively adjusting central points (centroids) until the distance between the data points and their assigned groups is minimized, concluding with the discovery and definition of the new, unknown cluster memberships.This can be especially useful when trying to identify groups such as market segments, when you don’t have any prescribed groupings to start with. Imagine for example, taking a customer survey and identifying persona/market segments, based on similar clusters of needs.

ii. Other

Beyond clustering, Unsupervised also includes other key concepts like Dimensionality Reduction, which simplifies features using techniques such as Principal Component Analysis (PCA) by transforming the data to a lower-dimensional space while retaining crucial information. This can significantly simplify the data before which will make the clustering analysis far more efficient. Additionally, Association Rule Mining finds relationships between variables in large databases, often used for manual analysis in retail to identify items frequently purchased together in “market basket analysis”, though it’s unlikely to be used in production features.

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make a sequence of optimal decisions by interacting with a dynamic environment. Unlike supervised learning (which uses fixed labels) or unsupervised learning (which finds patterns), RL involves the agent performing an action and then receiving a numerical reward signal (a positive or negative score) from the environment. The agent’s goal is to learn a policy—a strategy that maps states to optimal actions—that maximizes the total cumulative reward over time, effectively helping the system learn when ambient context is complex or incomplete.

i. Model-Based

Model-Based RL explicitly learns or is given a model of the environment, allowing the agent to predict future states and rewards before taking an action. The ability to simulate outcomes internally makes it very “sample efficient” (fewer real-world interactions required). A real world example is the Roomba vacuum cleaner that builds and uses an explicit map of your house to plan an efficient cleaning path is a good example of this approach.

ii. Model-Free

Model-Free RL skips modeling the environment and learns instead by trial-and-error, optimizing its actions or the policy itself (eg, Q-Learning). This approach is most useful in complex environments where an optimal model is difficult to create. It is like learning to ride a bike—you don’t need to know the physics behind it, you just learn which actions lead to a successful outcome. This is a common approach in training agents to play video games.

iii. Multi-Armed Bandit

This is one more example of Model-Free that is worth noting. With MAB, the chosen action does not affect the next state of the environment, meaning there is no long-term sequence of decisions to plan. MAB demonstrates the classic “exploration versus exploitation” trade-off, deciding whether to use the best-known action (exploit) or try an unknown one (explore) and finding the optimal balance along the way. MAB is commonly used in A/B testing, particularly in contexts such as ad optimization, wherein the system is quickly trying to figure out which version is most performant and optimizes toward showing that version of the duration of the ad campaign.

Neural Networks

Neural Networks are a concept originally pioneered in the 1950s with the idea networks of neurons, called Perceptrons. The concept was mostly dormant however during the “AI Winter”, though the key addition of “Back Propagation”, which provides the mechanism for learning by comparing previous states, was added in the 1980s. This paved the way for future breakthroughs that would be realized finally in the mid 2000’s when data and compute power finally caught up (the Deep Learning era).

In Neural Networks, each neuron takes a set of inputs, multiplies them by their associated weights, adds a bias value, and passes the resulting value through an activation function, generating a calculated output. The network is a series of these layers of Perceptrons and the process of data moving sequentially through them is called the forward pass, which culminates in the final output layer’s activation function making the network’s prediction.

After the forward pass generates a prediction, a “loss function” quantifies the error by comparing the prediction to true answer. Backpropagation then calculates the gradient of the error with respect to every weight and every bias, working backward from the output layer to the input layer. An optimizer then uses these calculated gradients to adjust the weights and biases across the whole network. This is the continuous, iterative learning process that refines the system, making it more accurate with each cycle, similar to the iterative human learning process.

Deep Learning

By the mid 2000’s, the stage was set for the AI breakthroughs that followed. This is the point where reality caught up with theory because all of the data being collected in the “big data” era and all of the advances in compute power being made possible with cloud data centers running large quantities of GPUs and Google’s TPUs, were finally coming online. What researchers quickly realized was that the original theories were not incorrect – they just required a level of scaling to be realized, that was finally possible. Applying Neural Network at scale and thus giving the networks far greater depth than before, is the hallmark of the “Deep Learning” movement that began around this time.

Three notable deep learning came to prominence in the DL era (2016-2016):

i. Convolutional Neural Network (CNN)

CNNs are used primarily for analyzing images. They use a “convolutional” layer, where a small matrix of learnable weights, known as a “kernel”, slides (or “convolves”) across the entire input image. At each step, this filter calculates (dot product) with the small patch of pixels it covers (typically a 3×3 pixel patch) to check for a matching pattern. This process is repeated across the entire image, generating a feature map that indicates where that specific pattern (like a vertical line or edge) was detected. By stacking multiple convolutional layers, the network builds a hierarchy of features, and learns simple elements in early layers and then combines them into complex concepts (like eyes or entire objects) in the deeper layers. Through this process, a CNN is able to “detect” images that are present in an image.

ii. Recursive Neural Network (RNN)

RNNs are built for processing sequential data, such as text, audio, where the order and context of elements are crucial (not just their spatial location, as in the case of CNN). Their unique feature is the “directed cycles” which loop to find information from a previous step in the sequence to influence the processing of the current step. Similar to the human brain, this approach essentially gives the network a short-term working “memory” of past inputs, making them ideal for tasks where context is necessary. RNNs are used for things like text autocomplete, as they can predict the next word in a sentence based on all the words that came before it. Later, Long Short-Term Memory (LSTM) was added which addressed the struggle of remembering information over long sequences (direct cycles helped but were not sufficient) and this dramatically improved its ability to handle longer-form text.

iii. Generative Adversarial Network (GAN)

Lastly, GANs represent a really innovative new new framework and was arguably the first glimmer of the future Generative AI wave. In this model, there are two different neural networks that compete: a Generator and a Discriminator. The Generator network attempts to create realistic synthetic data (like images or music) from random input, and the Discriminator network acts like a critic, trying to correctly distinguish between the real data from the training set and the fake data created by the Generator. This process drives both models to iteratively improve: the Generator gets better at producing data that fools the Discriminator, and the Discriminator gets better at detection.

While GANs gained initial fame for early image generation, their usefulness can be quite broad, including things such as data augmentation (eg creating synthetic training examples), super-resolution (enhancing low-quality images), and image-to-image translation (e.g., turning a sketch into a photorealistic scene).

Generative (GenAI)

The Generative AI era is defined by the ability of AI models to create coherent, and original content of different “modes” including text, images, video, and code. Technically, the ability to generate assets began earlier with models like Generative Adversarial Networks (GANs) in 2014, but the revolution was truly unlocked by the Transformer architecture in 2017, following the publication of Google’s seminal paper: Attention is All You Need.

The Attention Mechanism is a new capability, allowing for parallel processing of data and enabling the training of models on enormous data sets, leading to the development of highly sophisticated “Foundation Models” that can generalize across numerous tasks. Stemming from this major breakthrough is an entirely new class of models that take advantage of Transformers to leapfrog past their Deep Learning predecessors, from just one decade prior. The two primary model types in the Generative AI era are:

i. Large Language Models

LLMs are the successor to Recurrent Neural Networks (RNNs) for sequential data processing, primarily text. Built entirely on the Transformer architecture, they leverage the attention mechanism to capture long-range contextual relationships within massive datasets. The architecture is split into two dominant forms: Encoder-only models, like BERT, which focus on reading the entire text bidirectionally to deeply understand and encode context for analysis; and Decoder-only models, like the GPT genre (eg ChatGPT, Claud, Gemini), which excel at generating text by predicting the next word in a sequence. These models function as advanced statistical prediction machines, allowing for fluid conversation, sophisticated writing, and code generation at scale.

ii.Diffusion Models

For media generation, Diffusion Models are the new standard for synthesizing high-quality visual and temporal media such as video, replacing GANs for image and video generation. Well known examples include MidJourney, Stable Diffusion Dall-E, and Google’s Nano Banana. These models work through s process of systematically destroying an image and then reversing it, to learn how to recreate it. To achieve their high fidelity, Diffusion Models often incorporate the Vision Transformer (ViT) architecture as their core backbone, particularly for video, which requires temporal attention tracking (adding a 3rd dimension to attention, for time). This allows the generative process to leverage the Transformer’s attention mechanism for modeling global relationships across the pixels of an image or the frames of a video, ensuring the generated content is highly coherent and complex.

Agentic AI

The power of Large Language Models (LLMs) extends far beyond mere text generation – they are now capable of taking complex, high-level instructions, understanding user intent, and acting as an orchestrator across multiple tasks and systems.

The initial step was enabling Multimodal Generative AI, where the LLM routes a creation request to a specialized model, such as passing a prompt to a Stable Diffusion model—to execute a task it cannot do itself. This concept, however, has much deeper implications, as the LLM is not merely a translator; it is an Agent orchestrator capable of calling any tool necessary to achieve a goal. This can include web searches, running code, querying databases, or interacting with any 3rd-party service via MCP/API. The ability to understand, plan, act, reflect autonomously on behalf of the user. likely defines the next major wave of innovation beyond foundational Generative AI. And almost as if its right on queue – AI research has recently turned to reasoning and world models, which appear to represent the next level of cognition (See: LeCunn’s JEPA and Sapient’s HRM).

Note – the world of Agentic AI is very interesting and active – I’ll probably write a separate article on this topic alone (stay tuned!).

Use Cases Summary

Finally, I want to close this article by sharing a compilation of use cases that I put together, to really drive home my understanding the various algorithms and use cases of those algorithms.

AI Use Cases

Conclusion

Wrapping up, AI is a data-driven predictive approach to building systems. By gathering data and modeling it properly, it is possible to create systems that excel far beyond the need for humans to write declarative rules. This foundational shift from prediction to creation has opened up a world of exciting possibilities. Although Machine Learning had been in production use on popular sites and apps since the mid-2010s, it only recently captured the imagination of culture at large, when we entered the Generative AI era, starting in 2017, and particularly as punctuated by the release of ChatGPT in November 2022.

Research continues and the rate of development of this capability is expected to only increase. Meanwhile, there are many exciting possibilities and use cases that have yet to be fully realized, even if you were to take a snapshot of what’s possible today alone. And it you put all of this together, it is a wondrous future to consider, and a varietal playground for people who work in the space of innovation and digital products.