AI Tutorial

Markov Decision Processes (MDP) in Machine Learning

Table of Contents

  • Introduction
  • What is Markov Decision Process (MDP)?
  • Key Terms Related to Markov Decision Process in Machine Learning
  • Partially Observable Markov Decision Process
  • Markov Decision Process Formulation
  • Markov Decision Process Example
  • Applications of Markov Decision Process
  • Markov Chain vs Markov Process


The Markov Decision Process (MDP) is a mathematical tool or framework used for decision-making models where the outcomes are partially controllable and partially random. This framework can address most of the reinforcement learning (RL) problems.

What is Markov Decision Process (MDP)?

The Markov decision process in artificial intelligence is a stochastic decision-making process used to model the decision-making of a dynamic system. It is used where outcomes are partly random and partly controlled. MDPs assess actions that a decision-maker must take according to the current state and environment of the system. 

MDP in machine learning relies on different variables, including actions of the agent, environment, and rewards, to decide the next optimal action of the system. Based on different factors, such as available states, set of actions, and decision-making frequency, they are divided into four types- infinite, finite, continuous, and discrete.

Markov model has been around since the early 1950s, and the Russian mathematician Andrey Markov, who played a crucial part in shaping stochastic processes, inspired the name Markov. 

Initially, Markov decision processes reinforcement learning were used to solve problems associated with inventory management and control, routing, and queuing optimization. However, now, MDPs are used in robotics, studying optimization problems through dynamic programming, economics, automatic control, manufacturing, etc. 

We also use the Markov decision process to design intelligent machines that must function longer in an environment where actions can generate uncertain outcomes. It is primarily popular in two subareas of artificial intelligence- probabilistic planning and reinforcement learning (RL)

The probabilistic planning discipline uses a known model to achieve the goals of an agent and focuses on guiding machines to make decisions while helping them learn to behave to attain their objectives. On the other hand, reinforcement learning enables applications to learn from feedback that the agent receives from the environment. 

A Markov Decision Process (MDP) model includes:

  • A set of possible world states.

  • A real-valued reward function.

  • A set of Models.

  • A set of possible actions.

  • A policy.

Key Terms Related to Markov Decision Process in Machine Learning

Here are a few terms you must understand that are used throughout the blog.

  • State

A state in the Markov decision process in artificial intelligence is a set of tokens representing the current state of the agent. It can be either the exact position of the robot in the house, its current posture, or the alignment of its legs. It depends on the way you address a problem. 

  • Model

A model, or transition model, given the effect of an action in a state. To be specific, ‘T’ is a transition, where being in a state ‘S’ and taking an action ‘a’ get us to state ‘S’’ (S and S’ can be the same). For stochastic actions, we define a probability P, representing the probability of reaching a state S’ if an action is taken in state S. According to Markov property, the effects of an action taken in a state are based on only that state and not on the history. 

  • Actions

Actions refer to choices the agent makes at the current time step. It is a set of all possible actions. For example, a robot can move right or left leg, lift an object, raise an arm, or turn left or right. We already know the set of actions or decisions the agent will take.

  • Reward

A reward in the Markov model means a real-valued reward function. Reward (R) shows the reward for being in the state S, whereas R(S, a) shows the reward for being in a state and taking an action. R(S, a S’) refers to the reward for being in a state S, taking an action a, and reaching the state S’.

  • Policy

A policy shows the thought process behind choosing an action. It is a solution to the Markov decision process in AI and refers to the mapping from S to a, indicating an action taken while in state S. High rewarding actions have a high probability and vice-versa. In the case of an action having a low probability, it doesn’t mean it won’t be selected at all, but it is just less likely to be picked. 

  • Agent

A reinforcement learning agent is the entity we train to make the right decisions. For example, training a robot to move around a house without scratching.

  • Environment

The environment in MDP in machine learning means the surroundings an agent interacts with. For example, the room or premises where the robot moves. An agent can’t manipulate the environment but can control its own actions. So, the robot might not decide the place of a chair but can move around it.

Partially Observable Markov Decision Process

A Partially Observable Markov Decision Process (POMDP) is an extension of the standard Markov Decision Process (MDP) that accounts for situations where the agent does not have complete information about the state of the environment. 

In a POMDP, the agent faces uncertainty not only in the environment's dynamics but also in its observations. 

POMDPs find applications in various domains where decision-making must account for incomplete or noisy information, such as robotics, autonomous systems, natural language processing, and healthcare, among others. They provide a powerful framework for modeling and solving problems in uncertain environments.

Markov Decision Process Formulation

Formulating a Markov Decision Process (MDP) involves defining the key components and characteristics of the decision-making problem. Here's a step-by-step guide on how to formulate an MDP:

Step 1: Define the Components

Identify the core components of the problem:

  • States (S): Determine the set of possible states that represent different situations or configurations of the environment. States should encapsulate all relevant information about the system.

  • Actions (A): Specify the actions or decisions that the agent can take. Actions influence the state transitions and impact the system's behavior.

  • Rewards (R): Define the rewards associated with state-action pairs. Rewards represent the immediate desirability or cost of taking a specific action in a particular state.

  • Transitions (T): Describe the dynamics of the environment, specifying the probabilities of transitioning from one state to another based on the agent's actions.

Step 2: Define the Objective

Clearly articulate the objective or goal of the decision-making problem. Decide what the agent aims to achieve, whether it's maximizing cumulative rewards, minimizing costs, reaching a specific state, or another desired outcome.

Step 3: Ensure the Markov Property

Ensure that the problem satisfies the Markov property, which states that the future state (and rewards) depends only on the current state and action, regardless of the entire history of states and actions. This memoryless property simplifies modeling and computation.

Step 4: Formulate Policies

Introduce the concept of policies (denoted as π). Policies represent strategies or decision rules for the agent. A policy defines which action to take in each state. Policies can be deterministic (one action per state) or stochastic (a probability distribution over actions).

Step 5: Define the Objective Function

Define an objective function or criterion that quantifies the agent's goal. This could be the expected cumulative reward, which the agent aims to maximize over time.

Step 6: Solving the MDP

Depending on the complexity of the problem, select appropriate methods and algorithms to find an optimal policy (π*) that maximizes (or minimizes) the objective function. Common approaches include:

  • Dynamic Programming: Value iteration or policy iteration.

  • Monte Carlo methods: Estimate value functions through sampling.

  • Temporal Difference learning: Update value functions based on temporal differences.

  • Reinforcement Learning: Use deep reinforcement learning techniques for complex problems.

Step 7: Policy Execution

Once an optimal policy (or a good policy approximation) is found, it can be executed in the real system or environment, guiding the agent to make decisions that maximize its expected long-term rewards.

Step 8: Continuous Improvement

Implement a feedback loop to continuously update the policy as the agent interacts with the environment. This allows the agent to adapt to changing conditions and improve its decision-making over time.

Markov Decision Process Example

Let's consider a simple example of a Markov Decision Process (MDP) known as the "Frozen Lake" problem. This problem is often used to illustrate the basic concepts of MDPs and reinforcement learning.

Problem Description:

Imagine a frozen lake represented as a grid. The agent starts at the top-left corner and needs to reach the bottom-right corner while avoiding holes in the ice. The agent can take four possible actions at each grid cell: move up, move down, move left, or move right. The ice is slippery, so the agent may not always move in the intended direction.

Here are the key components of this MDP:

  • States (S): Each grid cell in the frozen lake represents a state. There are multiple states in the grid.

  • Actions (A): The agent can take four actions: "Up," "Down," "Left," and "Right."

  • Rewards (R): The agent receives a reward of +1 for reaching the goal (bottom-right corner) and a reward of -1 for falling into a hole. All other transitions have a reward of 0.

  • Transitions (T): Due to the slippery ice, transitions are probabilistic. If the agent chooses to move in a certain direction, there's a 0.7 probability that it will move in the intended direction and a 0.3 probability that it will move in a random direction.


The objective of the agent is to find an optimal policy (a strategy) that maximizes the expected cumulative reward while navigating from the start to the goal.


Let's look at a simplified 4x4 grid representing a portion of the frozen lake. In this grid, "S" represents the start, "G" represents the goal, "H" represents a hole, and "F" represents a safe frozen cell.

S  F  F  F

F  H  F  H

F  F  F  H

H  F  F  G

In this grid, the agent starts at "S" and needs to find a path to "G" while avoiding holes "H." The agent's actions are uncertain due to the slippery ice.

Solving the MDP:

To solve this MDP and find the optimal policy, various reinforcement learning algorithms can be applied, such as Q-learning or policy iteration. The agent learns to take actions that maximize the expected cumulative reward over time and, over many iterations, discovers the optimal strategy for reaching the goal while avoiding holes.

The optimal policy will guide the agent to take actions that increase the chances of reaching the goal and receiving positive rewards.

Applications of Markov Decision Process

Markov Decision Processes (MDPs) find applications in a wide range of fields where decision-making under uncertainty is crucial. Here are some notable applications of MDPs:

MDPs are at the core of reinforcement learning, where agents learn to make decisions by interacting with environments. Robots use MDPs to plan and execute actions, enabling them to navigate, manipulate objects, and perform tasks.

  • Game Playing:

MDPs are used in artificial intelligence for game playing. Game agents, such as chess or Go-playing programs, employ MDPs to make optimal moves and decisions to win games.

  • Finance and Portfolio Management:

In finance, MDPs help optimize portfolio allocation and trading strategies. Traders use MDPs to make decisions on buying or selling financial assets to maximize returns while considering risks.

  • Healthcare and Treatment Planning:

MDPs are applied in healthcare for treatment planning and personalized medicine. They assist in determining optimal treatment paths for patients with chronic diseases, considering various factors like patient history and drug interactions.

  • Energy Management:

MDPs play a role in energy management systems. They help control the operation of smart grids, optimizing energy distribution and consumption while minimizing costs and environmental impact.

  • Autonomous Vehicles:

Self-driving cars and drones use MDPs to make real-time decisions on navigation, obstacle avoidance, and route planning while considering traffic, weather, and safety.

  • Natural Language Processing (NLP):

In NLP, MDPs can be used for dialogue management and chatbot interactions. They help chatbots make decisions about what responses to generate based on user input and conversation history.

  • Supply Chain Management:

MDPs are employed in supply chain optimization. They assist in making decisions about inventory management, demand forecasting, and logistics to minimize costs and improve efficiency.

  • Environmental Management:

Conservationists use MDPs to manage natural resources and wildlife. These models aid in making decisions about habitat preservation, species conservation, and ecosystem management.

Video game developers use MDPs to create intelligent non-player characters (NPCs) that exhibit complex behaviors and adapt to player actions.

  • Recommendation Systems:

MDPs are used in recommendation systems to decide what products, movies, or content to recommend to users based on their preferences and behaviors.

  • Agriculture and Precision Farming:

MDPs help optimize crop management and irrigation systems in agriculture. They make decisions about when and how much water, fertilizer, or pesticides to apply to maximize yields.

  • Marketing and Advertising:

Marketers use MDPs for optimizing ad campaigns. These models decide which ads to display to users, considering factors like user demographics and ad effectiveness.

  • Pharmaceutical Drug Discovery:

In drug discovery, MDPs are applied to identify potential drug candidates and optimize drug development processes.

  • Security and Anomaly Detection:

MDPs are used in security applications to detect anomalies and make decisions about security protocols and threat responses.

Markov Chain vs Markov Process


Markov Chain

Markov Process


A Markov Chain is a mathematical model that describes the transitions between a finite set of states over discrete time steps.

A Markov Process is a general term that encompasses Markov Chains and extends to continuous-time processes as well.

Time Representation

Typically discrete-time, where transitions occur at fixed time intervals.

Can be discrete-time (like Markov Chains) or continuous-time, allowing transitions at any point in time.

State Space

Finite or countable set of states.

Can have a finite, countable, or continuous state space.

Transition Probabilities

Describes the probabilities of moving from one state to another in the next time step.

Transition rates are used to describe how the system moves between states.


Markov Chains are memoryless; future transitions depend only on the current state and not on the past history.

Markov Processes can be memoryless (like Markov Chains) or have memory, depending on the specific process.

Discrete vs. Continuous Variables

Typically involves discrete variables (states).

Can involve both discrete and continuous variables, depending on the process.


Board games (e.g., Monopoly), random walks, weather patterns, and discrete-event simulations.

Queueing systems, continuous-time financial models, Brownian motion, and continuous-state systems.

Did you find this article helpful?