Partially observable Markov decision process

Partially observable Markov decision process

A Partially Observable Markov Decision Process (POMDP) is a generalization of a Markov Decision Process. A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must infer a distribution over the state based on a model of the world and some local observations.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the operation research community, and was later taken over by the Artificial Intelligence and Automated Planning communities.

An exact solution to a POMDP yields the so-called optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.


Formal Definition

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a tuple (S,A,O,P,R), where
* S is a finite set of states,
* A is a finite set of actions,
* P is a finite set of probabilities,
* O is a finite set of observations,
* R: A,S o mathbb{R} is the reward function.Each time period, the environment is in some state s in S, the agent takes an action a in A. Taking action a makes the environment transition to state s' with probability P(s'mid s,a) and the agent receives a reward for it, with expected value r(s,a,s').

Belief update

In s', the agent observes o in O with probability P(omid s',a). Let b be the vector of state probabilities: b(s) denotes the probability that the environment is in state s. Given b(s), then after taking action a and observing o,:b'(s') = eta P(omid,s',a) sum_{sin S} P(s'mid s,a)b(s)where eta=P(omid b,a) is a normalizing constant with P(omid b,a) = sum_{s'in S}P(omid s',a)sum_{sin S}P(s'mid s,a)b(s).

Since the state is Markovian, maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted b' = au(b,a,o).

Belief MDP

The policy maps a belief state space into the action space. The optimal policy can be understood as the solution of a continuous space so-called belief Markov Decision Process [L.P. Kaelbling and M.L. Littman and A.R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence Journal, vol. 101, pages 99-134, 1998] . It is defined as a tuple (B,A, au, ho) where
* B is the set of belief states over the POMDP states,
* A is the same finite set of action as for the original POMDP,
* au is the belief state transition function,
* r:B,A o mathbb{R} is the reward function on belief states. It writes :r(b,a) = sum_{sin S} b(s) R(s,a).

Note that this MDP is defined over a continuous state space.

Policy and Value Function

The agent's policy pi specifies an action a=pi(b) for any belief b. Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When R defines a cost, the objective becomes the minimization of the expected cost.

The expected reward for policy pi starting from belief b is defined as:J^pi(b) = EBigl [ sum_{t=0}^infty gamma^t r(s_t,a_t,s_t') mid b, pi Bigr] where gamma<1 is the discount factor. The optimal policy pi^* is obtained by optimizing the long-term reward.:pi^* = underset{pi}{mbox{argmax J^pi(b_0)where b_0 is the initial belief.

The optimal policy, noted pi^* yields the highest expected reward value for each belief state, compactly represented by the optimal value function, noted V^*. This value function is solution to the Bellman optimality equation::V^*(b) = max_{ain A}Bigl [ r(b,a) + gammasum_{oin O} P(omid b,a) V^*( au(b,a,o)) Bigr] For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex [ The optimal control of partially observable Markov processes, E.J. Sondik, PhD Thesis, Standford University, 1971] . It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate V^* arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an epsilon-optimal value function, and preserves its piecewise linearity and convexity [R.D. Smallwood and E.J. Sondik, The optimal control of partially observable Markov decision processes over a finite horizon, Operations Research, 21, 1071-1088, 1973.] . By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead [E.J. Sondik, The optimal control of partially observable Markov processes over the infinite horizon: discounted cost, Operations Research, 26, 282-304, 1978.] [E. Hansen, Solving POMDPs by searching in policy space, In Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI-98), 1998.] .

Approximate POMDP solutions

In practice, POMDPs are often computationally intractable to solve exactly, so computer scientists have developed methods that approximate solutions for POMDPs.

Grid-based algorithms [Computationally feasible bounds for partially observed Markov decision processes, Lovejoy, W., Operations Research, 39, 162--175, 1991.] comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered and that aren't in the set of grid points. More recent work makes use of sampling techniques, generalisation techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states Assisting Persons with Dementia during Handwashing Using a Partially Observable Markov Decision Process Jesse Hoey, Axel von Bertoldi, Pascal Poupart, and Alex Mihailidis. Full text available [ here] .] . Dimensionality reduction using PCA has also been explored [Exponential Family PCA for Belief Compression in POMDPs (2003). Nicholas Roy, Geoffrey Gordon. In Advances in Neural Information Processing Systems.] .

POMDP uses

POMDPs model many kinds of real-world problems. Notable work includes the use of a POMDP in assistive technology for persons with dementia .


External links

* [ Tony Cassandra's POMDP pages] with a tutorial, examples of problems modeled as POMDPs, and software for solving them.
* [ zmdp] , a POMDP solver by Trey Smith
* [ APPL] , a fast point-based POMDP solver

Wikimedia Foundation. 2010.

См. также в других словарях:

  • Markov decision process — Markov decision processes (MDPs), named after Andrey Markov, provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for… …   Wikipedia

  • Markov model — In probability theory, a Markov model is a stochastic model that assumes the Markov property. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. Contents 1 Introduction 2 Markov chain… …   Wikipedia

  • List of numerical analysis topics — This is a list of numerical analysis topics, by Wikipedia page. Contents 1 General 2 Error 3 Elementary and special functions 4 Numerical linear algebra …   Wikipedia

  • Automated planning and scheduling — is a branch of artificial intelligence that concerns the realization of strategies or action sequences, typically for execution by intelligent agents, autonomous robots and unmanned vehicles. Unlike classical control and classification problems,… …   Wikipedia

  • List of mathematics articles (P) — NOTOC P P = NP problem P adic analysis P adic number P adic order P compact group P group P² irreducible P Laplacian P matrix P rep P value P vector P y method Pacific Journal of Mathematics Package merge algorithm Packed storage matrix Packing… …   Wikipedia

  • Monte Carlo POMDP — In the class of Markov decision process algorithms, the Monte Carlo POMDP (MC POMDP) is the particle filter version for the partially observable Markov decision process (POMDP) algorithm. In MC POMDP, particles filters are used to update and… …   Wikipedia

  • Preference elicitation — refers to the problem of developing a decision support system capable of generating recommendations to a user, thus assisting him in decision making. It is important for such a system to model user s preferences accurately, find hidden… …   Wikipedia

  • Michael L. Littman — Born August 30th, 1966 Philadelphia, PA …   Wikipedia

  • Reinforcement learning — Inspired by related psychological theory, in computer science, reinforcement learning is a sub area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long term reward .… …   Wikipedia

  • Vaclav E. Benes — Vaclav Edvard Vic Benes (born 1930) is a Czech American mathematician, known for his contributions to the theory of stochastic processes, queueing theory and control theory, as well as the design of telecommunications switches.He studied under… …   Wikipedia

Поделиться ссылкой на выделенное

Прямая ссылка:
Нажмите правой клавишей мыши и выберите «Копировать ссылку»