USE OF REINFORCEMENT LEARNING (RL) FOR PLAN GENERATION IN BELIEF-DESIRE-INTENTION (BDI) AGENT SYSTEMS

The Belief-Desire-Intention (BDI) agent framework is a reactive agent framework based on the idea of intentionality. A known weaknesses of BDI is its lack of learning capabilities resulting from its dependence on an a-priori library of plans. BDI plans are designed by human experts on the domain to which BDI is being applied and are fixed. Any situation the BDI agent encounters which does not have a matching plan can result in erroneous agent operation and even agent failure. Researchers have augmented the BDI framework with various learning frameworks including decision trees, self-organizing neural networks, hybridarchitectures using low level learners, and metaplans for plan hypothesis abduction and plan modifications. Other relevant research tackled the use of a-priori knowledge, previously learned knowledge and the learning of plans without apriori knowledge on planning systems, and the integration of learning, planning and execution. These studies were, however, not investigated in relation to BDI systems. This study explores the successful use of Reinforcement Learning (RL), a computational learning framework based on the idea of learning from repeated interactions with the environment, to generate plans in BDI systems without relying on a-priori knowledge.


Motivation
The lack of learning capabilities for BDI systems was recognized as far back as 2004 [1]. Researchers tackled this by augmenting the BDI framework with various learning frameworks including decision trees, self-organizing neural networks, hybrid-architectures using low level learners, and metaplans for plan hypothesis abduction and plan modifications. Other relevant research tackled the use of apriori knowledge, previously learned knowledge and the learning of plans without a-priori knowledge on planning systems, and the integration of learning, planning and execution. These studies were, however, not investigated in relation to BDI systems.
Recent research relied on Markov Decision Processes (MDPs) to generate BDI plans from optimal policies for completely specified MDPs [2]. Pereira's work was augmented to work with Partially Observable Markov Decision Processes (POMDPs) [3]. These two studies come closest to the proposed study with the difference that for the proposed study neither fully specified MDPs nor POMDPs will be considered.
The problem selected for study is justified by the lack of research exploring the generation of plans in BDI systems using reinforcement learning that does not rely on a-priori knowledge.

Thesis Organization
Chapter 2 Planning and Learning discusses the close relationship between planning and learning. In this work, RL represents the learning aspect and BDI represents the planning aspect. Chapter 3 Reinforcement Learn-ing introduces the computational RL framework and provides an introduction to the rigorous mathematical notions that underlie learning from repeated interactions with the environment. Chapter 4 Belief-Desire-Intention (BDI) Agent Systems Framework introduces the BDI framework and highlights its known weakness of relying on an a-priori plan library. The logic BDI programming language AgentSpeak is also examined as well. Chapter 5 Differences Between Proposed Study and Previous Research provides the previous research that justifies this thesis. The coding implementation for the RL and BDI parts is discussed in Chapter 6 Experimental Implementation. The results are discussed in Chapter 7 Results. Finally, a discussion of the results, limitations, and ways to enhance the ideas in this thesis in future work are part of Chapter 8 Discussion. List of References [1] A. Guerra-Hernández, A. E. Fallah-Seghrouchni, and H. Soldano, "Learning in bdi multi-agent systems," in CLIMA, 2004, pp. 218-233. [2] D. R. Pereira and G. P. Dimuro, "Um algoritmo para extração de um plano bdi que obedece uma políticaÓtima," in Workshop-Escola de Sistemas de Agentes para Ambientes Colaborativos, Pelotas, Anais do WESAAC 2007WESAAC , 2007. [3] D. R. Pereira, L. Vargas, G. P. Dimuro, and A. C. R. Costa, "Constructing bdi plans from optimal pomdp policies, with an application to agentspeak programming," in XXXIV -

Introduction
Planning and learning are closely connected. It is difficult to think of one without thinking of the other. Zimmerman considers them to be the "most broadly recognized hallmarks of intelligence" and defines them as: planning -solving problems in which one uses beliefs about actions and their consequences to construct a sequence of actions that achieve one's goals.
learning -using past experience and precepts to improve one's ability to act in the future. [1] 2

.2 Zimmerman's Model
Zimmerman introduces a 5 dimensional model to characterize automated planning systems that are augmented with a learning component, including an extensive survey of planning and learning [1]. The model allows us to compare, within the model limitations, the different approaches researchers have pursued to combine planning and learning in a graphical way.
The 5 dimensions of the model are: 1. Problem Type. The problem type is a function of the environment. Besides providing a useful model to characterize planning and learning research, the survey discovered few research attempts that used RL for planning.
These included a hybrid approach combining explanation based learning (EBL) with RL into an explanation base reinforcement learning (EBRL) algorithm and tested on chess endgames and synthetic maze tasks [2]. Incremental dynamic programming (DYNA) was proposed as an RL architecture that split learning and planning by having actions generated by a reactive system with a planning system acting "independently and conceptually in parallel" [3]. Learning by Observation in Planning Environments (LOPE) was proposed as an architecture integrating learning, planning and execution [4]. LOPE's learning is focused on learning operator definitions, plans using the operators, and executes plans that modify acquired operators.

CHAPTER 3
Reinforcement Learning

Introduction
Reinforcement learning (RL) is a computational learning framework based on the idea of learning from repeated interactions with the environment [1,2,3].
RL agents seek to maximize a reward signal from the environment. As the agent explores the environment it learns the actions that maximize the reward from particular states. The agent does this by selecting the action that will bring the greatest reward, known as the greedy action. Once an agent has completed a startstate to goal-state cycle, the agent has two choices: exploitation or exploration.
An agent can exploit the knowledge it learned interacting with the environment by choosing the greedy action or it can explore the environment by choosing an untried action or trying again a sub-optimal action. This is known as the exploration-exploitation dilemma.
Choosing the greedy action all the time, however, is not an effective strategy.
Immediate high rewards can be followed by low rewards that outweigh the initial high rewards. Immediate low rewards can, conversely, be followed by high rewards that outweigh the initial low rewards. To overcome myopic behavior, RL agents need to balance exploitation with exploration.

RL Problem
To apply reinforcement learning to a problem requires that the problem be characterized as a RL problem. RL problems can be characterized by four ele- The value function is the expected rewards averaged over many explorationexploitation trials. It provides an approximation of the value of particular states.
It answers the agent's question: What can I expect if I start in this state and follow what I have learned? It is "what is good in the long run" [2].
The model provides a model of the environment and of how it reacts to specific actions by the agent. It is used for planning by simulating actions in particular states and observing the rewards, in effect experiencing the environment through a simulation. In many domains, however, a model is not available or unfeasible to obtain. Fortunately, RL can learn the world model empirically by interacting with the environment.

RL Problem Formalization
This section introduces a formalization of the RL problem.

Agent
Environment action a t s t reward r t r t+1 s t+1 state Figure 2. Agent-Environment Model [2] In RL problems we will always have an agent situated in an environment.
"An agent is anything that can be viewed as perceiving its environment through sensors and acting upon the environment through actuators." [4] agent: the RL agent, is the part of the system that is learning, the learner action: a t ∈ A (s t ), the action chosen by the agent at a particular state and a particular time environment: everything outside the agent or what the agent interacts with state: s t ∈ S, where the agent finds itself after choosing an action and receiving the corresponding reward task: complete specification of the environment reward: numerical value that the agent tries to maximize received after choosing an action policy: π t (s, a) , mapping from states to probabilities of selecting each possible action The agent in state s t selects action a t ∈ A (s t ), which results in reward r t+1 and state s t+1 ∈ S, leading to the following sequence: As the agent moves from state to state it collects rewards. The sum of all rewards defines the returns.

Returns
RL agents seek to maximize the rewards received, the returns. Returns are defined differently depending on whether the task is episodic or continuing.
Episodic tasks can be broken in discrete and finite episodes. In other words, they do not go on forever. In this case the return would be the sum of rewards from the beginning of the episode, at time t + 1 until time T that the episode finishes.
In this straightforward case, the return R t can be defined as: Continuing tasks are not so easily broken into distinct episodes and can potentially go on for a long time, even "forever". In this case, rewards received immediately are more valuable than rewards received later. Multiplying each reward by a decreasing factor provides a way to specify the value of immediate rewards. This factor is called the discounting factor γ, for 0 < γ < 1. The value for the discounting factor provides a way to specify how short-term or forward-looking we want the agent to be. The discounted return R t is defined as:

Markov Property
The Markov property is important in RL because it defines systems that do not need to keep history. In other words, what matters is the current state of the system and not the history of how it got there. For some systems there might be infinite ways of getting to a particular state. In problems where keeping history is necessary to predict future actions, the state space would be orders of magnitude bigger. In such cases the problem would be consequently much more complex and difficult to analyze, model and simulate.
Formally, the Markov property is exhibited by systems in which the following two equations are equivalent: P r {s t+1 = s , r t+1 | s t , a t , r t s t−1 , a t−1 , . . . , r 1 , s 0 , a 0 } (1) for all s , r t+1 , and histories, s t , a t , r t s t−1 , a t−1 , . . . , r 1 , s 0 , a 0 . [2] Equation (1) shows the probability of state s t+1 being s and reward being r t+1 given the previous sequence of states, actions and rewards s t , a t , r t s t−1 , a t−1 , . . . , r 1 , s 0 , a 0 . Equation (2) shows probability of state s t+1 being s and reward being r t+1 given only the previous state and previous action s t , a t . If these 2 equations are equivalent, it means that state, action and reward history is not a factor in determining the probability of the next state and its reward. This leads to one-step dynamics that allows to predict the next state and the expected next reward based on the current state and the current action.
A RL task that satisfies the Markov property is Markov Decision Process (MDP). For finite MDPs, the probability of each possible state, s , is: The expected value of the next reward is: For the purposes of this research, a finite MDP is assumed.

Value Functions
The value function is the expected rewards averaged over many explorationexploitation trials. Value functions have two equivalent representations: value functions of states, typically denoted V , and value functions of state-action pairs, typically denoted Q. These functions are evaluated for a given policy π.
State-value function for policy π V π is called the state-value function for policy π , it represents "the expected return when starting in s and following π thereafter".
Action-value function for policy π Q π is called This equation represents the action-value function for policy π , it represents "the expected return starting from s, taking action a, and thereafter following policy π".
One useful way of visualizing these equations is through the use of backup diagrams. In backup diagrams, states are denoted by white circles and actions are denoted by smaller black circles. Figure 3(a) shows the diagram for V π . It is intuitive that the value for state s will depend on all possible actions a available at state s. These actions provide a reward r and take the agent to state s .
Since policies are stochastic and agents have to balance exploration-exploitation, the value for state s will be an approximation. The more the agent explores the environment, the closer the value for state s will come to its true value. Figure   3(b) shows the diagram for Q π . In this case, similar to V π , the value for actionvalue s, a will depend on the reward r and the value for s and the next action a constituting the next action-value s , a .
s,a s a s'

Optimal Functions
Value functions are evaluated for particular policies. It should not be surprising that some policies perform better than others. Given two policies π and π , π is a better policy than π , π ≥ π , if and only if V π (s) ≥ V π (s) for all s ∈ S. A policy that performs equal or better than all other policies is an optimal policy, denoted by π * .
Using the optimal policy, an optimal state-value function can be defined as: The optimal action-value function can be defined as: for all s ∈ S, a ∈ A (s).
The optimal state-value function equation above can be rewritten without referencing a policy: The last equation is the Bellman equation for V * , or the Bellman optimality equation.
Similarly, the optimal state-value function equation can as the Bellman optimality equation for Q * : The backup diagrams forV * and Q * are shown in figure 4.

Learning Tasks
Before RL and its algorithms can be used on a particular problem, the problem needs to be expressed as a learning task.
A learning task is a complete specification of the agent and the environment and its is defined by 3 sets.
1. The state set S.

The action set A.
3. The reward set R.
The environment is generally concerned with the state and the states s t ∈ S.
The state only changes as a result of actions selected by the agent.
In any learning task the goal is for the agent to learn what to do in any particular situation that it encounters, to learn a mapping from states to actions, a policy. The collection of all the situations that the agent may encounter define the state-space of the task. For example, an agent in a 2-dimensional gridworld has at least two pieces of information that it would have to track if we are to hope for any helpful and significant learning: where it is on the X coordinate and where it is on the Y coordinate. Furthermore, if the X coordinate has a maximum dimension of 10 and the Y coordinate has a maximum dimension of 10, then we know that there are 100 possible combinations of < x, y > values. This is the state-space for the problem. The X and Y coordinate are state-variables. They define a state because we find it useful to distinguish < x t , y t > from < x t+1 , y t+1 > and they are variables because they change dynamically depending on the action chosen by the agent.
The agent is generally concerned with the action set and the actions a t ∈ A (s t ). When the agent chooses an action a t ∈ A (s t ) at a particular state s t ∈ S and a particular time t, the result will be a reward r t+1 ∈ R and new state The reward is provided by the environment as a result of the agent selecting an action. Since many implementations of RL have synthetic, simulated environments, it is contingent upon the RL researcher to define a reward set in a manner that is conducive to learning. Usually this involves some amount of trial and error in selecting the reward structure in a way that will bias the learning task to learn the actual behaviors we want learned.
Once a problem is specified as a learning task, it is ready to learn but, how does the learning happen? The cross product S × A defines the action values or Q values for the learning task. These are state-action tuples. Action values are initialized arbitrarily since no learning has taken place and the true value is not yet known. As the agent interacts with the environment the RL algorithm updates the value of the corresponding state-action tuple by backing up a fraction of the value of latter states to preceding states to better reflect their true value. Selecting the state-action tuples with maximum value defines a policy.

RL Algorithms
A number of algorithms have been developed to implement RL. An important development in RL was the development of Q learning by Watkins in 1989 [2], [5]. In its simplest form the update rule for Q learning, also known as one-step Q learning, has the form: In 1992 Watkins and Dayan proved that Q learning "converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely" [6]. In other words "the learned action-value function, Q π , directly approximates Q * " [2].
Some of the benefits of Q learning include: 1. It is model-free. It does not need the probability distributions for transitions from state s to state s .

It can handle stochastic transitions.
The algorithmic representation for Q learning is taken from [2] and is shown in table 1.

Q Learning Algorithm
Initialize Q (s, a)arbitrarily Repeat (for each episode): Initialize s Repeat (for each episode): Choose a from s using policy derived from Q ( -greedy) Take action a, observe r, s s ← s until s is terminal Table 1. Q Learning Algorithm

RL and Planning
A detailed survey of RL and automated planning system is presented by Partalas [7]. Partalas argues that "there is a close relationship between those two areas as they both deal with the process of guiding an agent, situated in a dynamic environment, in order to achieve a set of predefined goals." Because RL combines planning and learning the distinctions above blur in practice if not in theory. Since a relevant part of the research for this study will be the selection of feasible RL approaches to generate plans for BDI systems, only a quick summary of the research detailed by Partalas will be given here except in cases where research is directly related to the problem for this study. Partalas describes the possible approaches to combining planning with RL as: 1. Planning for RL. The approaches to planning for RL are categorized in 3 groups: (a) Based on Dyna [8] -Dyna is an architecture that extends reinforcement learning by including a world model. Sutton summarized some of the limitations of RL stating that "Reinforcement learning architectures are effective at trial-and-error learning, but no more. They can not do any of the things that are considered "cognitive," such as reasoning or planning. They do not learn an internal model of the world's dynamics, of what-causes-what, but only of what-to-do (policy) and how-good-is-it (return predictions). This is an important limitation because potentially much more can be learned in the form of a world model than can be learned by trial and error; the reward signal is just a scalar, while the sensory input signal is a much richer potential source of training information. And what if the goal changes? Typically, a world model can remain relatively intact over goal changes and can assist in achieving the new goal, whereas policy and return predictions must be totally changed." [9] Approaches based on Dyna include: i. Dyna-Q Combines the Dyna architecture with Q-learning. It learns a world model to generate hypothetical experience and achieve planning. [10] ii. Queue-Dyna Value function estimates are prioritized. The ones with the highest priority are put on a queue and performed. Places where the value function needs to be update are identified as "update candidates" . Two methods are proposed: prediction difference in which the priority depends on the magnitude of the predicted difference in value and effect on start-state value in which the priority depends on the contribution of update candidates to the the value of a fixed start-state. [11] iii. AHC-Relaxation Planning Combines an adaptive heuristic critic (AHC) architecture with relaxation planning. AHC is an architecture to take into account the effect of delayed rewards. [12] "Relaxation planning, which is closely related to dynamic pro-gramming, is an incremental planning process that consists of a series of shallow (usually one-step look-ahead) searches and ultimately produces the same results as a conventional deep search." [13] iv. Q-Relaxation Planning Is "similar to Sutton's Dyna-Q architecture except that only the currently visited state is used to start hypothetical experiences." [13] v. Exploration Planning [14] (b) Based on Prioritized Sweeping [15] -The idea is to work backwards from states that have big changes in their value estimation. As stateaction pairs are estimated, predecessing states are put on priority queue according to the size of their potential back-up value. This is repeated a number of times or until the queue is empty.
i. Generalized Prioritized Sweeping [16] -Introduces the Generalized Prioritized Sweeping Principle (GenPS) which states "Update states where the approximation of the value function will change the most. That is, update the states with the largest Bellman error, E (s) = V (s) − max a∈AQ (s, a) ii. Structured Prioritized Sweeping [17] (c) Based on Other -Other interesting approaches include PLANQlearning [18], Reinforcement Learnt-TOPs [19], Teleo-Reactive Q-Learning [20,21]. These, however, differ in the planning component which is different from the BDI model. (b) Extraction of Planning Knowledge from Reinforcement Learners -A process for extracting plans is presented after using an RL algorithm to learn a policy [23,24]. The process presented is "concerned with the ability to plan in an uncertain environment where usually knowledge about the domain is required. Sometimes is difficult to acquire this knowledge, it may impractical or costly and thus an alternative way is necessary to overcome this problem. [7]" Plans are extracted by successively calculating the probabilities of each action reaching the goal state from an initial state. The plan is selected greedily from the probabilities calculated.
(c) RL Approach to Production Planning [25] -RL through the use of Q-learning and Monte Carlo simulation are used to "solve a multi-period production planning problem in a two stage hybrid manufacturing process (a combination of build-to-plan with build-to-order) with a capacity constraint." All references are as cited on [7].
Partalas ends the survey summarizing the approaches for combining learning and planning as "first learn then plan", "first plan then learn" or "interchange learning and planning".

Introduction
BDI is a reactive agent framework based on the idea of intentionality [1]. BDI agents have beliefs, desires and intentions [1], [2]. Beliefs are representations of what the agent believes true in its world. Desires represent the state of the world that the agent would like to achieve. Intentions represent what the agent intends to do.
BDI implementations have a library of plans that are triggered depending on the belief and desires of the agents. The plans are designed by human experts on the domain to which BDI is being applied.
One of the weaknesses of BDI is its lack of learning capabilities. Researchers have addressed this weakness by augmenting the BDI framework with various learning frameworks including decision trees [3], self-organizing neural networks [4], [5], hybrid-architectures using low level learners, and metaplans for plan hypothesis abduction and plan modifications [4].

AgentSpeak
AgentSpeak is a programming language for BDI agents, based on logic programming, proposed by Rao [6]. It was inspired by the Procedural Reasoning System (PRS), an early BDI architecture with a plan library and explicit symbolic representations of beliefs, desires, and intentions, [7], the distributed Multi-Agent Reasoning System (dMARS) [8], and BDI Logics [2].

AgentSpeak Components
The architecture of an AgentSpeak agent has four main components as shown in table 2.

AgentSpeak Components
Belief Base Plan Library Set of Events Set of Intentions This study is motivated at precisely this weakness of a fixed a-priori plan library. The idea is to find ways of generating plans without relying on a-priori knowledge.

AgentSpeak Constructs
The main language constructs of AgentSpeak are shown in table 3.

AgentSpeak Syntax
AgentSpeak is based on logical programming. Its syntax for belief, desires (goals), and intentions (plan) reflect the underlying logical programming paradigm.
Beliefs represent the information available to an agent about the environment or other agents. Beliefs are represented in symbolic form by predicates.
The representation of the belief that wiley is the publisher, for example, is represented as:

publisher(wiley)
Goals represent states of of fairs the agent wants to bring about (come to believe, when goals are used declaratively). There are two types of goals: achievement goals and test goals as show in table 4.

AgentSpeak Types of Goals
Achievement Test Table 4. AgentSpeak Types of Goals Achievement goals are used when attempting to change the belief base. An example of an achievement goal is shown below.

!write(book)
The ! makes the goal an achievement goal. In this case, the agent's goal is to write a book.
Test goals are used when attempting to retrieve information from the belief base. An example of a test goal is show below. ?publisher(P) The ? makes the goal an test goal. In this case, the agent wants to find the publisher and bound it to variable P.
A BDI agent reacts to events in the environment by executing plans. Events happen as a consequence to changes in the agents beliefs or goals. Plans are "recipes for action, representing the agents know-how".
An AgentSpeak plan has the following general structure: triggering_event : context <-body.
The triggering event denotes the events that the plan is meant to handle.
The context represent the circumstances in which the plan can be used. The body is the course of action to be used to handle the event if the context is believed true at the time a plan is being chosen to handle the event.
Consider the AgentSpeak's plan fragment shown below. For this plan, if the agent finds itself in state(0,0,0,0) and the context calculation(done) is true, the the agent will drop the belief calculation(done), it will perform action(n) and add the achievement goal !calculateReturn.

Jason
Jason is an Java-based interpreter for an extended version of AgentSpeak. It implements the operational semantics of AgentSpeak, and provides a platform for the development of multi-agent systems. Jason was developed and is currently maintained by Jomi F. Hübner and Rafael H. Bordini [9], [10], [11], [12].

CHAPTER 5
Previous Research

Introduction
The research described in Chapters 2, 3, and 4 explores several of the elements proposed as part of this thesis: use of RL, learning plans on BDI systems, and extraction of BDI plans from MDPs and POMDPs.
A more detailed overview of previous research whose elements come closest to the work done as part of this thesis will be discussed on Section 5.2. Section 5.3 will summarize the research and highlight the key differences.

Research on BDI and Learning
An extensive review of the existing literature in RL and BDI did not uncover any research that made use of RL to learn BDI plans without relying on a-priori knowledge.
As discussed in the Introduction on Chapter 1, the lack of learning capabilities for BDI systems was recognized as far back as 2004 [1]. Researchers tackled this by augmenting the BDI framework with various learning frameworks including decision trees, self-organizing neural networks, hybrid-architectures using low level learners, and metaplans for plan hypothesis abduction and plan modifications. Other relevant research tackled the use of a-priori knowledge, previously learned knowledge and the learning of plans without apriori knowledge on planning systems, and the integration of learning, planning and execution. These studies were, however, not investigated in relation to BDI systems.
Recent research relied on Markov Decision Processes (MDPs) to generate BDI plans from optimal policies for completely specified MDPs [2]. Pereira's work was augmented to work with Partially Observable Markov Decision Processes (POMDPs) using the Witness algorithm. [3]. These two studies come closest to the work done in this thesis. The difference is that neither fully specified MDPs nor POMDPs were considered in this thesis.
Besides the Pereira work, the research done by Karim, Plans as Products of Learning [4] is also very similar to the idea studied in this thesis. The difference is that Karim used a learning approach that did not rely on RL.
The next sections describe the details of the work done by Karim, Singh, Dixon, Tan, and Pereira.

Plans as Products of Learning [4]
• Target: plan learning and plan improvement The study investigated a hybrid approach combining a low-level learner with highlevel BDI based knowledge extractor and executor called plan generation subsystem (PGS). A PGS algorithm is presented that relies on a-priori clues provided by domain expert. The low-level learned used was FALCON which is a self-organizing neural network. Also investigated was a second approach that used a hypothesis generator to amend existing BDI plans by way of suggesting and executing plans and updating intentions accordingly.
Two examples were investigated. In the hybrid approach a predator-prey (or pursuit) task was used. Four predator agents and one prey agent in non-toroidal grid. The task to be learned was for the four predator agents to learn to surround the prey agent on all four sides without any communication. In the hypothesis generator a rat world (operant conditioning) task was used. A special BDI interpreter that supported the generation of metaplans for abductions and plan modifications at runtime was used.
Learning Context Conditions for BDI Plan [5] • Target: conditions for plan selection • Model : probabilistic • Learning Element : decision trees • Goal: learn the probability of success for plans This study highlights the lack of learning from experience particular to the BDI framework as well as the limitations of plan selection that relies on boolean formulas specified at design/implementation time and part of a plan library. It explores intelligent plan selection using feedback from plan success or failure to build decision trees that provide the probability of success of plans.
Other relevant research, not directly related to BDI systems, tackled the use of a-priori knowledge, previously learned knowledge and the learning of plans without a-priori knowledge on planning systems, and the integration of learning, planning and execution.

Incorporating Prior Knowledge and Previously Learned Information into Reinforcement Learning Agents [6]
• Target: off-policy controller • Model: hybrid • Learning Element: hybrid • Goal: incorporate prior knowledge and previously learned information into

RL agents
This study is concerned with the limits and appropriateness of tabula rasa learning and suggests a framework to incorporate a-priori knowledge and previously learned knowledge. The author points that learning tabula rasa might not be "appropriate" for two reasons: 1. system designers may have already embedded some domain-specific knowledge 2. the agent may have learned the task To solve this problem the author proposes an off-policy controller that uses modularized "prior knowledge sources" (PKSs) as inputs to an "exploration control module". An additional contribution of this study is the discussion of two terms that are of interest when designing RL systems: 1. state-space deficiency -refers to features of the state-space that the agent is not able to observe and thus unable learn when instructed by a PKS 2. representational deficiency -refers to PKS that use a history of events to make a decision. Reactive RL agents will not be able to learn this task.
The study proposed two approaches, incremental learning in which a "large, complex task is decomposed into smaller sub-tasks" with the idea that "solving the sub-tasks may be easier than solving the entire task" and composable skill synthesis in which "a problem is broken down into a set of basic skills that the agent must possess in order to complete a task".
Two examples were investigated. For the incremental learning process a simulated robot domain task was used which consisted of mobile robot tag. In this task, the agents learn a controller that is then used as a PKS to the next step in the learning process. The task was incrementally learned by first learning how to score with no defense then learning how to score with a single runner and a single defender, and finally by learning how to score with two runners and a single defender. No communication was allowed between the two runners.
In the composable skill synthesis approach a grid world task was used. The task's goal was for the agent to move from its start state to a goal state without bumping into any walls. The set of source skills used as PKSs was object avoidance (a priori) and goal homing (learned in wall-less gridworld).
FALCON: A Fusion Architecture for Learning, COgnition, and Navigation [7] • Target: fusion architecture A simulated minefield navigation task was used to test FALCON. In the task, an autonomous vehicle (AV) started in a random position. The objective was to navigate to randomly selected target position in a specified time frame without hitting a mine. The AV was equipped with five sonar sensors that provided coarse sonar sensor data. The target and mines were stationary.
Um Algoritmo para Extração de um Plano BDI que Obedece uma PolíticaÓtima (An Algorithm for Extracting BDI Plans from an Optimal Policy) [2] Pereira presents an analysis of hybrid approach BDI-MDP and introduces algorithm policyToIplan that extracts BDI plans from optimal policies. The analysis is based on completely specified MDPs.
• Target: BDI plans Pereira presents an analysis of hybrid approach BDI-MDP and introduces algorithm policyToBDIplan that extracts BDI plans from optimal POMDP policies.

Differences Between Proposed Study and Previous Research
Section 5.2 discussed the research whose elements come closest to the work done under this thesis. There are, however, several key differences.
The goal of this thesis is to use reinforcement learning to generate plans without a-priori knowledge on BDI agent systems. The key idea is that the result of reinforcement learning is a policy, or policies in the general case. Since policies map states to actions, the policies can then be used as input to generate plans in BDI agents systems. The approach can then be summarized as a two step process: 1. Use reinforcement learning as the learning module.
2. Use policies learned as input to generate BDI plans.
None of the previous work combines the elements of RL for plan generation on BDI agent systems. The problem selected for study in this thesis is justified by this lack of research exploring the generation of plans in BDI systems using reinforcement learning that does not rely on a-priori knowledge.  Experimental Implementation

RL Problem Selection
The RL problem selected was based on the problem by Poole and Mackworth 1 [1] described at: http://artint.info/html/ArtInt_262.html#davids-simple-game-ex

RL Problem Description
The original version of the new RL problem description is included verbatim from Poole and Mackworth: "There are 25 grid locations the agent could be in. A prize could be on one of the corners, or there could be no prize. When the agent lands on a prize, it receives a reward of 10 and the prize disappears. When there is no prize, for each time step there is a probability that a prize appears on one of the corners. Monsters can appear at any time on one of the locations marked M. The agent gets damaged if a monster appears on the square the agent is on. If the agent is already damaged, it receives a reward of -10. The agent can get repaired (i.e., so it is no longer damaged) by visiting the repair station marked R.
In this example, the state consists of four components: < X, Y, P, D >, where X is the X-coordinate of the agent, Y is the Y-coordinate of the agent, P is the position of the prize (P=0 if there is a prize on P0, P=1 if there is a prize on P1, similarly for 2 and 3, and P=4 if there is no prize), and D is Boolean and is true when the agent is damaged. Because the monsters are transient, it is not necessary to include them as part of the state. There are thus 5 * 5 * 5 * 2 = 250 states. The environment is fully observable, so the agent knows what state it is in. But the agent does not know the meaning of the states; it has no idea initially about being damaged or what a prize is.
The agent has four actions: up, down, left, and right. These move the agent one step -usually one step in the direction indicated by the name, but sometimes in one of the other directions. If the agent crashes into an outside wall or one of the interior walls (the thick lines near the location R), it remains where is was and receives a reward of -1.
The agent does not know any of the story given here. It just knows there are 250 states and 4 actions, which state it is in at every time, and what reward was received each time.
This game is simple, but it is surprisingly difficult to write a good controller for it."

RL Problem Terminology Changes
For this research I changed some of the terminology. The following term are used: Prizes are referred to as rewards. While this term is also used as part of the RL terminology, it should be clear from the context when it refers to RL rewards in general (which can be negative) and when it refers to rewards as the main goal for the agent. Monsters were replaced by damage positions. Rewards set R = {reward = 10, wall bump = −1, damage = −1000}.

RL Problem Implementation
The RL problem was implemented in Java following the RL learning model seen in chapter 1 and in figure 6.

Agent
Environment action a t s t reward r t r t+1 s t+1 state Figure 6. Agent-Environment Model [2] The code contains 4 main classes: 1. Agent 2. Environment 3. Simulation

GUI
The agent is initialized and then used to initialize the environment. The simulation thread runs the loop that implements the learning functionality by having the agent interact with the environment under diverse control parameters.
The GUI is the entry point because it provides visualization of the agent actions, rewards, resulting states, and learning process.  The learning functionality for the RL agent was implemented using the Q learning algorithm. Refer to Section 3.9, Table 1 for the pseudocode for Q learning.
Once the simulation starts, the following events take place: 1. An equiprobable initial policy is generated.
2. The agent explores the environment for 1000 episodes. Each episode lasts 1000 steps.
3. The learned policy is written in text format to a file. The policy is also serialized and written to file for use in the BDI environment.

RL Initial Policies
The initial equiprobable policy is the same for all configurations. The effect of this policy is that when the simulation runs the agent starts without any bias for any of the possible actions. All actions are equally possible and have the same utility. Figure 8. Initial Equi-probable Policy for All Learning Configurations

GUI
The BdiEnvironment class extends Jason's Environment class and is the entry point into Jason's infrastruture. The EnvironmentMechanics class is reused from the RL implementation to provide actual feedback from the environment resulting from the BDI agent's actions. The AgentGenerator class implements the main functionality of this research, its purpose is to convert learned RL policies into BDI agents. The GUI class is also reused from the RL implementation to provide visualization of the BDI agent actions, rewards, and resulting states.

BDI Agent Generation
The AgentGenerator class takes the serialized policy learned by the RL agent and converts it into a a BDI agent.

Results
This chapter presents the results of RL agents learning using the Q learning algorithm and the use of such learned policies to generate plans for BDI agents without relying on a-priori knowledge.
The Q learning algorithm resulted in RL learned policies that are presented and discussed in Section 7.1 RL Learned Policies. These policies map the greedy action(s) agents need to select in each state and encapsulate the knowledge learned through exploration-exploitation of the environment. Learned policies can show interesting, even counter-intuitive behavior that might not make sense at first. Interpreting learned policies requires keeping track of the actual state of the agent to be able to reference the applicable policy. Section 7.2 Interpretation of RL Learned Policies discusses how to interpret policies and what could happen when it is done incorrectly. Learned policies are highly dependent on the structure of the reward set and this affects the agent's behavior. The interrelationship of these factors is explored in Section 7.3 Reward Set, Learned Policies, and Agent Behavior. Three environment-agent configurations were tested. These are shown in Table 6. The performance of RL agents during the 1000 episode exploration phase (random action selection) and 1000 ( -greedy) phase (90% greedy and 10% random action selection), for all three configurations tested, is shown in Section 7.4 RL Results. The successful use of RL learned policies to generate BDI agents without relying on a-priori knowledge is the topic of Section 7.5 BDI Agent Generation. It includes the generation process and provides an example fragment of the AgentSpeak code generated. The performance of the BDI agents is presented and discussed in Section 7.6 BDI Agent Results. The results of a randomly generated BDI agent, to examine what happens when a BDI agent blindly follows plans without being able to learn, is presented and discussed in Section 7.7 Random

RL Learned Policies
Running the Q learning algorithm resulted in learned policies through repeated exploration-exploitation of the environment. Starting from an equiprobable policy the agent-environment interactions resulted in changes to the ac-tion values for most of the actions in the action set. The selection of the greedy action for every state then determined the policies approximating the optimal policies.
The resulting learned policies are different for all RL agents. This is no surprise given that the RL agents tested were working in different environments. Visual examination of the polices shows that the actions for many states have collapsed to a single optimal action. States with more that one action result from those actions having action values that are tied or the agent being unable to explore the environment enough to select an optimal action. In general, when the reward position is at position 0, the agent tends to move WEST and NORTH. When the reward position is at position 1, the agent tends to move EAST and NORTH. When the reward position is at position 2, the agent tends to move WEST and SOUTH. Finally, when the reward position is at position 3, the agent tends to move EAST and SOUTH. In this configuration the agent, similarly to configuration 001, learns how to deal with the damage positions in the environment. The difference between configurations 001 and 002 is that in configuration 001, when the agents steps on a damage position, there is 50% probability that the agent will not change from undamaged to damaged status. This allows the agent to follow the shortest routes between rewards positions as long as it continues to have undamaged status. In configuration 002, stepping on a damage position changes the agent from undamaged to damaged status with 100% probability. As a result, the agent learns to avoid the damage positions at the cost of having to follow longer routes between the reward positions.

Interpretation of RL Learned Policies
One must be careful when interpreting the RL learned policies. Let us examine an example to show how the policies work and how easy it can be to miss a state change that changes the policy in use.
Learned policies can show interesting, even counter-intuitive behavior that is highly dependent on the structure of the reward set. This is the topic discussed more thoroughly in the next section, section 7.3.

Reward Set, Learned Policies, and Agent Behavior
Applying RL to learning tasks can result in interesting and sometimes even counter-intuitive policies that are highly dependent on the structure of the reward set. RL researchers and practitioners need to be careful in selecting the reward structure such that the right behavior is learned.
The following example comes directly from the research done for this thesis.
The RL task description has a repair station where the agent can repair itself if it is damaged. During one of the learning experiments, a reward was set to reward the agent's action consisting of reaching the repair station when damaged. The problem with this approach was that it assigned a high reward to the repair action in relation to the reward for the agent getting damage. The agent learned to exploit this reward differential to get higher rewards. In other words: This resulted in the following clever behavior by the agent. The agent would get damage which would change its status from undamaged (U ) to damaged (D).
It would follow by going to the repair position to get repaired and receive the reward for changing its status from damaged (D) to undamaged (U ). The reward positions were being ignored. The agent learned a behavior that maximized it average returns even though it was not learning what I was intending it to learn.
"The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished.
In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do. For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such taking its opponent's pieces or gaining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent's pieces even at the cost of losing the game.

RL Results
All three RL agents learned policies policies approximating the optimal policies after 1, 000 steps of simulation experience (1000 episodes lasting 1000 steps each).
The results for the RL agent in configuration 000 is shown in figure 21.
In this configuration all damage probabilities were set to 0. The effect of damage probabilities being 0 is that the agent is free to move around and the only negative rewards come from the agent bumping into the walls in the environment. As a result, the average returns for this agent are higher than for those in configuration 001 and 002. The agent shows an average return of approximately −100 in the exploration phase and 2750 in the -greedy phase.
The results for the RL agent in configuration 001 is shown in figure 22. In this configuration all damage probabilities were set to 0.5. In this configuration with damage probabilities > 0, the agent is not free to move around without the possibility of having negative rewards due to damage and bumping into the walls. The average returns clearly show it with approximately −9500 in the exploration phase and −1000 in the -greedy phase.
The results for the RL agent in configuration 002 is shown in figure 23.
In this configuration all damage probabilities are set to 1. The agent is certain to have negative rewards due to damage and bumping into the walls.
This results in even lower average returns of −19000 in the exploration phase and −2500 in the -greedy phase.
Why the negative returns when operating in the -greedy phase for all three configurations? The negative returns result from the implementation of thegreedy phase with 90% greedy and 10% random action selection. Even though the agents are acting greedily they do so only 90% of the time. The remaining 10% they are choosing random actions.
The results for all 3 RL agents combined in a single graph are shown in figure   24.
Initial beliefs return(0) and calculation(done) are used to indicate that the agent believes its reward return is 0 and that it is done calculating reward returns.
The plans consists of observing the state, performing the optimal action (ties are randomly broken for tied actions during agent generation), observation of the reward and calculation of the return.
A total of 3 agents were generated: bdi agent000, bdi agent001, and bdi agent002. The full listings are included as an appendix.

BDI Agent Results
After generating the BDI agents, they were evaluated in their environments for 200 episodes lasting 1000 steps each.
The results for bdi agent000, generated from the learned policy of RL agent The results for all 3 BDI agents combined in a single graph are shown in figure   28.
Why are the average returns positive for the BDI agents and negative for the RL agents? This results from the implementation of the BDI AgentGenerator class. The RL agents use -greedy phase with 90% greedy and 10% random action selection, meaning that they still perform random actions that can result in negative rewards. The BDI agents were implemented by selecting the greedy action for all possible states; they do not perform any random action selection.
In other words, the RL agents used an exploration-exploitation strategy while the BDI agents only used an exploitation strategy.
The difference in average returns for configuration 000 versus 001 and 002 can be explained by fact that the agent is not concerned with damage positions in configuration 000, it will never get damage. In contrast, the agents have to

Random BDI Agent
A random BDI agent was created by randomly selecting the best action from the actions available. After a few iterations it became clear how easy it was for the resulting agent to have loops that were not just sub-optimal but completely bad. The agent gets in a 2 step cycle stepping over a damage position over and over. Because the agent's behavior is hard-coded into its plan, the agent is unable to learn. The end result is a pathological loop with returns of −500, 000. Figure 29 shows the initial sequence that ends in a fatal loop for the random BDI agent.

Thesis Research Results
The use of RL as a way to to generate plans for BDI agents without relying on a-priori knowledge was sucessfully demonstrated.
The RL agents explored and exploited their environment resulting in policies that approximated optimal policies and performed much better than random walks. The BDI agents did not exist until they were generated by the AgentGenerator using the polices learned by the RL agents. Once the BDI agent's plans could be tested against the environment they were shown to perform successfully.

CHAPTER 8 Discussion
This chapter discusses the contributions of this research, its limitations and future work that can expand and augment the work done as part of this thesis.

Contributions
The major contribution of this research is the demonstration of the practical and successful use of RL, a computational learning framework based on the idea of learning from repeated interactions with the environment, to generate plans in BDI systems without relying on a-priori knowledge.
The ability to learn and not follow the same plans over and over is arguably a key characteristic of intelligent systems. The results of this thesis provide another tool to augment BDI agents with learning capabilities. One of the benefits of using RL is that plans need not be fixed. RL can learn online and plans generated can change in response to changes in the environment.

Limitations
Even though the generation of BDI plans without relying on a-priori knowledge was successfully demonstrated, there remains a big limitation: how to express the BDI plans from learned RL policies. Human input is still necessary to decide the plans' expression or representation. There are arguably infinite ways to express an equivalent plan.
The situation is analogous to the case of writing a compiler for a higher level language, somebody needs to decide what is the target language that the source is being compiled into.