Policy¶
Policy.py - abstract class for all policies¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.DiaAct
import utils.ContextLogger
import ontology.OntologyUtils
import policy.SummaryAction
-
class
policy.Policy.
Action
(action)¶ Dummy class representing one action. Used for recording and may be overridden by sub-class.
-
class
policy.Policy.
Episode
(dstring=None)¶ An episode encapsulates the state-action-reward triplet which may be used for learning. Every entry represents one turn. The last entry should contain
TerminalState
andTerminalAction
-
check
()¶ Checks whether length of internal state action and reward lists are equal.
-
getWeightedReward
()¶ Returns the reward weighted by normalised accumulated weights. Used for multiagent learning in committee.
Returns: the reward weighted by normalised accumulated weights
-
record
(state, action, reward, ma_weight=None)¶ Stores the state action reward in internal lists.
Parameters:
-
tostring
()¶ Prints state, action, and reward lists to screen.
-
-
class
policy.Policy.
EpisodeStack
(block_size=100)¶ A handler for episodes. Required if stack size is to become very large - may not want to hold all episodes in memory, but write out to file.
-
add_episode
(domain_episodes)¶ Items on stack are dictionaries of episodes for each domain (since with BCM can learn from 2 or more domains if a multidomain dialogue happens)
-
retrieve_episode
(episode_key)¶ NB: this should probably be an iterator, using yield, rather than return
-
-
class
policy.Policy.
Policy
(domainString, learning=False, specialDomain=False)¶ Interface class for a single domain policy. Responsible for selecting the next system action and handling the learning of the policy.
To create your own policy model or to change the state representation, derive from this class.
-
act_on
(state)¶ Main policy method: mapping of belief state to system action.
This method is automatically invoked by the agent at each turn after tracking the belief state.
May initially return ‘hello()’ as hardcoded action. Keeps track of last system action and last belief state.
Parameters: - state (
DialogueState
) – the belief state to act on - hyps (list) – n-best-list of semantic interpretations
Returns: the next system action of type
DiaAct
- state (
-
convertStateAction
(state, action)¶ Converts the given state and action to policy-specific representations.
By default, the generic classes
State
andAction
are used. To change this, override method in sub-class.Parameters: - state (anything) – the state to be encapsulated
- action – the action to be encapsulated
Type: action: anything
-
finalizeRecord
(reward, domainInControl=None)¶ Records the final reward along with the terminal system action and terminal state. To change the type of state/action override
convertStateAction()
.This method is automatically executed by the agent at the end of each dialogue.
Parameters: - reward (int) – the final reward
- domainInControl (str) – used by committee: the unique identifier domain string of the domain this dialogue originates in, optional
Returns: None
-
nextAction
(beliefstate)¶ Interface method for selecting the next system action. Should be overridden by sub-class.
This method is automatically executed by
act_on()
thus at each turn.Parameters: beliefstate (dict) – the state the policy acts on Returns: the next system action
-
record
(reward, domainInControl=None, weight=None, state=None, action=None)¶ Records the current turn reward along with the last system action and belief state.
This method is automatically executed by the agent at the end of each turn.
To change the type of state/action override
convertStateAction()
. By default, the last master action is recorded. If you want to have another action being recorded, eg., summary action, assign the respective object to self.actToBeRecorded in a derived class.Parameters: - reward (int) – the turn reward to be recorded
- domainInControl (str) – the domain string unique identifier of the domain the reward originates in
- weight (float) – used by committee: the weight of the reward in case of multiagent learning
- state (dict) – used by committee: the belief state to be recorded
- action (str) – used by committee: the action to be recorded
Returns: None
-
restart
()¶ Restarts the policy. Resets internal variables.
This method is automatically executed by the agent at the end/beginning of each dialogue.
-
savePolicy
(FORCE_SAVE=False)¶ Saves the learned policy model to file. Should be overridden by sub-class.
This method is automatically executed by the agent either at certain intervals or at least before shutting down the agent.
Parameters: FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.
-
train
()¶ Interface method for initiating the training. Should be overridden by sub-class.
This method is automatically executed by the agent at the end of each dialogue if learning is True.
This method is called at the end of each dialogue by
PolicyManager
if learning is enabled for the given domain policy.
-
-
class
policy.Policy.
State
(state)¶ Dummy class representing one state. Used for recording and may be overridden by sub-class.
-
class
policy.Policy.
TerminalAction
¶ Dummy class representing one terminal action. Used for recording and may be overridden by sub-class.
-
class
policy.Policy.
TerminalState
¶ Dummy class representing one terminal state. Used for recording and may be overridden by sub-class.
PolicyManager.py - container for all policies¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.ContextLogger
import ontology.Ontology
import ontology.OntologyUtils
-
class
policy.PolicyManager.
PolicyManager
¶ The policy manager manages the policies for all domains.
It provides the interface to get the next system action based on the current belief state in
act_on()
and to initiate the learning in the policy intrain()
.-
_check_committee
(committee)¶ Safety tool - should check some logical requirements on the list of domains given by the config
Parameters: committee ( PolicyCommittee
) – the committee be be checked
-
_load_committees
()¶ Loads and instantiates the committee as configured in config file. The new object is added to the internal dictionary.
-
_load_domains_policy
(domainString=None)¶ Loads and instantiates the respective policy as configured in config file. The new object is added to the internal dictionary.
Default is ‘hdc’.
Parameters: domainString (str) – the domain the policy will work on. Default is None. Returns: the new policy object
-
act_on
(dstring, state)¶ Main policy method which maps the provided belief to the next system action. This is called at each turn by
DialogueAgent
Parameters: - dstring (str) – the domain string unique identifier.
- state (
DialogueState
) – the belief state the policy should act on
Returns: the next system action as
DiaAct
-
bootup
(domainString)¶ Loads a policy for a given domain.
-
finalizeRecord
(domainRewards)¶ Records the final rewards of all domains. In case of a committee, the recording is delegated.
This method is called once at the end of each dialogue by the
DialogueAgent
. (One dialogue may contain multiple domains.)Parameters: domainRewards (dict) – a dictionary mapping from domains to final rewards Returns: None
-
getLastSystemAction
(domainString)¶ Returns the last system action of the specified domain.
Parameters: domainString (str) – the domain string unique identifier. Returns: the last system action of the given domain or None
-
printEpisodes
()¶ Prints the recorded episode of the current dialogue.
-
record
(reward, domainString)¶ Records the current turn reward for the given domain. In case of a committee, the recording is delegated.
This method is called each turn by the
DialogueAgent
.Parameters: - reward (int) – the turn reward to be recorded
- domainString (str) – the domain string unique identifier of the domain the reward originates in
Returns: None
-
restart
()¶ Restarts all policies of all domains and resets internal variables.
-
savePolicy
(FORCE_SAVE=False)¶ Initiates the policies of all domains to be saved.
Parameters: FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.
-
train
(training_vec=None)¶ Initiates the training for the policies of all domains. This is called at the end of each dialogue by
DialogueAgent
-
PolicyCommittee.py - implementation of the Bayesian committee machine for dialogue management¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.ContextLogger
import utils.DiaAct
-
class
policy.PolicyCommittee.
CommitteeMember
¶ Base class defining the interface methods which are needed in addition to the basic functionality provided by
Policy
Committee members should derive from this class.
-
abstract_actions
(actions)¶ Converts a list of domain acts to their abstract form
Parameters: actions (list of actions) – the actions to be abstracted
-
getMeanVar_for_executable_actions
(belief, abstracted_currentstate, nonExecutableActions)¶ Computes the mean and variance of the Q value based on the abstracted belief state for each executable action.
Parameters: - belief (dict) – the unabstracted current domain belief
- abstracted_currentstate (
State
or subclass) – the abstracted current belief - nonExecutableActions (list) – actions which are not selected for execution based on heuristic
-
getPriorVar
(belief, act)¶ Returns prior variance for a given belief and action
Parameters: - belief (dict) – the unabstracted current domain belief state
- act (str) – the unabstracted action
-
get_Action
(action)¶ Converts the unabstracted domain action into an abstracted action to be used for multiagent learning.
Parameters: action (str) – the last system action
-
get_State
(beliefstate, keep_none=False)¶ Converts the unabstracted domain state into an abstracted belief state to be used with
getMeanVar_for_executable_actions()
.Parameters: beliefstate (dict) – the unabstracted belief state
-
unabstract_action
(actions)¶ Converts a list of abstract acts to their domain form
Parameters: actions (list of actions) – the actions to be unabstracted
-
-
class
policy.PolicyCommittee.
PolicyCommittee
(policyManager, committeeMembers, learningmethod)¶ Manages everything related to policy committee. All policy members must inherit from
Policy
andCommitteeMember
.-
_bayes_committee_calculator
(domainQs, priors, domainInControl, scale)¶ Given means and variances of committee members - forms the Bayesian committee distribution for each action, draws sample from each, returns act with highest sample.
Note
this implementation is probably slow – can reformat domainQs - and redo this via matricies and slicing
Parameters: - domainQs (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all Q-value estimates of all domains
- priors (dict of actions and values) – the prior of the Q-value
- domainInControl (str) – domain the dialoge is in
- scale (float) – a scaling factor used to control exploration during learning
Returns: the next abstract system action
-
_set_multi_agent_learning_weights
(comm_meansVars, chosen_act)¶ Set reward scalings for each committee member. Implements NAIVE approach from “Multi-agent learning in multi-domain spoken dialogue systems”, Milica Gasic et al. 2015.
Parameters: - comm_meansVars (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all committee members
- chosen_act (str) – the abstract system action to be executed
Returns: None
-
act_on
(domainInControl, state)¶ Provides the next system action based on the domain in control and the belief state.
The belief state is mapped to an abstract representation which is used for all committee members.
Parameters: - domainInControl (str) – the domain unique identifier string of the domain in control
- state (
DialogueState
) – the belief state to act on
Returns: the next system action
-
finalizeRecord
(reward, domainInControl)¶ Records for each committee member the reward and the domain the dialogue has been on
Parameters: - reward (int) – the final reward to be recorded
- domainInControl (str) – the domain the reward was achieved in
-
record
(reward, domainInControl)¶ record for committee members. in case of multiagent learning, use information held in committee along with the reward to record (b,a) + r
Parameters: - reward (str) – the turn reward to be recorded
- reward – the domain the reward was achieved in
Returns: None
-
HDCPolicy.py - Handcrafted dialogue manager¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import policy.Policy
import policy.PolicyUtils
import policy.SummaryUtils
import utils.Settings
import utils.ContextLogger
-
class
policy.HDCPolicy.
HDCPolicy
(domainString)¶ Handcrafted policy derives from Policy base class. Based on the slots defined in the ontology and fix thresholds, defines a rule-based policy.
If no info is provided by the user, the system will always ask for the slot information in the same order based on the ontology.
GPPolicy.py - Gaussian Process policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Relevant Config variables [Default values]:
[gppolicy]
kernel = polysort
thetafile = ''
See also
CUED Imports/Dependencies:
import policy.GPLib
import policy.Policy
import policy.PolicyCommittee
import ontology.Ontology
import utils.Settings
import utils.ContextLogger
-
class
policy.GPPolicy.
GPPolicy
(domainString, learning, sharedParams=None)¶ An implementation of the dialogue policy based on Gaussian process and the GPSarsa algorithm to optimise actions where states are GPState and actions are GPAction.
The class implements the public interfaces from
Policy
andCommitteeMember
.
-
class
policy.GPPolicy.
Kernel
(kernel_type, theta, der=None, action_kernel_type='delta', action_names=None, domainString=None)¶ The Kernel class defining the kernel for the GPSARSA algorithm.
The kernel is usually divided into a belief part where a dot product or an RBF-kernel is used. The action kernel is either the delta function or a handcrafted or distributed kernel.
-
class
policy.GPPolicy.
GPAction
(action, numActions, replace={})¶ Definition of summary action used for GP-SARSA.
-
class
policy.GPPolicy.
GPState
(belief, keep_none=False, replace={}, domainString=None)¶ Definition of state representation needed for GP-SARSA algorithm Main requirement for the ability to compute kernel function over two states
-
class
policy.GPPolicy.
TerminalGPAction
¶ Class representing the action object recorded in the (b,a) pair along with the final reward.
-
class
policy.GPPolicy.
TerminalGPState
¶ Basic object to explicitly denote the terminal state. Always transition into this state at dialogues completion.
GPLib.py - Gaussian Process SARSA algorithm¶
Copyright CUED Dialogue Systems Group 2015 - 2017
This module encapsulates all classes and functionality which implement the GPSARSA algorithm for dialogue learning.
Relevant Config variables [Default values]. X is the domain tag:
[gpsarsa_X]
saveasprior = False
random = False
learning = False
gamma = 1.0
sigma = 5.0
nu = 0.001
scale = -1
numprior = 0
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.ContextLogger
import policy.PolicyUtils
-
class
policy.GPLib.
GPSARSA
(in_policyfile, out_policyfile, domainString=None, learning=False, sharedParams=None)¶ - Derives from GPSARSAPrior
Implements GPSarsa algorithm where mean can have a predefined value self._num_prior specifies number of means self._prior specifies the prior If not specified a zero-mean is assumed
Parameters needed to estimate the GP posterior self._K_tida_inv inverse of the Gram matrix of dictionary state-action pairs self.sharedParams[‘_C_tilda’] covariance function needed to estimate the final variance of the posterior self.sharedParams[‘_c_tilda’] vector needed to calculate self.sharedParams[‘_C_tilda’] self.sharedParams[‘_alpha_tilda’] vector needed to estimate the mean of the posterior self.sharedParams[‘_d’] and self.sharedParams[‘_s’] sufficient statistics needed for the iterative estimation of the posterior
Parameters needed for the policy selection self._random random policy choice self._scale scaling of the standard deviation when sampling Q-value, if -1 than taking the mean self.learning if true in learning mode
-
class
policy.GPLib.
GPSARSAPrior
(in_policyfile, out_policyfile, numPrior=-1, learning=False, domainString=None, sharedParams=None)¶ Defines the GP prior. Derives from LearnerInterface.
-
class
policy.GPLib.
LearnerInterface
¶ This class defines the basic interface for the GPSARSA algorithm.
specifies the policy files self._inputDictFile input dictionary file self._inputParamFile input parameter file self._outputDictFile output dictionary file self._outputParamFile output parameter file
self.initial self.terminal flags are needed for learning to specify initial and terminal states in the episode
HDCTopicManager.py - policy for the front end topic manager¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import policy.Policy
import utils.Settings
import utils.ContextLogger
-
class
policy.HDCTopicManager.
HDCTopicManagerPolicy
(dstring=None, learning=None)¶ The dialogue while being in the process of finding the topic/domain of the conversation.
At the current stage, this only happens at the beginning of the dialogue, so this policy has to take care of wecoming the user as well as creating actions which disambiguate/clarify the topic of the interaction.
It allows for the system to hang up if the topic could not be identified after a specified amount of attempts.
WikipediaTools.py - basic tools to access wikipedia¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import policy.Policy
import utils.Settings
import utils.ContextLogger
-
class
policy.WikipediaTools.
WikipediaDM
¶ Dialogue Manager interface to Wikipedia – developement state.
SummaryAction.py - Mapping between summary and master actions¶
Copyright CUED Dialogue Systems Group 2015 - 2017, 2017
See also
CUED Imports/Dependencies:
import policy.SummaryUtils
import ontology.Ontology
import utils.ContextLogger
import utils.Settings
-
class
policy.SummaryAction.
SummaryAction
(domainString, empty=False, confreq=False)¶ The summary action class encapsulates the functionality of a summary action along with the conversion from summary to master actions.
Note
The list of all possible summary actions are defined in this class.
SummaryUtils.py - summarises dialog events for mapping from master to summary belief¶
Copyright CUED Dialogue Systems Group 2015 - 2017
- Basic Usage:
>>> import SummaryUtils
Note
No classes; collection of utility methods
Local module variables:
global_summary_features: (list) global actions/methods
REQUESTING_THRESHOLD: (float) 0.5 min value to consider a slot requested
See also
CUED Imports/Dependencies:
import ontology.Ontology
import utils.Settings
import utils.ContextLogger
PolicyUtils.py - Utility Methods for Policies¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Note
PolicyUtils.py is a collection of utility functions only (No classes).
Local/file variables:
ZERO_THRESHOLD: unused
REQUESTING_THRESHOLD: affects getRequestedSlots() method
See also
CUED Imports/Dependencies:
import ontology.Ontology
import utils.DiaAct
import utils.Settings
import policy.SummaryUtils
import utils.ContextLogger
-
policy.PolicyUtils.
REQUESTING_THRESHOLD
= 0.5¶ Methods for global action.
-
policy.PolicyUtils.
add_venue_count
(input, belief, domainString)¶ Add venue count.
Parameters: - input – String input act.
- belief – Belief state
- domainString (str) – domain tag like ‘SFHotels’
Returns: act with venue count.
-
policy.PolicyUtils.
checkDirExistsAndMake
(fullpath)¶ Used when saving a policy – if dir doesn’t exisit –> is created
-
policy.PolicyUtils.
getGlobalAction
(belief, globalact, domainString)¶ Method for global action: returns action
Parameters: - belief (dict) – full belief state
- globalact (int) –
- str of globalActionName, e.g. ‘INFORM_REQUESTED’
- domainString (str) – domain tag
Returns: (str) action
-
policy.PolicyUtils.
getInformAcceptedSlotsAboutEntity
(acceptanceList, ent, numFeats)¶ Method for global inform action: returns filled out inform() string need to be cleaned (Dongho)
Parameters: - acceptanceList (dict) – of slots with value:prob mass pairs
- ent (dict) – slot:value properties for this entity
- numFeats (int) – result of globalOntology.entity_by_features(acceptedValues)
Returns: (str) filled out inform() act
-
policy.PolicyUtils.
getInformAction
(numAccepted, belief, domainString)¶ Method for global inform action: returns inform act via getInformExactEntity() method or null() if not enough accepted
Parameters: - belief (dict) – full belief state
- numAccepted (int) – number of slots with prob. mass > 80
- domainString (str) – domain tag
Returns: getInformExactEntity(acceptanceList,numAccepted)
-
policy.PolicyUtils.
getInformExactEntity
(acceptanceList, numAccepted, domainString)¶ Method for global inform action: creates inform act with none or an entity
Parameters: - acceptanceList (dict) – of slots with value:prob mass pairs
- numAccepted (int) – number of accepted slots (>80 prob mass)
- domainString (str) – domain tag
Returns: getInformNoneVenue() or getInformAcceptedSlotsAboutEntity() as appropriate
BCM_Tools.py - Script for creating slot abstraction mapping files¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Note
Collection of utility classes and methods
See also
CUED Imports/Dependencies:
import ontology.Ontology
import utils.Settings
import utils.ContextLogger
This script is used to create a mapping from slot names to abstract slot (like slot0, slot1 etc), highest entropy to lowest. Writes mapping to JSON file
DeepRL Policies¶
A2CPolicy.py - Advantage Actor-Critic policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of the advantage actor-critic with the temporal difference as an approximation of the advantage function. The network is defined in DRL.a2c.py You can turn on the importance sampling through the parameter A2CPolicy.importance_sampling
The details of the implementation can be found here: https://arxiv.org/abs/1707.00130
ACERPolicy.py - Sample Efficient Actor Critic with Experience Replay¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of the sample efficient actor critic with truncated importance sampling with bias correction, the trust region policy optimization method and RETRACE-like multi-step estimation of the value function. The parameters ACERPolicy.c, ACERPolicy.alpha, ACERPolicy. The details of the implementation can be found here: https://arxiv.org/abs/1802.03753
See also: https://arxiv.org/abs/1611.01224 https://arxiv.org/abs/1606.02647
BDQNPolicy.py - deep Bayesian Q network policy¶
Copyright CUED Dialogue Systems Group 2015 - 2018
Implementation of Bayes by Backprop. The prediction is used both at training and testing time. The model is highly dependent on the following parameters:
See also: https://arxiv.org/abs/1505.05424 http://zacklipton.com/media/papers/bbq-learning-dialogue-policy-lipton2016.pdf
DQNPolicy.py - Deep Q Network policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of Double Deep Q Network. The algorithm is adapted to incorporate the action mask if needed. The details of implementation can be found here: https://arxiv.org/abs/1711.11486
See also: https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
ENACPolicy.py - Episodic Natural Actor-Critic policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of episodic natural actor-critic. The vanilla gradients are computed in DRL/enac.py using Tensorflow and then the natural gradient is obtained through function train. You can turn on the importance sampling through the parameter ENACPolicy.importance_sampling
The details of implementation can be found here: https://arxiv.org/abs/1707.00130 See also: https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2007-125.pdf
TRACERPolicy.py - Trust region advantage Actor-Critic policy with experience replay¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of the actor-critic algorithm with off-policy learning and trust region constraint for stable training. The definition of the network and the approximation of the natural gradient is computed in DRL.na2c.py. You can turn on the importance sampling through the parameter TRACERPolicy.importance_sampling
The details of the implementation can be found here: https://arxiv.org/abs/1707.00130
See also: https://arxiv.org/abs/1611.01224 https://pdfs.semanticscholar.org/c79d/c0bdb138e5ca75445e84e1118759ac284da0.pdf
FeudalRL Policies¶
Traditional Reinforcement Learning algorithms fail to scale to large domains due to the curse of dimensionality. A novel Dialogue Management architecture based on Feudal RL decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a second step where a primitive action is chosen from the selected subset. The structural information included in the domain ontology is used to abstract the dialogue state space, taking the decisions at each step using different parts of the abstracted state. This, combined with an information sharing mechanism between slots, increases the scalability to large domains.
For more information, please look at the paper Feudal Reinforcement Learning for Dialogue Management in Large Domains.