KQLearning
- class codpy.KQLearning.GamesKernel(latent_distribution=None, max_size=None, next_states=None, **kwargs)[source]
Bases:
KernelA specific type of kernel for deterministic policies, handling clustering
Initializes the Kernel class with default or user-defined parameters.
- Parameters:
x – A bi-dimensional numpy array.
fx – A bi-dimensional numpy array. If x or fx is not None, then call
set()max_nystrom (
int, optional) – Maximum number of Nystrom samples. Defaults to 1000.reg (
float, optional) – Regularization parameter for kernel operations. Defaults to 1e-9.order (
int, optional, order of the polynomial kernel) – Polynomial order for polynomial kernel functions. Defaults toNone(no polynomial regression).dim (
int, optional) – Dimensionality of the input data. Defaults to 1.set_kernel (
callable, optional) – A custom kernel function initializer. If not provided, defaults toself.default_clustering_functor().kwargs (
dict) – Additional keyword arguments for further customization.
- class codpy.KQLearning.GamesKernelClassifier(**kwargs)[source]
Bases:
GamesKernelA specific type of kernel for stochastic policies. Outputs probabilities
Initializes the Kernel class with default or user-defined parameters.
- Parameters:
x – A bi-dimensional numpy array.
fx – A bi-dimensional numpy array. If x or fx is not None, then call
set()max_nystrom (
int, optional) – Maximum number of Nystrom samples. Defaults to 1000.reg (
float, optional) – Regularization parameter for kernel operations. Defaults to 1e-9.order (
int, optional, order of the polynomial kernel) – Polynomial order for polynomial kernel functions. Defaults toNone(no polynomial regression).dim (
int, optional) – Dimensionality of the input data. Defaults to 1.set_kernel (
callable, optional) – A custom kernel function initializer. If not provided, defaults toself.default_clustering_functor().kwargs (
dict) – Additional keyword arguments for further customization.
- codpy.KQLearning.rl_hot_encoder(actions, actions_dim)[source]
Hot encodes actions over actions_dim.
- Parameters:
actions –
numpy.ndarray.actions_dim –
intThe dimension of the actions.
- Returns:
pandas.DataFrame
- class codpy.KQLearning.KAgent(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]
Bases:
objectBasic KAgent. Has most of the usefull methods for other Reinforcement Learning algorithms.
Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- __init__(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]
Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- compute_returns(states, actions, next_states, rewards, dones, **kwargs)[source]
Computes \(G_t = R_t + \gamma G_{t+1}\) for the given history.
- Parameters:
states –
np.ndarrayThe states of the game, in reverse order.actions
next_states
rewards
dones
- Returns:
np.ndarray
- bellman_optimal_action(games, q_value_function)[source]
Computes the optimal actions for the given \(Q(s,a)\) function, with
\[Q^*(s,a) = R(s,a) + \gamma \max_{a'} Q^{\pi}(s',a')\].- Parameters:
games –
tupleSARSD in reverse order.q_value_function –
codpy.kernel.KernelThe Q-value function.
- Returns:
np.ndarrayThe optimal actions one hot encoded.
- update_probabilities(advantages, games, last_policy, dt=None, kernel=None, clip=None, **kwargs)[source]
Updates the policies for advantage-based algorithms. The advantage either is \(\nabla_{y} Q^\pi_k(\cdot)\) for Policy Gradient methods, or \(R(s,a) + \gamma V^{\pi}(s') - V^{\pi}(s)\) for ActorCritic.
It does normalize the advantages and then computes the new policy as an interpolation between the last policy and the new one.
- Parameters:
advantages –
np.ndarraygames –
tupleSARSD in reverse order.last_policy –
np.ndarrayThe last policy.
- Returns:
codpy.kernel.KernelThe new policy.
- get_derivatives_policy_state_action_value_function(games, policy, output_value_function=False, **kwargs)[source]
Solve for
\[\nabla_{y} \theta^\pi = \gamma \Big(K(Z,Z) - \gamma \sum_a \pi^a(S) K(W,Z) \Big)^{-1} \sum_a\Big(Q^{\pi}_k(W) \pi^a(\delta_b(a)-\pi^b)\Big)\]Where:
\(Z\) are the state actions
\(W\) are the next state actions
\(K(Z,Z)\) is the Gram matrix of the training points
\(\gamma \sum_a \pi^a(S) K(W,Z)\) is the weighted projection operator onto the next state-actions
\(Q^{\pi}_k(W)\) is the critic evaluated at the next state-actions
\(\pi^a(\delta_b(a)-\pi^b)\) is an adjustment based on probability differences
- Parameters:
games –
tupleSARSD in reverse order.policy –
np.ndarrayThe policy to be used for weigthing the next state-actions.
- Returns:
codpy.kernel.KernelThe derivative estimator of \(Q(s,a)\)
- optimal_states_values_function(games, kernel=None, full_output=False, **kwargs)[source]
Find a kernel regressor solving the Bellman equation
\[Q(s,a) = r + \gamma \max_{a'} Q(s',a')\]The algorithm computes \(Q^n(s,a)\) iteratively :Solve \(\theta^{\pi}_{n+1/2} = \Big( K(Z, Z) - \gamma \sum_a \pi_{n+1/2}^a(S) K(W^a,Z)\Big)^{-1} R\)
Refines the parameters \(\theta_{n+1}^{\pi} = \lambda \theta^{\pi}_{n+1/2} + (1 - \lambda) \theta_{n}^{\pi}.\)
- Where:
Z is the concatenation of the states and actions
\(K(Z,Z)\) is the gram matrix of current state actions pairs
\(K(W^a,Z)\) is the gram matrix of the next states and actions
\(\pi_{n+1/2}^a(S) = \delta_{\arg \max q^n(S,a) }(S)\) is the max of the next Q-values, with \(q^n\) the current Q-values.
\(R\) is the rewards function
The function then assures a limit condition on the Q-values by setting the last Q-values equal to the rewards.
- Parameters:
games –
tupleSARSD in reverse order.kernel –
codpy.kernel.KernelKernel to be used. If None, a kernel fit on the returns is used.
- Returns:
codpy.kernel.KernelThe kernel with the optimal Q-values.
- class codpy.KQLearning.KActorCritic(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]
Bases:
KAgentKActorCritic Kernel algorithm. It is policy-based and uses a
GamesKernelClassifieras the actor.Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- class codpy.KQLearning.KQLearning(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]
Bases:
KActorCriticImplements KQLearning algorithm. Uses clustering by default in the
train()method.Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- class codpy.KQLearning.PolicyGradient(**kwargs)[source]
Bases:
KActorCriticPolicyGradient Kernel algorithm. It is policy-based and uses a
GamesKernelClassifieras the actor.Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- get_advantages(games, policy, **kwargs)[source]
Compute
\[A^{\pi}(s) = \nabla_{y} Q^\pi_k(\cdot) = K(\cdot, Z) \nabla_{y} \theta^\pi.\]- Parameters:
games –
tupleSARSD in reverse order.policy –
np.ndarrayThe policy to be used for weigthing the next state-actions.kwargs –
dict
- Returns:
tupleThe advantages of the policy along with a kernel estimator for new advantages on state-action pairs.
- class codpy.KQLearning.KController(state_dim, actions_dim, controller, **kwargs)[source]
Bases:
KAgentImplements the KController algorithm. The specificities of this algorithm is that it uses a heuristic controller to be tuned.
Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- __call__(z, **kwargs)[source]
The internal tuned heuristic controller directly outputs the action.
- Parameters:
z – :class:`np.ndarray the state
- Returns:
int
- get_function(**kwargs)[source]
Defines the function to be optimized \({L}(R_{k,\lambda_e},\theta)\).
- train(game, **kwargs)[source]
Solves for
\[\theta_{n+1} = \arg \max_{\theta \in \Theta_n} \mathcal{L}(R_{k,\lambda_e},\theta), \quad \Theta_{n} = \bar{\theta_e} \cup \Theta_{N,n}\]Where \(\Theta_{N,n}\) is a screening around the last \(\theta_n\) and is defined as follow:
\[\Theta_{N,n} = (\theta_n+\alpha^n \Theta_N) \cap \Theta\]with \(\alpha^n\) is a contracting factor and \({L}(R_{k,\lambda_e},\theta)\) is an optimization function which can be defined and tuned based on your needs atget_function().- Parameters:
game –
tupleSARSD in reverse order.
- class codpy.KQLearning.KQLearningHJB(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]
Bases:
KQLearningImplements the Hamilton-Jacobi-Bellman Q-learning algorithm.
Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.
- Parameters:
actions_dim –
intThe action dimension of the environment.state_dim –
intThe state dimension of the environment.gamma –
floatDiscount factor.kernel_type –
codpy.kernel.KernelType of kernel to be used as actor and critic.
- optimal_states_values_function(games, kernel=None, full_output=False, maxiter=5, reorder=False, **kwargs)[source]
Solve the Bellman equation
\[Q^{\pi}(s_t,a_t) = R(s_t,a_t) + \gamma \int \left[ \sum_{a \in \mathcal{A}} \pi^a(s_t) Q^{\pi}(s',a)\right] d \mathbb{P}_S(s',s_t,a_t).\]Numerically, we effectively solve for the set of parameters \(\theta\) of the kernel \(K\) such that:\[\theta = \Big( K(Z, Z) - \gamma \sum_{a} \pi^a(S)\Gamma(P^a) K(P, Z)\Big)^{-1} R, \quad P = \{ S+F_k(S,a), a \}\]- Where:
\(K(Z,Z)\) is the kernel matrix of the states and actions.
\(P\) is the set of the predicted next state actions possibilities.
\(\Gamma(P^a)\) is the transition probability matrix.
\(K(P,Z)\) is the kernel matrix of the predicted next state actions and the states and actions.