KQLearning

class codpy.KQLearning.GamesKernel(latent_distribution=None, max_size=None, next_states=None, **kwargs)[source]

Bases: Kernel

A specific type of kernel for deterministic policies, handling clustering

Initializes the Kernel class with default or user-defined parameters.

Parameters:
  • x – A bi-dimensional numpy array.

  • fx – A bi-dimensional numpy array. If x or fx is not None, then call set()

  • max_nystrom (int, optional) – Maximum number of Nystrom samples. Defaults to 1000.

  • reg (float, optional) – Regularization parameter for kernel operations. Defaults to 1e-9.

  • order (int, optional, order of the polynomial kernel) – Polynomial order for polynomial kernel functions. Defaults to None (no polynomial regression).

  • dim (int, optional) – Dimensionality of the input data. Defaults to 1.

  • set_kernel (callable, optional) – A custom kernel function initializer. If not provided, defaults to self.default_clustering_functor().

  • kwargs (dict) – Additional keyword arguments for further customization.

class codpy.KQLearning.GamesKernelClassifier(**kwargs)[source]

Bases: GamesKernel

A specific type of kernel for stochastic policies. Outputs probabilities

Initializes the Kernel class with default or user-defined parameters.

Parameters:
  • x – A bi-dimensional numpy array.

  • fx – A bi-dimensional numpy array. If x or fx is not None, then call set()

  • max_nystrom (int, optional) – Maximum number of Nystrom samples. Defaults to 1000.

  • reg (float, optional) – Regularization parameter for kernel operations. Defaults to 1e-9.

  • order (int, optional, order of the polynomial kernel) – Polynomial order for polynomial kernel functions. Defaults to None (no polynomial regression).

  • dim (int, optional) – Dimensionality of the input data. Defaults to 1.

  • set_kernel (callable, optional) – A custom kernel function initializer. If not provided, defaults to self.default_clustering_functor().

  • kwargs (dict) – Additional keyword arguments for further customization.

codpy.KQLearning.rl_hot_encoder(actions, actions_dim)[source]

Hot encodes actions over actions_dim.

Parameters:
  • actionsnumpy.ndarray.

  • actions_dimint The dimension of the actions.

Returns:

pandas.DataFrame

class codpy.KQLearning.KAgent(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]

Bases: object

Basic KAgent. Has most of the usefull methods for other Reinforcement Learning algorithms.

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

__init__(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

compute_returns(states, actions, next_states, rewards, dones, **kwargs)[source]

Computes \(G_t = R_t + \gamma G_{t+1}\) for the given history.

Parameters:
  • statesnp.ndarray The states of the game, in reverse order.

  • actions

  • next_states

  • rewards

  • dones

Returns:

np.ndarray

bellman_optimal_action(games, q_value_function)[source]

Computes the optimal actions for the given \(Q(s,a)\) function, with

\[Q^*(s,a) = R(s,a) + \gamma \max_{a'} Q^{\pi}(s',a')\]
.

Parameters:
  • gamestuple SARSD in reverse order.

  • q_value_functioncodpy.kernel.Kernel The Q-value function.

Returns:

np.ndarray The optimal actions one hot encoded.

update_probabilities(advantages, games, last_policy, dt=None, kernel=None, clip=None, **kwargs)[source]

Updates the policies for advantage-based algorithms. The advantage either is \(\nabla_{y} Q^\pi_k(\cdot)\) for Policy Gradient methods, or \(R(s,a) + \gamma V^{\pi}(s') - V^{\pi}(s)\) for ActorCritic.

It does normalize the advantages and then computes the new policy as an interpolation between the last policy and the new one.

Parameters:
  • advantagesnp.ndarray

  • gamestuple SARSD in reverse order.

  • last_policynp.ndarray The last policy.

Returns:

codpy.kernel.Kernel The new policy.

get_derivatives_policy_state_action_value_function(games, policy, output_value_function=False, **kwargs)[source]

Solve for

\[\nabla_{y} \theta^\pi = \gamma \Big(K(Z,Z) - \gamma \sum_a \pi^a(S) K(W,Z) \Big)^{-1} \sum_a\Big(Q^{\pi}_k(W) \pi^a(\delta_b(a)-\pi^b)\Big)\]

Where:

  • \(Z\) are the state actions

  • \(W\) are the next state actions

  • \(K(Z,Z)\) is the Gram matrix of the training points

  • \(\gamma \sum_a \pi^a(S) K(W,Z)\) is the weighted projection operator onto the next state-actions

  • \(Q^{\pi}_k(W)\) is the critic evaluated at the next state-actions

  • \(\pi^a(\delta_b(a)-\pi^b)\) is an adjustment based on probability differences

Parameters:
  • gamestuple SARSD in reverse order.

  • policynp.ndarray The policy to be used for weigthing the next state-actions.

Returns:

codpy.kernel.Kernel The derivative estimator of \(Q(s,a)\)

optimal_states_values_function(games, kernel=None, full_output=False, **kwargs)[source]

Find a kernel regressor solving the Bellman equation

\[Q(s,a) = r + \gamma \max_{a'} Q(s',a')\]
The algorithm computes \(Q^n(s,a)\) iteratively :

  1. Solve \(\theta^{\pi}_{n+1/2} = \Big( K(Z, Z) - \gamma \sum_a \pi_{n+1/2}^a(S) K(W^a,Z)\Big)^{-1} R\)

  2. Refines the parameters \(\theta_{n+1}^{\pi} = \lambda \theta^{\pi}_{n+1/2} + (1 - \lambda) \theta_{n}^{\pi}.\)

Where:
  • Z is the concatenation of the states and actions

  • \(K(Z,Z)\) is the gram matrix of current state actions pairs

  • \(K(W^a,Z)\) is the gram matrix of the next states and actions

  • \(\pi_{n+1/2}^a(S) = \delta_{\arg \max q^n(S,a) }(S)\) is the max of the next Q-values, with \(q^n\) the current Q-values.

  • \(R\) is the rewards function

The function then assures a limit condition on the Q-values by setting the last Q-values equal to the rewards.

Parameters:
  • gamestuple SARSD in reverse order.

  • kernelcodpy.kernel.Kernel Kernel to be used. If None, a kernel fit on the returns is used.

Returns:

codpy.kernel.Kernel The kernel with the optimal Q-values.

class codpy.KQLearning.KActorCritic(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]

Bases: KAgent

KActorCritic Kernel algorithm. It is policy-based and uses a GamesKernelClassifier as the actor.

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

get_advantages(games, policy, **kwargs)[source]

Compute the advantage function

\[A^{\pi^a}(s) = R(s,a) + \gamma V^{\pi}(s') - V^{\pi}(s), \quad s'=S(s,a).\]

Where :
  • \(R(s,a)\) is the rewards function

  • \(V^{\pi}(s)\) is the value function

  • \(S(s,a)\) is the next state function.

class codpy.KQLearning.KQLearning(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]

Bases: KActorCritic

Implements KQLearning algorithm. Uses clustering by default in the train() method.

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

class codpy.KQLearning.PolicyGradient(**kwargs)[source]

Bases: KActorCritic

PolicyGradient Kernel algorithm. It is policy-based and uses a GamesKernelClassifier as the actor.

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

get_advantages(games, policy, **kwargs)[source]

Compute

\[A^{\pi}(s) = \nabla_{y} Q^\pi_k(\cdot) = K(\cdot, Z) \nabla_{y} \theta^\pi.\]

Parameters:
  • gamestuple SARSD in reverse order.

  • policynp.ndarray The policy to be used for weigthing the next state-actions.

  • kwargsdict

Returns:

tuple The advantages of the policy along with a kernel estimator for new advantages on state-action pairs.

class codpy.KQLearning.KController(state_dim, actions_dim, controller, **kwargs)[source]

Bases: KAgent

Implements the KController algorithm. The specificities of this algorithm is that it uses a heuristic controller to be tuned.

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

__call__(z, **kwargs)[source]

The internal tuned heuristic controller directly outputs the action.

Parameters:

z – :class:`np.ndarray the state

Returns:

int

get_function(**kwargs)[source]

Defines the function to be optimized \({L}(R_{k,\lambda_e},\theta)\).

train(game, **kwargs)[source]

Solves for

\[\theta_{n+1} = \arg \max_{\theta \in \Theta_n} \mathcal{L}(R_{k,\lambda_e},\theta), \quad \Theta_{n} = \bar{\theta_e} \cup \Theta_{N,n}\]

Where \(\Theta_{N,n}\) is a screening around the last \(\theta_n\) and is defined as follow:

\[\Theta_{N,n} = (\theta_n+\alpha^n \Theta_N) \cap \Theta\]
with \(\alpha^n\) is a contracting factor and \({L}(R_{k,\lambda_e},\theta)\) is an optimization function which can be defined and tuned based on your needs at get_function().

Parameters:

gametuple SARSD in reverse order.

class codpy.KQLearning.KQLearningHJB(actions_dim, state_dim, gamma=0.99, kernel_type=<class 'codpy.KQLearning.GamesKernel'>, **kwargs)[source]

Bases: KQLearning

Implements the Hamilton-Jacobi-Bellman Q-learning algorithm.

Initializes the KAgent with the given parameters. Every agent has an actor and critic kernel. Some classes might not use both.

Parameters:
  • actions_dimint The action dimension of the environment.

  • state_dimint The state dimension of the environment.

  • gammafloat Discount factor.

  • kernel_typecodpy.kernel.Kernel Type of kernel to be used as actor and critic.

optimal_states_values_function(games, kernel=None, full_output=False, maxiter=5, reorder=False, **kwargs)[source]

Solve the Bellman equation

\[Q^{\pi}(s_t,a_t) = R(s_t,a_t) + \gamma \int \left[ \sum_{a \in \mathcal{A}} \pi^a(s_t) Q^{\pi}(s',a)\right] d \mathbb{P}_S(s',s_t,a_t).\]
Numerically, we effectively solve for the set of parameters \(\theta\) of the kernel \(K\) such that:
\[\theta = \Big( K(Z, Z) - \gamma \sum_{a} \pi^a(S)\Gamma(P^a) K(P, Z)\Big)^{-1} R, \quad P = \{ S+F_k(S,a), a \}\]

Where:
  • \(K(Z,Z)\) is the kernel matrix of the states and actions.

  • \(P\) is the set of the predicted next state actions possibilities.

  • \(\Gamma(P^a)\) is the transition probability matrix.

  • \(K(P,Z)\) is the kernel matrix of the predicted next state actions and the states and actions.