paperarXivTrust 82 · PrimaryPublished 7d agoLive · 4d ago

Regularized Reward-Punishment Reinforcement Learning

We propose KL-Coupled Policy Regularization (KCPR), a policy coordination framework for Reward-Punishment Reinforcement Learning (RPRL). Based on KCPR, we derive KL-Coupled Soft Optimality (KCSO) and develop its deep realization, klDMP. Unlike existing RPRL approaches that optimize reward-seeking and punishment-related policies largely independently, KCPR enables direct interactions between companion policies by treating each as a dynamically learned prior for the other. KCSO yields coupled soft-optimal policies and KL-regularized Bellman operators, allowing reward and punishment information t

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsRL without TD learning

Related across the graph

newsRL without TD learning

Topics

cs.LG