paperarXivTrust 82 · PrimaryPublished 7d agoLive · 4d ago
Regularized Reward-Punishment Reinforcement Learning
We propose KL-Coupled Policy Regularization (KCPR), a policy coordination framework for Reward-Punishment Reinforcement Learning (RPRL). Based on KCPR, we derive KL-Coupled Soft Optimality (KCSO) and develop its deep realization, klDMP. Unlike existing RPRL approaches that optimize reward-seeking and punishment-related policies largely independently, KCPR enables direct interactions between companion policies by treating each as a dynamically learned prior for the other. KCSO yields coupled soft-optimal policies and KL-regularized Bellman operators, allowing reward and punishment information t
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
