Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 13 von 8200
Expert systems with applications, 2023-11, Vol.231, p.120693, Article 120693
2023
Volltextzugriff (PDF)

Details

Autor(en) / Beteiligte
Titel
Guided deterministic policy optimization with gradient-free policy parameters information
Ist Teil von
  • Expert systems with applications, 2023-11, Vol.231, p.120693, Article 120693
Ort / Verlag
Elsevier Ltd
Erscheinungsjahr
2023
Quelle
Alma/SFX Local Collection
Beschreibungen/Notizen
  • Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) are two classical deterministic policy gradient algorithms. It is worth noting that the policies of both DDPG and TD3 are completely dependent on the gradient of critics. This will cause the policy to be unstable and easy to converge to the local optimum in the learning process. Although the idea of maximum entropy learning can provide more effective exploration, it can only be applied to the algorithm using stochastic policy, not to DDPG and TD3. In this paper, we propose a deterministic policy optimization method combining gradient-free policy parameters information (GFPPI). Specifically, we obtain a new set of policies by injecting Gaussian noise into the policy parameters, and then weight these policy parameters based on critics to obtain GFPPI. Finally, GFPPI is used as the regularization term of the policy optimization function to guide the policy update. GFPPI can mitigate premature policy convergence and facilitate exploration with optimistic principles. We provide the theoretical guarantee for monotonic improvement of expected cumulative return using augmented loss function with GFPPI, experimentally analyze the role of GFPPI in policy optimization and combine it with deterministic policy gradient information for policy optimization. The experiments on OpenAI gym demonstrate that GFPPI can improve sample efficiency and enable the algorithm to get higher performance. •We present the computational details of GFPPI and analyze two operators.•We theoretical guarantee the effective of GFPPI.•We propose GFPPI-TD3 and it mitigates the policy update instability.•Our GFPPI-TD3 outperforms the SOTA algorithms on six Mujoco environments.
Sprache
Englisch
Identifikatoren
ISSN: 0957-4174
eISSN: 1873-6793
DOI: 10.1016/j.eswa.2023.120693
Titel-ID: cdi_crossref_primary_10_1016_j_eswa_2023_120693

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX