Policy Gradient Methods: Algorithms like Proximal Policy Optimization (PPO) or Trust Region Policy Optimization (TRPO) can handle continuous action spaces and will iteratively refine policies.
Actor-Critic Algorithms: Soft Actor-Critic (SAC) combines the benefits of policy gradients and Q-learning.
Evolutionary Strategies: If only small parameter adjustments are expected, these can explore efficiently using known initial estimates. - unfeasible!!, too slow
In summary, if computational resources and accurate simulation data are available, SAC provides high-quality results efficiently. PPO is a solid balance between implementation simplicity and quality. Evolutionary strategies are simpler but typically slower and less precise.
Overall, TD3 is an excellent choice if your problem benefits from reduced Q-value overestimation and you have sufficient computational resources. It balances complexity and performance well in continuous control tasks.
Dheers recommendations:
Stellgrößen in U-Vektor (begrenzen, je nachdem was System kann), Statespace nutzen, Störgrößen
Vielleicht macht ein Ricatti Regler Sinn? Sonst einfach linear quadratic control?
Abstiegsverfahren (Optimierung) reicht auch → kein RL notwendig.
wichtig: alle Größen normalisieren, sodass sie vergleichbar sind!
Optimization algorithms:
use a probabilistic method, because measurements might have errors (to avoid stuckness in local minima):