Reinforcement Learning for Large Language Model Fine-Tuning: A Systematic Literature Review

Lingxiao Kong, Qusai Ramadan, Oussama Zoubia, Jahid Hasan Polash, Mayra Elwes, Mehdi Akbari Gurabi, Lu Jin, Ekaterina Kutafina, Roman Matzutt, Yuanbin Wang, Junqi Xu, Oya Deniz Beyan, Cong Yang, Zeyd Boukhers

Large Language Models (LLMs) have been developed for a wide range of language-based tasks, while Reinforcement Learning (RL) has been primarily applied to decision-making problems such as robotics, game theory, and control systems. Nowadays, these two paradigms are integrated through different synergies. In this literature review, we focus on RL4LLM fine-tuning, where RL techniques are systematically leveraged to fine-tune LLMs and align them with various preferences. Our review provides a comprehensive analysis of 230 recent publications, presenting a methodological taxonomy that organizes current research into three primary method domains: Optimization Algorithm, concerning innovation in core RL update rules; Training Framework, regarding innovation in the orchestration of the training process; and Reward Modeling, addressing how LLMs learn and represent preferences and feedback. Within these primary domains, we further analyze methods and innovations through more granular categories to provide an in-depth summary of RL4LLM fine-tuning research. We address three research questions: 1) recent methods overview, 2) methodological innovations, and 3) limitations and future directions. Our analysis comprehensively demonstrates the breadth and impact of recent RL4LLM fine-tuning research while highlighting valuable directions for future investigation.