Nicholas's picture

Nicholas

nickoo004

·

AI & ML interests

ML and NLP , and also DL,NN

Recent Activity

reacted to anakin87's post with ❤️ about 19 hours ago

How LLM training with RL Environments works? It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀 - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env ❌⭕ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1️⃣ Rollout generation: from the same board, model plays N games via sampling 2️⃣ Each game scored with deterministic rewards (win, format, ...) 3️⃣ Mean score computed across the group 4️⃣ Each rollout's advantage = its score minus the group mean 5️⃣ Model updated to favor trajectories above baseline 🔁 Repeat For a deep dive, check out 🌱 https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs

liked a dataset 2 days ago

tencent/MegaStyle-1.4M

published a dataset 6 days ago

nickoo004/kaa-parallel-corpus

View all activity

Organizations

None yet

models 3

nickoo004/karakalpak-gpt2-v3

Text Generation • 97M • Updated 20 days ago • 374 • 1

nickoo004/gemma-2b-reasoning-keras

Updated Jan 11 • 3

nickoo004/gpt2_karakalpak

Text Generation • 0.1B • Updated Jun 6, 2025 • 6 • 4

datasets 4

nickoo004/kaa-parallel-corpus

Viewer • Updated 6 days ago • 14.1k • 33

nickoo004/gemma-reasoning-gold-15k

Viewer • Updated Jan 9 • 27.1k • 20

nickoo004/FeruzaSpeech_to_fine_tuning

Viewer • Updated Sep 2, 2025 • 13k • 101 • 2

nickoo004/uzbekdata

Viewer • Updated Feb 23, 2025 • 7.27k • 5 • 3