Beating Gemini 3 Flash with a 4B model on Dune
Trained a GRPO RL policy to play a 90 turn complex strategy game
Trained a GRPO RL policy to play Dune: Imperium – Uprising, a complex strategy board game where early sacrifices compound into late-game advantages. A typical 2-player game runs ~80-90 turns with a massive action space — agent & spy placement, deck-building, combat, and intrigue all interleaved.
Approach
Built a full game engine from scratch that enforces all board game rules, then trained a Qwen3 4B Instruct 2507 model in two stages — first SFT from games played between Minimax M2.5 and Grok 4.1 Fast, then RL against self-play using GRPO. The model learns to evaluate board states and select actions through natural language reasoning.
The key insights were a compact game state representation paired with available actions, and an RL rubric that rewards long-term strategic thinking over locally perfect moves.
Results
The 4B model consistently beats Gemini 3 Flash when both are given the same game state and asked to choose moves. A model with 1000x fewer parameters outplays a frontier model on a task that requires genuine long-term strategic planning.
More details and codebase coming soon.