Chinese AI researchers have achieved what many thought was light years away: A free, open-source AI model that can match or exceed the performance of OpenAI’s most advanced reasoning systems. What makes this even more remarkable was how they did it: by letting the AI teach itself through trial and error, similar to how humans learn.
“DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT ($0.00)) as a preliminary step, demonstrates remarkable reasoning capabilities.” the research paper reads.
“Reinforcement learning” is a method in which a model is rewarded for making good decisions and punished for making bad ones, without knowing which one is which. After a series of decisions, it learns to follow a path that was reinforced by those results.
Initially, during the supervised fine-tuning phase, a group of humans tells the model the desired output they want, giving it context to know what’s good and what isn’t. This leads to the next phase, Reinforcement Learning, in which a model provides different outputs and humans rank the best ones. The process is repeated over and over until the model knows how to consistently provide satisfactory results.