AI at the end of 2025
Despite everyone saying we were hitting a wall, 2025 has been yet maybe the most surprising year in terms of advances of AI capabilities.
In fact, I am writing and publishing this blog post directly from https://claude.ai/code/ asking to make a PR and update my website. Anyway, just want to use this post as a reminder of several important things that happened this year in AI.
RLVR and the thinking models
It started with o1 slightly before the beginning of the year, but it exploded with the release of DeepSeek R1 and personally with Gemini 2.5 pro, which I have been immediately been using for some of my work and completely revolutionized the paradigm we have been working with.
The key breakthrough here is RLVR (Reinforcement Learning from Verifiable Rewards). Unlike traditional RLHF which requires expensive human annotation, RLVR uses binary reward signals from tasks where correctness can be automatically verified: math problems (does the answer match?), code (do the tests pass?). This makes scaling much easier since you don’t need humans in the loop.
DeepSeek’s R1 paper (also published in Nature) showed something remarkable: R1-Zero, trained with pure RL and no supervised fine-tuning, achieved 71% on AIME 2024 and spontaneously developed self-correction behaviors. The model learned to use chain-of-thought reasoning purely from reward signals. They used GRPO (Group Relative Policy Optimization), which samples multiple outputs and trains the model to prefer the best one - no critic model needed, very memory efficient.
Another interesting finding: the s1 paper from Stanford showed you need surprisingly few examples (just 1,000) for a model to start reasoning well - hinting that pretrained models already have latent reasoning capabilities that just need to be “activated”.
There’s an ongoing debate on whether RLVR truly expands reasoning capacity or just makes models more efficient samplers of existing capabilities. The evidence suggests both are somewhat true: RLVR definitely makes models better at pass@1, but base models can catch up at high pass@k values.
A brief mention of key reasoning models this year:
- o1 series
- DeepSeek R1
- Gemini 2.5 pro
- Kimi K2
- Claude Opus 4.5
- Gemini 3 pro and flash
AI agents - while loop with tools
Thanks to these enhanced capabilities, the industry shifted from sophisticated graph pipelines to just a simple very smart model in a loop with an objective and a set of tools.
The immense success of this paradigm has been made evident by Claude Code which is probably the most mind-blowing innovation for me this year. It’s an agentic coding tool that lives in your terminal, understands your entire codebase (via agentic search, not RAG), and handles everything from code migrations to bug fixes through natural language.
Anthropic internally uses it everywhere: data infrastructure teams use it for debugging Kubernetes issues, finance teams generate Excel reports from natural language, product teams wrote 70% of Vim mode code in auto-accept mode.
Engineers adoption of AI assisted coding
By the end of 2025 it is very hard to find engineers still doubting AI effects on productivity and quality of work.
Several well known engineers use LLM regularly - for vibe coding, assistance, or finding hard bugs. This ranges from critical cryptography code written in Go, to Python libraries, to system level C code:
- Simon Willison - Cooking with Claude
- Filippo Valsorda - Claude Debugging
- Andrej Karpathy - Vibe Coding
- Antirez - LLM assisted coding
I personally lately delegate most of my code to AI agents. I spend a lot more time to:
- In depth studying of context / related libraries
- Scoping, design and plan of the solution + brainstorm with AI
- Delegate implementation to AI
- In depth review! NOTE: as I spend less time writing the code, I spend much more time here, making sure I also learn about new concepts and patterns
Context engineering
This was a trend especially in product applications rather than research, and something that as an applied AI engineer I have been working with a lot.
As models become more powerful and less sensitive to prompting, and at the same time simple workflows might do very well 90% of the job, it becomes important to manage the context effectively. Questions like:
- What should make it into the context?
- What should be deprecated?
- How do I manage 50+ tools?
- What if one of the tools was called in history and is now deprecated?
- What if one of the tools had a bug and now it is fixed?
- How do I make the model perform custom multi-step sets of instructions?
Some interesting approaches here:
- Skills - reusable instruction sets
- Tool search - retrieving relevant tools dynamically
- Programmatic tool calls
- etc.
Multi-modal starts to really feel different
Models like Gemini Nano Banana Pro are much more than simply image editing models. They have a strong understanding of the world and can perform very complex tasks, some of the most fun I saw:
- Summarise in a whiteboard a 50+ page paper
- Translate and convert a handwritten menu into a menu with dish images and descriptions
Browser agents
Having been mind blown from Claude Code since day one, and experiencing first hand the dramatic improvement in productivity and quality of work this brings, I feel it is only natural to see these productivity gains extending from engineering to other domains.
The main platform to do that is probably a browser. This is close enough to coding assistants but can also easily extend to many other everyday tasks: planning and booking trips, checking emails, taking notes during meetings, preparing presentations.
The very recent release of Claude for Chrome already feels like a convincing step in that direction.
However, there are significant safety concerns here, so it is still early. If those get sorted, the potential is huge in my opinion.
Research frontiers I’m excited about
While I spend most of my time on practical applications, some research developments this year feel like genuine inflection points.
Embodied AI and Robotics
The release of Gemini Robotics in March 2025 (paper) was a big deal. It’s a Vision-Language-Action (VLA) model - essentially adding “physical actions” as an output modality to Gemini. The model can directly control robots, generalizing across novel situations and responding to open vocabulary instructions.
In the September update (Gemini Robotics 1.5) added reasoning before action - the robot can now “think” and explain its decisions.
Self-driving goes international
Waymo announced London for 2026 - their first European market.
Meanwhile Tesla continues pushing vision-only FSD and is preparing robotaxi tests. The regulatory landscape is shifting fast.
World models for agent training
Genie 3 (August 2025) might be the most underrated release of the year. It’s a general-purpose world model that generates interactive 3D environments from text prompts at 720p/24fps in real-time.
The key breakthrough: world memory. The model remembers what it generated - walk away, come back, and everything is where you left it. It learned physics without a hard-coded engine, purely through self-supervised learning. DeepMind is already using it to train their SIMA agent on navigation tasks.
Training embodied AI in the real world is expensive and dangerous. Unlimited simulated worlds with consistent physics could be what finally gives us that “Move 37 moment” for robotics - agents discovering strategies humans never imagined.
Convergence
The theme across all of this is convergence: LLMs + vision + world understanding + physical action. The pieces are coming together faster than I expected.