Curated Resource ( ? )

2025: The year in LLMs

2025: The year in LLMs

my notes ( ? )

"a year filled with a lot of different trends":

"“reasoning” aka ... Reinforcement Learning from Verifiable Rewards (RLVR) ... Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results such that they can update their plans to better achieve the desired goal... also exceptional at producing and debugging code."

So add in tool access and you get "The year of agents", which he defines as "an LLM that runs tools in a loop to achieve a goal". AI Search became good as a result, but "The “coding agents” pattern is a much bigger deal... I wrote more about how I’m using these in Code research projects with async coding agents like Claude Code and Codex and Embracing the parallel coding agent lifestyle... developers ... embrace(d) LLMs on the command line".

Willison runs "in YOLO mode all the time, despite being deeply aware of the risks involved", which means he's falling for "the “Normalization of Deviance” phenomenon, where repeated exposure to risky behaviour without negative consequences leads people and organizations to accept that risky behaviour as normal."

2025 saw China release serious LLMs, of course. Also:

  • "enormous leaps forward ... perform tasks that take humans multiple hours... the length of tasks AI can do is doubling every 7 months"
  • "The most successful consumer product launch of all time" was OpenAI's prompt-driven image editing in March, where "you could upload your own images and use prompts to tell it how to modify them... 100 million ChatGPT signups in a week". Followed by many others, including Nano Banana, which "could generate useful text! ... also clearly best at following image editing instructions"
  • "reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad", without tools, solving challenges designed for the competition, and so not in any training data.
  • "Llama 4 had high expectations" but disappointed: too big.
  • "OpenAI lost their lead... [in 2025]: still have top tier models, but ... challenged across the board", although they still own consumer mindshare, where Gemini is the biggest challenger - see their own victorious 2025 recap. Under the hood, Google has their own "in-house hardware, TPUs, which they’ve demonstrated this year work exceptionally well for both training and inference", giving them a huge advantage over their Nvidia-reliant competitors
  • He "built 110 tools... tools.simonwillison.net. Almost every tool is accompanied by a commit history that links to the prompts and transcripts I used to build them" See: Here’s how I use LLMs to help me write code, Adding AI-generated descriptions to my tools collection , Building a tool to copy-paste share terminal sessions using Claude Code for web Useful patterns for building HTML tools—my favourite post of the bunch.
  • We learnt "Claude 4 might snitch you out to the feds", then SnitchBench showed "almost all [LLMs] do the same thing!"
  • "Andrej Karpathy coined the term “vibe coding” ... The key idea here was “forget that the code even exists”... prototyping software that “mostly works” through prompting alone... I’ve [n]ever seen a new term catch on—or get distorted—so quickly"
  • "Model Context Protocol ... open standard for integrating tool calls with different LLMs... exploded in popularity... may be a one-year wonder [due to] the stratospheric growth of coding agents... the brilliant Skills mechanism" is probably more sigificant: see "Claude Skills are awesome, maybe a bigger deal than MCP. MCP involves web servers and complex JSON payloads. A Skill is a Markdown file in a folder, optionally accompanied by some executable scripts."
  • "everyone seems to want to put LLMs in your web browser ... deeply concerned about the safety implications ... My browser has access to my most sensitive data ... A prompt injection attack against a browsing agent that can exfiltrate or modify that data is a terrifying prospect."
  • AI slop was "crowned word of the year", but has it really changed anything? "The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff... Curation matters more than ever." He does admit that he's not on Facebook, so maybe slop "is a growing tidal wave that I’m innocently unaware of."

Read the Full Post

The above notes were curated from the full post simonwillison.net/2025/Dec/31/the-year-in-llms/.

Related reading

More Stuff I Like

More Stuff tagged ai , ai reasoning , ai-agent-gpt , claude code , llm , simon willison , curation

See also: Content Strategy , Digital Transformation , Innovation Strategy , Media , Communications Strategy , Science&Technology , Large language models

Cookies disclaimer

MyHub.ai saves very few cookies onto your device: we need some to monitor site traffic using Google Analytics, while another protects you from a cross-site request forgeries. Nevertheless, you can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings, you grant us permission to store that information on your device. More details in our Privacy Policy.