ChatGPT Is a Blurry JPEG of the Web

Interesting, illuminating (but contested) metaphor for thinking about LLMs from one of my favourite authors, Ted Chiang:

"Think of ChatGPT as a blurry jpeg of all the text on the Web. It retains much of the information... but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation... nonsensical answers to factual questions... are compression artifacts... plausible enough that identifying them requires comparing them against the originals ...

a common technique used by lossy compression algorithms is interpolation ... when [ChatGPT] is prompted to describe... losing a sock in the dryer using the style of the Declaration of Independence: it is taking two points in “lexical space” and generating the text that would occupy the location between them."

But the best way of compressing knowledge is to understand it: "the more the program knows about supply and demand, the more words it can discard when compressing the pages about economics... [But can] we say that it actually understands economic theory?", given that its understanding stems from statistical analyses of the text which "reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.”"?

It certainly doesn't do maths well once the numbers get large - "there aren’t many Web pages that contain the text “245 + 821,” " So if it hasn’t mastered basic maths but can write college-level essays, do "statistical regularities in text actually correspond to genuine knowledge of the real world?"

Consider a chatbot which simply quotes relevant pages: "In human students, rote memorization isn’t an indicator of genuine learning, so ChatGPT’s inability to produce exact quotes from Web pages is precisely what makes us think that it has learned something... lossy compression looks smarter than lossless compression."

But what is it actually useful for?

  • search: how could we trust that the answers are based on the "right" parts of the web, rather than propaganda and conspiracy theories?
  • content creation: it's good for content mills, which means it "is not good for people searching for information"
  • creative writing: "starting with a blurry copy of unoriginal work isn’t a good way to create original work... Sometimes it’s only in the process of writing that you discover your original ideas... Your first draft isn’t an unoriginal idea expressed clearly; it’s an original idea expressed poorly", accompanied by your dissatisfaction with it, which drives you to improve it.

So... not that useful: "So just how much use is a blurry jpeg, when you still have the original?

