1 Comment
⭠ Return to thread

The obsession with model size has two interesting aspects.

The first is a parallel obsession (by many of the same people) with a few specific numbers in modern CPUs (cache size, branch predictor size, ROB size, and so on). After having interacted with people for many years, it's become clear to me that for most people these numbers have mere totemic significance. You can try all you like to explain that the advances of company A are, most importantly, in particular *algorithms* as to how they use this raw material of caches or branch predictors; it will make no difference. They don't especially know what the numbers mean in a technical sense and, even more importantly, they don't *care*; the numbers exists purely to fulfill a shibboleth role, to indicate that their tribe is doing better or worse than the opposing tribe.

So Altman may be strategic on this front. His primary concern may be less about informing competitors of anything than of exciting the fires of tribal fury. As soon as a signal number ("model parameter count") becomes a tribal shibboleth, for most participants in this culture war it becomes unmoored from reality and all that matters is whether it's growing ("bad!" for many of them) or not.

The second is how can models become better without growing?

There are at least two obvious directions. The immediate one (which we're already going down) is offloading system 2 thinking to the machines (eg Google or Wolfram) that can do it a lot better than an LLM. If someone asks you a fact, or arithmetic, don't try to synthesize it from your neural net, know enough to recognize the type of question and look up the answer in the right place. That's already 50% of the difference between a person educated enough to know when to use "the library" vs a person convinced their vague recollection of something is probablu close enough to the correct answer.

The second obvious direction is that flat attention (ie looking at more and more words backwards) doesn't scale. Take a hint from how TAGE does it (for branch, or more general, pattern prediction) or how humans do it, with a geometrically increasing "window" on past text. What humans seem to do is that not just individual words (and word fragments) are "embedded" (ie placed in a space relative to other words) but so are sentences, paragraphs, sections and so on. Apart from recursion, our other language superpower is chunking, but LLMs (at least as far as I can tell) do not yet do any "designed" chunking, only whatever small-scale chunking they might get by accident as a side effect of the LLM training. So even if GPT4's only real innovation is a "second round" of embedding at the sentence rather than the word-fragment level, that should already be enough to substantially improve it; and of course the obvious next thing once that works is to recurse it to paragraphs and larger semantic structures.

(This is apart from the sort of trivial low level parameter tweaking and network layer restructuring that will be on-going for many years.)

Expand full comment