The path from LLMs to AGI

Published July 14, 2024 in Uncategorized

This is going to be a philosophical dive into the theoretical limits of Large Language Models (LLMs). This is not meant to be some formal study where I cite sources etc. I’m also not concerned about real-world costs or technical constraints here. Instead, I think this is an interesting philosophical exploration of what intelligence actually is, and from that figuring out if LLMs contain it and how intelligent they can become by just piling on more parameters and capabilities.

I’ll basically use bro-science (the art of deducing “the obvious” from premises without relying on any induction) to reach a surprising conclusion at the end. Let’s start!

What is Intelligence and Do LLMs Contain It?

LLMs (Large Language Models) are neural networks designed to predict the next word in a sequence based on their training data. So how can that do the amazing things we see LLMs do today?

Here’s the surprising thing about that: In order to predict the next word in a sequence efficiently, it helps to understand what the words actually mean (i.e. what they represent). Thus, LLMs that have this understanding will be able to perform this task more effectively.

Thus, when we’re training LLMs to predict the next word, we’re actually training them to be intelligent.

The reason for this is that intelligence can actually be seen as compression. This idea is related to information theory, where understanding something means you can represent it more compactly. Just think about this: Can you convey something with fewer words if you do not “understand” that thing?

That “understanding” is actually just a higher level representation of the thing that you are trying to convey. Possessing this higher level understanding, and being able to relate different high level concepts to each other and have the interact in your head, that is in my opinion the definition of intelligence.

Here’s a fun test you can try: Give a friend a piece of text and ask them to summarize it in as few words as possible. This is, in my opinion, a more reliable method of testing their true intelligence than traditional IQ tests. You can also try this in interviews if you are hiring people. For example in an interview, explain to them something about your product or your company, and ask them to summarize what you have explained with as few words as possible. That will, in my opinion and experience, test their intelligence better than any IQ test I’ve seen.

I believe this “compression of information” is actually why nature evolved intelligence in the first place. Humans are able to compress vast amounts of information in tiny brains with extreme efficiency compared to what we can build with computers today.

Can Current LLM Models Become More Intelligent by Just Adding More Parameters?

Adding more parameters to a neural net will allow it to represent exponentially more concepts, and higher level concepts. Therefore, I believe that the answer is yes.

Concepts are hierarchical, as explained in Ray Kurzweil’s book How to Create a Mind. To paraphrase his explanation, let’s say you have a concept called “cat”. How do you create a neural network that can “see” this?

Simple Inputs: Start with basic inputs, such as two straight lines at different angles. Each neuron represents one such line.
Combination of Features: When the two neurons representing the two straight lines at different angles fire together, that represents an ear-like shape (/\). Thus, the higher level neuron which fires together with those two lower level neurons starts to represent a “cat’s ear”. Then this one, in turn fires together with other (say the sounds cat make, other shapes that cats have, etc.) to form higher and higher level neurons, which represent the concept of a “cat”.
Higher-Level Abstractions: As you combine more features and patterns, you build higher and higher levels of abstractions, such as for example the concept of “animal”.
Top-Level Concepts: Eventually, you get to abstract concepts such as meanings, ideas, relationships, etc.

I’m probably completely botching Ray’s explanation here, but hopefully the point is clear. This hierarchical building of concepts allows larger networks to have more “levels”, thus representing more complex ideas with greater detail and accuracy.

By adding the possibility of representing such high level concepts, larger networks can incorporate more nuanced details, such as:

Contextual Understanding: Who asked a question, the formulation of the question, and what it says about the person asking.
Writing Style: The style used and what it implies about the question and the required answer.
Higher Abstractions: Embedding higher levels of abstraction to better understand and respond to inputs.

The More Modalities, The Higher The Intelligence

However, there are some limitations in today’s LLMs that will limit the level and accuracy of understanding that they can achieve.

One significant such limitation I see is that current networks are primarily “single modality” or “few modalities” in a single network. The LLMs that are multimodal implement this multimodality not in a single neural net, but in multiple specialized neural nets that communicate with each other to achieve the effect. There are some exceptions, but even these implement, to my knowledge at least, at max 1-3 modalities in a single neural net. I am not aware, for example, of neural networks that incorporate elements necessary to represent true reality, such as force, friction, temperature, and smell, among other potential sensory inputs.

This limits the amount of intelligence that an LLM can contain, and the level of accuracy with which they can represent true reality.

For instance, an LLM trained only on words will be limited to forming concepts that relate words to other words. Its understanding of the universe will be based solely on how every word relates to other words, thus forming concepts that only represent these relationships. This means LLMs can only see reality indirectly, through the concepts we created to represent reality (the words and symbols it is trained on). This indirect mapping may not always correspond accurately to actual reality and can often be misleading. I don’t have a mathematical way to prove this, but I’m sure someone somewhere has proven that an indirect mapping (via an intermediary map, let’s call it B) of one set of tings (A) to another set of things (C) will not necessarily lead to an accurate mapping of A to C.

The more direct the inputs and the more types of modalities a single network can process, the better it will be able to create a representation of reality as it is.

In addition, such a network can compress at higher levels, thus containing more intelligence.

Finally, true multimodal networks will also be able to be trained on vastly more information, because it can now process all the input physical reality has to offer, not just the symbols that humans created.

Self Supervised Learning through Predictions vs. Results

Once multimodal input is integrated, there is one more essential element needed for LLMs to achieve a higher level of autonomy in learning: multimodal output.

Multimodal output means generating outputs that create physical consequences, such as physical force.

Having both multimodal input and output allows the LLM to create models predicting what should happen given a certain output and then compare this to what actually happens. If the predictions and results don’t match, the network can self-adjust without human intervention.

This allows it to improve its neural network by itself.

A larger network, which can represent high level concepts such as human relations and feelings, can then begin to experiment with for example how humans react to certain statements or actions, and finetune its representations of how to interact with others, group dynamics, feelings, etc.

This is the final frontier to achieving true superhuman intelligence.

Conclusion

Expanding the size and scope of current LLMs will certainly make them more intelligent. However, they are unlikely to fully and truly understand reality until we add single neural network multimodal input and output.

Once this is achieved, the only thing, I believe, stopping AGI from happening is sufficiently powerful computers.

I believe Ray Kurzweil is correct in his predictions. The singularity is indeed near. And the remarkable thing is that it can be achieved with incremental improvements to current technology.

About Yusuf Young