Unveiling Yan Lakhani's Bold Vision: Moving Beyond LLMs for True AI Advancement

Unveiling Yan Lakhani's Bold Vision: Moving Beyond LLMs for True AI Advancement - Yan Lakhani, a leading AI researcher, shares his perspective on the limitations of language models and the need for new architectures that can truly understand the physical world and reason about it.

2025年4月17日

party-gif

Unlock the secrets of the physical world with a revolutionary AI architecture that goes beyond language models. Discover a new approach that empowers machines to reason, plan, and understand the real world like never before.

Why Yanlukan is Not Interested in LLMs Anymore

Yanlukan, a renowned AI researcher, has expressed his diminishing interest in large language models (LLMs), stating that they are now in the hands of industry and product people, who are primarily focused on incremental improvements, such as acquiring more data and compute power, and generating synthetic data.

Instead, Yanlukan is more excited about four key areas that he believes are more interesting and crucial for the advancement of AI:

  1. Understanding the physical world: Developing machines that can truly understand the physical world, rather than just predicting the next token.

  2. Persistent memory: Enabling machines to have persistent memory, which is an aspect that is not widely discussed.

  3. Reasoning and planning: Improving the way machines reason and plan, as Yanlukan believes the current approaches to reasoning in LLMs are oversimplified.

  4. World models: Exploring the concept of world models, which are mental representations of the physical world that humans acquire in the first few months of life, and how to incorporate them into AI systems.

Yanlukan argues that the current architectures used in LLMs are not well-suited for dealing with the real world, as they are primarily focused on predicting the next token, which is a discrete and limited task. He believes that the type of architectures needed for systems that can truly deal with the physical world must be fundamentally different from the ones used in LLMs.

Yanlukan's perspective suggests that the path to achieving general artificial intelligence (AGI) may not lie solely in the continued development of LLMs, but rather in exploring new architectures and approaches that can better capture the complexity of the physical world and human-like reasoning.

The Need for Machines to Understand the Physical World

Yan Lukan, a renowned AI researcher, argues that the current focus on large language models (LLMs) is not the most promising path towards achieving artificial general intelligence (AGI). He believes that there are more interesting questions to explore, such as how to get machines to understand the physical world, have persistent memory, reason, and plan.

Lukan explains that text-based models, like LLMs, are limited in their ability to deal with the real world, as they are trained to predict the next token in a sequence, which is a relatively simple task compared to understanding and interacting with the physical environment. He emphasizes that the type of architectures needed to deal with the real world is completely different from the ones used for language tasks.

Lukan suggests that the key to understanding the physical world lies in the concept of "world models" - internal representations of the world that humans acquire in the first few months of life, which allow us to manipulate our thoughts and deal with the real world. He argues that these world models are much more complex and difficult to learn than language models, as they need to capture the continuous and high-dimensional nature of the physical world, rather than the discrete and finite set of tokens used in language models.

To address this challenge, Lukan and his colleagues have been working on a new architecture called "Joint Embedding Predictive Architectures" (JEPA), which aims to learn abstract representations of the physical world and reason about them, rather than trying to predict every pixel in a video. This approach, he believes, is a more promising path towards developing AI systems that can truly understand and interact with the real world, a crucial step towards achieving AGI.

The Limitations of Token-Based Representations

Yan Lecun, a prominent figure in AI research, expresses his diminishing interest in large language models (LLMs), stating that they are now in the hands of industry professionals who are primarily focused on incremental improvements, such as acquiring more data and compute power, or generating synthetic data. Instead, Lecun highlights four more exciting areas of focus: understanding the physical world, developing persistent memory, reasoning and planning.

Lecun argues that token-based representations, which are the foundation of LLMs, are insufficient for modeling the physical world. Tokens, being discrete and finite, cannot adequately capture the continuous and high-dimensional nature of real-world data. Attempts to train systems to predict videos at the pixel level have largely failed, as they waste resources trying to predict unpredictable details.

Lecun suggests that the solution lies in joint embedding predictive architectures (JEPAs), which learn abstract representations of the world and can manipulate these representations to reason and plan, rather than relying on token-based predictions. These architectures, such as the VJEPA model, are trained to predict missing or masked parts of a video in an abstract representation space, allowing them to discard irrelevant information and learn more efficiently.

Furthermore, Lecun draws a distinction between "system one" and "system two" thinking, where system one is the automatic, reactive mode, while system two involves more deliberate reasoning and planning. Lecun argues that current AI systems are primarily focused on system one, and that achieving general artificial intelligence (AGI) will require the development of architectures that can effectively handle system two-like reasoning and planning.

Introducing the VJEP Architecture

The VJEP (Joint Embedding Predictive Architecture) is a novel approach proposed by Yan Lecun as an alternative to the current language model-based AI systems. Unlike the traditional token-based prediction models, VJEP focuses on learning abstract representations of the physical world through video data.

The key aspects of the VJEP architecture are:

  1. Learning from Video Data: VJEP is pre-trained on video data, allowing it to efficiently learn concepts about the physical world, similar to how a human baby learns by observing their surroundings.

  2. Representation-level Prediction: VJEP is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space, rather than trying to reconstruct every pixel.

  3. Flexibility and Efficiency: By discarding irrelevant information, VJEP can learn more efficiently compared to generative approaches that attempt to fill in every missing pixel.

  4. Physical Plausibility: The VJEP system is able to detect whether a video sequence is physically possible or not, by measuring the prediction error on a sliding window of video frames. This allows the model to learn physically realistic representations.

  5. Reasoning and Planning: The VJEP architecture aims to enable machines to reason and plan in an abstract mental space, similar to how humans can mentally manipulate and rotate objects, without relying solely on language-based reasoning.

Yan Lecun believes that this type of architecture, which he calls "Joint Embedding Predictive Architectures" (JEPA), is a promising direction for achieving more general and intelligent AI systems, beyond the current limitations of language models.

The Importance of Reasoning in Abstract Mental Spaces

Yan Lecun emphasizes the importance of reasoning in abstract mental spaces, rather than relying solely on language-based approaches like current large language models (LLMs). He argues that true reasoning and understanding of the physical world cannot be achieved by simply predicting the next token or text.

Lecun explains that when we reason about the world, we do so in an abstract mental state that is not tied to language. For example, when we imagine rotating a cube in our mind, we are performing a mental operation that has nothing to do with language. Similarly, cats can perform complex planning and navigation tasks without the use of language.

The current approaches to reasoning in LLMs, which involve augmenting them with additional reasoning capabilities, are, in Lecun's opinion, a "simplistic way" of viewing reasoning. He believes that the proper way to achieve reasoning is through architectures that can operate in an abstract latent space, rather than in the discrete token space.

Lecun's team has been working on a new architecture called JEPPA (Joint Embedding Predictive Architectures), which aims to learn abstract representations of the physical world and reason about them. This approach, unlike generative models that try to reconstruct every pixel, focuses on learning efficient representations that can be used for prediction and reasoning.

The key idea behind JEPPA is to train the model on video data, similar to how a baby learns by observing the world, and then use this learned representation to efficiently solve new tasks with only a few examples. Lecun believes that this type of architecture, which can reason in an abstract mental space, is the path forward towards achieving true general artificial intelligence (AGI).

The Distinction Between System One and System Two Thinking

As Yan Lecun explains, the distinction between system one and system two thinking is a crucial concept in understanding the limitations of current language models (LLMs) in achieving general artificial intelligence (AGI).

System one thinking refers to the automatic, subconscious, and reactive processes that we can perform without much conscious effort, such as driving a familiar route or engaging in a conversation. These tasks become compiled into a "policy" that allows us to execute them efficiently without the need for extensive planning or reasoning.

In contrast, system two thinking involves the deliberate, conscious, and effortful processes that we use to tackle novel or complex problems. This type of thinking relies on our internal world model and our ability to reason, plan, and predict the consequences of our actions.

Lecun argues that current LLMs are primarily focused on system one-like tasks, such as next-token prediction in language. While these models have made impressive progress, they lack the fundamental capabilities required for system two thinking, which is essential for achieving AGI.

To move towards AGI, Lecun suggests that we need to develop new architectures, such as the Joint Embedding Predictive Architectures (JEPA) he has been working on, that can learn abstract representations of the world and reason about them in a more flexible and efficient manner. These architectures would be able to learn from limited data, like a human child, and would not be constrained by the limitations of token-based prediction.

In summary, Lecun's insights highlight the importance of understanding the distinction between system one and system two thinking, and the need to develop new AI architectures that can effectively capture and reason about the complexities of the physical world, rather than relying solely on language-based models.

The Limitations of Current AI Systems and the Path to AGI

Yan Lecun, a renowned AI researcher, has expressed his diminishing interest in large language models (LLMs), stating that they are now in the hands of industry professionals who are primarily focused on incremental improvements, such as acquiring more data and compute power, and generating synthetic data. Instead, Lecun believes that there are more exciting developments in four key areas:

  1. Understanding the physical world: Machines need to develop a deeper understanding of the physical world and how it operates, beyond just predicting the next token.

  2. Persistent memory: The ability to maintain persistent memory, which is not widely discussed, is crucial for building more capable AI systems.

  3. Reasoning and planning: Current approaches to reasoning in LLMs are oversimplified, and Lecun believes there are better ways to achieve more sophisticated reasoning capabilities.

  4. World models: Lecun emphasizes the importance of developing world models, which are mental representations of the physical world that humans acquire in the first few months of life. These world models allow us to reason about and interact with the real world, which is much more complex than dealing with language alone.

Lecun argues that the current architectures used in LLMs, which are primarily focused on next-token prediction, are not well-suited for dealing with the complexities of the physical world. He suggests that alternative architectures, such as the Joint Embedding Predictive Architecture (JEPA), which learns abstract representations of the world and can reason about them, are more promising for achieving general artificial intelligence (AGI).

Lecun also draws a parallel between the two modes of human thinking, known as System 1 and System 2. System 1 is the automatic, reactive mode, while System 2 involves more deliberate, abstract reasoning. Lecun believes that current AI systems are primarily focused on System 1 capabilities, and that achieving AGI will require developing System 2-like reasoning abilities.

Overall, Lecun's perspective highlights the limitations of current AI approaches and the need for more holistic, world-modeling architectures that can reason about the physical world and plan complex actions, rather than simply predicting the next token in a sequence.

FAQ