Yang Likun: It is nonsense to rely solely on LLM to achieve AGI. The future of AI requires the JEPA world model (GTC conference 10,000-word interview)

This article is machine translated
Show original

At a time when large language models (LLMs) are accelerating the world's embrace of AI, Yann LeCun, known as the father of convolutional neural networks and current chief AI scientist at Meta, recently made a surprising statement that his interest in LLMs has waned, and he even believes that LLMs have reached a bottleneck in development.

It has sparked extensive discussion in the artificial intelligence community.

In an in-depth conversation with NVIDIA Chief Scientist Bill Dally last month, LeCun detailed his unique insights on the future development direction of AI, emphasizing the importance of understanding the physical world, persistent memory, reasoning and planning capabilities, and open source ecology, which are the keys to leading the next wave of AI revolution. The following is a summary of the key points for you.


Bill Dally: Yann, a lot of interesting things have happened in the AI ​​field over the past year. In your opinion, what has been the most exciting development of the past year?

Yann LeCun: Too many to count, but I’ll tell you one thing that might surprise some of you. I am not that interested in Large Language Models (LLMs) anymore.

LLMs are kind of on the tail end of the spectrum, and they're in the hands of product people in industry, just improving on them at the margins, trying to get more data, more computing power, and generating synthetic data. I think there are four areas where the more interesting questions are:

Not many people talk about how to make machines understand the physical world, how to make them have persistent memory, and the last two are how to make them reason and plan.

Of course, there are some efforts to make LLM do reasoning, but in my opinion this is a very simplified way of looking at reasoning. I think there might be a better way to do this. So I'm excited about things that a lot of people in the tech community might not be excited about until five years from now. But now, they don’t seem that exciting because they are some obscure academic papers.

World Models and Understanding of the Physical World

Bill Dally: But if it's not LLM reasoning about the physical world, having persistent memory and planning, then what is it? What will the underlying model be?

Yann LeCun: So, a lot of people are working on world models. What is a world model?

We all have models of the world in our heads. This is basically what allows us to manipulate thoughts. We have a model of the world as it currently is. You know if I push this bottle from the top, it will most likely tip over, but if I push it from the bottom, it will slide. If I press too hard, it might pop.

Screenshot of Yann LeCun interview

We have models of the physical world, acquired in the first few months of our lives, that enable us to cope with the real world. Dealing with the real world is much more difficult than dealing with language. We need a system architecture that can actually handle real-world systems that are completely different from what we currently deal with. LLM predicts token, but token can be anything. Our self-driving car model uses tokens from sensors and generates tokens to drive the vehicle. In a sense, it's reasoning about the physical world, at least about where it's safe to drive and where you won't hit a pole.

Bill Dally: Why are tokens not the right way to represent the physical world?

Yann LeCun: Tokens are discrete. When we talk about tokens, we usually mean a finite set of possibilities. In a typical LLM, the number of possible tokens is around 100,000. When you train a system to predict a token, you can never train it to predict the exact following token in a text sequence.

You can generate a probability distribution over all possible tokens in your dictionary, which is just a long vector of 100,000 numbers between zero and one that sum to one. We know how to do that, but we don't know how to do it with movies, with data that is naturally high-dimensional and continuous. Every attempt to get a system to understand the world or build a mental model of the world by training it to predict movies at the pixel level has largely failed.

Even training a system like some kind of neural network to learn a good representation of an image, all techniques that work by reconstructing the image from corrupted or transformed versions have failed. They work somewhat, but not as well as an alternative architecture we call joint embeddings, which essentially do not try to reconstruct at the pixel level. They try to learn an abstract representation of the image, the movie, or the natural signal that they are training on, so that you can make predictions in that abstract representation space.

Yann LeCun: The example I often use is that if I take a movie of this room, move the camera and stop here, and then ask the system to predict what the rest of the movie will look like, it might predict that this is a room, that there are people sitting in it, and so on. It cannot predict what each of you looks like. This is completely unpredictable from the initial clips of the film.

There are many things in the world that are unpredictable. If you train a system to make predictions at the pixel level, it will spend all its resources trying to figure out details that it simply cannot invent. This is a complete waste of resources. Every time we've tried, and I've been working on this for 20 years, to train a system using self-supervised learning to predict movies, it hasn't worked. It is only effective if it is done at the presentation level. This means that those architectures are not generative.

Bill Dally: If you're basically saying that transformers don't have this capability, but people have vision transformers and have gotten very good results.

Yann LeCun: That’s not what I meant, because you can use transformers for that. You can put transformers in those architectures. It’s just that the kind of architecture I’m talking about is called a joint embedding predictive architecture. So, take a piece of film or an image or whatever, run it through an encoder, you get a representation, and then take the subsequent part of the transformed version of that text or film or image, run it through an encoder as well, and now try to make predictions in that representation space, rather than in the input space.

You can use the same training method, which is filling in the blanks, but you do it in this latent space instead of in the original representation.

Yann LeCun: The difficulty is that if you are not careful and don't use clever techniques, the system will break. It completely ignores the input and just produces a constant representation that does not contain much information about the input. Until five or six years ago, we didn't have any technology to prevent this from happening.

Now, if you want to use this for an agentic system, or a system that can reason and plan, what you need is a predictor. When it observes a piece of video, it has some idea of ​​the state of the world, the current state of the world, and what it needs to do is predict what the next state of the world will be, assuming that I might take an action that I'm imagining.

So, you need a predictor that, given the state of the world and an action you imagine, can predict the next state of the world. If you have such a system, then you can plan a series of actions to achieve a specific outcome. This is really how we all plan and reason. We are not doing this in the token space.

Yann LeCun: Let me give you a very simple example. There are a lot of so-called agent-based reasoning systems out there, and the way they work is that they use a way of randomly generating different tokens, generating a large number of token sequences, and then there's a second neural network that tries to select the best one from all the generated sequences. It's a bit like writing a program without knowing how to write a program.

It's completely hopeless to write a random program and then test all of them and keep the one that actually gives you the right answer.

Bill Dally: Well, there are actually some papers on super-optimization that suggest doing just that.

Yann LeCun: For short programs, you can, but as the length grows exponentially, after a while it becomes completely hopeless.

The author adds my understanding: Simply put, the current LLM model plays a probability game, picking the best possible answer from a large number of text choices. However, Yann LeCun believes that the real world has too many variables and is too complex, and the next step for the model is to be able to predict the future on its own. Just like a child learns from life that if you let go of a ball it will fall to the ground, and if you get close to fire it will feel hot... Although the child does not understand the principles behind it, he can gain predictive ability from life experience.

Prospects and Challenges of AGI/AMI

Bill Dally: Well, a lot of people say AGI, or I guess you would call it AMI, is coming. What do you think? When do you think it will appear, and why? Where is the gap?

Yann LeCun: I don’t like the term AGI because people use it to refer to systems with human-level intelligence, whereas human intelligence is, sadly, super-specialized. So, calling it general is a misnomer. I prefer the phrase AMI, which stands for advanced machine intelligence.

It's just a matter of vocabulary, and I think that the concept that I described, of systems that can learn abstract mental models of the world and use them for reasoning and planning, I think we're probably going to have a good grasp of how to make it work at least on a small scale within three to five years. Then it will be a matter of scaling up until we reach human-level AI.

Yann LeCun: Here’s the thing: throughout the history of AI, generation after generation of AI researchers have discovered a new paradigm and claimed that this is it. In ten years, we will have human-level intelligence. We will have machines that are smarter than humans in all areas. This has been going on for 70 years, with such a wave occurring approximately every 10 years.

The current wave is also wrong. The idea that you can achieve human-level intelligence by just scaling up LLMs, or having them generate thousands of token sequences and select the good ones, and in a few years you’ll have a nation of geniuses in a data center, to quote someone who shall remain anonymous, is nonsense. It's complete nonsense.

Certainly, for many applications, systems of the near future will be at PhD level, if you will, but in terms of overall intelligence, no, we are not even close. But when I say far away, it could happen in about ten years.

Bill Dally: That's not too far away. AI has been applied in many ways to improve the human condition and make people’s lives easier. Which application of AI do you think is the most compelling and advantageous?

Yann LeCun: I think the impact of AI on science and medicine is likely to be much greater than we can currently imagine, although it is already quite large. Not just in research such as protein folding and drug design, but also in understanding the mechanisms of life. And there are a lot of short-term consequences. Now in the United States, when you do medical imaging, AI is often involved. If it’s a mammogram, it’s likely already been pre-screened with a deep learning system to detect tumors. If you go to an MRI machine, the time you have to spend in that MRI machine has been reduced by about four times because we can now recover high-resolution MRI images with much less data. So there are a lot of short-term consequences.

Yann LeCun: Absolutely, every one of our cars, and NVIDIA is one of the big suppliers in this area, now has at least a driver assistance system or an automatic emergency braking system. These have been mandatory in Europe for several years. These things reduced collisions by 40 percent. They save lives. These are huge applications.

Obviously, this is not generative AI; this is perception and now some control for the car. LLM has many applications in areas such as industry and services that exist or will be in the next few years, but we must also consider its limitations. Deploying and implementing a system that achieves the desired level of accuracy and reliability is much more difficult than most people imagine. This is certainly the case with autonomous driving. The timeline for reaching Level 5 autonomous driving has been a receding horizon. I think it will be the same. Where AI usually fails is not in basic technology or flashy demos, but when you actually have to deploy it, apply it, and make it reliable enough to integrate with existing systems.

That’s what makes it difficult, expensive, and time-consuming than expected.

Bill Dally: Of course, in an application like a self-driving car, where it has to be correct all the time or someone could get hurt or killed, the level of accuracy has to be almost perfect. But there are many applications where if it gets it right most of the time, it can be very beneficial. Even with some of the medical applications, where you have a doctor doing a second check, or certainly entertainment and education, you just want the good to outweigh the harm and the consequences of getting it wrong to not be catastrophic.

Yann LeCun: Of course. For most of these systems, the most useful ones are the ones that make people more productive and creative. For example, a coding assistant to assist them with coding. This is true in medicine, in art, and in generating text. AI isn’t replacing people; it’s giving them powerful tools.

Well, it might replace at some point, but I don’t think people will accept it. Our relationship to future AI systems, including superintelligence, is that we will be their bosses. We will have a group of super-intelligent virtual humans working for us. I don’t know about you, but I love working with people who are smarter than me. It’s the best thing in the world.

Bill Dally: So, conversely, just as AI can benefit humanity in many ways, it also has a dark side, where people will use it to create deep fakes and fake news, and if it's not used properly, it can cause emotional distress. What is your biggest concern about the use of AI? How can we alleviate these concerns?

Yann LeCun: One thing that Meta is very familiar with is using AI as a countermeasure to adversarial attacks, whether those attacks come from AI or not. One thing that may be surprising is that even though LLMs and various deepfakes have been available for several years, our colleagues responsible for detecting and removing these types of attacks have told us that we have not seen a huge increase in generated content being posted on social networks, or at least not in a malicious manner. Usually, it will be marked as synthetic. So we're not seeing all of the catastrophic scenarios that people were warning about three or four years ago, saying this would destroy information and communications systems.

Yann LeCun: I need to tell you a funny story. In the fall of 2022, my colleagues at Meta, a small team, put together an LLM that was trained on the entire scientific literature. All the technical papers they could get their hands on. It’s called Galactica, and they released a long paper describing how it was trained, open source code, and a demo system you can play around with.

This was heavily slammed in the Twitter sphere. People said, "Oh, this is terrible. This is going to kill us. It's going to destroy the scientific communication system. Now any idiot can write an article that sounds like a scientific paper about the benefits of eating broken glass or something." The wave of negative opinion was so great that my poor colleagues, a team of five, couldn't sleep at night. They took down the demo, leaving open source code and papers, but our conclusion was that the world wasn't ready for this technology and no one was interested.

Yann LeCun: Three weeks later, ChatGPT appeared, like the second coming of the Messiah. We looked at each other and said, “What just happened?” We couldn’t understand the public’s enthusiasm for this, given the reaction to Galactica.

Bill Dally: A lot of it is a matter of perception. GPT is not trying to write academic papers or do scientific research; it's something you can talk to and ask any question, trying to be more general. In a way, it's more useful to more people, or closer to being useful.

Yann LeCun: There are definitely dangers and there are various abuses. But the antidote to abuse is better AI. As I talked about before, there are unreliable systems. The solution to this problem is better AI systems that have common sense, reasoning skills, the ability to check if an answer is correct, and the ability to assess the reliability of their own answers, which is not the case currently. But those catastrophic scenarios, frankly, I don't believe. People will adapt. I tend to think AI is generally good, even if there are some bad things mixed in.

The Importance and Future of Open Source

Bill Dally: As someone who has homes on both sides of the Atlantic, you have a very global perspective. Where do you think future AI innovations will come from?

Yann LeCun: It can come from anywhere. There are smart people everywhere. No one has a monopoly on good ideas. Some people have a huge superiority complex and think they can come up with all the good ideas without talking to anyone. In my experience as a scientist, this is not the case.

Good ideas come from the interaction and exchange of ideas among many people. In the past decade or so, communication of code has also become important. This is one of the reasons why I have been a strong advocate of open source AI platforms, and why Meta has adopted this philosophy to some extent. We don’t have a monopoly on good ideas, even though we think we are. The recent story about DeepSeek really shows that good ideas can come from anywhere.

Yann LeCun: There are many excellent scientists in China. A story that many people should know is that if you ask yourself, what is the most cited paper in all scientific fields in the past 10 years? That paper was published in 2015, exactly 10 years ago. It’s about a particular neural network architecture called ResNet or residual networks, which came out of Microsoft Research Asia in Beijing and was proposed by a group of Chinese scientists.

The main author is Kaiming He. A year later, he joined Meta's FAIR lab in California, where he stayed for about eight years and recently moved to the Massachusetts Institute of Technology (MIT). This tells you that there are a lot of good scientists around the world and ideas can come from anywhere. But to actually put these ideas into practice, you need a massive infrastructure, a lot of computing resources, and you need to give your friends and colleagues a lot of money to buy the necessary resources. Having an open intellectual community allows progress to happen faster because someone comes up with half a good idea and someone else comes up with the other half. If they communicate, things will happen. If they are all very closed and insular, progress will not happen.

Yann LeCun: Another thing is that in order to get innovative ideas to emerge, as the chief scientist at NVIDIA, you need to let people actually innovate, rather than forcing them to come up with something every three months or every six months. This is basically the case with DeepSeek and LLaMA.

A less well-known story is that there are several LLM programs at FAIR 2022. One with a lot of resources and leadership support, the other was a small “pirate” project of a dozen people in Paris who decided to build their own LLM because they needed it for some reason. That project became LLaMA, and that big project you never heard of was discontinued.

So, even if you don’t have all the support, you can still come up with great ideas. If you are isolated from your management to some extent and they let you work alone, you are likely to come up with better ideas than if you are asked to innovate on a schedule. A dozen people developed LLaMA and then decided to choose it as the platform. A team was built around it to develop LLaMA 2, which was eventually open sourced and caused a small revolution in the industry landscape. As of yesterday, LLaMA has been downloaded more than 1 billion times. I think this is amazing. I’m assuming that includes many of you, but who are all those people? I mean, you must know them because they all had to buy NVIDIA hardware to run that stuff. We thank you (Nvidia) for selling all these GPUs.

Bill Dally: Let's talk more about open source. I think LLaMA is really innovative in this regard because it is a state-of-the-art LLM and provides open weights so people can download and run it themselves. What are the pros and cons of doing this? The company has obviously invested a huge amount of money to develop the model, train the model, and fine-tune the model, and then give it away for free. What are the benefits of doing this? What are the disadvantages?

Yann LeCun: Well, I think there are disadvantages. If you are a company that expects to earn revenue directly from the service, it may not be to your benefit to reveal all your secrets if that is your only business. But if you’re a company like Meta or Google, where revenue comes from other sources: advertising in Meta’s case, a variety of sources in Google’s case, what matters is not how much revenue you can generate in the short term, but whether you can build the features needed for the product you want to build and get the most smart people in the world contributing to it.

For Meta, it won't hurt if some other company uses LLaMA for other purposes, because they don't have a social network to build on it. This is more threatening to Google because you can use it to build a search engine, which is probably why they are not too aggressive about this approach.

Yann LeCun: Another thing we’ve seen the impact of, first with PyTorch and now with LLaMA, is that they kick-start a whole ecosystem of new startups. We’re seeing this in the larger industry right now where people will sometimes prototype AI systems using proprietary APIs, but when it comes time to deploy, the most cost-effective way to do it is on LLaMA because you can run it on-premise or on some other open source platform. Philosophically, I think the most important factor, the most important reason for wanting to have an open source platform is that in a very short period of time, every interaction we have with the digital world will be mediated by AI systems. I now wear Ray-Ban Meta smart glasses, and through them I can talk to Meta AI and ask it any question.

Yann LeCun: We don’t believe that people will want a single assistant and that those assistants will come from a handful of companies on the West Coast of the United States or in China. We need extremely diverse assistants. They need to be able to speak all the world's languages, understand all the world's cultures, all of its value systems, and all of its centers of interest. They need to have different biases, political views, and so on. We need diverse assistants for the same reason we need diverse media. Otherwise, we'd all be getting the same information from the same sources, which wouldn't be good for democracy or anything else.

We need a platform that anyone can use to build those diverse assistants. Currently, this can only be done through open source platforms. I think this is going to be even more important in the future because if we want to have base models that can speak all the world’s languages ​​and so on, no single entity is going to be able to do that alone. Who is going to collect all the data in all the languages ​​in the world and give it to OpenAI, Meta, Google, or Anthropic? No one.

They want to keep that data. Individual regions of the world will want to contribute their data to a global base model, but will not actually want to hand it over. They may contribute to training a global model. I think this is the model for the future. The base model will be open source and trained in a distributed manner, with different data centers around the world having access to different subsets of the data, essentially training a consensus model. This makes open source platforms completely inevitable, and proprietary platforms, I think, will disappear.

Bill Dally: It makes sense for the language and the diversity of things and the applications. A particular company could download LLaMA and then fine-tune it on proprietary data that they would rather not upload.

Yann LeCun: This is what is happening now. Most AI startups’ business models are built around this. They build specialized systems for vertical applications.

Bill Dally: In Jensen's keynote, he gave a great example of using a generative LLM to do wedding planning and decide who is going to sit at the table. This is a great example of the trade-off between investing effort in training and investing effort in inference.

One scenario is that you can have a very powerful model that you spend a lot of resources on training, or you can build a less powerful model but run it a lot of times so that it can reason and complete the task. What do you think is the trade-off between training time and inference or testing time when building powerful models? Where is the sweet spot?

Yann LeCun: First of all, I think Jensen is absolutely right that you ultimately get more power from a system that can reason.

But I disagree that the current way of reasoning used by LLMs with reasoning skills is the right way. It works, but it's not the right way. When we reason, when we think, we do so in some abstract mental state that has nothing to do with language. You don’t want to be kicking around in token space; you want to be reasoning in your latent space, not in token space.

If I tell you to imagine a cube floating in front of you and then rotate that cube 90 degrees around its vertical axis, you can do it in your mind, regardless of language. A cat can do this, and we can't verbally explain this to a cat, but cats do far more complex things than this when planning their trajectory to jump onto furniture. They do things much more complicated than that, and are language independent. It's certainly not done in token space, that would be a series of actions. It is done in an abstract mental space. That's the challenge for the next few years: figuring out new architectures that allow this type of reasoning. This is what I have been researching for the past few years.

Bill Dally: Should we expect a new kind of model that allows us to reason in this abstract space?

Yann LeCun: It’s called the JEPA world model. My colleagues and I have published a series of papers on this issue over the past few years, which can be said to be the first step in this direction. JEPA stands for joint embedding predictive architecture

These are models of the world that learn abstract representations and are able to manipulate those representations, perhaps reasoning and generating a sequence of actions to achieve a particular goal. I think this is the future. About three years ago, I wrote a long paper on this issue, explaining how this might work.

Bill Dally: To run these models, you need great hardware. Over the past decade, GPU capabilities have increased 5 to 10,000 times for both training and inference of AI models, from Kepler to Blackwell. We saw today that there are more coming. Scale-out and scale-up provide additional capabilities. In your opinion, what will happen in the future? What kind of things do you expect us to be able to build your JPA models and other more powerful models?

Yann LeCun: Well, keep pushing it out because we need all the computing power we can get. This kind of reasoning in abstract space is going to be very computationally expensive at runtime, and it’s related to something we’re all very familiar with.

Psychologists talk about System 1 and System 2. System 1 is the tasks you do without thinking about them. They’ve become second nature, and you can do them without much thought. For example, if you are an experienced driver, you can drive even without driving assistance and can drive while talking to someone. But if you’re driving for the first time or have only been driving for a few hours, you have to really focus on what you’re doing. You’re planning for various disaster scenarios and so on. That's System 2. You are mobilizing your entire model of the world to figure out what is going to happen, and then planning actions so that good things happen.

Yann LeCun: However, when you are familiar with a task, you can use only System 1, a reactive system that allows you to complete the task without planning. First, this kind of reasoning is System 2, while automatic, subconscious, reactive strategies are System 1.

The current system is trying to slowly move towards System 2, but ultimately, I think we need a different architecture to implement System 2. If you want a system that can understand the physical world, I don't think it would be a generative architecture. The physical world is much more difficult to understand than language. We think of language as the epitome of human intellectual capacity, but in fact, language is simple because it is discrete. Because it is a communication mechanism, it needs to be discrete to be resistant to noise. Otherwise, you won't be able to understand what I'm saying now. So, for that reason, it's simple. But the real world is much more complicated.

Yann LeCun: Here’s something you may have heard me say in the past: current LLMs typically use around 30 trillion tokens for training. Tokens are usually about 3 bytes, so it is about 0.9 to 10^14 bytes, assuming it is 10^14 bytes. It would take any one of us over 400,000 years to read them all, because that's all the text available on the Internet combined.

But now, psychologists tell us that a 4-year-old has a total of 16,000 waking hours, and we have about 2MB of data going through the optic nerve into the visual cortex every second, about 2MB per second. Multiply that by 16,000 hours and then by 3600, and you get about 10^14 bytes, which is the amount of data acquired through vision in four years. The amount of data your eyes see is equivalent to the amount of text that would take you 400,000 years to read.

This tells you that we will never achieve AGI, whatever you mean, through text training alone. This simply cannot happen.

Bill Dally: Going back to hardware, there are a lot of advances in spiking systems, and advocates and people who study analogies with biological systems think there is a place for neuromorphic hardware. Do you think neuromorphic hardware can complement or replace GPUs in AI processing?

Yann LeCun: Not in the short term. Um, well, I gotta tell you a story about this. When I started at Bell Labs in 1988, my group was actually focused on analog hardware for neural networks. They built several generations of neural networks that were completely analog, then hybrid analog-digital, and then completely digital by the mid-1990s.

At that time people had kind of lost interest in neural networks, so it didn't make sense. The problem with fancy underlying principles like this is that current digital semiconductors are in such deep local minima that it will take some time, and a lot of investment, for alternative technologies to catch up. Even in principle, it's not clear that it has any advantages.

Yann LeCun: Things like analog or spiking neurons or spiking neural networks may have some inherent advantages, but they make hardware reuse very difficult. Every piece of hardware we currently use is too big and too fast, in a sense, so you have to basically reuse the same piece of hardware to calculate different parts of your model.

If you use analog hardware, you cannot use multiplexing. For every neuron in your virtual neural network, there must be a physical neuron. This means you can't fit a decent-sized neural network on a single chip. You have to use multiple chips, and once you can do that, it's very fast, but not very efficient because you need to communicate across the chips and the memory becomes complicated. Ultimately, you need to communicate digitally because that is the only way to achieve efficiency against noise.

Yann LeCun: Actually, the brain provides an interesting piece of information. Most brains, or the brains of most animals, communicate via impulses. A pulse is a binary signal, so it is digital, not analog. Computation at the neuron level may be analog, but communication between neurons is actually digital, except in very small animals. For example, C. elegans, a 1-millimeter-long worm, has 302 neurons. They don't send out pulses because they don't need to communicate over long distances, so at that scale they can use analog communications.

This tells you that even if we want to use fancy techniques like analog computing, we have to use digital communications somehow. At least that's the case with memory. It's not clear, and I've done this calculation many times. I may not know nearly as much about this as you do, but I don’t think it’s going to happen any time soon.

Bill Dally: Maybe in some corners of edge computation, it would make sense. For example, if you want a super cheap microcontroller to run the perception system for your vacuum cleaner or lawn mower, maybe compute makes sense. If you could put the whole thing on a single chip and use something like phase change memory or something like that to store the weights, I know there are people who are seriously working on building these things. These are what are called PIM (Processor-in-Memory) or analog and digital processor and memory technologies. Do you think they work? Do they have a future?

Yann LeCun: Of course. Some of my colleagues are very interested in this because they want to make a successor to those smart glasses. What you want is some visual processing to be going on all the time. Currently, this is not possible due to power consumption. Just one sensor, like the image sensor, can't be left on all the time in such glasses; the battery will run out in a few minutes.

One potential solution is to do the processing directly on the sensor, so you don't have to move the data off the chip, which is where energy is consumed. Moving data is what consumes energy, not the computation itself. There is a lot of work going on in this area, but we are not there yet. I think this is a promising direction. In fact, biology has already solved this problem. The retina has about 60 million photoreceptors, and in front of our retina, there are four layers of neurons - transparent neurons - that process the signals and compress them into 1 million optic nerve fibers that transmit them to our visual cortex. There's compression, feature extraction, and all sorts of things to get the most useful information out of the vision system.

Bill Dally: What about other emerging technologies? Do you think quantum, superconducting logic, or anything else on the horizon will give us a huge advance in AI processing power?

Yann LeCun: Superconductivity, maybe. I don't know enough about it to really judge. Optics has been very disappointing. I remember being very surprised in the 1980s by talks about optical implementations of neural networks, but they never took off. Technology is evolving, so maybe that will change.

Regarding quantum, I am extremely skeptical about quantum computing. I think the only medium-term application of quantum computing that I can see is simulating quantum systems, like quantum chemistry or something like that. As for anything else, I am extremely skeptical.

Bill Dally: You talk about building AI that can learn by observation, just like a baby cub. What kind of demands do you think this places on the hardware? How do you think we need to develop the hardware to make this happen? How much can you give us?

Yann LeCun: It's a question of how much you are willing to buy. As we've heard today, the more you buy, the more you save. This won't be cheap. For example, movies. Let me tell you about an experiment that some of my colleagues conducted until about a year ago. There is a technique for self-supervised learning that uses reconstruction to learn image representations. The project is called MAE, which stands for Masked Autoencoder.

It's basically an autoencoder, a denoising autoencoder, very much like the one used. You take an image, corrupt it by removing parts of it — large chunks, actually — and then train a giant neural network to reconstruct the complete image at the pixel level, or token level. Then you use the internal representation as input to a downstream task, like object recognition or whatever, for supervised training.

Yann LeCun: It works OK, but you have to boil a small pond to cool those liquid-cooled GPU clusters to do it. It is far less effective than those joint embedding architectures. You may have heard of DINO, DINO V2, JAPA, and so on. These are joint embedding architectures, and they tend to work better and are actually cheaper to train.

In joint embedding, you essentially have two latent spaces, corresponding to the two input categories. Instead of converting everything into one token, we can take the full image and the corrupted or transformed version, run both through the encoder, and then try to concatenate the embeddings. You train a representation of the complete image from representations of the partially visible or corrupted image. This will produce better results and reduce costs.

Yann LeCun: Okay, so the team said, “This seems to work OK for images, let’s try it on movies.” So now you have to tokenize the movie, which is basically converting it into 16×16 patches, which is a lot of patches even for a short movie. Then you train a huge neural network to reconstruct missing patches in the movie, perhaps to predict future movies. This would require boiling a small lake, not just a small pond, and would basically fail. That project was stopped.

Yann LeCun: Our current alternative is a project called VJA, and we are about to launch the second version. It is one of those joint embedding prediction architectures. So, it makes predictions about the film, but does it at the level of representation, and it seems to work very well. We have an example of this. The first version was trained on very short clips, just 16 frames, and was trained to predict a representation of the full clip from a partially masked version of the clip.

That system can apparently tell you whether a particular film is physically possible, at least under restricted circumstances. It gives you a binary output: "this works", "this doesn't work", or maybe it's simpler than that. You measure the prediction error made by the system. You use those 16-frame sliding windows on the movie and see if you can predict the next few frames. You measure the prediction error, and when something really weird happens in the movie — like an object disappears, changes shape, appears spontaneously, or doesn’t obey the laws of physics — it flags it as an anomaly.

Bill Dally: These are natural films, and then you test them on synthetic films where very strange things happen.

Yann LeCun: If you train it on movies where very strange things happen, that becomes the norm and it won’t detect those as strange. So you can't do that. It's a bit like the way babies learn intuitive physics. An unsupported object will fall, essentially due to the force of gravity, something babies learn around nine months of age.

If you show a five or six-month-old baby a scene in which an object appears to be floating in mid-air, they won't be surprised. But by nine or ten months old, they'll be looking at it with wide eyes, and you can actually measure that. Psychologists have ways of measuring attention, which means that the baby's internal model of the world is violated. The baby sees something that she thought was impossible, and it does not match her expectations. So, she has to look at it to correct her internal model and say, “Maybe I should learn about this.”

Bill Dally: You talked about doing reasoning and planning in this joint embedding space. What do we need to get there? What are the bottlenecks in terms of models and hardware?

Yann LeCun: A lot of it was just making it work. We need a good recipe. Before people came up with a good recipe for training even simple convolutional networks, it was very difficult. Back in the late 2000s, Geoff Hinton told everyone that training deep networks using backpropagation was very difficult. Yann LeCun could do it with ConvNets, but he was the only person in the world who could do it, which was true at the time, but not entirely accurate.

It turns out that it’s not that difficult, but there are a lot of tricks you have to figure out — engineering tricks, intuition tricks, which nonlinear functions to use, the idea of ​​ResNet, which is the most cited paper in all of science in the last 10 years. It's a very simple idea: you just have the connections skip every layer, so by default a layer in a deep neural network is essentially confused as the identity function, and what neural networks do is a deviation from that very simple idea. This allows us to avoid vanishing gradients during backpropagation and train neural networks with 100 or more layers.

Yann LeCun: Nothing really worked until people came up with a complete recipe with all these residual connections, Adam optimizers, and regularization. We just published a paper showing that you don’t need regularization in transformers, and things like that. Until you have this complete formula and all the tips, nothing will work.

The same is true for NLP and Natural Language Processing systems. In the mid-2010s, there were systems based on denoising autoencoders, like BERT, where you take a piece of text, corrupt it, and then train a large neural network to recover the missing words. Eventually, this was replaced by GPT-style architectures where you just train on the entire system. You train it as an autoencoder, but you don’t need to corrupt the input because the architecture is causal. This approach has proven to be very successful and scalable.

Yann LeCun: We had to come up with a good recipe for those JAPA architectures so that they could scale to the same degree. That's the missing piece.

Bill Dally: Well, we have a flashing red light ahead of us. Before we adjourn, do you have any final thoughts for the audience?

Yann LeCun: Yes, I want to emphasize the point I made earlier. Advances in AI and the journey toward human-level AI, advanced machine intelligence, or AGI, whatever you want to call it, will require contributions from everyone. It will not come from some single entity doing research and development in secret. That's not going to happen. It won't be one event; it will be many continuous advances along the way.

Humans won't be killed within the first hour of this happening because it won't be an event. It will require contributions from around the world. It will have to be open research and based on open source platforms. If they require extensive training, we will need cheaper hardware. You (Nvidia) need to lower your prices. [laugh]

Bill Dally: You need to talk to Jensen about this.

Yann LeCun: We will have a future with a highly diverse population of AI assistants that will help us in our daily lives, always accompanying us through our smart glasses or other smart devices, and we will be their bosses. They will work for us. It’s like all of us are going to be managers. That's a scary future.

Bill Dally: Well, to stop there, I want to thank you for bringing about a really intellectually stimulating conversation, and I hope we get a chance to do this again.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments