Stanford University professor Fei-Fei Li has already earned her place in the history of AI. She played a major role in the deep learning revolution by laboring for years to create the ImageNet dataset and competition, which challenged AI systems to recognize objects and animals across 1,000 categories. In 2012, a neural network called AlexNet sent shockwaves through the AI research community when it resoundingly outperformed all other types of models and won the ImageNet contest. From there, neural networks took off, powered by the vast amounts of free training data now available on the Internet and GPUs that deliver unprecedented compute power.
In the 13 years since ImageNet, computer vision researchers mastered object recognition and moved on to image and video generation. Li cofounded Stanford’s Institute for Human-Centered AI (HAI) and continued to push the boundaries of computer vision. Just this year she launched a startup, World Labs, which generates 3D scenes that users can explore. World Labs is dedicated to giving AI “spatial intelligence,” or the ability to generate, reason within, and interact with 3D worlds. Li delivered a keynote yesterday at NeurIPS, the massive AI conference, about her vision for machine vision, and she gave IEEE Spectrum an exclusive interview before her talk.
Why did you title your talk “Ascending the Ladder of Visual Intelligence”?
Fei-Fei Li: I think it’s intuitive that intelligence has different levels of complexity and sophistication. In the talk, I want to deliver the sense that over the past decades, especially the past 10-plus years of the deep learning revolution, the things we have learned to do with visual intelligence are just breathtaking. We are becoming more and more capable with the technology. And I was also inspired by Judea Pearl’s “ladder of causality” [in his 2020 book The Book of Why].
The talk also has a subtitle, “From Seeing to Doing.” This is something that people don’t appreciate enough: that seeing is closely coupled with interaction and doing things, both for animals as well as for AI agents. And this is a departure from language. Language is fundamentally a communication tool that’s used to get ideas across. In my mind, these are very complementary, but equally profound, modalities of intelligence.
Do you mean that we instinctively respond to certain sights?
Li: I’m not just talking about instinct. If you look at the evolution of perception and the evolution of animal intelligence, it’s deeply, deeply intertwined. Every time we’re able to get more information from the environment, the evolutionary force pushes capability and intelligence forward. If you don’t sense the environment, your relationship with the world is very passive; whether you eat or become eaten is a very passive act. But as soon as you are able to take cues from the environment through perception, the evolutionary pressure really heightens, and that drives intelligence forward.
Do you think that’s how we’re creating deeper and deeper machine intelligence? By allowing machines to perceive more of the environment?
Li: I don’t know if “deep” is the adjective I would use. I think we’re creating more capabilities. I think it’s becoming more complex, more capable. I think it’s absolutely true that tackling the problem of spatial intelligence is a fundamental and critical step towards full-scale intelligence.
I’ve seen the World Labs demos. Why do you want to research spatial intelligence and build these 3D worlds?
Li: I think spatial intelligence is where visual intelligence is going. If we are serious about cracking the problem of vision and also connecting it to doing, there’s an extremely simple, laid-out-in-the-daylight fact: The world is 3D. We don’t live in a flat world. Our physical agents, whether they’re robots or devices, will live in the 3D world. Even the virtual world is becoming more and more 3D. If you talk to artists, game developers, designers, architects, doctors, even when they are working in a virtual world, much of this is 3D. If you just take a moment and recognize this simple but profound fact, there is no question that cracking the problem of 3D intelligence is fundamental.
I’m curious about how the scenes from World Labs maintain object permanence and compliance with the laws of physics. That feels like an exciting step forward, since video-generation tools like Sora still fumble with such things.
Li: Once you respect the 3D-ness of the world, a lot of this is natural. For example, in one of the videos that we posted on social media, basketballs are dropped into a scene. Because it’s 3D, it allows you to have that kind of capability. If the scene is just 2D-generated pixels, the basketball will go nowhere.
Or, like in Sora, it might go somewhere but then disappear. What are the biggest technical challenges that you’re dealing with as you try to push that technology forward?
Li: No one has solved this problem, right? It’s very, very hard. You can see [in a World Labs demo video] that we have taken a Van Gogh painting and generated the entire scene around it in a consistent style: the artistic style, the lighting, even what kind of buildings that neighborhood would have. If you turn around and it becomes skyscrapers, it would be completely unconvincing, right? And it has to be 3D. You have to navigate into it. So it’s not just pixels.
Can you say anything about the data you’ve used to train it?
Li: A lot.
Do you have technical challenges regarding compute burden?
Li: It is a lot of compute. It’s the kind of compute that the public sector cannot afford. This is part of the reason I feel excited to take this sabbatical, to do this in the private sector way. And it’s also part of the reason I have been advocating for public sector compute access because my own experience underscores the importance of innovation with an adequate amount of resourcing.
It would be nice to empower the public sector, since it’s usually more motivated by gaining knowledge for its own sake and knowledge for the benefit of humanity.
Li: Knowledge discovery needs to be supported by resources, right? In the times of Galileo, it was the best telescope that let the astronomers observe new celestial bodies. It’s Hooke who realized that magnifying glasses can become microscopes and discovered cells. Every time there is new technological tooling, it helps knowledge-seeking. And now, in the age of AI, technological tooling involves compute and data. We have to recognize that for the public sector.
What would you like to happen on a federal level to provide resources?
Li: This has been the work of Stanford HAI for the past five years. We have been working with Congress, the Senate, the White House, industry, and other universities to create NAIRR, the National AI Research Resource.
Assuming that we can get AI systems to really understand the 3D world, what does that give us?
Li: It will unlock a lot of creativity and productivity for people. I would love to design my house in a much more efficient way. I know that lots of medical usages involve understanding a very particular 3D world, which is the human body. We always talk about a future where humans will create robots to help us, but robots navigate in a 3D world, and they require spatial intelligence as part of their brain. We also talk about virtual worlds that will allow people to visit places or learn concepts or be entertained. And those use 3D technology, especially the hybrids, what we call AR [augmented reality]. I would love to walk through a national park with a pair of glasses that give me information about the trees, the path, the clouds. I would also love to learn different skills through the help of spatial intelligence.
What kind of skills?
Li: My lame example is if I have a flat tire on the highway, what do I do? Right now, I open a “how to change a tire” video. But if I could put on glasses and see what’s going on with my car and then be guided through that process, that would be cool. But that’s a lame example. You can think about cooking, you can think about sculpting—fun things.
How far do you think we’re going to get with this in our lifetime?
Li: Oh, I think it’s going to happen in our lifetime because the pace of technology progress is really fast. You have seen what the past 10 years have brought. It’s definitely an indication of what’s coming next.
From Your Site Articles
Related Articles Around the Web