How Robots Learn to Understand Humans

Rudolf Lioutikov is revolutionizing human-robot interaction with his Intuitive Robots Lab at KIT – and is successfully competing with U.S. tech giants in the process

When Rudolf Lioutikov talks about intelligent robots, it does not sound like science fiction. The professor of Machine Learning and Robotics at KIT is concerned with everyday things: Smart machines should be able to hand someone a cup or place a glass on a shelf. Still, his vision is revolutionary: Lioutikov wants to develop robots that truly understand humans. He wants them to not only perform tasks, but also be able to communicate and cooperate with humans in a natural way – even those with no prior technical knowledge.

To achieve this, he uses a new generation of AI models capable of understanding language and images, and deriving meaningful actions from them. His goal is for robots to be able to communicate with humans as intuitively as we do with each other – without complicated commands, but rather through eye contact, tone of voice, or facial expressions. His team is working particularly intensively on improving and further developing these models themselves – an approach that makes them pioneers across Europe. “Robots must not only be able to understand human intentions, but also make themselves understood,” says Lioutikov. With his Intuitive Robots Lab, the 38‑year‑old is even competing with U.S. tech giants and receiving global recognition for his work.

Cover of issue 04/2025 of lookKIT featuring the title “Artificial Intelligence” modus: medien + kommunikation gmbh
Artificial Intelligence

Issue 04/2025 of the lookKIT research magazine focuses on fundamental research and groundbreaking applications of AI.

To the Magazine

Technology That Understands Humans – and Vice Versa

Social needs are significant: In areas such as nursing, household tasks, and industry, intelligent machines are needed that can flexibly adapt to new situations – without requiring users to provide large amounts of data or understand complex systems. This is precisely where Lioutikov's research comes in: “We want to make technology immediately accessible and usable for people.”

But how do researchers intend to achieve the goal of bringing more humanity to technology? Behind the spectacular videos of robots running across a field, climbing steep stairs, or doing somersaults lies a great deal of programming effort. “Current machine learning methods are often not sufficiently user-oriented,” says Lioutikov. “We are developing learning methods that enable robots to learn from interacting with humans – and to deal with incomplete or incorrect information in the process.” This would make robotics more accessible in everyday life.

The Search for the ‘ChatGPT Moment’

Major U.S. corporations such as Google and Meta are investing billions in so‑called Large Behavior Models (LBMs). These AI models are designed to equip robots with general, versatile behavioral capabilities – similar to how large language models like ChatGPT can flexibly perform countless tasks without being programmed or trained for each one individually.

A robot equipped with an LBM could, for instance, set a table, fetch a tool, guide a person to their destination, or open a door – all based on a general understanding of its surroundings, language, and actions.

The problem is that robotics is still searching for its ‘ChatGPT Moment’ – a breakthrough that makes robots as capable and flexible as today’s large language models. LBMs are considered a key technology for this, but the models operate with huge amounts of data and are very complex. They learn from millions of demonstrations, videos, sensor recordings, and voice inputs how humans behave in certain situations and transfer this knowledge to the robot.

Hongyi Zhou stands at a table in the lab with two robotic arms. He is holding one of the arms himself to control its movements. Magali Hauser, KIT
Hongyi Zhou, a doctoral student at the Intuitive Robots Lab, is working on an experimental setup in which the robot is trained to understand and imitate human movements.
Rudolf Lioutikov and Pankhuri Vanjani sit together in front of two computer screens. Lioutikov gestures with one hand toward the screen. Magali Hauser, KIT
Rudolf Lioutikov and doctoral student Pankhuri Vanjani are working on efficient vision-language-action models – putting them in competition with U.S. tech giants.

Small Models, Big Impact

Rudolf Lioutikov, however, focuses on efficiency. His vision: Smaller, more efficient, and explainable LBMs that operate with limited data and are suitable for on-premise use – i.e., locally, without cloud dependency. With a small team, he develops so-called vision-language-action models: AI systems that can see, understand, and act. And he has had considerable success. The Intuitive Robots Lab at KIT is one of the few research laboratories in Europe actively working on such models – successfully competing with billion‑dollar U.S. startups.  

“Our models are smaller, faster, and require comparatively little data,” Lioutikov explains. Yet they achieve comparable – or even better – results. The team deliberately relies on local systems, which offer users greater independence and stronger data protection.

With FLOWER, the team has developed the first European vision-language-action model that runs on standard hardware and can be trained in just a few hours – a milestone for resource-efficient robotics. BEAST, on the other hand, can display movements in a particularly compact and fluid way, similar to a navigation system that smooths out a route. “FLOWER and BEAST have enormous potential, especially in care or household settings where intuitive and reliable interaction is required,” says Lioutikov.

Dr. Felix Mescoli, January 28, 2026
Translation: Dipl.-Übers. Veronika Zsófia Lázár

Vision-language-action Models
Vision-language-action models (VLAs) are a new class of AI systems that aim to make robots more intelligent and flexible, especially in their interaction with humans. They combine three key components:

  1. Vision:
    The robot perceives its environment visually, for example through cameras or other sensors. It recognizes objects, people, movements, and spatial relationships.
  2. Language:
    The robot understands and processes natural language. This means it can interpret instructions, questions, or descriptions—similar to ChatGPT, but with reference to the physical world.
  3. Action:
    Based on what it sees and understands, the robot performs meaningful actions, such as grasping an object, opening a door, or following a person.