03/25/2024 09:09
Results of research and scientific conferences
How can a machine learn to orient itself visually in our world? Scientists at the German Artificial Intelligence Research Institute (DFKI) are currently tackling this question – and working to develop solutions. At this year's Computer Vision and Pattern Recognition (CVPR) conference in Seattle, USA, researchers from the Augmented Vision Department will present their latest technical developments. So does Mikasa. The Multi-Key Scene-Aware Adapter for 3D Optical Grounding (MiKASA) makes it possible to identify and linguistically understand complex spatial dependencies and features of objects in 3D space.
What people intuitively accompany language acquisition is the determination of meaning, independent of actual linguistic expression. This means that we can understand an intention or reference in many ways and relate it to something in our world.
Machines do not have this ability yet, or only in an immature form. This is set to change in the future thanks to MiKASA, a technology developed by DFKI researchers. The Multi-Key Scene-Aware Adapter for 3D Optical Grounding (MiKASA) makes it possible to identify and linguistically understand complex spatial dependencies and features of objects in 3D space.
Context is everything
“For example, if we see a large cube-shaped object in the kitchen, we can naturally assume that it might be a dishwasher. “If we see a similar shape in the bathroom, the assumption that it is a washing machine will be more reasonable,” explains project leader Alain Pagani from the area. Enhanced vision research.
The meaning depends on the context. This context is essential for an accurate understanding of our surroundings. Thanks to the Scene-Aware Object Recognizer, machines can now draw inferences from the surroundings of a reference object – thus recognizing the object more accurately and identifying it correctly. Another challenge for software is understanding relative spatial dependencies. “The chair in front of the blue screen” is ultimately “the chair behind the screen” from a different perspective.
To make it clear to the machine that both chairs are in fact the same thing, MiKASA works with the so-called “multi-key anchor concept”. This transmits the coordinates of anchor points in the field of view with respect to the target object and evaluates the importance of nearby objects based on text descriptions.
Object recognition: more accurate than ever
Semantic references can help with object localization. The chair is usually placed on a table or against a wall. So the presence of a table or wall indirectly determines the direction of the chair.
By correlating linguistic models, learned semantics, and object recognition in real 3D space, MiKASA achieves an accuracy of up to 78.6 percent (Sr3D Challenge). This means that the hit rate of object detection can be increased by about 10 percent compared to the previous best techniques in the field!
“Seeing” does not mean “understanding”
Before a program can begin to understand its environment, it must first be able to perceive it. Countless sensors provide their data, which is then combined to form an overall impression. The robot then uses this, for example, to orient itself in space.
Problem: As with the human eye, there is interference in visual information. In order to understand this and create a coherent picture from many data, “SG-PGM (Part Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks)” was developed at DFKI. The alignment between so-called 3D scene visualizations (3D scene graphs) provides the basis for a variety of applications. For example, it supports “point cloud registration” (point cloud registration) – and helps robots navigate.
To ensure that this is also possible in dynamic environments with potential interference sources, SG-PGM links the visualizations with a neural network. “The software reuses geometric elements learned through point cloud registration and links the collected geometric point data to node-level semantic features,” says DFKI's Alain Pagani.
The software recognizes objects based on their meaning
Basically, semantics are assigned to a particular set of points (eg meaning: “The blue chair is in front of the screen”). The same set can then be identified in another graph and the scene can only be expanded to include non-repeating elements.
The SG-PGM is therefore able to identify any interferences in the scene with unprecedented accuracy and thus determine the most accurate overall image possible using a large number of sensors. This means that robots can better find their way in 3D space and accurately locate objects. CVPR organizers have honored this technological advance with a placement.
With a total of six different research papers, the team led by Didier Stricker, Head of Augmented Vision Research at DFKI, wants to introduce, among other things, techniques that can identify and capture objects in 3D space using variable linguistic descriptions. And map the environment completely using sensors.
Scientific contact:
[email protected], Prof. Dr. Didier Stricker, AV Director
[email protected], Dr.-Ing. Alain Pagani, AV – employee
additional information:
http://the Taking place June 17-21 at the Seattle Convention Center, CVPR 2024 is one of the most important events in the field of machine pattern recognition. This year, out of thousands of entries, the most relevant technological approaches and their developers were once again rewarded with an invitation to attend the conference. This also applies to the team led by Didier Stricker, Head of Augmented Vision Research at DFKI.
“Certified tv guru. Reader. Professional writer. Avid introvert. Extreme pop culture buff.”
More Stories
Remotely controlled cargo ships coming soon on the Elbe Canal?
Siemens technology makes Baden Canton Hospital smart
Discovering an ancient Mayan city – what do the rainforests hide?