New SLAM System Enhances Robot Navigation in Crowded Spaces

New SLAM System Enhances Robot Navigation in Crowded Spaces

In a breakthrough that could reshape how robots perceive and navigate complex environments, a team of researchers from Nanjing University of Aeronautics and Astronautics has unveiled a new visual simultaneous localization and mapping (SLAM) framework designed to operate reliably in highly dynamic settings. The system, developed by Lai Shangxiang, Yang Zhong, Jiang Yuhong, Zhang Chi, and Fang Qianhui, addresses one of the most persistent challenges in robotics: the assumption that environments are static. By integrating deep learning with geometric reasoning, the team has created a SLAM solution that significantly outperforms existing methods when people, pets, or moving objects dominate the scene.

For over two decades, SLAM has served as the cornerstone of robotic autonomy, enabling machines to build maps of unknown spaces while tracking their own position within them. From warehouse robots to self-driving cars and augmented reality headsets, SLAM allows devices to move through the world without relying on GPS or pre-existing blueprints. However, most traditional SLAM systems operate under a fundamental assumption — that the world around them is still. This assumption breaks down in real-world environments, where humans walk through rooms, pets dart across floors, and objects are frequently moved. In such settings, conventional SLAM algorithms struggle, often misinterpreting motion as camera movement or generating distorted maps.

The research, published in Applied Science and Technology, introduces a novel approach that redefines how robots handle dynamic content. Rather than treating moving objects as noise to be filtered out, the team’s method actively identifies, segments, and excludes dynamic elements before they can interfere with pose estimation and map construction. The result is a system that maintains high localization accuracy even when a person occupies a large portion of the camera’s field of view — a scenario that typically causes existing systems to fail.

The innovation builds upon ORB-SLAM2, one of the most widely adopted open-source SLAM frameworks, known for its robustness in static environments. ORB-SLAM2 relies on detecting and matching ORB (Oriented FAST and Rotated BRIEF) features across image sequences to estimate camera motion and reconstruct 3D scenes. While effective in controlled settings, its performance degrades rapidly in dynamic spaces because moving objects introduce false correspondences that skew the optimization process. To overcome this, the Nanjing team implemented a dual-stage filtering mechanism that combines semantic understanding with geometric consistency checks.

At the core of the system is a deep neural network, specifically Mask R-CNN, used to perform pixel-level instance segmentation on each incoming RGB frame. This allows the system to identify objects such as people, animals, and vehicles — entities that are inherently mobile — and mask them out before feature extraction. By removing features associated with these known dynamic objects, the algorithm prevents them from influencing the camera pose estimation. This step leverages prior knowledge: humans are assumed to be in motion, tables are assumed to be stationary, and ambiguous objects like chairs are flagged for further analysis.

However, semantic segmentation alone is insufficient. Many dynamic objects may not be recognized by the network due to occlusion, unusual poses, or limited training data. Moreover, static objects can become dynamic through human interaction — a chair pushed across the room, a book picked up from a shelf. To detect such cases, the researchers introduced a geometric consistency module that operates after the initial semantic filtering.

This second stage relies on epipolar geometry — a fundamental principle in multi-view computer vision that describes the geometric relationship between two camera views. When the camera moves, points in the scene should project onto corresponding epipolar lines in the next frame. Static points adhere to this constraint, while dynamic ones deviate due to their independent motion. By estimating the essential matrix between consecutive frames, the system can compute the expected projection of each feature point and measure the deviation from its actual location.

To ensure reliable pose estimation during this process, the team developed an inertial motion model that predicts the camera’s next position based on its previous movement. Assuming constant velocity between high-frame-rate images, the model provides an initial guess for the camera pose, which is then refined using only the static features identified by the semantic network. This hybrid approach avoids the instability that would result from using potentially contaminated features for initial alignment.

Once a preliminary pose is established, the system performs motion consistency checks on the remaining feature points. Points that exhibit large reprojection errors — indicating they do not conform to the expected camera motion — are flagged as potentially dynamic. To avoid missing parts of moving objects, especially near boundaries where feature density is high, the researchers applied a region-growing algorithm on the depth map. This technique groups spatially connected pixels with similar depth values, effectively expanding the mask to cover entire dynamic objects, even in regions where semantic confidence is low.

A morphological refinement step further sharpens the mask boundaries, ensuring that partial or fragmented detections are consolidated into coherent segments. The final output is a clean set of static features that are then fed into the main SLAM pipeline for tracking, mapping, and loop closure. This separation of dynamic and static content allows the system to maintain a consistent and accurate representation of the environment over time.

The team evaluated their method using the TUM RGB-D dataset, a benchmark widely used in the robotics community for testing SLAM performance in indoor environments. The dataset includes sequences with varying degrees of dynamism, from a person sitting still to individuals walking across the room, along with ground-truth trajectories captured by motion capture systems. Testing was conducted on a standard desktop configuration with an Intel i5-9400F processor, Nvidia GTX-1660 GPU, and 16 GB of RAM — hardware representative of mid-tier robotics platforms.

Results showed a dramatic improvement over the baseline ORB-SLAM2 system. In sequences labeled “w.half” and “w.rpy,” where a person moves through the scene, the proposed method reduced absolute trajectory error (ATE) by over 95% in some cases. The root mean square error (RMSE) dropped from more than 2.9 meters in ORB-SLAM2 to just 0.0614 meters in the new system — a reduction of nearly two orders of magnitude. Even in less dynamic scenarios, such as “s.half” where the person remains seated, the system maintained competitive accuracy, demonstrating its robustness across different conditions.

Perhaps more telling is the comparison with other state-of-the-art dynamic SLAM systems. When benchmarked against DS-SLAM and Detect-SLAM — two recent approaches that also incorporate semantic information — the Nanjing team’s solution achieved superior performance across nearly all test sequences. In the “w.xyz” sequence, for example, the RMSE was 0.0192 meters, outperforming DS-SLAM (0.0247 m) and Detect-SLAM (0.0241 m). These results suggest that the combination of semantic segmentation and geometric verification offers a more reliable and accurate solution than either method alone.

One of the key advantages of the proposed system is its ability to handle edge cases that challenge purely semantic or purely geometric approaches. For instance, in situations where a person stands still for an extended period, semantic segmentation might incorrectly classify them as static. However, the geometric consistency check can still detect subtle movements or inconsistencies in depth projection, allowing the system to retain awareness of potential dynamics. Conversely, in cluttered scenes with many small moving objects, geometric methods alone may generate false positives, but semantic priors help suppress noise by focusing only on object categories likely to move.

The implications of this work extend beyond academic interest. Service robots operating in homes, hospitals, or retail spaces must navigate environments filled with people and moving objects. Autonomous vehicles need to distinguish between static infrastructure and moving pedestrians or cyclists. Augmented reality systems must anchor virtual content to stable surfaces while ignoring transient elements. In all these applications, the ability to reliably separate dynamic and static components of a scene is critical for safety, accuracy, and user experience.

Moreover, the modular design of the system allows for future enhancements. The researchers note that incorporating additional semantic categories — such as doors, drawers, or elevators — could further improve scene understanding. They also envision building richer semantic maps that not only identify objects but also encode their functional relationships and interaction patterns. Such maps could enable higher-level reasoning, allowing robots to anticipate human behavior or adapt to changing environments more intelligently.

Another promising direction is the integration of temporal reasoning. While the current system operates on frame-to-frame comparisons, adding memory of past object states could help track the history of motion and predict future trajectories. This would be particularly valuable in crowded environments where objects may temporarily disappear behind obstacles or undergo complex interactions.

From an engineering perspective, the system strikes a balance between accuracy and computational efficiency. While deep learning models like Mask R-CNN are computationally intensive, the use of ORB features — which are lightweight and fast to compute — helps maintain real-time performance. The inertial motion model further reduces processing load by providing a good initial pose estimate, minimizing the need for exhaustive search during feature matching. On the tested hardware, the system ran at interactive frame rates, making it suitable for deployment on mobile robots and embedded platforms.

The success of this research also highlights the growing synergy between deep learning and classical geometric computer vision. For years, these two paradigms were seen as competing approaches — one data-driven and black-box, the other model-based and interpretable. But recent advances show that combining them can yield systems that are both intelligent and robust. The Nanjing team’s work exemplifies this trend, using deep networks to provide high-level understanding while relying on geometric principles to enforce physical consistency.

Looking ahead, the researchers plan to explore more sophisticated object interaction models and expand the system’s capabilities to outdoor and large-scale environments. They also aim to incorporate feedback from the SLAM backend to refine the dynamic object detection process — a form of mutual enhancement where mapping informs perception and vice versa.

In an era where robots are increasingly expected to operate in human-centric spaces, the ability to navigate dynamic environments is no longer a luxury — it is a necessity. The work by Lai Shangxiang and colleagues represents a significant step toward that goal, offering a practical, high-performance solution that bridges the gap between laboratory research and real-world deployment. As robots move from controlled factories into homes, offices, and public spaces, systems like this will be essential for ensuring they can move safely, reliably, and intelligently through a world that is anything but static.

Lai Shangxiang, Yang Zhong, Jiang Yuhong, Zhang Chi, Fang Qianhui, Nanjing University of Aeronautics and Astronautics, Applied Science and Technology, DOI: 10.11991/yykj.202007016