Simultaneous localization and mapping (SLAM) stands as one of the critical challenges in robot navigation. Recent advancements suggest that methods based on supervised learning deliver impressive performance in front-end odometry, while traditional optimization-based methods still play a vital role in the back-end for minimizing estimation drift. In this paper, we found that such decoupled paradigm can lead to only sub-optimal performance, consequently curtailing system capabilities and generalization potential. To solve this problem, we proposed a novel self-supervised learning framework, imperative SLAM (iSLAM), which fosters reciprocal correction between the front-end and back-end, thus enhancing performance without necessitating any external supervision.

## Bi-level Optimization

The framework of iSLAM consists of three parts, i.e., an odometry network \(f_\theta\), a memory (map) \(M\) and a pose graph optimization.

To achieve co-optimization between the front-end and back-end, we formulate the SLAM task as a bi-level optimization problem,

\[\begin{align} \min_\theta& \;\; \mathcal{U}(f_\theta, \mathcal{L}^*), \\ \operatorname{s.t.}& \;\; P^*=\arg\min_{P} \; \mathcal{L}(f_\theta, P, M). \end{align}\]where \(P\) is the robot’s poses to be optimized and \(\mathcal{U}\) and \(\mathcal{L}\) are the higher-level and lower-level objective functions, respectively. \(P^*\) is the optimal pose in the low-level optimization, while \(\mathcal{L}^*\) is the optimal low-level objective, i.e., \(\mathcal{L}(f_\theta, P^*, M)\). In this work, both \(\mathcal{U}\) and \(\mathcal{L}\) are geometry-based objective functions such as the pose transform residuals in PVGO. Thus the entire framework is label-free, which leads to a self-supervised framework. Intuitively, to have a lower loss, the odometry network will be driven to generate outputs that align with the geometrical reality, imposed by the geometry-based objective functions. The acquired geometrical information is stored in the network parameter \(\theta\) and map \(M\) for future reference and to contribute to a more comprehensive understanding of the environment. This framework is named “imperative SLAM” to emphasize the passive nature of this process.

## One-step Back-propagation

To supervise the front-end, we need to compute the gradient of higher-level objective \(\mathcal{U}\) w.r.t. network parameter \(\theta\). According to the chain rule,

\[\frac{\partial\mathcal{U}}{\partial\theta} = \frac{\partial\mathcal{U}}{\partial f_\theta}\frac{\partial f_\theta}{\partial\theta} + \frac{\partial\mathcal{U}}{\partial \mathcal{L}^*}\left(\frac{\partial \mathcal{L}^*}{\partial f_\theta}\frac{\partial f_\theta}{\partial\theta} + \frac{\partial \mathcal{L}^*}{\partial P^*}\frac{\partial P^*}{\partial\theta}\right).\]The challenge lies in the last term \(\frac{\partial P^*}{\partial \theta}\), which is very complicated due to iterative solutions of optimization. If we directly go backward along the forward path, we need to unroll the iterations, which is inefficient and also error-prone due to numerical instabilities. Thus, we apply an efficient “one-step” strategy that utilizes the nature of stationary points to solve this problem. We find that after the lower-level optimization converges, \(\frac{\partial \mathcal{L}^*}{\partial P^*} \approx 0\), which eliminates the complex gradient term and bypasses the lower-level optimization iterations. This technique allow the back-propagation to be done in one-step.

## Front-end Odometry

Our front-end odometry module includes an IMU Preintegrator and a Stereo VO to estimate frame-to-frame motions from IMU data and stereo images, respectively. The Stereo VO is consist of a Monocular VO for estimating the rotation and translation direction, and a Scale Corrector for recovering the true scale. The estimated motions are aggregated to form trajectories and transmitted to the back-end for optimization.

## Back-end Pose-velocity Graph Optimization

In back-end we employ a pose-velocity graph optimization to integrate the estimations from visual and inertial odometry and jointly contributes to a more accurate trajectory. The optimization variables, i.e., poses and velocities, are nodes in the graph, which are connected by four type of geometric constraints. The graph optimization objective, also the lower-level objective \(\mathcal{L}\), is the weighted sum of these four constraints. The upper-level objective \(\mathcal{U}\) is selected to be identical to \(\mathcal{L}\) for simplicity (they’re not necessarily the same in general cases).

## Experiments

The figure below shows how much the trajectory error decreases through imperative learning. To demonstrate the generalizability, we use both our Stereo VO and a Monocular VO + true scale as the front-end. In both settings, significant accuracy improvements are observed for both the front-end and back-end on both datasets.

The VO improvment throughout imperative iterations is visualized below. It is seen that the original estimation (Iter. 0) is adjusted towards the ground truth (GT) during 6 iterations. These results strongly support our claim about mutual learning between the front-end and back-end.