Feature correspondence is a crucial task in computer vision, but it remains challenging due to the lack of sufficient pixellevel annotated data for point matching. Existing methods to alleviate this data problem, such as simulating correspondence annotation or collecting ground truth using specialized equipment, have limitations.
We propose a new selfsupervised endtoend learning framework, called iMatching, for feature correspondence using bilevel optimization and imperative learning. This framework formulates the problem of feature matching as a bilevel optimization problem, where the model parameters are updated through an optimization procedure (bundle adjustment) that is itself an optimization process. This design eliminates the need for ground truth geometric labels and is flexible, lightweight, and generalizable to any uptodate matching model. We also formulate a specially designed loss function that allows for endtoend learning without the need for differentiating through the lowlevel optimization process. Experimental results show that iMatching significantly boosts the performance of stateoftheart models on pixellevel feature matching tasks.
Bilevel Optimization
The framework consists of a feature correspondence network \(f_{\boldsymbol{\theta}}\) and a bundle adjustment (BA) process. The feature correspondence network produces a set of pixel correspondences, and the BA process seeks to find the optimal camera poses and landmark positions that minimize the total reprojection error.
The system is formulated as a bilevel optimization problem, where the upper level optimizes the network parameters \(\boldsymbol{\theta}\) to minimize the reprojection error, and the lower level optimizes the camera poses \(\mathbf{T}\) and landmark positions \(\mathbf{p}\) to minimize the reprojection error given the feature correspondences inferred by the network.
The bilevel optimization problem is expressed as:
\[\begin{aligned} \min_{\boldsymbol{\theta}}& \;\; \mathcal{R}(f_{\boldsymbol{\theta}}; \mathbf{T}^*, \mathbf{p}^*), \\ \operatorname{s.t.}& \;\; \mathbf{T}^*, \mathbf{p}^* = \arg\min_{\mathbf{T}, \mathbf{p}} \mathcal{R}(\mathbf{T}, \mathbf{p}; f_{\boldsymbol{\theta}}), \end{aligned}\]where \(\mathcal{R}\) is the reprojection error function.
Optimization
We optimize the feature correspondence network \(f_{\boldsymbol{\theta}}\) via gradient descent. To compute the derivative of the reprojection error w.r.t. the network parameters \(\boldsymbol{\theta}\), we use the chain rule: Here is the revised equation using the standard notation for partial derivatives:
\[\frac{\partial \mathcal{R}}{\partial \boldsymbol{\theta}} = \left( \frac{\partial \mathcal{R}}{\partial \mathbf{T}^*} \frac{\partial \mathbf{T}^*}{\partial f_{\boldsymbol{\theta}}} + \frac{\partial \mathcal{R}}{\partial \mathbf{p}^*} \frac{\partial \mathbf{p}^*}{\partial f_{\boldsymbol{\theta}}} + \frac{\partial \mathcal{R}}{\partial f_{\boldsymbol{\theta}}} \right) \frac{\partial f_{\boldsymbol{\theta}}}{\partial \boldsymbol{\theta}}.\]However, this equation is only welldefined if the feature correspondence network is differentiable and the bundle adjustment (BA) process is also differentiable. The method is compatible with expectationbased matching prediction and regressionbased matching prediction.
To make the BA process differentiable, we leverage the optimization of BA at convergence, which allows us to efficiently backpropagate the gradient through the lowerlevel optimization without compromising its robustness. Specifically, the gradient can be simplified to:
\[\frac{\partial \mathcal{R}}{\partial \boldsymbol{\theta}} = \frac{\partial \mathcal{R}}{\partial f_{\boldsymbol{\theta}}} \frac{\partial f_{\boldsymbol{\theta}}}{\partial \boldsymbol{\theta}}.\]This implies that it is sufficient to evaluate the Jacobian of the reprojection error only once after bundle adjustment and backpropagate through the correspondence argument, treating \(\mathbf{T^*}\) and \(\mathbf{p^*}\) as given.
Experiments
iMatching demonstrates strong performance gains in feature matching and relative pose estimation tasks on unseen, imageonly datasets. In terms of practicalness, it preserves the original model’s inference efficiency. Compared to stateoftheart (SOTA) models, iMatching finetuning boosts performance by an average of 30% and a maximum of 82% on pixellevel feature matching.
Publications

iMatching: Imperative Correspondence Learning.European Conference on Computer Vision (ECCV), 2024.