State Estimation for SLAM

Posted on 2023-06-16 Edited on 2023-07-05 In SLAM , 1. Math

This post is the notes of book “State Estimation for Robotics”, which talks about the state estimation math for SLAM.

System and Estimation Methods

~	LG (Linear Gaussian) System	NLNG (Non-linear Non-Gaussian) System
Recursive estimation	KF (Kalman Filter)	EKF (Extended Kalman Filter), IEKF (Iterative Kalman Filter), SPKF (Sigmapoint Kalman Filter), ISPKF (Iterative Sigmapoint Kalman Filter), Particle Filter
Batch estimation	$(H^TW^{-1}H)^{-1} H^T W^{-1} z$	Gradient Decent, Newton, Gauss-Newton, Levenberg-Marquardt

Recursive vs Batch estimation

The difference between the recursive estimation and batch estimation is whether to use the “future” state to update the “history” state.

In other words, at each timestamp,

Recursive estimation: only use past and current info to update the current state
Batch estimation: use past and current info to update the current state and some / all past states

Sliding Window

Recursive method may not only use the last state to estimate the current state. Sliding window filter uses several past states to estimate/update the current state.

Batch estimation could also work with a sliding window, but it will updates all the state inside the sliding window.

Therefore, whether updating past states with future state is the only criteria to classify the estimation is recursive of batch.

LG vs NLNG system

System Equations

LG system
- System and observation equations are linear
- State and noise in Gaussian distribution
  $$
  \begin{aligned}
  & x_k = A_{k-1} x_{k-1} + v_k + w_k, \quad k=1,\cdots,K \\
  & y_k = C_k x_k + n_k, \quad k=0,\cdots,K \\
  & x_0 \in \mathbb{R}^N \sim \mathcal{N}(\check{x}_0, \check{R}_0) \\
  & w_k \in \mathbb{R}^N \sim \mathcal{N}(0, Q_k) \\
  & n_k \in \mathbb{R}^M \sim \mathcal{N}(0, R_k)
  \end{aligned}
  $$
NLNG system
- System and observation equations are nonlinear
- State and noise in non-Gaussian distribution
  $$
  \begin{aligned}
  & x_k = f(x_{k-1}, v_k, w_k), \quad k=1,\cdots,K \\
  & y_k = g(x_k, n_k), \quad k=0,\cdots,K \\
  \end{aligned}
  $$

To estimate state for NLNG system, we usually needs to make some approximation such as approximating the state in Gaussian or linearizing the equations. These will actually affect the Markov property of the state and lead to biased, non-optimal estimation.

Unbiased and optimal

Unbiased means the expectation of the difference between the estimation and the true state is zero. i.e. $E(\hat{x} - x_{true}) = 0$.

For LG system, optimal means the estimation cannot be better given the same noisy input. In other words, the variance of estimation is the least, reaching Cramér-Rao lower bound (CRLB).

For NLNG system, optimal means to maximizing the posterior.

Markov property

Markov property apply to LG system, but not apply to NLNG system. Markov property is that give the current state and past state, the future state of the system only depends on current state.

Therefore,

For LG system
- The recursive and batch estimation give the same results
- They are both unbiased optimal estimation
- This is because the past states won’t be affected by future states, thus estimate the current state with past states but without future state is enough
For NLNG system
- The recursive and batch estimation give different results
- Batch estimation is the optimal estimation, while the recursive is not
- Both recursive and batch estimation might be biased or unbiased, depends on which specific approximation the estimation made
- This is because future state also depends on past states apart from the current state, thus it’s not enough to only estimate the current state with past states but without future state in recursive methods.

Reasons that Markov property not held in NLNG system

Though there is the exact system function that describes the evolution of the state from one time step to the next, it actually only provides a rough estimate of the real system because NLNG system could be too complex to model. In other words, it’s very difficult to accurately predict the next state based on the current state alone.

Mold and Mean

Mold is the value to make the state reaches the maximum probability.

For LG system, mold and mean of the state (either prior or posterior) are equaled.
For NLNG system, state is not in Gaussian distribution, thus the mold and mean are not equaled.

In NLNG system, since the estimation methods need to do some approximation such as equations linearization, it may actually estimate the mold of the posterior of the state instead of the mean, leading to biased estimation.

MAP and Bayesian Inference

All estimation methods could be derived from either of following principles.

MAP (Maximize a Posterior)
- Get the estimation to maximize the state posterior
- It estimates the mold of the posterior probability function, which is not equaled to mean in NLNG system
Bayesian Inference
- Get the probability function of state posterior
- It estimates both the mean and the variance of the posterior probability function.

However, the mean Bayesian inference get may not be the real mean of posterior, and it may also be the mold of posterior. e.g. For the batch estimation of NLNG system, deriving from MAP and Bayesian Inference both provides the same result, which is the mold of posterior. This is because both methods perform the equation linearization.

MAP vs MLE

MLE is a simplified version of MAP since it doesn’t consider about the prior.

Estimation methods summary

~	System	Assume state in Gaussian	Assume noise in Gaussian	State evolution	Estimate	Unbiased	Optimal	Efficiency
KF	LG	None	None	None	Mean / mold	Yes	Yes	High
Batch estimation	LG	None	None	None	Mean / mold	Yes	Yes	High
EKF	NLNG	Yes	Yes	Equation linearization	not mean nor not mold	Yes	No	High
IEKF	NLNG	Yes	Yes	Equation linearization	Mold	No	No	Medium
SPKF/UKF	NLNG	Yes	Yes	Sampling sigmapoints	not mean nor not mold	Yes	No	High
ISPKF	NLNG	Yes	Yes	Sampling sigmapoints	Mean	Yes	No	Medium
Particle Filter	NLNG	No	No	Largely sampling	Mean	Yes	No	Low
Batch estimation	NLNG	No	Yes	Equation linearization	Mold	No	Maybe	Low

Based on the table above, we could conclude the following findings.

Batch estimation could provide the “best” estimation comparing with other methods.
- For LG system, it provides the optimal estimation same with KF.
- For NLNG system, it provides the estimation which maximizes the posterior. However, it may reach the local optimal instead of the global optimal.
For NLNG system, batch estimation exploits all available data to improve the whole state optimality and consistency, while filtering methods only focus on the optimality on a single step.
For NLNG system, filtering methods tradeoff optimality for realtime efficiency.
For NLNG system, linearizing equations leads the estimation towards mold instead of mean, no matter deriving from MAP or Bayesian Inference perspectives. This is because the linearization abandons those non-linear factors to cause the posterior non-Gaussian.
For NLNG system, non-iterative methods such as EKF or SPKF may give very bad estimation which is far from the mold and the mean because they linearize function / sample data in a working point which may very far from the real state. This makes their estimation have no relationships with the posterior probability function.

NLNG estimations

Estimations for NLNG system could also be categorized by the approximations they made.

Don’t use future state to update past states: recursive estimation, Bayesian filter
- Approximate state and noise in Gaussian: Generalized Gaussian filter
  - Linearzing system and observation equations for state evolution
    - Non-iterative: EKF
    - Iterative: IEKF
  - Sample states and pass by real non-linear equations for state evolution
    - Non-iterative: SPKF / UKF
    - Iterative: ISPKF
- Largely sampling the probability function: Particle filter
Use future state to update past states: batch estimation
- Approximate noise in Gaussian and linearizing system and observation equations

Applications

State estimation has following applications for robotics domain.

~	Sensor	State	Measurements	State Prior	Closed form	Numerical form Description
Point cloud registration	lidar	single pose	ldmks pos in lidar frame	No	ICP - R form - q form	G-N
Point cloud tracking	lidar	single pose	ldmks pos in lidar frame	No	-	- MLE batch est
Point cloud tracking	lidar	single pose	ldmks pos in lidar frame	Yes	-	- EKF filtering - MAP batch est
Frontend frame tracking	camera	single pose	ldmks’ projections in pixel frame	No	PnP (RANSAC) - P3P - DLT	G-N
Bundle adjustment	camera	several - poses - ldmks pos	landmarks’ projections in several pixel frame	No	-	MLE batch est
SLAM	camera	several - poses - ldmks pos	landmarks’ projections in several pixel frame	Yes	-	- EKF filtering - MAP batch est
Pose graph optimization	-	several poses	relative poses	No	-	MLE batch est

State prior come from the motion measurements, which usually come from IMU sensors
- If there is state prior, the bath estimation is MAP
- If there is no state prior, the batch estimation turns to MLE, and no filtering methods
State in several time
- Filtering and batch estimations need states in several time
- Single time application don’t have filtering and batch estimation methods. i.e. point cloud registration, frontend frame tracking
Lidar vs Camera
- Lidar has direct linear measurements of the landmarks positions
- Camera has nonlinear measurements of the landmarks positions

SLAM System Equations

SLAM system is a NLNG system, its control and observation equations are as follow.

$$
\begin{aligned}
& x_k = f(x_{k-1}, u_k) + w_k \\
& z_{k,j} = h(x_k, y_j) + v_{k,j}
\end{aligned}
$$

where

$x_k \in \mathbb{R}^6, k=1,\cdots,N$ is the 6DoF camera pose in Lie Algebra form
$y_j \in \mathbb{R}^3, j=1,\cdots,P$ is the 3D mappoint position in world frame
$z_{k,j} \in \mathbb{R}^2$ is the 2D feature point position in pixel frame for mappoint $j$ observed by camera pose $k$
$u_k$ is the motion data measured by IMU
$w_k$ is the noise in control, could assume Gaussian distribution $w_k \sim \mathcal{N}(0, R_k)$
$v_k$ is the noise in observation, could assume Gaussian distribution $v_k \sim \mathcal{N}(0, Q_{k,j})$

SLAM State Estimation

The problem is

Given $z_{k,j}, u_k$
Estimate $x_k, y_j$

According to the Bayesian, the estimation towards $x_k, y_j$ could be wrote as
$$
\begin{split}
(x_k, y_j)^*
&= \arg \max P(x_k, y_j|z_{k,j}, u_k) \\
&= \arg \max \frac{P(z_{k,j}, u_k| x_k, y_j)P(x_k, y_j)}{P(z_{k,j}, u_k)} \\
&= \arg \max P(z_{k,j}, u_k| x_k, y_j)P(x_k, y_j)
\end{split}
$$
Note. $P(z_{k,j}, u_k)$ is a constant value.

Based on whether or not the prior term $P(x_k, y_j)$ is known, which is equivalent to whether control signal $u_k$ is known, there are several methods for SLAM state estimations.

~	MAP (Maximize a Posterior)	MLE (Maximize Likelihood Estimation)
Recursive method	KF, EKF, etc	-
Batch estimation	- Bundle adjustment in visual inertial SLAM - Marginalization in sliding window BA	BA in visual SLAM

MLE Estimation

If we don’t know the prior term and the control signal (e.g. in visual SLAM, we only have the observations), the estimation turns to MLE
$$
(x_k, y_j)^* = \arg \max P(z_{k,j}, u_k| x_k, y_j)P(x_k, y_j) \approx \arg \max P(z_{k,j}| x_k, y_j)
$$

Since $z_{k,j} = h(x_k, y_j) + v_{k,j}$ and we assumes observation noise $v_{k,j}$ in Gaussian distribution, then we have $P(z_{k,j} | x_k, y_j) \sim \mathcal{N}(h(x_k, y_j), Q_{k,j})$, so the estimation turns to
$$
\begin{split}
(x_k, y_j)^*
&= \arg \max \mathcal{N}(h(x_k, y_j), Q_{k,j}) \\
&= \arg \max \frac{1}{\sqrt{(2\pi)^k |Q_{k,j}|}} \exp\left(-\frac{1}{2} (z_{k,j}-h(x_k, y_j))^T Q_{k,j}^{-1} (z_{k,j}-h(x_k, y_j)) \right) \\
&= \arg \min -\ln\left( \exp\left(-\frac{1}{2} (z_{k,j}-h(x_k, y_j))^T Q_{k,j}^{-1} (z_{k,j}-h(x_k, y_j)) \right) \right) \\
&= \arg \min \left( (z_{k,j}-h(x_k, y_j))^T Q_{k,j}^{-1} (z_{k,j}-h(x_k, y_j)) \right)
\end{split}
$$
which is a least square problem, and $(x_k, y_j)^*$ could solved by iterative method.

MLE is usually used in batch estimation form
$$
\begin{split}
(x, y)^* \approx \arg \max P(z| x, y)
\end{split}
$$
where $x={x_1, \cdots, x_N}$, $y={y_1,\cdots,y_M}$ are the variables of the whole time. The estimation is to solve the $(x, y)^*$ by maximizing the whole probability $P(x, y| u, z)$.

Similarily, the batch estimation of MLE could be converted to the least square problem as following.
$$
\begin{split}
(x, y)^*
&\approx \arg \max P(z| x, y) = \arg \max \prod_{k,j} P(z_{k,j}| x_k, y_j) \\
&= \arg \min \sum_{k, j} \left( z_{k,j}-h(x_k, y_j))^T Q_{k, j}^{-1} (z_{k,j}-h(x_k, y_j) \right)
\end{split}
$$

This is the form to minimizing reprojection error in visual SLAM bundle adjustment.

MAP Estimation

If we know the prior probability, so the estimation is

For recursive method
$$
(x_k, y_j)^* = \arg \max P(z_{k,j}, u_k| x_k, y_j)P(x_k, y_j)
$$
We also need to assume Markov of the system and only incrementally update the state vectors, see following filters section for more info.

For batch estimation
$$
\begin{split}
(x, y)^*
&= \arg \max P(z, u| x, y) = \arg \max \prod_k P(u_k| x_{k-1}, x_k) \prod_{k,j} P(z_{k,j}| x_k, y_j) \\
&= \arg \min \left( \sum_k \left( x_k - f(x_{k-1}, u_k))^T R_k^{-1} (x_k - f(x_{k-1}, u_k) \right) \right. \\
&\left. + \sum_{k, j} \left( z_{k,j}-h(x_k, y_j))^T Q_{k, j}^{-1} (z_{k,j}-h(x_k, y_j) \right) \right)
\end{split}
$$

Recursive method: filters

For the MAP estimation of the system, if we assume the Markov property, which means the current state only depends on the latest previous state. Then we could have the filter optimization method.

Consider the state in time $k$, set all the observation in time $k$ as $z_k$, set the pose and observations on landmarks as $x_k = {x_k, y_1, \cdots, y_m}$, where

$$
\begin{split}
P(x_k | x_0, u_{1:k}, z_{1:k})
&\propto P(z_k | x_k) P(x_k | x_0, u_{1:k}, z_{1:k-1}) \\
&= P(z_k | x_k) \int P(x_k | x_{k-1}, x_0, u_{1:k}, z_{1:k-1}) P(x_{k-1} | x_0, u_{1:k}, z_{1:k-1}) \mathrm{d} x_{k-1} \\
&= P(z_k | x_k) \int P(x_k | x_{k-1}, u_k) P(x_{k-1} | x_0, u_{1:k-1}, z_{1:k-1}) \mathrm{d} x_{k-1}
\end{split}
$$

The $P(x_{k-1} | x_0, u_{1:k-1}, z_{1:k-1})$ is the distribution in state $k-1$. Therefore, $P(x_k | x_0, u_{1:k}, z_{1:k-1})$ is the state $k-1$ passing the control equation

Kalman Filter (for LG system)

For LG system, if we ignore the landmarks
$$
\begin{aligned}
& x_k = A_k x_{k-1} + u_k + w_k \\
& z_k = C_k x_k + v_k
\end{aligned}
$$
where $w_k \sim \mathcal{N}(0, R_k)$, $v_k \sim \mathcal{N}(0, Q_k)$

Then we could use Kalman filter to estimate the state $x_k$.

Predict
with
$$
P(x_k | x_0, u_{1:k}, z_{1:k-1}) = \mathcal{N} (\check{x}_k, \check{P}_k) = \mathcal{N}(A_k \hat{x}_{k-1} + u_k, A_k \hat{P}_{k-1} A_k^T + R_k)
$$
then
$$
\check{x}_{k} = A_k \hat{x}_{k-1} + u_k, \quad \check{P}_k = A_k \hat{P}_{k-1} A_k^T + R_k
$$
Update (correction)
with
$$
\begin{aligned}
& P(z_k | x_k) = \mathcal{N} (C_k x_k, Q_k) \\
& \mathcal{N} (\hat{x}_k, \hat{P}_k) = \eta \mathcal{N} (C_k x_k, Q_k) \mathcal{N} (\check{x}_k, \check{P}_k)
\end{aligned}
$$
then
$$
\hat{x}_k = \check{x}_k + K(z_k - C_k \check{x}_k), \quad
\hat{P}_k = (I - KC_k) \check{P}_k
$$
where $K = \check{P}_k C_K^T (C_K \check{P}_k C_k^T + Q_K)^{-1}$ is Kalman gain

Extend Kalman Filter (for NLNG system)

For NLNG system, if we ignore the landmarks
$$
\begin{aligned}
& x_k = f(x_{k-1}, u_k) + w_k \\
& z_k = h(x_k) + v_{k}
\end{aligned}
$$
where $w_k \sim \mathcal{N}(0, R_k)$, $v_k \sim \mathcal{N}(0, Q_k)$

EKF linearizes the system and observation equations using Taylor expression, so the system equations turns to
$$
\begin{aligned}
& x_k \approx f(\hat{x}_{k-1}, u_k) +
\frac{\partial f (\hat{x}_{k-1})}{\partial x^T} (x_{k-1} - \hat{x}_{k-1}) + w_k \\
& z_k \approx h(\check{x}_k) +
\frac{\partial h(\check{x}_k)}{\partial x^T} (x_k - \check{x}_k) + v_k
\end{aligned}
$$

Define $F_k = \frac{\partial f (\hat{x}_{k-1})}{\partial x^T}, H_k = \frac{\partial h(\check{x}_k)}{\partial x^T}$, note the point of their derivative are different.

Then this non-linear system is converted to a linear system and the estimation steps are as follows.

Predict
$$
\check{x}_k = f(\hat{x}_{k-1}, u_k), \quad \check{P}_k = F_k \hat{P}_{k-1} F_k^T + R_k
$$
Update (correction)
$$
\hat{x}_k = \check{x}_k + K_k (z_k - h(\check{x}_k)), \quad \hat{P}_k = (I - K_k H_k) \check{P}_k
$$
with $K_k = \check{P}_k H_k^T (H_k \check{P}_k H_k^T + Q_k )^{-1} $

Different from KF, EKF needs to calculate the $F_k$ then $H_k$ at each state.

Batch estimation: least square problem

Problem Construction

The batch estimation, including MLE or MAP optimization target, is finally converted to solving a least square problem.
Minimizing the sum of squared error is equivalent to maximizing the log likelihood of independent Gaussian distributions. This is how the least square connecting with probabilities.
Least square is often used to estimate model parameters given observations. Usually, there will be more equations than unknowns, making the equations over-determined. The approximated solution of parameters could be obtained by least square.

Given the observation equations $ z_{k,j} = h(x_k, y_j) + v_{k,j} $, the least square problem is constructed as

$$
\begin{split}
\mathbf{x}^{*}
&= \arg\min_\mathbf{x} F(\mathbf{x}) \\
&= \arg\min_\mathbf{x} \sum_{k, j} e(\mathbf{x}_{k, j}) \\
&= \arg\min_\mathbf{x} \sum_{k, j} \mathbf{e}(\mathbf{x}_{k, j})^T Q_{k,j}^{-1} \mathbf{e}(\mathbf{x}_{k, j}) \\
&= \arg\min_\mathbf{x} \sum_{k, j} (z_{k,j}-h(x_k, y_j))^T Q_{k, j}^{-1} (z_{k,j}-h(x_k, y_j))
\end{split}
$$
where

$\mathbf{x}_{k, j} \in \mathbb{R}^{n+p}$ is the stack of system parameters $x_k$ and environment parameters $y_j$, which is
$$
\mathbf{x}_{k, j} = \begin{bmatrix}
x_k \\
y_j
\end{bmatrix}
$$
$\mathbf{x} \in \mathbb{R}^{Nn + Pp} $ is the stack of all state vectors. Usually we put all pose elements together and all mappoint elements together, which is
$$
\mathbf{x} = \begin{bmatrix}
x_1 \\
\vdots \\
x_N \\
y_1 \\
\vdots \\
y_M
\end{bmatrix}
$$
$F(\mathbf{x}) \in \mathbb{R}$ is a scalar value, representing the sum of square of error
$e(\mathbf{x}_{k, j}) \in \mathbb{R}$ is a scalar value, representing the square of error for the observation of state $k$ on environment $j$
$\mathbf{e}(\mathbf{x}_{k, j}) \in \mathbb{R}^{m \times 1}$ is a vector, representing the error function for the observation of state $k$ on environment $j$

More explanations

This is a least square problem of $\mathbf{e}(\mathbf{x}_{k, j})$ rather than $\mathbf{x}_{k,j}$
To solve this least square problem, a theoretical solution is to derive the global error function and find its zero point. Since the global error function doesn’t have a maximum value (it could be infinitely large), so the that point is the minimum value. This is practical for linear least square proble, see linear equations section, which we could solve it with 1 step’s calculation.
For non-linear least square issue, the derivation is very complex to compute, then we need to use Iterative Local Linearization methods to solve it.

Iterative Local Linearization methods

The iterative local linearization method is given an initial guess of the state vector, then iteratively calculate a delta value for state vector, to make the $F(\mathbf{x})$ reaches the local minimal. Specifically, it has following steps

We have a good initial value $\mathbf{x}_0$, note this means the initial vale for all state vectors elements
Use Taylor expression to get the square approximation of $F(\mathbf{x})$ at point $\mathbf{x_0}$
- is the approximation of $F(\mathbf{x})$ around point $\mathbf{x}_0$
- is the same with square linearization of $F(\mathbf{x_0 + \Delta x})$ at point $\mathbf{\Delta x} = \mathbf{0}$, which is a quadratic function of $\mathbf{\Delta x}$
Compute the first derivative of linearized quadratic function of $F(\mathbf{x_0+\Delta x})$ on $\Delta x$
Set the first derivative to zero and solve the $\mathbf{\Delta x}^*$, which makes the $F(\mathbf{x_0+\Delta x})$ reaches minimal around point $\mathbf{x}_0$
Obtain the new state by $\mathbf{x}_1=\mathbf{x_0+\Delta x^*}$
Repeat step 2-5 until $F(\mathbf{x})$ reaches the minimal value or pre-defined iteration steps

Since the $F(\mathbf{x})$ is constructed by $\mathbf{e}(\mathbf{x}_{k, j})$, we can use different ways to construct the square linearization of $F(\mathbf{x_k + \Delta x})$, note the $\mathbf{x_k}$ is a fixed value here.

Applying the Taylor expression to the $F(\mathbf{x_k + \Delta x})$, we get
$$
F(\mathbf{x_k + \Delta x}) = F(\mathbf{x_k}) + \mathbf{J}^T \mathbf{\Delta x} + \frac{1}{2} \mathbf{\Delta x}^T \mathbf{H} \mathbf{\Delta x} + \cdots
$$
Only keep the linear term, we get
$$
F(\mathbf{x_k + \Delta x}) \approx F(\mathbf{x_k}) + \mathbf{J}^T \mathbf{\Delta x}
$$
with
$$
\mathbf{J} = \frac{\mathrm{d} F(\mathbf{x})}{\mathrm{d} \mathbf{x}} |_{\mathbf{x_k}} \in \mathbb{R}^{n \times 1}
$$
Then if $\mathbf{\Delta x} = -\lambda \mathbf{J}$, $F(\mathbf{x_k + \Delta x}) = F(\mathbf{x_k}) - \lambda \mathbf{J}^T \mathbf{J}$ will decrease, so we choose
$$ \mathbf{\Delta x}^{*} = - \lambda \mathbf{J} $$
The step length $\lambda$ is a positive scalar could be get from line search or set to a constant value
Update the state by $\mathbf{x_{k+1}} \leftarrow \mathbf{x_k} + \mathbf{\Delta x}^{*}$

Pros

This method is robust

Cons

Require to search the step length. Setting to a constant value may not always work
Inefficient when the value is closed to the optimal value, which means it requires large iteration steps. This is due to the state update is in a sawtooth route (last optimization direction is orthogonal with current gradient, thus orthogonal with current direction).

Applying the Taylor expression to the $F(\mathbf{x_k + \Delta x})$, we get
$$
F(\mathbf{x_k + \Delta x}) = F(\mathbf{x_k}) + \mathbf{J}^T \mathbf{\Delta x} + \frac{1}{2} \mathbf{\Delta x}^T \mathbf{H} \mathbf{\Delta x} + \cdots
$$
Keeps the square term, we get
$$
F(\mathbf{x_k + \Delta x}) \approx F(\mathbf{x_k}) + \mathbf{J}^T \mathbf{\Delta x} + \frac{1}{2} \mathbf{\Delta x}^T \mathbf{H} \mathbf{\Delta x}
$$
with
$$
\mathbf{J} = \frac{\mathrm{d} F(\mathbf{x})}{\mathrm{d} \mathbf{x}} |_{\mathbf{x_k}} \in \mathbb{R}^{n\times 1} \quad
\mathbf{H} = \frac{\mathrm{d}^2 F(\mathbf{x})}{\mathrm{d} \mathbf{x} \mathrm{d} \mathbf{x}^{T}} |_{\mathbf{x_k}} \in \mathbb{R}^{n \times n}
$$
Calculate the first derivative on $\mathbf{\Delta x}$ and set it to 0, we could get
$$
\begin{split}
& \frac{\mathrm{d} F(\mathbf{x_k + \Delta x})}{\mathrm{d} \mathbf{\Delta x}} \approx \mathbf{J} + \mathbf{H \Delta x} = \mathbf{0} \\
\rightarrow & \mathbf{\Delta x}^{*} = -\mathbf{H}^{-1}\mathbf{J}
\end{split}
$$
Update the state by $\mathbf{x_{k+1}} \leftarrow \mathbf{x_k} + \mathbf{\Delta x}^{*}$

Pros

This method requires less step to converge

Cons

The Hessian matrix $\mathbf{H}$ and its inverse is very expensive to compute, especially for high dimensional optimization target
The inverse of Hessian matrix may not exist
The inverse of Hessian matrix may be not positive-definite, so the direction is not the optimization direction
Taylor expression is only valid in nearby area, but the optimization step may too large to make the linearization invalid.

Applying the Taylor expression to the $\mathbf{e}(\mathbf{x}_{k, j} + \mathbf{\Delta x}_{k, j})$, we get
$$
\mathbf{e}(\mathbf{x}_{k, j} + \mathbf{\Delta x}_{k, j}) = \mathbf{e} (\mathbf{x}_{k, j}) + \mathbf{J}_{k, j} \mathbf{\Delta x}_{k, j} + \frac{1}{2} \mathbf{\Delta x}_{k, j}^T \mathbf{H}_{k, j} \mathbf{\Delta x}_{k, j} + \cdots
$$
Only keeps the linear term, we get
$$
\mathbf{e}(\mathbf{x}_{k, j} + \mathbf{\Delta x}_{k, j}) = \mathbf{e} (\mathbf{x}_{k, j}) + \mathbf{J}_{k, j} \mathbf{\Delta x}_{k, j}
$$
with

$$
\mathbf{J}_{k, j} = \frac{\mathrm{d} \mathbf{e}(\mathbf{x}_{k, j})}{\mathrm{d} \mathbf{x}_{k,j}^T} |_{\mathbf{x}_{k,j}} \in \mathbb{R}^{m \times (n+p)}
$$

Then we could get the quadratic equation for $F(\mathbf{x_k + \Delta x})$ as
$$
\begin{split}
e(\mathbf{x}_{k, j}+ \mathbf{\Delta x}_{k,j})
& = \mathbf{e}(\mathbf{x}_{k, j} + \mathbf{\Delta x}_{k, j})^T Q_{k,j}^{-1} \mathbf{e}(\mathbf{x}_{k, j} + \mathbf{\Delta x}_{k, j}) \\
& \approx (\mathbf{e} (\mathbf{x}_{k, j}) + \mathbf{J}_{k, j} \mathbf{\Delta x}_{k, j})^T Q_{k,j}^{-1} (\mathbf{e} (\mathbf{x}_{k, j}) + \mathbf{J}_{k, j} \mathbf{\Delta x}_{k, j}) \\
& = \mathbf{e}_{k,j}^T Q_{k,j}^{-1} \mathbf{e}_{k,j} + 2 \mathbf{e}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j} \mathbf{\Delta x}_{k,j} + \mathbf{\Delta x}_{k,j}^T \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j} \mathbf{\Delta x}_{k,j}
\end{split}
$$

$$
\begin{split}
F(\mathbf{x_k + \Delta x})
& \approx \sum_{k,j} (\mathbf{e}_{k,j}^T Q_{k,j}^{-1} \mathbf{e}_{k,j} + 2 \mathbf{e}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j} \mathbf{\Delta x}_{k,j} + \mathbf{\Delta x}_{k,j}^T \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j} \mathbf{\Delta x}_{k,j}) \\
& = F(\mathbf{x_k}) + 2 (\sum_{k,j} \mathbf{e}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}) \mathbf{\Delta x} + \mathbf{\Delta x}^T (\sum_{k, j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}) \mathbf{\Delta x}
\end{split}
$$
Note here we combine some terms using the $\mathbf{\Delta x}$, which will make some $\mathbf{J}_{k,j}$ turns to 0

Calculate the first derivative on $\mathbf{\Delta x}$ and set it to 0, we could get
$$
\begin{split}
\frac{\mathrm{d} F(\mathbf{x_k + \Delta x})}{\mathrm{d} \mathbf{\Delta x}}
& \approx \frac{\mathrm{d} \left(F(\mathbf{x_k}) + 2 (\sum_{k,j} \mathbf{e}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}) \mathbf{\Delta x} + \mathbf{\Delta x}^T (\sum_{k, j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}) \mathbf{\Delta x}
\right)}{\mathrm{d} \mathbf{\Delta x}} \\
& = 2\sum_{k,j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{e}_{k,j} + 2 (\sum_{k,j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}) \mathbf{\Delta x}_{k,j} \\
& = \mathbf{0} \\
\rightarrow & \mathbf{\Delta x}^{*} = -(\sum_{k,j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j})^{-1} \sum_{k,j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{e}_{k,j}
\end{split}
$$
Update the state by $\mathbf{x_{k+1}} \leftarrow \mathbf{x_k} + \mathbf{\Delta x}^{*}$

Below is the whole iteration process to solve the problem.

Pros

Uses $\sum\limits_{k,j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}$ to estimate the Hessian matrix, avoiding the complexity to compute Hessian matrix

Cons

$\sum\limits_{k,j} \mathbf{J}_{k,j}^T Q_{k,j}^{-1} \mathbf{J}_{k,j}$ is still not guaranteed to have an inverse
The inverse of estimated Hessian matrix may be not positive-definite, so the direction is not the optimization direction
Taylor expression is only valid in nearby area, but the optimization step may too large to make the linearization invalid.

There are several ways to express the meaning of Levenberg-Marquardt methods. We chose the way of combining Gauss-Newton and Gradient Descent to express it.

One problem of Gauss-Newton method is when the size of $\Delta x$ is pretty large, the Taylor linearization may not valid, so L-M method adds the size of $\Delta x$ to the target.
$$
\begin{split}
\Delta x^{*}
& = \arg\min_{\mathbf{\Delta x}} \left( F(\mathbf{x_k + \Delta x}) + \lambda \mathbf{\Delta x}^T \mathbf{\Delta x} \right) \\
& \approx \arg\min_{\mathbf{\Delta x}} \left( F(\mathbf{x_k}) + 2 \mathbf{b}^T \mathbf{\Delta x} + \mathbf{\Delta x}^T \mathbf{H} \mathbf{\Delta x} + \lambda \mathbf{\Delta x}^T \mathbf{\Delta x} \right)
\end{split}
$$
where $\lambda > 0$ is a scalar, and its initial value is $\lambda = \max_i (a_{ii})$, which is the diagonal element $A = J(x_0)^TJ(x_0)$
Compute value to get minimum of the quadratic form by setting first derivative to 0
$$
\begin{split}
\mathbf{0}
& = \frac{\mathrm{d} (F(\mathbf{x_k}) + 2 \mathbf{b}^T \mathbf{\Delta x} + \mathbf{\Delta x}^T \mathbf{H} \mathbf{\Delta x} + \lambda \mathbf{\Delta x}^T \mathbf{\Delta x})}{\mathrm{d} \mathbf{\Delta x}} \\
& = 2\mathbf{b} + 2\mathbf{H\Delta x} + 2 \lambda \mathbf{\Delta x} \\
\rightarrow & \mathbf{\Delta x}^{*} = -(\mathbf{H} + \lambda\mathbf{I})^{-1}\mathbf{b}
\end{split}
$$
Compute $\rho$ to measure the linearization validity
$$
\rho = \frac{e(\mathbf{x_k + \Delta x}) - e(\mathbf{x_k})}{\mathbf{J^T \Delta x}}
$$
- If $\rho < \frac{1}{4}$, which means the real decreased value is smaller than the estimated decreased value, which means the approximation is bad, we need to decrease the $\lambda = 0.5\lambda$ and compute the $\Delta x^*$ again. The method will be closer to Gradient Descent method.
- If $\rho > \frac{3}{4}$, which means the approximation is good, we can accept the computed $\Delta x^*$, and we also need to increase the $\lambda = 2\lambda$ for next step to make the update more efficient. The method will be closer to Gauss-Newton method.
Update the state by $\mathbf{x_{k+1}} \leftarrow \mathbf{x_k} + \mathbf{\Delta x}^{*}$

Pros

Uses $\sum\limits_i \mathbf{J}_i^T \Omega_i \mathbf{J}_i + \lambda I$ to estimate the Hessian matrix, avoiding the complexity to compute Hessian matrix, also $\lambda > 0$ makes sure the matrix always has an inverse and is positive definite.
The value of $\lambda$ can make sure using Gradient Descent when the linearization is bad to make sure the method is robust, and using Gauss-Newton when the linearization is good to make sure the method is efficient.

Matrix Form

If we stack all the error term together, and use the state vector $\mathbf{x}$ putting all parameters together, we could get the matrix from of least square as.
$$
\mathbf{x}^\star = \arg\min_\mathbf{x} \mathbf{e}(\mathbf{x})^T \Omega \mathbf{e}(\mathbf{x})
$$
where
$$
\mathbf{e}(\mathbf{x}) = \begin{bmatrix}
\mathbf{e}(\mathbf{x}_{1, 1}) \\
\vdots \\
\mathbf{e}(\mathbf{x}_{k, j}) \\
\vdots \\
\mathbf{e}(\mathbf{x}_{N, P})
\end{bmatrix} \in \mathbb{R}^{Lm \times 1}
$$
Note $L \leq NP$ since not every state and every environment has the observation.

Then we could still use the same methods above to solve it iteratively. e.g. if we want to use Gauss-Newton, the Jocobian of $\mathbf{e}(\mathbf{x})$ should be
$$
\mathbf{J} =
\frac{\mathrm{d} \mathbf{e}(\mathbf{x})}{\mathrm{d} \mathbf{x}} =
\begin{bmatrix}
\mathbf{J}_{11}^{(0)} & \cdots & 0 & \mathbf{J}_{11}^{(1)} & \cdots & 0 \\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\
\cdots & \mathbf{J}_{k,j}^{(0)} & \cdots & \cdots & \mathbf{J}_{k,j}^{(1)} & \cdots \\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\
0 & \cdots & \mathbf{J}_{N, P}^{(0)} & \cdots & \cdots & \mathbf{J}_{N, P}^{(1)}
\end{bmatrix} \in \mathbb{R}^{Lm \times (Nn+Pp)}
$$
where

$\mathbf{J}_{k,j}^{(0)}$ is the Jocobian on pose part (derivative of error term on camera pose)
$\mathbf{J}_{k,j}^{(1)}$ is the Jocobian on mappoint part (derivative of error term on mappoint position)

Refer to this post to see the real form to solve SLAM bundle adjustment.

Reference

Bayesian filter and Kalman filter