Photogrammetric Computer Vision

Posted on 2021-01-16 Edited on 2023-07-05 In SLAM , 2. Computer Vision

This post is the notes of Cyrill Stachniss’s Photogrammetric Computer Vision course. This post includes concepts about Visual Features, RANSAC, Extrinsics and Intrinsics, Fundamental and Essential matrices and so on.

Overview of Concepts

Concepts	Given	Goal
Calibration	points in world coordinate and image coordinate	estimate the extrinsic and intrinscis
P3P	points in world coordinate and image coordinate, and intrinscis	estimate the extrinsic
Fundamental and Essential matrices	coorresponding points from 2 images	calculate the matrices and estimate the relative pose changes of 2 cameras
Epipolar Geometry	relative poses of 2 camera, points in first image	estimate the positions of corresponding points in second image
Triangulation	relative poses of 2 camera, points in 2 image coordinates	estimate the world positions of points (relative to camera or absolute positions)
Bundle Adjustment	intrinscis and extrinsic of all cameras, points in all image coordinate, all points in world coordinate	estimate extrinsics of cameras, all points in world coordinate

Visual Features

Motivation
- sparser representation of the real objects
- build correspondences between two images
Visual features include the following two parts.
- Keypoint: a locally distinct location in an image
- Descriptor: summarizes the local structure (apperance) around the keypoint

Visual Features: Keypoints

Corners and Strcuture Matrix

Keypoints are usually corners or edges in the image. Those points have intensity changes. Compute the SSD (sum of squared differences) could compute the intensity changes in a patch, where $W_{xy}$ is the local patch around $(x,y)$

$$
\begin{split}
f(x, y)
& = \sum_{(u,v)\in W_{xy}} [I(u, v) - I(u+\delta u, v+\delta v)]^2 \\
& \approx \sum_{(u,v)\in W_{xy}} [I(u, v) - I(u, v) - [J_x J_y][\delta u, \delta v]^T]^2 \\
& = \sum_{(u,v)\in W_{xy}}
\left[\begin{matrix}
\delta u \\
\delta v \\
\end{matrix}
\right]^T
\left[\begin{matrix}
\sum_W J_x^2 & \sum_W J_x J_y \\
\sum_W J_y J_x & \sum_W J_y^2 \\
\end{matrix}
\right]
\left[\begin{matrix}
\delta u \\
\delta v \\
\end{matrix}
\right]
\end{split}
$$

The middle part matrix of the equation is the Structure Matrix, which encodes the changes of image intensities in a local patch. To calculate $J_x, J_y$, Scharr or Soble operator ($D_x, D_y$, each is a $3\times3$ sliding window) are usually used.

Scharr
Soble

$$
D_x = \frac{1}{32}
\left[\begin{matrix}
3 & 10 & 3 \\
0 & 0 & 0 \\
-3 & -10 & -3
\end{matrix}
\right]
$$

$$
D_y = \frac{1}{32}
\left[\begin{matrix}
3 & 0 & -3 \\
10 & 0 & -10 \\
3 & 0 & -3
\end{matrix}
\right]
$$

$$
D_x = \frac{1}{8}
\left[\begin{matrix}
1 & 2 & 1 \\
0 & 0 & 0 \\
-1 & -2 & -1
\end{matrix}
\right]
$$

$$
D_y = \frac{1}{8}
\left[\begin{matrix}
1 & 0 & -1 \\
2 & 0 & -2 \\
1 & 0 & -1
\end{matrix}
\right]
$$

The ideal structure matrix should looks like

$$
\begin{split}
M
&=
\left[\begin{matrix}
\sum_W (D_x * I)^2 & \sum_W (D_x * I)(D_y * I) \\
\sum_W (D_y * I)(D_x * I) & \sum_W (D_y * I)^2 \\
\end{matrix}
\right]
&=
\left[\begin{matrix}
\gg 1 & ~0 \\
~0 & \gg 1 \\
\end{matrix}
\right]
\end{split}
$$

Several keypoints are based on the structure matrix like Harris, Shi-Tomasi, Forstner, while they just use differnet criterion to determine whether a point is a corner point.

Within a local region, looks for the position with the maximum value ($R$ or $\lambda_{\min}$) and select this point.

Summary to extract corners via Structure Matrix

Prepend blur $B$ at the beginning to remove noise.

Difference of Gaussians (DoG)

For each image pyramid levels, perform the following steps

Gaussian smoothing in different extent
Difference-of-Gaussians: find exrema
- subtract differently blurred images from each other, and keep the subtraction results for every blurred image pair
- select the extrema from the results of subtraction
Maxima suppression at deges
- above steps also finds edges, which are not good for matching
- eleminate deges via Eigenvalue test (similar to Harris corners)

Illustration of DoG

Visual Features: Descriptors

SIFT (Scale Invariant Feature Transform)

Properties
- Uses DoG to extract keypoints
- Invariant to translation, rotation, and scale
- Partially invariant to illumination changes and affine transformations and 3D projections
A SIFI feature includes the following parts
- Location: pixel coordinate of the keypoints
- Scale: extrema in scale space from DoG
- Orientation: compute image gradienst in a local region, build historgram and select the peak as the keypoint Orientation
- Descriptor: 128-dimension float value vector. Align with the orientation
Steps to calculate the descriptors
- Pick a patch of $16\times16$ local grids around the keypoints based on its scale and orientation
- Divide each grid into $4\times4$ smaller grids
- For each $4\times4$ smaller grid, compute each grid’s gradient and build the historgram of 8 directions (up, up right, right, etc)
- Concact all historgram from each $4\times4$ grid to form the final descriptor $8\times16=128$
Descriptor matches
- Pick one descriptor, compare it with each descriptor from another image
- Eliminate ambiguous matches
  - only accept match that the query keypoint’s first closest match distance smaller that its second closest match distance by 1/2
  - treat left outliers with other ways (e.g. in Bundle Adjustment)
SIFT Descriptor Illustration

Binary Descriptors

Properties
- Computing fastly
- Compare easily
Steps to calculate descriptors
- Select a patch around a keypoint
- Select a set of pixel paris in the patch following a pre-fixed rule
- For each pair, compare the intensities and get the binary result (1 means first element larger than second, 0 means otherwise)
- Concate all binary result to a bit string
Descriptor matches: Hamming distance
$$
d_{\text{Hamming}}(B_1, B_2) = \sum(\text{xor}(B_1, B_2))
$$

Different binary descriptors differ based on the rule of selecting pairs.

BRIEF (Binary Robust Independent Elementary Features)
ORB (Oriened FAST Rotated BRIEF)

256 bit descriptor
5 Ways to Genereate paris
- Uniform random sampling
- Gaussian sampling
- First point is from Gaussian distribution, the second is generated by the Gaussian around the first one
- Discrete location from a coarse polar grid
- Draw polar grid and other keeps in center
Disadvantages
- Cannot deal with scenario when image rotates

An extension to BRIEF that
- Learns the optimal sampling pairs from dataset
- Adds rotation compensation
Paris choice
- Select 256 pairs using a training database
- Uncorrelated: each new pair adds new information
- High variance: makes a feature more discriminative
Rotation compenstation
- Estimate the center of mass and orientation
  $$
  C = \left(
  \frac{m_{10}}{m_{00}}, \frac{m_{01}}{m_{00}}
  \right)
  $$
  $$
  \theta = \arctan (m_{01} / m_{10})
  $$
  $\quad$ where
  $$
  m_{pq} = \sum_{x,y} x^p y^q I(x, y)
  $$
- Rotate the pairs around the center of mass $C$ by the orientation $\theta$, then generate descriptors based on the new pairs
Disadvantages
- Not scale invariant, needs image pyramid to add compensation

Camera Extrinsics and Intrinsics

Goal: describe how a point is mapped to a pixel coordinate

$$
\left[\begin{matrix}
x \\
y \\
1
\end{matrix}\right] = P
\left[\begin{matrix}
X \\
Y \\
Z \\
1
\end{matrix}\right]
$$

How a point maps between world and sensor
- World coordinate system: $S_o$
- Camera coordinate system: $X_k$
- Image coordinate system: $S_c$
- Sensor coordinate system: $S_s$
- Point in world coordinate system: ${}^o \boldsymbol{X}$ or $\boldsymbol{X}$
- Point in camera coordinate system: ${}^k \boldsymbol{X}$
- Point in image coordinate system: ${}^c \boldsymbol{X}$
- Point in sensor coordinate system: ${}^s \boldsymbol{X}$
Extrinsics: from world coordinate to camera coordinate; describe the pose of camera in the world
Instrinsic: the rest of the conversion; describe the mapping of the scene in fron of the camera to the pixels in the final image (sensor)

Extrinsics

Parameters: 3 translation, 3 rotation
Ecuclian coordinate: map a point from world coordinate to camera coordinate

$$
{}^k \boldsymbol{X}_P = R(\boldsymbol{X}_P - \boldsymbol{X}_O)
$$
$\quad$ where $\boldsymbol{X}_P = [X_P, Y_P, Z_P]^T$ is Point $P$ world coordinate, $\boldsymbol{X}_O = [X_O, Y_O, Z_O]^T$ is origin of camera coordinate system in world coordinate, $R$ is rotation from world coordinate system $S_o$ to the camera coordinate system $X_k$

Homogeous coordinate: map a point from world coordinate to camera coordinate

$$
\begin{split}
\left[\begin{matrix}
{}^k \boldsymbol{X}_P \\
1
\end{matrix}\right]
& =
\left[\begin{matrix}
R & 0 \\
0^T & 1
\end{matrix}\right]
\left[\begin{matrix}
I & -\boldsymbol{X}_O \\
0^T & 1
\end{matrix}\right]
\left[\begin{matrix}
\boldsymbol{X}_P \\
1
\end{matrix}\right] \\
& =
\left[\begin{matrix}
R & -R \boldsymbol{X}_O \\
0^T & 1
\end{matrix}\right]
\left[\begin{matrix}
\boldsymbol{X}_P \\
1
\end{matrix}\right]
\end{split}
$$

$\quad$ short as

$$
{}^k \mathbf{X}_P = {}^k H \mathbf{X}_P
$$

Note for different marks

$\boldsymbol{X}$ means in Ecuclian coordinate system
$\mathbf{X}$ means in Homogeous coordinate system

Intrinsics

Map a point from camera coordinate to image coordinate in Homogeous coordinate with central projection applied

$$
\left[\begin{matrix}
{}^c X_P \\
{}^c Y_P \\
1
\end{matrix}\right] =
\left[\begin{matrix}
\frac{c}{ ^k Z_P} ^k X_P \\
\frac{c}{ ^k Z_P} ^k Y_P \\
1
\end{matrix}\right] =
\left[\begin{matrix}
c \ {}^k X_P \\
c \ {}^k Y_P \\
{}^k Z_P
\end{matrix}\right] =
\left[\begin{matrix}
c & 0 & 0 & 0 \\
0 & c & 0 & 0 \\
0 & 0 & 1 & 0
\end{matrix}\right]
\left[\begin{matrix}
{}^k X_P \\
{}^k Y_P \\
{}^k Z_P \\
1
\end{matrix}\right]
$$

where $c$ is the camera constant

short as

$$
{}^c \mathbf{X}_P = {}^c P_k {}^k \mathbf{X}_P
$$

Map a point from image coordinate to ideal (don’t consider non-linear erros) sensor coordinate system in Homogeous coordinate

$$
{}^s \mathbf{x} = {}^s H_c {}^c \mathbf{X}_P
$$

with

$$
{}^s H_c =
\left[\begin{matrix}
1 & s & x_H \\
0 & 1+m & y_H \\
0 & 0 & 1
\end{matrix}\right]
$$

$x_H, y_H$ are principal point, $m$ is scale difference in horizontal direction and vertical direction, $s$ is sheer compensation and is near to 0 in digital camera.

Compared to ideal sensor, real sensor has Non-Linear Errors, which caused by
- Imperfect lens
- Planarity of the sensor
The non-linear error is location-dependent shift: every location in the image has a shift in the sensor coordinate system

$$
\begin{matrix}
{}^a x = {}^s x + \Delta x (\boldsymbol{x}, \boldsymbol{q}) \\
{}^a y = {}^s y + \Delta y (\boldsymbol{y}, \boldsymbol{q})
\end{matrix}
$$

$\quad$ where $\boldsymbol{q}$ is the non-linear paramteres. Written into matrix form

$$
\begin{split}
{}^a \mathbf{x}
& = {}^a H_s(\boldsymbol{x}) {}^s \mathbf{x} \\
& = \left[\begin{matrix}
1 & 0 & \Delta x (\boldsymbol{x}, \boldsymbol{q}) \\
0 & 1 & \Delta y (\boldsymbol{y}, \boldsymbol{q}) \\
0 & 0 & 1
\end{matrix}\right]
{}^s \mathbf{x}
\end{split}
$$

Model the non-linear error: Barrel Distortion

See more in here

Combination

Combine those processes of extrinsics and intrinscis, from world frame to

The mapping from world coordinate system to image coordiante system can be wrote as (in Homogeous coordinate)

$$
\begin{split}
{}^c \mathbf{X}_P
& = {}^c P_k {}^k H \mathbf{X}_P \\
& = {}^c K [R | -R\boldsymbol{X}_O] \mathbf{X}_P \\
& = {}^c K R [I_3 | -\boldsymbol{X}_O] \mathbf{X}_P
\end{split}
$$

How this works

The ${}^c K$ is the matrix ${}^c P_k$ drops the last column
The $[R | -R\boldsymbol{X}_O]$ is the matrix ${}^k H$ drops the last row

Calibration Matrix for Ideal Camera

$$
{}^c K =
\left[\begin{matrix}
c & 0 & 0 \\
0 & c & 0 \\
0 & 0 & 1
\end{matrix}\right]
$$

$\quad$ where $c$ is the camera constant.

Projection short as

$$
{}^c P = {}^c K R [I_3 | -\boldsymbol{X}_O]
$$

The mapping from world coordinate to sensor coordinate system (in Homogeous coordinate)

$$
\begin{split}
{}^s \mathbf{x}
& = {}^s H_c {}^c K R [I_3 | -\boldsymbol{X}_O] \mathbf{X}_P \\
& = K R [I_3 | -\boldsymbol{X}_O] \mathbf{X}_P
\end{split}
$$

Calibration Matrix for Affine Camera

$$
K = {}^s H_c {}^c K =
\left[\begin{matrix}
c & cs & x_H \\
0 & c(1+m) & y_H \\
0 & 0 & 1
\end{matrix}\right]
$$

Projection short as (also called DLT (Direct Linear Transform))

$$
P = K R [I_3 | -\boldsymbol{X}_O]
$$

More about DLT

It is a homogeneous projection matrix and the model of affine camera (camera with an affine mapping to the sensor).

It contains 6 extrinsic parameters $R, \boldsymbol{X}_O$ and 5 intrinsic parameters $c, x_H, y_H, m, s$, where $c$ is camera constant, $x_H, y_H$ are principal point, $m$ is scale difference in horizontal direction and vertical direction, $s$ is sheer.

Besies, DLT also refers to an approach to calcualte camera poses.

Map a point from world coordinate to sensor coordinate system with distoration considered

$$
\begin{split}
{}^a \mathbf{x}
& = {}^a H_s(\boldsymbol{x}) K R [I_3 | -\boldsymbol{X}_O] \mathbf{X}_P \\
& = {}^a K (\boldsymbol{x}, \boldsymbol{q}) R [I_3 | -\boldsymbol{X}_O] \mathbf{X}_P
\end{split}
$$

Calibration Matrix for General Camera

$$
{}^a K (\boldsymbol{x}, \boldsymbol{q}) = {}^a H_s(\boldsymbol{x}) K =
\left[\begin{matrix}
0 & cs & x_H + \Delta x (\boldsymbol{x}, \boldsymbol{q}) \\
0 & c(1+m) & \Delta y (\boldsymbol{y}, \boldsymbol{q}) \\
0 & 0 & 1
\end{matrix}\right]
$$

Projection

$$
{}^a P (\boldsymbol{x}, \boldsymbol{q}) = {}^a K (\boldsymbol{x}, \boldsymbol{q}) R [I_3 | -\boldsymbol{X}_O]
$$

In reality

We usually follow these 2 steps to perform the convert from world to real sensor
1. Use Affine Camera model $P$ for all points (DLT)
  $$
  {}^s \mathbf{x} = P \mathbf{X}_P
  $$
2. Use Non-linear model for each point
  $$ {}^a \mathbf{x} = {}^a H_s(\boldsymbol{x}) {}^s \mathbf{x} $$

Then,

from real sensor to world

According to Affine camera model (projection)

$$
\begin{split}
\lambda \mathbf{x}
& = P \mathbf{X}_P \\
& = KR[I_3 | -\boldsymbol{X}_O] \mathbf{X}_P \\
& = KR \boldsymbol{X}_P - KR \boldsymbol{X}_O
\end{split}
$$

where $\lambda$ is the scaler since it is in Homogeneous Coordinate

So we could get the inversion

$$
\boldsymbol{X}_P = \boldsymbol{X}_O + \lambda (KR)^{-1} \mathbf{x}
$$

This equation gives a direction. $\lambda (KR)^{-1}$ describes the direction of the ray from origin $\boldsymbol{X}_O$ to the 3D point $\boldsymbol{X}_P$

Camera Categories

Camera categoreis

Camera Calibration

Calculate the extrinsic and instrinsic for cameras

Direct Linear Transform

Assume the model of an affine camera (ignoring non-linear errors).
Given the matched point pairs in world frame, and in pixel frame
Goal: Estimate the 5 intrinsic and 6 extrinsic parameters
Matched pair ($\boldsymbol{x}_i$, $\boldsymbol{X}_i$) relationship
- Based on DLT, we have
  $$
  {\mathbf{x_i}}=P{\mathbf{X_i}}=\left[\begin{matrix}
  p_{11} & p_{12} & p_{13} & p_{14} \\
  p_{21} & p_{22} & p_{23} & p_{24} \\
  p_{31} & p_{32} & p_{33} & p_{34}
  \end{matrix}\right]
  \mathbf{X}_i
  $$

$$
s\left[\begin{matrix}
x_i^{(x)} \\ x_i^{(y)} \\ 1
\end{matrix}\right] = \left[\begin{matrix}
p_{11} & p_{12} & p_{13} & p_{14} \\
p_{21} & p_{22} & p_{23} & p_{24} \\
p_{31} & p_{32} & p_{33} & p_{34}
\end{matrix}\right]
\left[\begin{matrix} X_i^{(x)} \\ X_i^{(y)} \\ X_i^{(z)} \\ 1 \end{matrix}\right]
$$

Analysis
- Each point pair gives 2 constraints, so 6 points are required to estimate 12 elements of $P$
- If known intrinsic $K$, can also use this to estimate extrinsic
Solve by SVD

Use 6 pairs of points could form equations like

$$
\rm{M} \boldsymbol{p} = 0
$$

Redundant observations

SVD

Zhang’s Method

Assume the model of an affine camera (ignoring non-linear errors). Use a printed checkerboard
Goal: estimate 5 intrinsics parameters
Background
- Every keypoint from checkerboard are in the checkerboard coordinate system, which is defined in the XY plane of checkerboard with the origin in the corner ($Z = 0$)
  $$
  \begin{split}
  \mathbf{x_i}
  & = \left[\begin{matrix}
  c & cs & x_H \\
  0 & c(1+m) & y_H \\
  0 & 0 & 1
  \end{matrix}\right]
  \left[\begin{matrix}
  r_{11} & r_{12} & r_{13} & t_1 \\
  r_{21} & r_{22} & r_{23} & t_2 \\
  r_{31} & r_{32} & r_{33} & t_3
  \end{matrix}\right]
  \left[\begin{matrix}
  X_i \\
  Y_i \\
  Z_i \\
  1
  \end{matrix}\right] \\
  & = K[\mathbf{r}_1, \mathbf{r}_2, \mathbf{t}]
  \left[\begin{matrix}
  X_i \\
  Y_i \\
  1
  \end{matrix}\right]
  = H
  \left[\begin{matrix}
  X_i \\
  Y_i \\
  1
  \end{matrix}\right]
  \end{split}
  $$
Matched point pair ($\boldsymbol{x}_i$, $\boldsymbol{X}_i$) relationship
$$
\begin{split}
\mathbf{x_i} = H
\left[\begin{matrix}
X_i \\
Y_i \\
1
\end{matrix}\right]
\end{split}
$$
Analysis
- 9 unknown $\rightarrow$ 8 unknown elements consider the scale factor
- Each point pair gives 2 constraints, so 4 points per plance are required
- Required at least 3 planes’ views
Compute $K$ from $H$
- Exploit Constraints
- Definte $V$ and $B$
Summary
For Non-linear Errors
- e.g.
  use Barrel Distortion
  
  $$
  \begin{matrix}
  {}^a x = x (1 + q_1 r^2 + q_2 r^4) \\
  {}^a y = y (1 + q_1 r^2 + q_2 r^4)
  \end{matrix}
  $$
  
  with $[x, y]^T$ beging point as projected by an ideal pin-hole camera, $[{}^a x, {}^a y]^T$ beging points really get, $r$ beging the distance of pixel to the principal point, $q_1, q_2$ as additional non-linear parameters.
  
  To calcualte the parameters, use iteration
Summary

P3P

Projective 3 Point (P3P), or Spatial Resection (SR) is the algorithm to estimate the extrinsic of the calibrated camera given points from images and world. Camera poses has 6 degree of freedom, P3P uses the least number of pairs, 3, to compute it.

Given: 3D coordinates of object points $\mathbf{X}_i$ (At least 3 points are required based on DLT and already known calibration matrix $K$)
Observed: 2D image coordinates $\mathbf{x}_i$ of the object points
Goal: Estimate extrinsic parameters $R, \boldsymbol{X}_O$ of a calibrated camera
Steps:
- Get Length of Projection Rays
- Orientation of the Camera
Then the length of projection rays $s_1, s_2, s_3$ can be calculated through law of Cosines applying to the 3 triangles.
The results of $s_1, s_2, s_3$ will have 4 possible solutions. Therefore, we need 4th point to confirm the right solution.
e.g. use vectors $\mathbf{X}_1\mathbf{X}_2, \mathbf{X}_1\mathbf{X}_3$ to canculate the rotation between two coordinate systems and translations between two systems.
Handle Outlier with RANSAC
Summary

Fundamental and Essential Matrix

Given matched pixel points from 2 frames (a stereo camera, a moving camera)
Goal: estimate relative rotation and direction between 2 frames (translation without scale)
Background: Coplanarity Constraint

Two camera origins and a world space point are in the same plane

where $X$ is a point in space, $O^{\prime}, O^{\prime\prime}$ are the origins of two cameras, and operation $[O^{\prime}X^{\prime}, O^{\prime}O^{\prime\prime}, O^{\prime\prime}X^{\prime\prime}] = O^{\prime}X^{\prime} \cdot (O^{\prime}O^{\prime\prime} \times O^{\prime\prime}X^{\prime\prime})$

To constrcut the 3 vectors

The direction of $O^{\prime}X^{\prime}$ can be constructed by
$$
{}^n \mathbf{x}^{\prime} = (R^{\prime})^{-1} (K^{\prime})^{-1} \mathbf{x}^{\prime}
$$
Similarily
$$
{}^n \mathbf{x}^{\prime\prime} = (R^{\prime\prime})^{-1} (K^{\prime\prime})^{-1} \mathbf{x}^{\prime\prime}
$$
The direction $O^{\prime}O^{\prime\prime}$ is
$$
\mathbf{b} = \boldsymbol{X_{O^{\prime\prime}}} - \boldsymbol{X_{O^{\prime}}}
$$

Then the Constraint could be rewritten as
$$
\begin{split}
0
&= [{}^n \mathbf{x}^{\prime}, \mathbf{b}, {}^n \mathbf{x}^{\prime\prime}] \\
&= {}^n \mathbf{x}^{\prime} \cdot (\mathbf{b} \times {}^n \mathbf{x}^{\prime\prime}) \\
&= {}^n \mathbf{x}^{\prime T} S_b {}^n \mathbf{x}^{\prime\prime} \\
&= \mathbf{x}^{\prime T} \bbox[border: solid]{(K^{\prime})^{-T}(R^{\prime})^{-T} S_b (R^{\prime\prime})^{-1}(K^{\prime\prime})^{-1}} \mathbf{x}^{\prime\prime}
\end{split}
$$
How to convert

Definitions

Fundamental Matrix for uncalibrated camera
$$
\begin{split}
F
&= (K^{\prime})^{-T}(R^{\prime})^{-T} S_b (R^{\prime\prime})^{-1}(K^{\prime\prime})^{-1} \\
&= (K^{\prime})^{-T}R^{\prime} S_b R^{\prime\prime T} (K^{\prime\prime})^{-1}
\end{split}
$$
with constraint
$$
\mathbf{x}^{\prime T} F \mathbf{x}^{\prime\prime} = 0
$$
Essential matrix for calibrate camera
$$
E = R^{\prime} S_b R^{\prime\prime T}
$$
with constraint
$$
{}^{k}\mathbf{x}^{\prime T} E {}^{k}\mathbf{x}^{\prime\prime} = 0
$$
where ${}^K\mathbf{x} = K^{-1}\mathbf{x}$ is the coordinate in image plane.

Matrix $F$ and $E$ are up to scale, which means the constraint is still valid when multiplying a scalar.

Steps

Given corresponding points from first image $(x^{\prime}_n, y^{\prime}_n)$ and second image $(x^{\prime\prime}_n, y^{\prime\prime}_n), \ i = 1,2,\cdots,N$

Compute $F$ Matrix for uncalibrated cameras

Given: N Corresponding pixel points
$$
(x^{\prime}_n, y^{\prime}_n), (x^{\prime\prime}_n, y^{\prime\prime}_n), \ n = 1,2,\cdots,N
$$
Matched point relationship
$$
\left[\begin{matrix}
x^\prime_n & y^\prime_n & 1
\end{matrix}\right]
\left[\begin{matrix}
F_{11} & F_{12} & F_{13} \\
F_{21} & F_{22} & F_{23} \\
F_{31} & F_{32} & F_{33}
\end{matrix}\right]
\left[\begin{matrix}
x^{\prime\prime}_n \\
y^{\prime\prime}_n \\
1
\end{matrix}\right]
=0
$$
$\quad$ We could flattern the matrix $F$ to vector $\hat{F}$, and put points as coeffient matrix $A$, like
$$
A \hat{F} = 0
$$
Analysis
- 9 unknown $\rightarrow$ 8 unknown consider the scale factor
- each point pair gives 1 constraint, so requires 8 matched point pairs
More than 8 points

In reality, when we have more than 8 points, and the matrix $A$ will become regular. Therefore, we need to use the singular vector $\hat{f}$ of $A$ that corresponds to the smallest singular value as the final solution.

Singular value of matrix $F$ has the form of $[\sigma_1, \sigma_2, 0]^T$, so we need to enforce rank of $F$ to 2
8-Point Algorithm Summary
Bad cases
- Points are in the same plane, leading $\text{rank}(A) < 8$
- Two cameras have no translation

Compute $E$ Matrix for calibrated cameras

Given: Camera intrinsics $K$ and N Corresponding pixel points
$$
(x^{\prime}_n, y^{\prime}_n), (x^{\prime\prime}_n, y^{\prime\prime}_n), \ n = 1,2,\cdots,N
$$
Matched point relationship
$$
\left[\begin{matrix}
{}^k x^\prime_n & {}^k y^\prime_n & 1
\end{matrix}\right]
\left[\begin{matrix}
E_{11} & E_{12} & E_{13} \\
E_{21} & E_{22} & E_{23} \\
E_{31} & E_{32} & E_{33}
\end{matrix}\right]
\left[\begin{matrix}
{}^k x^{\prime\prime}_n \\
{}^k y^{\prime\prime}_n \\
1
\end{matrix}\right]
=0
$$
Analysis
- 9 unknown elements $\rightarrow$ 8 unknown elements consider the scale factor $\rightarrow$ 5 unknown consider rotation constraint and direction constraint
- Each point pair gives 1 constraint, so requires at least 5 matched point pairs
- Using 5 respective point pairs required more complex calculation, so we usually use 8 pairs
- Singular value of matrix $E$ has form of $[\sigma, \sigma, 0]^T$
- Cannot solve rotation in pure rotation scenario
To esimate the matrix $E$ has the required singular value, and since it is up to scale, usually enforce the non-zero singular values to 1
8-Point Alogirthm Summary

Compute rotation and direction from $E$

If we set $R^\prime = I$, we could get $R^{\prime\prime}$ represented the second camera’s relative rotation to first camera.

$$
\begin{split}
E
& = I S_b R^{\prime\prime T} \\
& = U
\left[\begin{matrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 0
\end{matrix}\right]
V^T \\
& = U
\left[\begin{matrix}
0 & 1 & 0 \\
-1 & 0 & 0 \\
0 & 0 & 0
\end{matrix}\right]
\left[\begin{matrix}
0 & -1 & 0 \\
1 & 0 & 0 \\
0 & 0 & 0
\end{matrix}\right]
V^T \\
& = UZWV^T \\
& = UZU^T UWV^T
\end{split}
$$

So,
$$
S_b = UZU^T \quad R^{\prime\prime T} = UWV^T
$$
Using different $Z, W$ to decompose $U$ leads to 4 different solutions
By triangulating the point and then check whether it is in the front of both camera could filter out the correct answer

Summary

Above solutions are direct solutions, which means no required for additional information. The direct solutions are often used with RANSAC to deal with outliers. Direct solutions & RANSAC could serve as initial guess of iterative solutions.

Homography Matrix

Given: matched point pairs from 2 frames, where points are placed in the same plane of real world
What to solve: relative pose changes, mapping relationships of pixels to constrcut a panorama image
Background

For a matched pixel pair $p_1$, $p_2$, suppose the respective space point $P_w$ are in a plane, so
$$
n^T P_w + d = 0 \Rightarrow -\frac{n^T P_w}{d} = 1
$$

Set image 1’s pose as reference, and set the relative pose change from image 1 to 2 as $R$, $t$, so
$$
\begin{gather}
s_1 \mathbf{p}_1 = K P_w \\
s_2 \mathbf{p}_2 = K (R P_w + t)
\end{gather}
$$

Since the left side is in Homogeneous coordinate, we can take the $s_1, s_2$ off if we consider the equal is up to scale.

$$
\begin{split}
\mathbf{p}_2 &\simeq K (R P_w + t) \\
&\simeq K \left(R P_w - t \frac{n^T P_w}{d} \right) \\
&\simeq K (R - \frac{t n^T}{d}) P_w \\
&\simeq \bbox[border: solid]{K (R - \frac{t n^T}{d}) K^{-1}} \mathbf{p}_1 \\
&\simeq H p_1
\end{split}
$$
Matched pixel point pairs $(u_1, v_1)$, $(u_2, v_2)$ relationship
$$
\left[\begin{matrix}
u_2 \\ v_2 \\ 1
\end{matrix}\right] \simeq
\left[\begin{matrix}
h_1 & h_2 & h_3 \\
h_4 & h_5 & h_6 \\
h_7 & h_8 & h_9
\end{matrix}\right]
\left[\begin{matrix}
u_1 \\ v_1 \\ 1
\end{matrix}\right]
$$
Analysis
- 9 unknown $\rightarrow$ 8 unknown consider scale factor
- Each point pair gives 2 constraints, thus required 4 point pairs to solve
- 4 point pairs cannot lie in the same line
- Can recover the rotation in pure rotation scenario, but it cannot used for triangulation.
Use SVD to solve homography matrix
Solve $R$, $t$ from the homography matrix - TODO

Epipolar Search

Motivation
- Given extrinsic, intrinsic of two frames, and a point $X^{\prime}$ in the first image
- Find the corresponding point $X^{\prime\prime}$ in the second image
Epipolar Geometry
- Describe geometric relations in image pairs
- Enables efficient search for and prediction of corresponding points
- Given a straight-line preserving mapping, reduce the search space from 2D to 1D
Some concepts
- Epipole Axis: lines between two projection center
- Epipoles: are images of the projection centers. All epipole lines pass the epipoles
- Epipole Lines: are lines between epilole and image of a point
Compute elements based on projection matrices and fundamental matrix

$$
\mathbf{b} = \boldsymbol{X_{O^{\prime\prime}}} - \boldsymbol{X}_{O^{\prime}}
$$

$\mathbf{b}$ is in Homogeneous coordinate, which means we only know the direction

For $\mathbf{l}^{\prime}$ in first image, and $\mathbf{l}^{\prime\prime}$ in second image
$$
\mathbf{l}^{\prime} = F \mathbf{x^{\prime\prime}} \ \text{or} \ \mathbf{l}^{\prime\prime} = F^T \mathbf{x^{\prime}}
$$

If a point lies on the epipolar lines, it should have
$$
\mathbf{x}^{\prime T} \mathbf{l}^{\prime} = 0 \ \text{or} \ \mathbf{x}^{\prime\prime T} \mathbf{l}^{\prime\prime} = 0
$$

$$
\mathbf{e}^{\prime} = P^{\prime} \mathbf{X_{O^{\prime\prime}}} \ \text{or} \ \mathbf{e}^{\prime\prime} = P^{\prime\prime} \mathbf{X}_{O^{\prime}}
$$

Triangulation

Motivation
- Given the relative orientation of two frames, translation of 2 frames, corresponding points from two images
- Compute the points in 3D world

Geometric Approach

The two corresponding points from two images may not intersect in the space. Therefore, the intersection should be the middle point of the shortest line between two rays.

Equations for the 2 lines
$$
\begin{matrix}
\boldsymbol{f} = \boldsymbol{X_{O^\prime}} + \lambda R^{\prime T} {}^k \mathbf{x}^\prime \\
\boldsymbol{g} = \boldsymbol{X}_{O^{\prime\prime}} + \mu R^{\prime\prime T} {}^k \mathbf{x}^{\prime\prime}
\end{matrix}
$$

The constraints for point $F$ and $G$ are

$$
[\boldsymbol{X_{O^\prime}} + \hat{\lambda} R^{\prime T} {}^k \mathbf{x}^\prime - (\boldsymbol{X}_{O^{\prime\prime}} + \hat{\mu} R^{\prime\prime T} {}^k \mathbf{x}^{\prime\prime})]^T R^{\prime\prime T} {}^k \mathbf{x}^{\prime\prime} = 0
$$

How it works

The $R^{\prime T} {}^k \mathbf{x}^\prime$ and $R^{\prime\prime T} {}^k \mathbf{x}^{\prime\prime}$ are the directions of line $\boldsymbol{f}$ and $\boldsymbol{g}$
The line between point $F$ and $G$ are orthogonal to the line $\boldsymbol{f}$ and line $\boldsymbol{g}$

Then we could solve the $\hat{\lambda}, \hat{\mu}$ to get the world coordinate of point $F, G$, and the intersection point $H$ is the middle point.

The process to use SVD to solve the solution could be found at here

Analysis
- Points can only be triangulated when there is translation heppending
- Higher precision, further points, the more accurate of triangulation

Stereo Normal Case

For stereo cameras, the rotations of 2 cameras $O^\prime, O^{\prime\prime}$ are the same, and they just have a position offset $B$. Then we could use some geometric methods to get the position of the point $P$.

Compute Absolute Orientation (ICP)

The results of Relative Orientation is called Photogrammetric Model.
In order to map Photogrammetric Model to object reference model (world coordinate system), control points (the world coordinate and camera coordinate points are both known) are required.
$$
{}^o \boldsymbol{X}_n = \lambda R \boldsymbol{X}_n + \boldsymbol{t}
$$
To solve the $\lambda, R, \boldsymbol{t}$, Least Square Solution is used.
Least Square Solution
- Problem Description
- Simplify the target by computing the origin of $y_n$
  $$
  \boldsymbol{y}_0 = \frac{\sum \boldsymbol{y}_n p_n}{\sum p_n}
  $$
  then
  $$
  \begin{split}
  &\sum || \boldsymbol{y}_n - \boldsymbol{\bar{x}}_n || p_n \\
  =&\sum || \boldsymbol{y}_n - \boldsymbol{y}_0 - \lambda R \boldsymbol{x}_n - \boldsymbol{t} + \boldsymbol{y}_0 || p_n \\
  =&\sum || \boldsymbol{y}_n - \boldsymbol{y}_0 - \lambda R (\boldsymbol{x}_n - \boldsymbol{x}_0) || p_n \\
  \rightarrow&\sum || \lambda^{-\frac{1}{2}} (\boldsymbol{y}_n - \boldsymbol{y}_0) - \lambda^{\frac{1}{2}} R (\boldsymbol{x}_n - \boldsymbol{x}_0) || p_n \\
  =&\sum [ \lambda^{-\frac{1}{2}} (\boldsymbol{y}_n - \boldsymbol{y}_0) - \lambda^{\frac{1}{2}} R (\boldsymbol{x}_n - \boldsymbol{x}_0) ]^T [ \lambda^{-\frac{1}{2}} (\boldsymbol{y}_n - \boldsymbol{y}_0) - \lambda^{\frac{1}{2}} R (\boldsymbol{x}_n - \boldsymbol{x}_0) ] p_n \\
  =& \sum \lambda^{-1} (\boldsymbol{y}_n - \boldsymbol{y}_0)^T (\boldsymbol{y}_n - \boldsymbol{y}_0) p_n + \sum \lambda (\boldsymbol{x}_n - \boldsymbol{x}_0)^T (\boldsymbol{x}_n - \boldsymbol{x}_0) p_n + \\
  & \quad 2 \sum (\boldsymbol{y}_n - \boldsymbol{y}_0)^T (\boldsymbol{x}_n - \boldsymbol{x}_0) p_n
  \end{split}
  $$
  where $\boldsymbol{x}_0 = R^T \boldsymbol{y}_0 - R^T \boldsymbol{t}$
- Minimize above equation by
  - Compute the first derivatives
  - Set the derivatives to 0
  - Solve the remaining parts
Actually here we could get the result of $R = VU^T$ Proof
Summary
Summary
Solving the absolute orientation problem is key in photogrammetric and point cloud processing. The least square approach for computing the similarity transforms between two point sets is very important.

Build 3D Maps of Environment

Given 2 cameras, corresponding points from 2 images, control points
Goal: to build the 3D maps of the environment

(Photogrammetric model) Compute relative information (Fundamental or Essential Matrix) of 2 cameras based on corresponding points. Triangulate the corresponding points to get the 3D location of points in camera coordinate system.
Compute absolute information of points based on results of Photogrammetric Model and control points.

Illustration

Pros
- Can handle gross erros for Redundancy/Num-of-Observations > 0.5
- Serve as initial guess for BA

DLT/P3P for each camera using control points
Triangulate for all corresponding points

Illustration

Pros
- Direct solution can be used to find gross errors
- For DLT
  - Could serve as intial guess for BA in case of intrinsics are not known
Cons
- Requires 2 cameras both observe 6 points (for DLT) 3(4) control points (for P3P)
- Less accurate than 2-step solution
- For DLT
  - Control points cannot lie in the same plane

One big Least Sqaures approach
Estimate 2 camera’s poses and 3D points at the same time
Illustration
Pros
- Is a statistically optimal solution
- Could deal with outliers using robust estimators
Cons
- Requires a good estimation as intial state

Bundle Adjustment

A least squares approach to estimating camera poses and 3D points
Cloud be used for 3D Reconstruction (with several image pairs)
Key Idea (minimize reprojection error)
- Start with an initial guess
- Project the estimated 3D points into the estimated camera images
- Compare locations of the projected 3D points with measured (2D) ones
- Adjust to minimize error in the images
  $$
  {}^a \mathbf{x}^\prime_{ij} + {}^a \mathbf{v_{x_{ij}}} = \lambda_{ij} {}^a P_j(\boldsymbol{x}_{ij}, \boldsymbol{p}, \boldsymbol{q}) \mathbf{X}_i
  $$
  Explanation
  - ${}^a \mathbf{x}^\prime_{ij}$: the pixel coordinate of point $i$ in image $j$
  - ${}^a \mathbf{v_{x_{ij}}}$ is the coorections
  - $\lambda_{ij}$ is the scale factor
  - $P_j$ is the projection matrix of camera $j$ containing 6DoF of camera, calibration matrix, non-linear parameters
  - $\mathbf{X}_i$ is the world coordinate of point $i$
  - In order to reduce the unknown parameters, we use Ecuclian Coordinate instead of Homogeneous Coordiante.
    $$
    {}^a \boldsymbol{x}^\prime_{ij} + {}^a \boldsymbol{v_{x_{ij}}} =
    \frac
    { {}^a P_{1:2j} (\boldsymbol{x_{ij}, p, q}) \mathbf{X_i} }
    { {}^a P_{3j} (\boldsymbol{x_{ij}, p, q}) \mathbf{X_i} }
    $$
Optimality
- BA is statistically optimal
- Exploits all observations and considers the uncertainties and correlations (all available information)
- Computes orientations, calibration parameters, and point locations with highest precision
- Assumes Gaussian noise
- Requires an initial estimate

Below are some aspects of BA.

Compute asolute orientation through Control Points

Using camera images, we only obtain a photogrammetric model, which are
- No absolute scale information
- Unknown rotation and translation with regard to external reference frame
We need to estimate a similarity transform to get the absolute orientation
Can be done inside BA

In order to get absolute orientation, control points (the world coordinate are known) are required. For control points, we could treat them with noise or without noise.

Control points with noise. Statistically optimal approach.
Control points without noise (fixed points). Statistically suboptimal approach.
- Enforceds the geometry to match the control points
- Used if the control points cannot be changed

Two steps BA with Control Points

BA with noisy control points
1.1 Search for gross errors using statistically tests
1.2 Eliminate erroneous control points
BA with fixed control points

How many control points required in reality

Only a small number of control points is needed (at least 3 points)
Typically fully control points at the boundaries

Compute relative orientation from the first image pair
Compute orientation for subsequent images through P3P/RRS

Critical Issues

Gross error handling is essential
Requires enough new points in each image overlap
No singular/critical configurations

Caused by

Wrong correspondences
Wrong point measurements

Observations per point

At least 4 views to identify the observation with a gross error
Observed points from 5 to 6 different views typically yield good estimates

Eliminating Gross Errors

Only consider features that can be tracked consistently in all 3-6 images
Relative orientation (5-Point algo) combined with RANSAC
Robust Kernels (Huber Kernel, Blake-Zisserman Kernel)

BA Calculation Process

State vector $\boldsymbol{x}$ contains $m$ the position of points and $n$ 6DoF image poses. We could arrange the state vector to put all locations $\boldsymbol{k}$ together and all 6DoF $\boldsymbol{t}$ together.

$$
\boldsymbol{x} = [\boldsymbol{k}_0, \cdots, \boldsymbol{k}_i, \cdots, \boldsymbol{k}_m, \boldsymbol{t}_0, \cdots, \boldsymbol{t}_j, \cdots, \boldsymbol{t}_n]^T
$$

Standard Least Square
Analyze of Jocobian

For each observations ${}^a \boldsymbol{x}_{ij}$, which is the pixel coordinate of $i$-th point and $j$-th camera, we have a observation funciton

$$
{}^a \boldsymbol{x_{ij}} = f_{ij} (\boldsymbol{x}) = g_{ij} (\boldsymbol{k}_i, \boldsymbol{t}_j)
$$

We could solve the $\boldsymbol{x}$ when the total sum square of error reaches minimum. See Least Square Section for more information.

$$
\begin{split}
\boldsymbol{x}^*
&= \arg\min_{\boldsymbol{x}} F(\boldsymbol{x}) \\
&= \arg\min_{\boldsymbol{x}} \sum_{ij} e_{ij}(\boldsymbol{x}) \\
&= \arg\min_{\boldsymbol{x}} \sum_{ij} \boldsymbol{e_{ij}}(\boldsymbol{x})^T \Omega_{ij} \boldsymbol{e_{ij}}(\boldsymbol{x}) \\
&= \arg\min_{\boldsymbol{x}} \sum_{ij} ({}^a \boldsymbol{x}^\prime_{ij} - f_{ij} (\boldsymbol{x}))^T \Omega_{ij} ({}^a \boldsymbol{x}^\prime_{ij} - f_{ij} (\boldsymbol{x}))
\end{split}
$$

To calculate the minimum of $F(\boldsymbol{x})$, we need to set the first derivative of linearzed $F(\boldsymbol{x+\Delta x})$ to 0, then use the $\boldsymbol{\Delta x}^{*}$ to update the state $\boldsymbol{x}$
$$
\begin{split}
e_{ij}(\boldsymbol{x + \Delta x})
&= \boldsymbol{e_{ij}}(\boldsymbol{x+\Delta x})^T \Omega_{ij} \boldsymbol{e_{ij}}(\boldsymbol{x+\Delta x}) \\
&\approx (\boldsymbol{e_{ij}}(\boldsymbol{x}) + \boldsymbol{A_{ij}^T \Delta x})^T \Omega_{ij} (\boldsymbol{e_{ij}}(\boldsymbol{x}) + \boldsymbol{A_{ij}^T \Delta x}) \\
&= \boldsymbol{e_{ij}}^T (\boldsymbol{x}) \Omega_{ij} \boldsymbol{e_{ij}} (\boldsymbol{x}) + 2 \boldsymbol{e_{ij}}^T (\boldsymbol{x}) \Omega_{ij} \boldsymbol{A_{ij}^T \Delta x} + \boldsymbol{\Delta x}^T \boldsymbol{A_{ij}} \Omega_{ij} \boldsymbol{A_{ij}^T} \boldsymbol{\Delta x} \\
&= c_i + 2\boldsymbol{b_i^T \Delta x} + \boldsymbol{\Delta x}^T \boldsymbol{H}_i \boldsymbol{\Delta x}
\end{split}
$$

Formation of $\boldsymbol{A_{ij}^T}$

$\boldsymbol{A_{ij}^T}$ is the Jocbian of $-\boldsymbol{f_{ij}}$, which is only related to $\boldsymbol{k}_i$ and $\boldsymbol{t}_j$

$$
\boldsymbol{A_{ij}}^T =
[0, \cdots, 0, C_{ij}^T, 0, \cdots, 0 \ | \ 0, \cdots, 0, B_{ij}^T, 0, \cdots, 0 ]
$$

$\boldsymbol{A_{ij}}^T$ with size of $2\times (3m+6n)$, $C_{ij}^T$ with size of $2\times 3$, $B_{ij}^T$ with size of $2\times 6$

$$
\begin{split}
F(\boldsymbol{x+\Delta x})
&= \sum_{ij}e_{ij}(\boldsymbol{x + \Delta x}) \\
&\approx \sum_{ij} (c_i + 2\boldsymbol{b_i^T \Delta x} + \boldsymbol{\Delta x}^T \boldsymbol{H}_i \boldsymbol{\Delta x}) \\
&= F(\boldsymbol{x}) + 2\mathbf{b}^T \boldsymbol{\Delta x} + \boldsymbol{\Delta x}^T \mathbf{H} \boldsymbol{\Delta x}
\end{split}
$$

$\mathbf{b}^T$ and $\mathbf{H}$

$$
\begin{matrix}
\mathbf{b}^T = \sum_{ij} \boldsymbol{e_{ij}}^T \Omega_{ij} \mathbf{A_{ij}} \\
\mathbf{H} = \sum_{ij} \mathbf{A_{ij}} \Omega_{ij} \mathbf{A_{ij}}^T
\end{matrix}
$$

$\quad\quad$ Set the first derivative to 0 to get the minimum $F(\boldsymbol{x + \Delta x})$, we get
$$
\begin{split}
\frac{\mathrm{d} F(\boldsymbol{x + \Delta x})}{\mathrm{d} \boldsymbol{\Delta x}} &= \mathbf{0} \\
\mathbf{H\Delta x} + \mathbf{b} &= \mathbf{0} \\
\mathbf{\Delta x} &= -\mathbf{H}^{-1}\mathbf{b}
\end{split}
$$

In reality, the size of $\mathbf{H}$ is very huge, so it is very difficult to calculate the $\mathbf{H}^{-1}$ and solve the $\Delta x$.

We put all observations as a vector $\boldsymbol{l}$ and define our new square of error

$$
\begin{split}
F(\boldsymbol{x})
& = \boldsymbol{e}(\boldsymbol{x})^T \Omega \boldsymbol{e}(\boldsymbol{x}) \\
& = (\boldsymbol{l} - f(\boldsymbol{x}))^T \Omega (\boldsymbol{l} - f(\boldsymbol{x}))
\end{split}
$$

Set the first derivative of lineared $F(\boldsymbol{x+\Delta x})$ to 0 and get the $\boldsymbol{\Delta x}$

$$
\begin{split}
F(\boldsymbol{x+\Delta x})
& \approx (\boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{A} \boldsymbol{\Delta x})^T \Omega (\boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{A} \boldsymbol{\Delta x}) \\
& = \boldsymbol{e}(\boldsymbol{x})^T \Omega \boldsymbol{e}(\boldsymbol{x}) + 2 \boldsymbol{e}(\boldsymbol{x})^T \Omega \boldsymbol{A} \boldsymbol{\Delta x} + \boldsymbol{\Delta x}^T \boldsymbol{A^T} \Omega \boldsymbol{A} \boldsymbol{\Delta x}
\end{split}
$$

$$
\begin{matrix}
& \frac{\mathrm{d} F(\boldsymbol{x + \Delta x})}{\mathrm{d} \boldsymbol{\Delta x}} = \mathbf{0} \\
\rightarrow & 2 \boldsymbol{A^T} \Omega \boldsymbol{e}(\boldsymbol{x}) + 2 \boldsymbol{A^T} \Omega \boldsymbol{A} \boldsymbol{\Delta x} = \mathbf{0} \\
\rightarrow & N \boldsymbol{\Delta x} = -\boldsymbol{A^T} \Omega \boldsymbol{e}(\boldsymbol{x}) = \boldsymbol{h}
\end{matrix}
$$

We analyze the structure of $\boldsymbol{A}$ and $\boldsymbol{N}$, the matrix $\boldsymbol{A}$ is constructed by $\boldsymbol{A^T_{ij}}$, and it could be divided into C (constructed by $\boldsymbol{C^T_{ij}}$) and B (constructed by $\boldsymbol{B^T_{ij}}$) block.

$$
\boldsymbol{A} \Delta\boldsymbol{x} =
\left[\begin{matrix}
\boldsymbol{A^T_{00}} \\
\boldsymbol{A^T_{01}} \\
\vdots \\
\boldsymbol{A^T_{mn}}
\end{matrix}\right] \boldsymbol{\Delta x} =
[C \ | \ B]
\left[\begin{matrix}
\Delta\boldsymbol{k} \\
\Delta\boldsymbol{t}
\end{matrix}\right]
= C \Delta\boldsymbol{k} + B \Delta\boldsymbol{t}
$$

$$
\mathbf{N} = \boldsymbol{A}^T \Omega \boldsymbol{A}
=\left[\begin{matrix}
C^T \Omega C & C^T \Omega B \\
B^T \Omega C & B^T \Omega B
\end{matrix}\right]
$$

Solve $\boldsymbol{\Delta t}$ for camera poses
According to the structure of $\mathbf{N}$, the $\mathbf{N_{kk}} = \boldsymbol{C^T} \Omega\ \boldsymbol{C}$ is a block-diagonal matrix and those blocks are $3\times 3$ blocks, thus easy to compute the $\mathbf{N}_{kk}^{-1}$ The $\overline{\mathbf{N}}_{tt}$ has small dimensions and is a sparse matrix, so the $\boldsymbol{\Delta t}$ could be solved by **sparse solver**.

Illustraion

Solve $\boldsymbol{\Delta k}$ for points locations

$$
\boldsymbol{\Delta k} = N^{-1}{kk} (\boldsymbol{h_k} - N{kt}\Delta\boldsymbol{t})
$$

In real calculation

Summary

SLAM

Localizae and map simultaneously
- Background: a term from robotics community, allowing the robot to build a map to navigate with
- Take: sequences of sensor measurements from cameras, RGBD, lidars, sonars
- Task-Localization: estimating the locatin and headings of the sensor / robot
- Task-Mapping: buildign a map of the environment
Formation of SLAM system
- Front end: transfer raw data from sensors to intermediate data
- Backend: optimize the intermediate data
- Others
  - Loop closure detection
  - Map
  - …
BA and SLAM backend

Bundle Adjustment is Full SLAM using a camera minimizing the reprojection error of features and no motion model
Definition of SLAM
Full SLAM and Online SLAM
Common used SLAM backend method
- Kalman filter (EKF SLAM)
- Particle filter
- Least square (Graph-SLAM)

More info about SLAM could be found in this post.

Orhtophoto

Motivation: Orhtophoto is an image of a surface in orthogonal parallel projection. It is important to align aerial images with maps, and allows to measure distance in the orhtophoto.
- e.g. satellite images
- Normal photo and Orhtophoto
Given: aerial image(s), intrinscis and extrinsic of camera, 3D information of the scene (DSM, Direct Surface Maps)
- DSM and Orthophoto
Output: input image(s) as acquired with an orhtophoto parallel projection
Method Illustration
Interpolation
- DSM
- Intensity

Image Stitching

Orthophoto is affected by occulusion, which means some areas should appread in the orhtophoto cannot be seen from the normal photos, thus leading to holes. Therefore, we need to stitch multi normal images to constrcut the true orhophoto.

Compute Projection Matrix of Planar

When the surface is a plane, we could compute the projection matrix directly without knowing the cameras’ intrinscis and extrinsics. Therefore, it simplify the requirements to compute orhophoto

Methods

Reference

Cyrill Stachniss’s Photogrammetric Computer Vision course