Multiple View Geometry in Computer Vision

Posted on 2022-02-11 Edited on 2023-06-16 In SLAM , 2. Computer Vision

Notes of Book Multiple View Geometry in Computer Vision

Basic Concepts

Different space

Euclidean space $\mathbf{R}^n$
- normal 2D, 3D, n-d spaces
- parallel lines won’t intersects
Projective space $\mathbf{P}^n$
- adding points at infinity to Euclidean space
- all lines will intersect at infinity points

Euclidean coordinate $\rightarrow$ Homogenous coordinate (coordinate in projective space): append an extra dimension

Different geometries

Euclidean geometry $\rightarrow$ Projective geometry: adding points / line at infinity
Projective geometry $\rightarrow$ Affine geometry: the projective plane with a distinguished line (line at infinity)
Projective geometry $\rightarrow$ Euclidean geometry: the projective plane with a line at infinity, also 2 points called circular points (2D) or a absolute conic (3D)

Different transformation

In Euclidean space

Euclidean transformation: the space translate and rotate to a different position
Linear transformation: rotate and stretching
Affine transformation: moving, rotate and finally stretching linearly by different ratios in different directions (linear transformation plus moving the origin)

Common property: points at infinity remain the infinity

In Projective space

Euclidean transformation: any projective transformation that maps the distinguished line in one space to the distinguished line of other space, and circular points / absolute conic in one space to other space
Affine transformation: any projective transformation that maps the distinguished line (line at infinity) in one space to the distinguished line of other space
Projective transformation: an arbitrary mapping of the homogeneous coordinates. Line at infinity may not be the line at infinity in other space

Camera projections

Most general imaging projection is represented by an arbitrary $3 \times 4$ matrix of rank 3, acting on the homogeneous coordinate of the point in 3-d Projective space $\mathbf{P}^3$ and mapping it to the imaged point in 2-d Projective space $\mathbf{P}^2$. The matrix $P$ is called camera projection matrix, or short as camera matrix.

$$
\left[\begin{matrix}
x \\
y \\
w
\end{matrix}\right] = P_{3\times 4}
\left[\begin{matrix}
X \\
Y \\
Z \\
T
\end{matrix}\right]
$$

Reconstruction from images

Projective reconstruction
- Reconstruct the scene with uncalibrated cameras
- Result is up to projective transformation (projective ambiguity)
- Fundamental matrix for 2 views, trifocal tensor for 3 views
Euclidean reconstruction
- Reconstruct the scene with calibrated cameras
- Result has the correct Euclidean structure and shape

2D Projective Geometry

Homogeneous expression

Entity	Homogeneous expression	Euclidean expression	DoF
Point	$(x, y, z)^T$	$(\frac{x}{z}, \frac{y}{z})^T$	2
Line	$(a, b, c)^T$ representing $ax_1 + bx_2 + cx_3 = 0$	$ax + by + c = 0$	2
Conic	$\left[\begin{matrix} a & b/2 & d/2 \\ b/2 & c & e/2 \\ d/2 & e/2 & f \end{matrix}\right]$ representing $a x_1^2 + bx_1 x_2 + c x_2^2 + dx_1 x_3 + ex_2 x_3 + f x_3^2 = 0$	$ax^2 + bxy + cy^2 + dx + ey + f = 0$	5

A point lay in a line: $\mathbf{l}^T x = 0$, $\mathbf{l}, x$ are column vector in homogeneous coordinate
A line passing 2 points: $\mathbf{l} = x \times y$
A point lay in conic: $x^T C x = 0$

Projective space and transformation

Projective space $\mathbf{P}^2$ consist of the normal points in $\mathbf{R}^2$, whose homogeneous coordinate is $(x_1, x_2, x_3)^T, x_3 \neq 0$, and the ideal points $(x_1, x_2, 0)^T$.

All ideal points lay on the same line, the line at infinity: $(0, 0, 1)^T$

Projective transformation is an invertible mapping $h$ from $\mathbf{P}^2$ to itself such that points $x_1, x_2, x_3$ lie on the same line if and only if $h(x_1), h(x_2), h(x_3)$ do. It could be represents by

$$
\left[\begin{matrix}
x_1^\prime \\
y_2^\prime \\
x_3^\prime
\end{matrix}\right] =
\left[\begin{matrix}
h_{11} & h_{12} & h_{13} \\
h_{21} & h_{22} & h_{23} \\
h_{31} & h_{32} & h_{33}
\end{matrix}\right]
\left[\begin{matrix}
x_1 \\
y_2 \\
x_3
\end{matrix}\right]
$$

Matrix $H$ is a non-singular and homogeneous matrix, which has 8 DoF.

Projected points, line and conic

With projective transformation $H$,

Points: $x^\prime = H x$
Line: $l^\prime = H^{-T} l$
Conic: $C^\prime = H^{-T} C H^{-1}$

Different transformations

transformations	Matrix	DoF	Determined by X point pairs	Invariant properties	Extra
Euclidean	$ \left[\begin{matrix} R & t \\ 0^T & 1 \end{matrix}\right]$	3	2	length, area	$R$ is orthogonal matrix, $RR^T = I$, only has 1 DoF
Similarity	$\left[\begin{matrix} sR & t \\ 0^T & 1\end{matrix}\right]$	4	2	ratio of lengths, angle, circular points	$s$ is a scale factor
Affine	$\left[\begin{matrix} A & t \\ 0^T & 1\end{matrix}\right]$	6	3 point pairs not in same line	parallelism, ration of areas, line at infinity	$A = R(\theta)R(-\phi)DR(\phi)$, $D = \text{diag}(d_1, d_2)$. Comparing with the $sR$, $A$ has 2 more degrees of the ratio of $d_1/d_2$, and $\phi$
Projective	$\left[\begin{matrix} A & t \\ u^t & v \end{matrix}\right]$	8	4 point pairs not in same line	concurrency, collinearity, order of contact	matrix up to scale

Affine transformation maps points in infinity to points in infinity, while projective transformation maps points in infinity to normal points.

Projective transformation can be decomposed into a chain of transformations, where each matrix in the chain represents a transformation higher in the hierarchy than the previous one.
$$
H = H_s H_A H_P =
\left[\begin{matrix} sR & t \\ 0^T & 1 \end{matrix}\right]
\left[\begin{matrix} K & 0 \\ 0^T & 1 \end{matrix}\right]
\left[\begin{matrix} I & 0 \\ \mathbf{v}^T & v \end{matrix}\right] =
\left[\begin{matrix} A & t \\ \mathbf{v}^T & v \end{matrix}\right]
$$
where $v \neq 0$ and K an upper-triangular matrix normalized as $\det K = 1$.

Recover affine properties from image

Affine rectification: given an image (II) which is the projective transformation of real world plane (I), computes the image (III) which is the affine transformation of the real world plane, where the parallel line will looks parallel

This could be done by mapping the vanishing line $(l_1, l_2, l_3)^T$ back to line at infinity $(0, 0, 1)^T$ with
$$
\left[\begin{matrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
l_1 & l_2 & l_3
\end{matrix}\right]
$$

in order to compute the vanishing line, there are several options

Use 2 pairs of line parallel in real world to compute 2 intersections. Then 2 intersections lay on the line at infinity
Use 3 points whose length ratio are known in a line to compute the vanishing point by 1D projective transformation, and found another vanishing point from another line. Then 2 vanishing points lay on the line at infinity.

Recover similarity properties from image

Metric rectification: computes the image which is the similarity transformation of the real world plane, where the parallel line will looks parallel, and tangential lines are tangential

This could be done by transforming the circular points to their canonical position $(1, \pm i, 0)^T$. And dual conic $C_{\infty}^{\star}$ includes these 2 points.

Recover similarity properties from projective images

The dual conic $C_{\infty}^{\star} = U \left[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{matrix}\right] U^T$, and $U$ could achieve the metric rectification.
Choose 2 lines $l, m$ which should be orthogonal in real world, then $l^T C_{\infty}^{\star} m = 0$ could have 1 constraint. Since $C_{\infty}^{\star}$ has 5 DoF, then we need 5 orthogonal pairs.

Recover similarity properties from images after affine rectification

The orthogonal pair serves the constraint of $C_{\infty}^{\star}$ as
$$
l^T \left[\begin{matrix}
KK^T & 0 \\
0 & 0
\end{matrix}\right] m = 0
$$
Since $K$ is a up to scale symmetric matrix, then it only has 2 DoF. So 2 orthogonal pair could solve it.

3D Projective Geometry

Homogeneous expression and projective transformation

Entity	Homogeneous coordinate	DoF	By Projective trans $H$
Point	$(x_1, x_2, x_3, x_4)^T$	3	$x^\prime = H x$
Plane	$(\pi_1, \pi_2, \pi_3, \pi_4)$ representing $\pi_1 x_1 + \pi_2 x_2 + \pi_3 x_3 + \pi_4 x_4 = 0$	3	$\pi^\prime = H^{-T} \pi$
Line	$L = AB^T - BA^T$ (Plucker matrices representations) where $A, B$ are points, $L$ is skew-symmetric homogeneous matrix	4	$L^\prime = HLH^T$
Quadric	a symmetric $4\times 4$ matrix $Q$	9	$Q^\prime = H^{-T} Q H^{-1}$

A point lay in a plane: $\mathbf{\pi}^T x = 0$
A point lay in a quadric: $x^T Q x = 0$

Different transformations

Similar to the transformations in 2D, 3D also has following transformations.

The euclidean transformation could be partitioned to: 1) a rotation about the screw axis, which covers the rotation and translation $t_{\perp}$, 2) translation $t_{\parallel}$ along the screw axis

Recover properties

The affine transformation maps the plane at infinity $\pi_{\infty}$ to another plane at infinity. Mapping the real world map of infinity to its canonical position $(0, 0, 0, 1)^T$ could recover the affine properties
The similarity transformation maps the absolute conic $\Omega_{\infty}$ to another absolute conic, which is a (point) conic on plane of infinity. Metric properties could be recovered if the absolute conic is recognized

Estimation - 2D projective transformation

Given point pairs $(x_i, x_i^\prime)$ in 2 projective planes, compute the projective transformation $H$, where $x_i^\prime = Hx_i$.

Since $H$ has 8 DoF and 1 point pair contributes 2 constraints, so it requires at least 4 point pairs and 3 points cannot lay on the same line.

Similarly, 4 line pairs, 3 points + 1 line, 1 point + 3 lines could also compute the matrix $H$.

Direct Linear Transformation (DLT) algorithm

DLT transforms the $x_i^\prime = Hx_i, i = 1, 2, \dots, n$ relationship to $Ah = 0$, where $A$ is a $3n \times 9$ or a $2n \times 9$ matrix.

Minimal solution

If $n = 4$, then the $h$ could be solved by Gaussian Elimination.

Over-determined solution

If $n > 4$, which means the equations are over-determined, then the $h$ could be solved by minimizing the algebraic distance $||Ah||$. The solution is the unit singular vector corresponding to the smallest singular value of matrix $A$. e.g. If $A=UDV^T$ with D diagonal with positive diagonal entries, arranged in descending order down the diagonal, then $h$ is the last column of $V$.

Usually before DLT algorithm, a data normalization step is required for less well conditioned problem, such as DLT computation of the fundamental matrix or the trifocal tensor. The steps are as follows

Minimizing algebraic distance

The computation of minimizing algebraic distance (like DLT algorithm) is cheap but the quantity that is minimized is not geometrically or statistically meaningful.

Often solutions based on algebraic distance are used as a starting point for a non-linear minimization of a geometric or statistical cost function. The non-linear minimization gives the solution a final “polish”.

Minimizing Geometric distance

Apart form minimization the algebraic distance, another way is to minimizing the geometric distance.

When the transformation is affine transformation, algebraic distance is the same as geometric distance.

One image transfer error

If the error only exist in the second image, then the cost function to minimize is
$$
\sum_i d(x_i^\prime, H \bar{x}_i)^2
$$
where $x_i^\prime$ is the measured point coordinates in 2nd image, $\bar{x}_i$ is the real value of point coordinate in 1st image, and $d$ is the Euclidean distance between 2 points.

This is the MLE estimation for one image transfer error if assuming the errors between observed values and estimated values of 2nd image are in Gaussian distribution.

Parametrization
- parameters (9): the vector $h$ of entries of the homography matrix $H$
- measurements (2n): vector $X = [x_1^\prime, \cdots, x_n^\prime]^T$, made up of the inhomogeneous coordinates of the points $x_i^\prime$
Function specification
$$
f : h \rightarrow (Hx_1, Hx_2, \cdots , Hx_n)
$$
where $Hx_i$ indicates the inhomogeneous coordinates.
Optimization target
$$
\arg\min_h ||X − f(h)||^2
$$

Symmetric transfer error

If the error exists in 2 images, then the cost function to minimize is
$$
\sum_i d(x_i^\prime, H x_i)^2 + d(x_i, H^{-1} x_i^\prime)^2
$$

Parametrization
- parameters (9): the vector $h$ of entries of the homography matrix $H$
- measurements (4n): vector $X = [x_1, \cdots, x_n, x_1^\prime, \cdots, x_n^\prime]^T$, made up of the inhomogeneous coordinates of the points $x_i$ and $x_i^\prime$.
Function specification
$$
f : h \rightarrow (H^{-1}x_1^\prime, H^{-1}x_2^\prime, \cdots , H^{-1}x_n^\prime, Hx_1, Hx_2, \cdots , Hx_n)
$$
Optimization target
$$
\arg\min_h ||X − f(h)||^2
$$

Reprojection error

Estimating a “correction” for each correspondence.
$$
\sum_i d(x_i, \hat{x}_i)^2 + d(x_i^\prime, \hat{x}_i^\prime)^2 \quad \text{subject to} \quad \hat{x}_i^\prime=\hat{H}\hat{x}_i
$$

Minimizing this cost function involves determining both $\hat{H}$ and a set of subsidiary correspondences $\hat{x}_i, \hat{x}_i^\prime$.

This is the MLE estimation for 2 image transfer error if assuming the errors between observed values and estimated values of images are in Gaussian distribution.

Parametrization
- parameters (2n + 9): the vector $h$ of entries of the homography matrix $H$, and also the coordinates of points $\hat{x_i}$
- measurements (4n): $X = [x_1, \cdots, x_n, x_1^\prime, \cdots, x_n^\prime]^T$, made up of the inhomogeneous coordinates of the points $x_i$ and $x_i^\prime$
Function specification
$$
f : (h, \hat{x}_1, \cdots, \hat{x}_n) \rightarrow (\hat{x}_1, \hat{x}_1^\prime, \cdots, \hat{x}_n, \hat{x}_n^\prime)
$$
where $\hat{x}_i^\prime = H\hat{x}_i$
Optimization target
$$
\arg\min_{h,\hat{x}_1, \cdots, \hat{x}_n} ||X − f(h, \hat{x}_1, \cdots, \hat{x}_n)||^2
$$

Robust estimation

In order to deal with the outliers of data, following methods are usually used.

RANSAC: see this post
Robust cost function to limit the error scale: see the outliers tab of this post

Conclusion - automatic computation of a homography

Camera Models

Different camera models

A camera is a mapping between the 3D world (object space) and a 2D image. The principal camera of interest is central projection.

The specialized models fall into two major classes – those that model cameras with a finite centre, and those that model cameras with centre “at infinity”. Of the cameras at infinity the affine camera is of particular importance because it is the natural generalization of parallel projection.

Pinhole camera model

Pinhole camera model maps a point $(X, Y, Z)^T$ in camera’s coordinate frame to $(f\frac{X}{Z}, f\frac{Y}{Z}, f)^T$ on the image plane, where coordinates are in camera’s Euclidean system, and $f$ is the focal length.

Dropping the last dimension, then the mapping $(X, Y, Z)^T \rightarrow (f\frac{X}{Z}, f\frac{Y}{Z})^T$ is a mapping from Euclidean 3D space $\mathbf{R}^3$ to Euclidean 2D space $\mathbf{R}^2$.

Camera centre, optical centre: the centre of projection is called the camera centre
Principal axis, principal ray: the line from the camera centre perpendicular to the image plane
Principal point: the point where the principal axis meets the image plane
Principal plane: the plane through the camera centre parallel to the image plane

Finite camera

Projection hierarchy	Projection equations	Projection matrix	DoF	Descriptions
Central projection using homogeneous coordinates	$\left[\begin{matrix} fX \\ fY \\ Z \end{matrix}\right] = \left[\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}\right] \left[\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}\right]$	$P = \text{diag}(f, f, 1)[I \mid 0]$	1	This is the $\mathbf{P}^3 \rightarrow \mathbf{P}^2$ mapping. Note. the homogeneous coordinate $(fX, fY, Z)^T$ in the image plane is the same with the Euclidean coordinate $(f\frac{X}{Z}, f\frac{Y}{Z})^T$
Consider principle point offset	$\left[\begin{matrix} fX + Zp_x \\ fY + Zp_y \\ Z \end{matrix}\right] = \left[\begin{matrix} f & 0 & p_x & 0 \\ 0 & f & p_y & 0 \\ 0 & 0 & 1 & 0 \end{matrix}\right] \left[\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}\right]$	$P = K[I \mid 0]$ $K = \left[\begin{matrix} f & 0 & p_x \\ 0 & f & p_y \\ 0 & 0 & 1 \end{matrix}\right]$	3	The origin of image plane may not be defined in the principal point, the offset should be considered, $(X, Y, Z)^T \rightarrow (f\frac{X}{Z} + p_x, f\frac{Y}{Z} + p_y)^T$ $(p_x, p_y)^T$ is the coordinate of the principal point
Consider camera rotation and transformation	$\left[\begin{matrix} fX + Zp_x \\ fY + Zp_y \\ Z \end{matrix}\right] = K[I \mid 0] \left [\begin{matrix} R & -R\widetilde{C} \\ 0 & 1 \end{matrix}\right] \left[\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}\right]$ $\mathbf{X} = [X, Y, Z, 1]^T$ is the homogeneous coordinate of the world point in world coordinate frame, and its camera coordinate is $X_{cam} = R(\mathbf{X} - \widetilde{C})$	$P = KR[I \mid -\widetilde{C}]$	9	The projection from world frame $\mathbf{P}^3$ to the image plane frame $\mathbf{P}^2$ 9 DoF
Non-square pixel in CCD camera	Similar to above	$K = \left[\begin{matrix} a_x & 0 & x_0 \\ 0 & a_y & y_0 \\ 0 & 0 & 1 \end{matrix}\right]$ $a_x = fm_x$, $a_y = fm_y$ $x_0 = m_xp_x$, $y_0 = m_yp_y$	10	CCD cameras may have different pixels per unit distance in image coordinates, which are $m_x, m_y$ in the $x, y$ directions. However, only the fraction $m_x/m_y$ matters
Real finite projective camera	$\begin{split}P &= KR[I \mid -\widetilde{C}] = M [I \mid -\widetilde{C}] \\ &= M [I \mid M^{-1} p_4] = [M \mid p_4]\end{split}$ $M = KR$ is a non-singular matrix $p_4 = -M\widetilde{C}$ is the last column of projection matrix $P$	$K = \left[\begin{matrix} a_x & s & x_0 \\ 0 & a_y & y_0 \\ 0 & 0 & 1 \end{matrix}\right]$	11	$s$ is the skew parameter, which means the $x$ and $y$ directions are not tangential.
General projective matrix	Similar to above	$P$ is an arbitrary homogeneous $3\times 4$ matrix of rank 3	11	It still has 11 DoF but it doesn’t require the lef hand $3\times 3$ block be non-singular. The rank 3 requirement is because otherwise the range of matrix mapping will a line or a point instead of a 2D image.

The $3\times 4$ camera projection matrix $P$ presents the projection $x = PX$
K is called camera calibration matrix

The projective camera anatomy

A general projective matrix $P = [M | p_4]$ maps world points to image points. The meaning of different part of matrix could be revealed as following.

Some descriptions here.

The coordinate $C$ of camera center is in world coordinate frame
Column vectors are vanishing point of world axes
Since the camera centre is in the principal plane, then $p^{3T}C = 0$
Point at axis planes maps to the $x$ or $y$ axis of the image, since $p^{iT}X = 0$, leading to image coordinate $(0, y, w)^T$ or $(x, 0, w)^T$
Principal point is computed by mapping the point at infinity $X = (p_{31}, p_{32}, p_{33}, 0)^T$ to the image $[M|p_4]X = M[p_{31}, p_{32}, p_{33}]^T$

The camera matrix could be decomposed to a 1) projective transformation from world coordinate frame to camera coordinate frame, 2) a projection from 3D space to an image, and a 3) projective transformation of the images.
$$
P = [3\times 3 \ \text{homography}] \left[\begin{matrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0
\end{matrix}\right]
[4\times 4 \ \text{homography}]
$$

Back-projection of points to rays

If we have a point $x$ in the image, then the line it lays in the world coordinate frame could be represent by
$$
X(\lambda) = P^{+}x + \lambda \widetilde{C}
$$
where the $P^{+} = P^T(PP^T)^{-1}$ is the pseudo-inverse of $P$.

Specifically, for the finite cameras ($M$ is non-singular), the line is
$$
X(\mu) =
\left[\begin{matrix}
-M^{-1}p_4 \\
1
\end{matrix}\right] +
\mu\left[\begin{matrix}
M^{-1}x \\
0
\end{matrix}\right]
$$

Depth of points

For a finite camera maps a point to the image plane $w(x, y, 1)^T = P(X, Y, Z, T)^T$, then $w$ is the depth of a point $\mathbf{X}$ from the camera centre $C$ in the direction of the principal ray, when the camera matrix is normalized so that $\det M > 0$ and $||m^3|| = 1$.

If the matrix hasn’t be normalized, then the depth is
$$
\text{depth}(\mathbf{X};P) = \frac{\text{sign}(\det M)w}{T||m^3||}
$$

Decomposition of the camera matrix

Given a general projective camera $P$, then

Camera centre $C$ for which $PC = 0$ can be obtained by
$$
X = \det([p_2, p_3, p_4]) \quad Y = -\det([p_1, p_3, p_4]) \quad Z = \det([p_1, p_2, p_4]) \quad T = -\det([p_1, p_2, p_3])
$$

The camera orientation and internal parameters could be found for finite camera, where $P = [M | p_4] = [M | -M\widetilde{C}]$

Then the $K$ and $R$ could be got by decomposing $M$ as $M = KR$ using the RQ decomposition (similar to QR decomposition) and requiring the $K$ has positive diagonal entries.

Then $\widetilde{C}$ could be got by $-M^{-1}p_4$, which doesn’t have the scale ambiguity.

Cameras at infinity

Cameras with centre lying on the plane at infinity. This means the left hand $3\times 3$ block of the camera matrix $P$ is singular.

Camera at infinity can be classified into affine cameras and non-affine cameras, where the first one is usually talked about.

Affine camera

Affine camera is one that has a camera matrix $P$ in which the last row $P^{3T}$ is $(0, 0, 0, 1)^T$. It is called affine camera because points at infinity are mapped to points at infinity.

$$
P_{\infty} = \left[\begin{matrix}
K_{2\times 2} & \hat{0} \\
\hat{0}^T & 1
\end{matrix}\right]
\left[\begin{matrix}
\hat{R} & \hat{t} \\
0^T & 1
\end{matrix}\right]
$$
where $K_{2\times 2}$ is a $2\times 2$ upper-triangular matrix. $\hat{R} = (r^{1T}, r^{2T})^T$ is the first 2 rows of a rotation matrix, with size $2\times 3$. $\hat{0}$ is $(0, 0)^T$. $\hat{t}$ is the vector $(-r^{1T}\widetilde{C}, -r^{2T}\widetilde{C})^T$

Compare with finite camera

The parallel projection matrix $\left[\begin{matrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}\right]$ replaces the finite camera’s $[I | 0]$
The calibration matrix $\left[\begin{matrix} K_{2\times 2} & \hat{0} \\ \hat{0}^T & 1\end{matrix}\right]$ replaces $K$ of a finite camera
The principal point is not defined

Hierarchy of affine cameras

projection type	projection matrix	explanation	DoF
orthographic projection	$P = \left[\begin{matrix} r^{1T} & t_1 \\ r^{2T} & t_2 \\ 0^T & 1 \end{matrix}\right]$	$r^{1T}, r^{2T}$ has 6 unknown, but they the first 2 rows of $R$, which are unit vectors and orthogonal with each other, contributing 3 constraints, thus only 3 unknown. $t_1, t_2$ are the first 2 elements of $t$	5
scaled orthographic projection	$P = \left[\begin{matrix} k & 0 & 0 \\ 0 & k & 0 \\ 0 & 0 & 1 \end{matrix}\right] \left[\begin{matrix} r^{1T} & t_1 \\ r^{2T} & t_2 \\ 0^T & 1 \end{matrix}\right] = \left[\begin{matrix} r^{1T} & t_1 \\ r^{2T} & t_2 \\ 0^T & \frac{1}{k} \end{matrix}\right]$	~	6
weak perspective projection	$P = \left[\begin{matrix} \alpha_x & 0 & 0 \\ 0 & \alpha_y & 0 \\ 0 & 0 & 1 \end{matrix}\right] \left[\begin{matrix} r^{1T} & t_1 \\ r^{2T} & t_2 \\ 0^T & 1 \end{matrix}\right] $	~	7
affine camera	$P = \left[\begin{matrix} \alpha_x & s & 0 \\ 0 & \alpha_y & 0 \\ 0 & 0 & 1 \end{matrix}\right] \left[\begin{matrix} r^{1T} & t_1 \\ r^{2T} & t_2 \\ 0^T & 1 \end{matrix}\right] = \left[\begin{matrix} m_{11} & m_{12} & m_{13} & t_1 \\ m_{21} & m_{22} & m_{23} & t_2 \\ 0 & 0 & 0 & 1 \end{matrix}\right]$	$m_{ij}$ are the entires of $M = KR$. Also the $2\times 3$ submatrix $M_{2\times 3}$ has rank of 2, the whole projection matrix $P$ has rank of 3.	8

The orthographic projection comes from matrix

$P = \left[\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}\right]$ maps a point $(X, Y, Z, 1)^T$ to image point $(X, Y, 1)^T$.
$H = \left[\begin{matrix} R & t \\ 0^T & 1 \end{matrix}\right]$, where $t = -R\widetilde{C}$, maps a point from world coordinate frame to a camera coordinate frame.

Single View Geometry

Above section introduced the camera matrix from the view of points mapping, and this section will cover other the link between other 3D entities, e.g. planes, lines and quadrics, and their images.

Actions of projective cameras

On plane

For points in the plane $\pi$ which lays in the world XY-plane, its image point is
$$
x = [\begin{matrix} p_1 & p_2 & p_3 & p_4 \end{matrix}]\left[\begin{matrix}
X \\
Y \\
0 \\
1
\end{matrix}\right] = [\begin{matrix} p_1 & p_2 & p_4 \end{matrix}]\left[\begin{matrix}
X \\
Y \\
1
\end{matrix}\right]
$$
then this mapping is can be defined by a $3\times 3$ matrix with the rank of 3.

It indicates that the mapping between a world plane and an image is a general planar homography (a plane to plane projective transformation).

On lines

A line in 3-space projects to a line in the image
The set of points in space which map to a line in the image is a plane $P^Tl$ in space defined by the camera centre and image line $l$.

On conics

A conic $C$ back-projects to a cone. A cone is a degenerate quadric, i.e. the $4 \times 4$ matrix representing the quadric does not have full rank.
$$
Q_{cone} = P^TCP
$$
where $C$ is the conic, and $Q_{cone}$ is the cone quadric.

The cone vertex, in this case the camera centre, is the null-vector of the quadric matrix, since $Q_{cone}C = P^TC(PC) = 0$.

On smooth surfaces

The contour generator of a surface are the set of points $X$ on the surface $S$ at which rays from camera centre are tangent to the surface, thus it only depends on the relative position of the camera centre and the surface.

The apparent contour is defined by the intersection of the image plane with the rays to the contour generator, so it depends on the position of the image plane.

On quadrics

A quadric is a smooth surface and so its outline curve is given by points where the back-projected rays are tangent to the quadric surface.

If the quadric is a sphere, its contour generator is a circle. The cone of rays between the camera centre and quadric is a right-circular, then the apparent contour, which is the intersection between the image plane and the cone, is a conic. In particular if the sphere centre lies on the $z$ camera axis, the conic is a circle.

For general quadric, its contour generator is a conic, and its apparent conic is also a conic.

Under camera matrix $P$, the outline of the quadric $Q$ is the conic $C$ given by
$$
C^{\star} = PQ^{\star}P^T
$$
where $C^{\star}$ and $Q^{\star}$ are their dual.

Fixed camera centre

An object in 3D space and camera centre define a set of rays, and an image is obtained by intersection these rays with a plane.

Property: images obtained with the same camera centre can be mapped to one another by a plane projective transformation, which is a $3\times 3$ matrix.
$$
\begin{cases}
P = KR[I \ | \ -\widetilde{C}] \\
P^{\prime} = K^{\prime}R^{\prime}[I \ | \ -\widetilde{C}] \\
\end{cases} \ \Rightarrow \
x^{\prime} = P^{\prime}X = (K^{\prime}R^{\prime})(KR)^{-1}P X = (K^{\prime}R^{\prime})(KR)^{-1}x
$$

Usages when camera centre fixed

If the focal length changes and camera centre keep still, the images on different image planes are up to projective transformation $x^{\prime} = K^{\prime}K^{-1}x$
If the camera does pure rotation, the images on different images plane are up to projective transformation $x^\prime = KRK^{-1}x$. This also means the 3D scenes cannot be recovered from images under pure rotations. One usage is image splicing.

If the camera centre moves

Different images are usually not up to projection transformations because there is motion parallax.
However, if all 3D scene points are in the same world plane, then the images are still up to projection matrix because the mapping between 3D points to different image planes are general planar transformation.

Entities from calibration matrix

Usages	Equation	Description
Define rays	$d = K^{-1}x$	$x$ is the image point, $d$ is the direction of ray in the camera’s Euclidean coordinate frame who start from camera centre and pass through the image point. $K$ here is an affine transformation
Measure angles	$\cos \theta = \frac{x_1^T (K^{-T}K^{-1}) x_2}{\sqrt{x_1^T (K^{-T}K^{-1}) x_1} \sqrt{x_2^T (K^{-T}K^{-1}) x_2}}$	Angle between the 2 rays. Thus, a calibrated camera whose calibration matrix is known is a direction sensor like a 2D protractor
Define plane	$n = K^Tl$	The image line $l$ and the camera centre defines the plane. Its direction $n$ is under the camera’s Euclidean coordinate frame.
Define image of points at infinity	$x=KR[I, -\widetilde{C}] [d^T, 0]^T = KRd$	Property: the mapping between plane at infinity $\pi_{\infty}$ and an image is given by the planar homography. Also this is only related to calibration matrix and camera rotation, but independent of the position of the camera.
Define image of absolute conic	$\begin{split}w &= (KR)^{-T} I (KR)^{-1} \\ &= (KK^T)^{-1} = K^{-T}K^{-1}\end{split}$	Is the image of absolute conic $\Omega_{\infty} = I$. Therefore, once the conic $w$ can be identified in an image, then $K$ is also determined by Cholesky factorization.

Vanishing points and lines

~	Vanishing points	Vanishing lines
Definition	the intersection $v=Kd$ of the image plane with a ray through the camera centre which parallel to the world line with direction $d$ in 3D space	intersections on the image with a plane parallel to the scene plane through the camera centre
How to get from image	intersecting world parallel lines in images	- 2 vanishing points from two sets of lines parallel to the plane - 3 coplanar equally spaced parallel lines $l = \left( (l_0 \times l_2)^T (l_1 \times l_2)\right)l_1 + 2\left((l_0 \times l_1)^T (l_2 \times l_1)\right)l_2$
Properties	Vanishing point of a line parallel to a plane lies on the vanishing line of the plane	- All parallel 3D planes interest in the vanishing lines at $\pi_{\infty}$ - A plane with vanishing line $l$ has orientation $n = K^Tl$ in the camera’s Euclidean coordinate frame

Example of vanishing points and lines from an image

Using 2 paris of vanishing points in 2 calibrated camera’s image could compute the relative camera rotation. Each pair contributes 2 constraints $d_i^{\prime} = Rd_i$, where $d_i = K^{-1} v_i / ||K^{-1} v_i||$ is the direction of the vanishing point $v_i$ in the camera frame

Relationship with conic $w$

The vanishing points of lines with perpendicular directions satisfy $v_1^Twv_2 = 0$
If a line is perpendicular to a plane then their respective vanishing point $v$ and vanishing line $l$ are related by $l = wv$
The vanishing lines of two perpendicular planes satisfy $l_1^T w^{\star} l_2 = 0$

Calibrate cameras

Knowing conic $w=\left[\begin{array} aw_1 & w_2 & w_4 \\ w_2 & w_3 & w_5 \\ w_4 & w_5 & w_6 \end{array}\right]$, which is a homogeneous coordinate thus only 5 DoF, means knowing the calibration matrix, ways to identify $w$ include

with vanishing points and lines

Known angle between rays could provide constraints to estimate the $w$, while usually is quadratic constraint. Only we the points $v_1, v_2$ are in the perpendicular rays, the constraint on $w$ turns to a linear constraint $v_1^T w v_2 = 0$.

Vanishing point and lines could provide linear constraints.

Note the zero skew and square pixels are the common assumptions made to the calibration matrix. This adds extra constraints to the $w$ and therefore decrease the number of unknown.

Usages: based on the assumptions of zero skew and square pixels (2 constraints), use use 3 orthogonal vanishing points (3 constraints)

with circular points

Image of circular points $h_1 \pm h_2 i$ are in $w$, which means $(h_1 \pm h_2 i)^T w (h_1 \pm h_2 i) = 0$. This could be transferred to 2 constraints $h_1^T w h_2 = 0$, and $h_1^T w h_1 = h_2^T w h_2$

Usages: use 3 square patterns (each one provides 2 circular points, thus 2 constraints) to estimate calibration matrix without knowing the world coordinates

with calibration conic

calibration conic is the image of a cone with apex angle $45^{\circ}$ and axis coinciding with the principal axis of the camera. Any points on it satisfies $X^2 + Y^2 = Z^2$ when the camera matrix $P = K[I | 0]$, thus they maps to conic
$$
C = K^{-T}
\left[\begin{array}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & -1
\end{array}\right]
K^{-1}
$$

Therefore, once the calibration conic is known, the entries of calibration matrix could be read from it easily by

One way to find the calibration is using 3 orthogonal vanishing points

Summary

Planar homography mapping (a plane to plane projective transformation)

Mapping between world plane and image $H = (p_1, p_2, p_4)^T$
Mapping between plane at infinity and image $H = KR$
Mapping between different images with the same camera centre $H = (K^{\prime}R^{\prime})(KR)^{-1}$

2 important properties

Images acquired by cameras with the same centre are related by a plane projective transformation
Images of entities on the plane at infinity don’t depend on camera position, only an camera rotation $R$ and internal parameters $K$

For conic $w$

If a point $x$ is in $w$, then $x^T w x = 0$
If 2 points $x_1, x_2$ are conjugated with respect to $w$, then $x_1^T w x_2 = 0$

Estimation - Camera Matrix $P$

Given sufficiently many correspondences of 3D point and its image $(X_i, x_i)$ under the unknown camera mapping, estimate the $3\times 4$ camera matrix $P$, where $x_i = PX_i$.

Since $P$ has 11 DoF and 1 point pair contributes 2 constraints, so it requires at least 6 point pairs.

DLT and minimizing geometric errors

Similar to the estimation of 2D projective transformation, the DLT algorithm here could give

the minimal solution, which use exactly 11 equations to solve the entries of $P$
or the over-determined solution, which use more than 11 equations and solve it by SVD.

Different geometric errors

The geometric errors also have different forms.

error type	error function	explanation	use cases
errors only in image	$\sum_i d(x_i, PX_i)^2$	$x_i$ is the measured point. This is the MLE estimation if the measurement errors are Gaussian.	Camera estimation from a calibration object
errors only in world	$\sum_i d(X_i, \hat{X}_i)^2$	$X_i$ is the measured 3D point, and $\hat{X}_i$ is the closest point in space to $X_i$ the maps exactly onto $x_i$ via $x_i = P\hat{X}_i$	Equivalent to minimize algebraic error
errors in both image and world	$\sum_i d_m(x_i, P\hat{X}_i)^2 + d_m(X_i, \hat{X}_i)^2$	$d_m$ represents Mahalanobis distance with respect to the known error covariance matrices for each of the measurements $x_i$ and $X_i$.	bundle adjustment in SLAM to optimize both camera pose and position of 3D points

Note: for errors in both image and world, the optimization will optimize both the entires of camera mapping matrix and the estimated 3D points, thus there are $3n + 11$ parameters.

Geometric interpretation of algebraic error

The DLT algorithm is actually to minimizing the errors only in world, with the estimated 3D points are in the direction perpendicular to the principal axis of the camera.

Estimation of an affine camera

For affine cameras, the algebraic error and geometric image error are equal. This means that geometric image distances may be minimized by a linear algorithm.

The DLT algorithm is actually to minimize the geometric errors only in images, where $||Ap||^2 = \sum_i d(x_i, \hat{x}_i)^2$

Restricted camera estimation

Apply additional constraints to the camera matrix $P = K[R | -R\widetilde{C}]$, where the common constraints for matrix $K$ are

The skew $s$ is zero
The pixels are square: $a_x = a_y$
The principal point $(x_0, y_0)$ is known
The complete camera calibration matrix $K$ is known

With some of assumptions above, the camera matrix could be solved by minimizing geometric error or algebraic error

Minimizing geometric errors

e.g. we assume the 1st and 2nd constraints above, then the camera matrix $P$ only has $11-2 = 9$ DoF, so we could use a $9\times 1$ vector $q$ to represent the unknown entries. Then we could still build the optimization target and use iterative method to solve it, but the optimization dimension changes.

For image error only, the function $f: \mathbf{R}^9 \rightarrow \mathbf{R}^{2n}$ is minimized.

For both image and world errors, the function $f: \mathbf{R}^{3n+9} \rightarrow \mathbf{R}^{5n}$ is minimized, because it also includes the positions of the 3D points.

Minimizing algebraic error

After we made some assumptions like 1st and 2nd above, we could get a mapping $g: g(q) \rightarrow p$, where $p$ is the entries of matrix $P$ and $q$ is the unknown entries. Minimizing algebraic error is to minimize $||Ag(q)||$.

Using reduced measurement matrix $\hat{A}$, where
$$
A^T A = (VDU^T) (UDV^T) = (VD) (DV^T) = \hat{A}^T\hat{A}
$$
could reduce the dimension of $A$ from $2n \times 12$ to $12 \times 12$.

Then minimizing the algebraic errors is to minimize $||\hat{A}g(q)||$. This is a projection from $\mathbf{R}^9 \rightarrow \mathbf{R}^{12}$ and could be solved by iterative method.

Exterior orientation

If the calibration matrix $K$ is known, which turns to the camera pose estimation problem for uncalibrated camera.

Since there are only 6 unknown parameters, 3 point pairs are required to solve it. Additional point pair will be used for validation since there would be 4 results.

Covariance estimation

Assume all errors are in the image measurements, the expected ML residual error is

$$
\epsilon_{res} = \sigma (1 - \frac{d}{2n})^{\frac{1}{2}}
$$
where $d$ is the number of camera parameters being fitted (11 for pinhole camera model)

Epipolar Geometry and the Fundamental Matrix

Epipolar geometry

The epipolar geometry is the geometry between 2 view images. The 2 images could be captured by 2 different cameras at the same / different time, or could be captured by 1 camera at different time. The epipolar geometry is independent of scene structure, and only depends on the camera’s internal parameters and relative pose.

Some concepts

Epipole: the point of intersection of the line joining the camera centre with the image plane. It is also the vanishing point of the baseline direction
Epipolar plane: a plane containing the baseline
Epipolar line: the intersection of an epipolar plane with the image plane. All epipolar lines intersect at the epipole.

Epipolar geometry indicates that for a point $X$ in 3D space, its position $x^{\prime}$ in 2nd camera lays in the epipolar line defined by 2 camera centres and the its position $x$ in 1st camera.

Fundamental matrix $F$

The fundamental matrix $F$ is the algebraic representation of epipolar geometry. It could indicate the mapping between a point and its epipolar line.

The epipolar line $l^\prime$ is the projection in the second image of the ray from the point $x$ through the camera centre $C$ of the first camera. Therefore, there is a mapping which described by $F$ as $F: x \rightarrow l^\prime$.

geometric derivation

Explanation

Since the mapping between a plane and an image is 2D homography mapping, so the relationship between image point $x$ and $x^\prime$ is also 2D homography mapping
In 2D geometry, a line passing 2 points $e^\prime, x^\prime$ could be defined as $e^\prime \times x^\prime$
Thus, $l^\prime = e^\prime \times x^\prime = [e^\prime]_{\times} H_\pi x = Fx$
$[e^\prime]_{\times}$ has rank of 2, $H_\pi$ has rand of 3, thus $F$ has rank of 2

algebraic derivation

Suppose 2 camera’s projections matrix are $P, P^\prime$. The first camera centre is $C$ and the ray back-projected from image point $x$ in first image through camera centre is
$$
X(\lambda) = P^+ x + \lambda C
$$
where $P^+ P = I$

Thus 2 special points are $P^+ x$ and $C$, and their projections in second image defines the epipolar line
$$
l^\prime = (P^\prime C) \times (P^\prime P^+ C) x = [e^\prime]_{\times}(P^\prime P^+ C) x = Fx
$$
which explicitly express the geometric derivation where $H_\pi = P^\prime P^+$

Specifically, when $P = K[I \mid 0]$, $P^\prime = K^\prime [R \mid t]$, then $P^+ = \left[\begin{array} 1K^{-1} \\ 0^T \end{array}\right]$, $C = \left[\begin{array} 10 \\ 1 \end{array}\right]$

Therefore, $F = K^{\prime -T} R K^T [e]_{times}$

properties

Fundamental matrix from special motions

Motion type	Fundamental matrix	Description
Pure translation no rotation and no internal parameters changes	$F = [e^\prime]_{\times}$ suppose $P = K[I \mid 0]$ $P^\prime = K[I \mid t]$	This pure translation $t$ is equivalent to where the camera fixes while the 3D space undergoes a translation $-t$. In this case, the points in 3D space move on straight lines parallel to $t$, and their intersections projects to the vanishing point $v$ to the image. Then for each real image - Epipole: the vanishing points $v$ - Epipolar line: the imaged parallel lines - $x, x^\prime, e, e^\prime$ are collinear since $x^T [e^\prime]_{\times} x = 0$
General motion	$F = [e^\prime]_{\times} H$ $H = K^\prime R K^{-1}$ suppose $P = K[I \mid 0]$ $P^\prime = K^\prime[R \mid t]$	A general motion could be divided into a 1) rotation on the 1st camera to make it align with 2nd camera 2) modify the internal parameters to be the same 3) a pure translation. The first 2 motions could be represented by a homography matrix $H$ because they are 2D projection translation, therefore there is the relationship that $x{^\prime T} [e^\prime]_{times} \hat{x} = 0$, where $\hat{x}$ is the corrected point in the 1st image, so $\hat{x} = Hx$.
Pure planar motion translation direction is orthogonal to the rotation axis	~	This brings an extra constraint to fundamental matrix $F$, which is its symmetric part $F_s$ has rank 2, leading the fundamental matrix only has 6 DoF.

Retrieve camera matrices from fundamental matrix

If $H$ is a $4\times 4$ matrix representing a projective transformation of 3-space, then the fundamental matrices corresponding to the pairs of camera matrices $(P,P^\prime)$ and $(PH, P^\prime H)$ are the same.

Therefore, although a pair of camera matrices $(P = [I \mid 0], P^\prime = [M \mid m])$ uniquely determine a fundamental matrix $F=[m]_{\times} M$, the converse is not true. The fundamental matrix determines the pair of camera matrices up to a projective transformation.

However, we can still define a specific canonical form for the pair of camera matrices retrieved from a fundamental matrix, then they are

$$
P = [ I \mid 0 ] \quad P^\prime =
\begin{cases}
[SF \mid e^\prime] \\
\left[ [e^\prime]_\times F \mid e^\prime \right] \\
\left[ [e^\prime]_\times F + e^\prime v^T \mid \lambda e^\prime \right]
\end{cases}
$$

where $S$ is a skew-symmetric matrix, $e^\prime$ is the epipole such that $e^{\prime T} F = 0$. $v$ is any 3-vector, and $\lambda$ is a non-zero scalar.

The essential matrix $E$

The essential matrix is the specialization of the fundamental matrix to the case of normalized image coordinates, which is $\hat{x} = K^{-1}x = [R \mid t]X$, and the $\hat{P} = k^{-1}P = [R \mid t]$ is called normalized camera matrix.

So for $P = [I \mid 0]$ and $P^\prime = [R \mid t]$ $\Rightarrow E=[t]_\times R=R[R^T t]_\times $

In other words, the essential matrix means the camera calibration matrix $K, K^\prime$ is known, where $E = K^{\prime T} F K$.

properties

For points in normalized image coordinate, $\hat{x^\prime}^T E \hat{x} = 0$
Only 5 DoF, 3 from rotation, 3 for translation, but reduce 1 scale ambiguity
The constraints for this $3\times 3$ matrix are
- homogeneous matrix
- rank is 2
- 2 of its singular values are equal
- third singular value is 0

Retrieve camera matrices from essential matrix

The camera matrices may be retrieved from the essential matrix up to scale and a four-fold ambiguity, which means there are four possible solutions.

For a given essential matrix $E = U\text{diag}(1,1,0)V^T$ ($E$ is homogeneous matrix, so we can set its non-zero singular values as $1$), and first camera is $P = [I \mid 0]$ (the calibration matrix $K$ is known and not count in), so the second camera matrix could be one of the following
$$
P^\prime = \begin{cases}
[UWV^T | +u_3] \\
[UWV^T | -u_3] \\
[UW^TV^T | +u_3] \\
[UW^TV^T | -u_3]
\end{cases}
$$
where $u_3$ is the last column of $U$, and $W = \left[\begin{matrix} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{matrix}\right]$

After reconstructing a space point with the camera matrix, there will be only 1 case that the point is in front of 2 cameras, which means only 1 result is valid.

3D Reconstruction of Cameras and Structure

How and to what extent the spatial layout of a scene and the cameras can be recovered from two views.

Description of question

Unknown: position, rotation and calibration matrices
Given: corresponding points $(x_i, x_i^\prime)$ from 2 images, came from 3D points $X_i$
Object: solve $P, P^\prime, X_i$ so that $x_i = PX_i \quad x_i^\prime = P^\prime X_i$

General reconstruction method

Compute the fundamental matrix from point correspondences
Compute the camera matrices from the fundamental matrix
Triangulate each point correspondence $(x_i, x_i^\prime)$ with camera matrices to get the point in space $X_i$

Stratified reconstruction

Normally, reconstruction is up to similarity transformation because the real world position (latitude, longitude) and the scale are unknown, which means the best reconstruction could achieved without further information is similarity reconstruction.

Suppose the real reconstruction given corresponding points are $\widetilde{P}, \widetilde{P}^\prime, \widetilde{X_i}$, and the reconstruction result are $P, P^\prime, X_i$. Then their relationship is
$$
\widetilde{P} = PH^{-1} \quad \widetilde{P}^\prime = P^\prime H^{-1} \quad \widetilde{X_i} = HX_i
$$
for any invertible $4\times 4$ matrix $H$, which is the ambiguity of reconstruction.

The type of transformation $H$ between reconstruction result and the real reconstruction reflects different types of reconstruction. Also different types of reconstruction could convert to each other with respective matrix $H$.

Reconstruction type	Extra information	How to	Usages
Projective	None	follow general reconstruction steps	“at what point does a line intersect a plane?” “what is the mapping between two views induced by particular surfaces, such as a plane or quadric?”
Affine	The image of the plane at infinity OR knowing 2 cameras’ calibration matrix doesn’t change	1. perform projective reconstruction 2. map the coordinate of plane at infinity back to $(0, 0, 0, 1)^T$ by applying $H = \left[\begin{matrix} I \mid 0 \\ \pi^T \end{matrix}\right]$ to the reconstruction last step where $\pi$ is the coordinate of the plane at infinity in the projective reconstruction.	the mid-point of two points and the centroid of a set of points may now be computed lines constructed parallel to other lines and to planes.
Similarity	The image of absolute conic $\Omega_\infty$, which is the planar conic lying in the plane at infinity OR 2 cameras are calibrated apart from their focal length	1. perform projective reconstruction 2. map the coordinate of $\Omega_\infty$ back to the $I$ by applying $H = \left[\begin{matrix} A^{-1} & 0 \\ 0 & 1 \end{matrix}\right]$ to reconstruction last step where $A$ is obtained by Cholesky factorization from $AA^T = (M^TwM)^{-1}$, $w$ is the image of absolute conic, $M$ came from the camera matrix $P = [M \mid m]$ of affine reconstruction	Measure the real shape and provide accurate metric info
Similarity	2 calibration matrices are known	compute essential matrix $E$, then follow the general reconstruction steps	This is not a stratified reconstruction but a direct reconstruction

Images of different reconstruction

Projective Reconstruction

Affine Reconstruction

Similarity Reconstruction

In real cases

Similarity reconstruction is also called metric reconstruction, Euclidean reconstruction because the scale factor is usually cannot be recovered
Affine reconstruction - ways to get the plane at infinity
1. If knowing camera only has translation, no rotation and no intrinsic change, then coordinates of points at infinity won’t change in 2 views, 3 such points can give the plane at infinity
2. 2 parallel lines intersect at the point at infinity, 3 sets of parallel lines can be identified in the scene
3. 1 camera is known as affine camera
Similarity reconstruction - ways to get the absolute conic - following methods could be combining used

Given	Constraints	Num of cons
Pairs of vanishing points $v_1, v_2$ arising from orthogonal scene lines	$v_1^T w v_2 = 0$	1
A vanishing point $v$ and a vanishing line $l$ arising from a direction and plane which are orthogonal	$l = wv$	2
Some internal parameters of calibration matrix	$w = K^{-T}K^{-1}$, see single view geometry to see how to use the constraint
Calibration matrix of different images doesn’t change	$w^\prime = H_\infty^{-T} w H_\infty^{-1}$, and $w = w^\prime$	4

Direct metric reconstruction

2 views to metric reconstruction with $w$

Knowing the images of absolute conic $w$ is possible to proceed directly to metric reconstruction, given at least two views.

The method is to use $w$ to compute calibration of each of the camera by inverting it and then applying Cholesky factorization to obtain $K$ (according to $w = (KK^T)^{-1}$), and then carry out a calibrated reconstruction where essential matrix is used.

projective reconstruction to metric reconstruction with control points

Knowing the 3D locations of $n$ ground control points $X_{Ei}$, corresponding to the image points $x_i, x_i^\prime$, and the projective reconstruction is also known $P, P^\prime, X_i$.

Then the homography matrix $H$ to map the projective construction to metric reconstruction follows.
$$
X_{Ei} = HX_i, \quad i = 1, \cdots, n
$$
where each point provides 3 linearly independent equations, thus at least 5 points could provide the constraints to solve the 15DoF $H$.

$$
x_i = PH^{-1}X_{Ei}, \quad x_i^\prime = P^\prime H^{-1} X_{Ei}, \quad i = 1, \cdots, n
$$
where each point provides 2 constraints for $H^{-1}$, and the control points $X_{Ei}$ doesn’t need to appear in both 2 images. If it is, then that points could only provide 3 constraints in 2 equations since $x$ and $x^\prime$ have the coplanarity constraint.