Explore

Photogrammetry and Metashape

Photogrammetry in a nutshell

Imagine you have a bunch of overlapping photos of the same terrain. Photogrammetry is about turning those 2D images into a consistent 3D reconstruction. The rough sequence is:

Feature detection & matching (e.g., SIFT) → find distinctive points in each image and match them across images.

Structure from Motion (SfM) → use those matches to simultaneously estimate:

The 3D coordinates of the matched points (a sparse point cloud), and

The camera poses (position + orientation) from which each image was taken.

Bundle Adjustment → refine all those estimates (camera parameters + 3D points) to minimize projection discrepancies.

Dense reconstruction / DEM generation → densify into full point clouds/meshes/rasters using the optimized geometry.

Georeferencing (with GCPs / geotags) is either integrated into the bundle adjustment (as constraints) or applied after to put the model into real-world coordinates.

⁠

Photogrammetry Key terms and their relationships

SIFT (Scale-Invariant Feature Transform)

It’s an algorithm that detects “features” (keypoints) in an image that are distinctive and stable under scale/rotation changes.

Example: a rock with a particular texture will generate a descriptor; the same rock seen in another overlapping photo should generate a similar descriptor even if the angle or scale changed.

SIFT gives you: locations + descriptors. Descriptors are compared across images to establish matches (correspondences).

Tie Points / Correspondences

When a feature in image A is matched to the same physical feature in image B (and possibly more images), you get a tie point.

Each tie point corresponds to a single 3D location that projects into multiple image pixels.

Structure from Motion (SfM)

“Structure” = the 3D positions of those tie points.

“Motion” = the camera poses (where and how each photo was taken).

SfM uses the raw matched tie points to bootstrap an initial estimate: given overlapping observations of the same 3D point from different viewpoints, you can triangulate its position and infer the relative motion/orientation between cameras.

Camera Calibration Parameters

These describe how 3D points in space map to pixels in an image. They are of two types:

Intrinsics (internal camera model):

Focal length (f): scales projection; higher f zooms in.

Principal point (cx, cy): where the optical axis hits the image plane (usually near image center).

Distortion coefficients: real lenses bend light—radial/tangential distortion causes straight lines to curve; these parameters model that.

(Optional) Skew / aspect ratio: usually small or fixed for modern cameras.

Extrinsics (pose):

Rotation and translation of the camera in space (i.e., position and orientation).

Calibration can be pre-computed (you give Metashape a known intrinsics model), or self-calibrated during bundle adjustment—i.e., the software adjusts intrinsics along with extrinsics and 3D points to best explain all the observations.

Bundle Adjustment

Mathematical optimization that simultaneously refines the 3D point coordinates and all camera parameters (extrinsics, and optionally intrinsics) to minimize reprojection error across all observations.

It’s called “bundle” because each camera projects a bundle of rays; adjustment tightens the whole system to best fit observed image data.

Reprojection Error / Image Residuals

For each 3D point and each image observing it, you can project the current estimated 3D point into that image using the current camera model (intrinsics + extrinsics).

The reprojection error is the difference (in pixels) between that projected point and the actual detected feature location in the image.

Bundle adjustment minimizes the sum (often weighted) of squared reprojection errors over all such observations.

Residual = the vector difference; error magnitude is typically its Euclidean norm in the image plane.

Simple numeric analogy: Suppose a 3D point projects to pixel (100.2, 200.5) in image A under current estimates, but the measured matched feature was actually at (99.7, 200.8).

Residual vector = (−0.5, +0.3)

Reprojection error magnitude ≈ sqrt(0.5² + 0.3²) ≈ 0.583 pixels. Bundle adjustment tries to adjust cameras/points so these per-observation errors shrink globally.

⁠

How do they tie together in a typical SfM step

Detect features in every image using something like SIFT.

Match descriptors across image pairs → build a graph of correspondences.

Initialize camera poses and sparse 3D points via triangulation from good matches (often starting from a seed pair with strong geometry).

Bundle adjustment takes that rough model and refines:

Adjusts camera extrinsics (pose), intrinsics (if enabled), and 3D point positions such that projected points align with their observed positions (minimizing reprojection error).

Result: a consistent sparse point cloud and camera network; reprojection error statistics are produced (global RMS, per-camera, per-point residuals) which are your immediate diagnostics of internal consistency.

⁠

Role of GCPs and camera accuracy, and why they matter

When you introduce external constraints like geotags or GCPs, you’re adding extra “observations” into that optimization:

A GCP gives a known 3D location that a model point should align to (with some expected uncertainty).

A geotag (camera position prior) says “this camera was roughly here” with a certain confidence (e.g., 5 mm vs looser).

Bundle adjustment now balances:

Tie point reprojection errors (from image geometry),

Camera pose priors (geotags),

Ground truth anchors (GCPs),

Calibration parameters (if being adjusted).

Each of these is weighted by its assumed accuracy; overly tight weights on a bad geotag can force the whole system to warp, increasing residuals elsewhere. That’s why normalized residuals (residual divided by expected sigma) give insight into which constraints are being violated more than they should.