CS355 - Digital Forensics

Digital forensics is the use of scientific methods to collect probative facts from digital evidence.

Image acquisition

An imaging sensor converts light energy to a proportional electrical voltage. Sensors are usually CCD or CMOS. The sensor array is responsible for sampling. Sampling is how often an image is captured. The grid spacing size in the sensor array determines the spatial resolution of an image. Quantisation is the process of discretisation of intensity values. It converts the continuous signal into one of the discrete values a pixel can have.

Before the imaging sensor there is a colour filter array (CFA). The colour filter only allows particular wavelengths of light through, either red, green or blue. The Bayer pattern is a repeated pattern of filters over four pixels which uses twice as many green filters as red and blue to mimic the human eye, which has twice as many green light absorption cells as red or blue.

CFA interpolation is used to recover the complete RGB values for each pixel in the Bayer pattern. Bilinear interpolation is a popular approach for CFA interpolation.

Human eyes do not perceive light like a camera does. The camera follows a linear relationship between light level and pixel intensity but the human eye does not. Gamma correction is used to account for this, which translates the actual luminance to a perceived luminance according the the sensitivity of the eye to light. For an image with 8 bits per pixel, the pixel values range from 0 to 255. The equation for gamma correction for such an image is $v_{out} = 255 \times \left(\frac{v}{255}\right)^\gamma$. Different values of $\gamma$ will give different results.

A video is a series of images captured at regular intervals. 30 fps is the standard video capture frame rate.

Image representation

The RGB colour model is one of the most common colour spaces used for both acquisition and display. Any point in the RGB color space is a unique combination of red, green and blue.

The Y'UV colour space consists of a luminance component (Y') and chrominance components (U and V). Y'UV was originally developed for adding colour to black and white TV transmissions while providing backwards compatibility.

The YCbCr colour space is a variant of Y'UV. JPEG supports the YCbCr format with 8-bits for each component. YCbCr can be obtained by converting from RGB for each pixel. $$ \begin{bmatrix} Y\\ C_B \\ C_R \end{bmatrix} = \begin{bmatrix} 0.2990 & 0.5870 & 0.1140\\ -0.1678 & -0.3313 & 0.5000\\ 0.5000 & -0.4187 & -0.0813 \end{bmatrix} \begin{bmatrix} R\\G\\B \end{bmatrix} + \begin{bmatrix} 0\\128\\128 \end{bmatrix} $$ This transformation can also be done in reverse from YCbCr to RGB.

Human eyes are much more sensitive to luminance than chrominance so there are redundancies in chrominance channels. The chrominance information can therefore be reduced without noticeable loss of quality. This process is called chroma-subsampling.

The chroma-subsampling scheme is expressed as a three part ratio, A:B:C where

  • A is the width of the region in which subsampling is performed
  • B is the number of Cb/Cr samples in each row of A pixels (horizontal factor)
  • C is the number of changes in Cb/Cr samples between the first and second row (vertical factor)

4:4:4 subsampling preserves the original image. 4:2:2 subsampling merges every two colours into one. 4:2:0 subsampling merges the first two columns into the same colour and the last two into another. This is the most popular format for digital images.

Consider 4:2:0 subsampling on the following full resolution chroma sample $$ \begin{bmatrix} 20 & 22 & 42 & 40\\ 20 & 22 & 43 & 39 \end{bmatrix} $$

There are several techniques for subsampling.

  • Average - the average of the original block. For the example, this would be $\begin{bmatrix} 21 & 41\end{bmatrix}$ because the average of $\begin{bmatrix}20 & 22\\20&22\end{bmatrix}$ is 21 and the average of $\begin{bmatrix}42 & 40\\43&39\end{bmatrix}$ is 41.
  • Left - the average of the two leftmost chroma pixels of the block, which gives $\begin{bmatrix} 20 & 42\end{bmatrix}$
  • Right - the average of the two rightmost chroma pixels of the block, which gives $\begin{bmatrix}22 & 39\end{bmatrix}$
  • Direct - the top left chroma pixel, which gives $\begin{bmatrix}20 & 42\end{bmatrix}$

Using 4:2:0 subsampling reduces the size of the chrominance channels by a factor of 4 with little noticeable change in quality.

Examples

Based on 2023 paper, question 1b with additional examples

Consider the chrominance channel $I$: $$ I = \begin{bmatrix} 19 & 20 & 20 & 23 \\ 21 & 22 & 24 & 25 \\ 22 & 22 & 24 & 24 \\ 20 & 20 & 22 & 21 \end{bmatrix} $$

Show the subsampled channel for each of the following formats, using floor to get integer values:

  1. 4:2:0 left $$\begin{bmatrix}20 & 22 \\ 21 & 23\end{bmatrix}$$
  2. 4:1:1 direct $$\begin{bmatrix}19 \\ 21 \\ 22 \\ 20\end{bmatrix}$$
  3. 4:2:2 average $$\begin{bmatrix} 19 & 21\\21 & 24\\22 & 24\\20 & 21\end{bmatrix}$$
  4. 4:2:0 average $$\begin{bmatrix}20 & 23\\21 & 22\end{bmatrix}$$
  5. 4:1:1 right $$\begin{bmatrix}21 \\ 24 \\ 24 \\ 21\end{bmatrix}$$
  6. 4:2:2 direct/left $$\begin{bmatrix}19 & 20\\ 21 & 24\\ 22 & 24\\ 20 & 22\end{bmatrix}$$

2021 paper, question 5b Perform 4:2:2 (average) and 4:2:0 (direct) chroma subsampling on the following matrix. Use floor for integer conversion. $$ M = \begin{bmatrix} 5 & 5 & 0 & 5\\ 5 & 0 & 3 & 2\\ 5 & 5 & 2 & 2\\ 2 & 0 & 1 & 0 \end{bmatrix} $$

  1. 4:2:2 (average) $$\begin{bmatrix}5 & 2\\2&2\\5&2\\1&0\end{bmatrix}$$
  2. 4:2:0 (direct) $$\begin{bmatrix}5 & 0 \\5 & 2\end{bmatrix}$$

Comparing images

There are three main methods of comparing the similarity of images

  1. Mean squared error (MSE), measures dissimilarity
  2. Correlation, measures similarity
  3. Structural similarity (SSIM)

The simplest method of measuring similarity is the mean squared error (MSE). The MSE for two images $X$ and $Y$ is defined as $$ MSE(X, Y) = \frac{1}{N}\sum_{i=1}^N(y_i-x_2)^2 $$ where $x_i$ and $y_i$ are pixel values and $N$ is the image size (this requires $X$ and $Y$ to be the same size).

MSE is simple, parameter free, doesn't need memory and has quadratic form meaning it can be optimised easily.

Correlation coefficients measure the statistical relationship between two variables. By modelling an image as a discrete random variable, the mean, variance and standard deviation can be found. For an image $X = \left[x_1, x_2, ..., x_i, ..., x_N\right]$ where $x_i$ is the value of the $i$-th pixel in the image, the mean is given by $$\bar{x} = \frac{1}{N}\sum_{i=1}^Nx_i$$ the variance is given by $$\operatorname{Var}(X) = \frac{1}{N}\sum_{i=1}^N(x_i - \bar{x})^2$$ and the standard deviation is given by $$\sigma(X) = \sqrt{\operatorname{Var}(X)}$$

Covariance measures the relationship between two images $X$ and $Y$ is given by $$ \operatorname{Cov}(X, Y) = \frac{1}{n}\sum_{i=1}^N(x_i - \bar{x})(y_i - \bar{y}) $$

If $X= Y$ then $\operatorname{Cov}(X, Y) = \operatorname{Var}(X)$

The sample correlation coefficient can be computed as $$ r(X, Y) = \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}} = \frac{\operatorname{Cov}(X, Y)}{\sigma(X)\sigma(Y)} $$

Image distortion is caused by structural and non-structural distortions.

Non-structural distortions include

  • Luminance change
  • Contrast change
  • Gamma distortion
  • Spatial shift

Structural distortions include

  • Noise contamination
  • Blurring
  • JPEG blocking
  • Wavelet ringing

The Structural Similarity (SSIM) index is widely used for benchmarking imaging devices. SSIM compares contrast, luminance and structure to determine image similarity. It is more resistant to image distortions than other methods.

Image enhancement

Image enhancement can be used to improve illumination or contrast, remove noise or sharpen an image. These are useful for forensic applications.

Pixel domain

Pixel/spatial domain processing works directly on pixels in an image. Examples include inversion and gamma correction.

Pixel values in an image can be plotted on a histogram. A high contrast image should ideally have a flat histogram spanning the range of pixel intensities.

Histogram equalisation can be used to enhance contrast in images. Firstly, the histogram has to be normalised into a probability mass function (PMF). Consider a histogram of an image with 64 pixels. A bin with frequency 24 would have probability $24/64 = 0.375$. The PMF is the same shape as the histogram, but the $y$-axis uses probabilities instead of frequencies. Flattening the PMF flattens the histogram. To flatten the PMF, it needs to be converted to a Cumulative Distribution Function (CDF).

For a continuous random variable $X$, the CDF is defined as $F_x(X) = P(X \leq x)$. For a discrete random variable the CDF is $F_x(X) = \sum_{x_i \leq x}{P(X = X_i)}$ or more simply, the sum of probabilities of all bins before it.

To equalise the histogram, the CDF has to be flattened. Each pixel with value $k$ is replaced with $$ \operatorname{round}\left((L-1)\sum_{j=0}^{k}\frac{n_j}{n}\right) $$ where $L$ is the number of pixel values.

For RGB, histogram equalisation can be performed on each RGB channel separately or alternatively the image can be converted to YCbCr and equalisation is done on the luminance channel only.

Histogram equalisation can be generalised to histogram matching. An image histogram can be transformed to other histograms.

Another enhancement is noise removal. Image noise is unwanted structures in an image that degrade its quality. Noise is a random quantity, often unknown. For practical purposes, it is assumed that noise is additive and can be modelled as a random variable of a known distribution function, such as Gaussian noise which follows the Gaussian distribution.

Gaussian noise removal can be done using local averaging. Every pixel value is replaced by the average of its neighbouring pixel values. This reduces noise but causes information loss, causing images to become blurry.

Local averaging is a convolution operation. A 3x3 local averaging mask on a 3x3 section of an image $i_1\dots i_9$ centred on $i_5$ can be expressed as $$ \frac{1}{9} \begin{bmatrix} 1&1&1\\1&1&1\\1&1&1 \end{bmatrix} \begin{bmatrix} i_1&i_2&i_3\\i_4&i_5&i_6\\i_7&i_8&i_9 \end{bmatrix} $$ where the left part is called the kernel, filter or weights.

Salt and pepper noise is usually caused by faulty sensors. Only a few pixels are modified and replaced by black or white. This type of noise can be removed with a median filter, which takes the median of neighbouring pixels.

Frequency domain

Two methods to transform an image from the pixel domain to the frequency domain are

  • 2D Discrete Fourier Transform (DFT)
  • 2D Discrete Cosine Transform (DCT)

Any function can be expressed in terms of harmonic functions of different frequencies. A discrete function with length $M$ can be represented as the sum of harmonic functions $$ f(x) = \sum_{u=0}^{M-1}F(u)e^{\frac{2\pi iux}{M}} $$ As an orthonormal basis this is $$ f(x) = \sum_{u=0}^{M-1}F(u)\frac{1}{\sqrt{M}}e^{\frac{2\pi iux}{M}} $$ where $F(u)$ is a Fourier coefficient.

This can be represented using matrices $$ \begin{bmatrix} f(0)\\ f(1)\\f(2) \\ \vdots \\ f(M-1) \end{bmatrix} = \frac{1}{\sqrt{M}} \begin{bmatrix} 1 & 1 & 1 & \cdots & 1\\ 1 & e^{\frac{2\pi i}{M}} & e^{\frac{4\pi i}{M}} & \cdots & e^{\frac{2(M-1)\pi i}{M}} \\ 1 & e^{\frac{4\pi i}{M}} & e^{\frac{8\pi i}{M}} & \cdots & e^{\frac{4(M-1)\pi i}{M}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & e^{\frac{2(M-1)\pi i}{M}} & e^{\frac{4(M-1)\pi i}{M}} & \cdots & e^\frac{2(M-1)^2\pi i}{M} \end{bmatrix} \begin{bmatrix} F(0) \\ F(1) \\ F(2) \\ \vdots \\ F(M-1) \end{bmatrix} $$ This is called the inverse Fourier transform and transforms frequency into space/time. The forward transform is found by taking the conjugate transpose, which transposes the matrix and negates the imaginary parts of the exponents: $$ \begin{bmatrix} F(0) \\ F(1) \\ F(2) \\ \vdots \\ F(M-1) \end{bmatrix} = \frac{1}{\sqrt{M}} \begin{bmatrix} 1 & 1 & 1 & \cdots & 1\\ 1 & e^{\frac{-2\pi i}{M}} & e^{\frac{-4\pi i}{M}} & \cdots & e^{\frac{-2(M-1)\pi i}{M}} \\ 1 & e^{\frac{-4\pi i}{M}} & e^{\frac{-8\pi i}{M}} & \cdots & e^{\frac{-4(M-1)\pi i}{M}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & e^{\frac{-2(M-1)\pi i}{M}} & e^{\frac{-4(M-1)\pi i}{M}} & \cdots & e^\frac{-2(M-1)^2\pi i}{M} \end{bmatrix} \begin{bmatrix} f(0)\\ f(1)\\f(2) \\ \vdots \\ f(M-1) \end{bmatrix} $$ This is called the forward Fourier transform and transforms time/space into frequency.

The 1D Fourier transform can be extended to the 2D Fourier transform. An $M \times N$ image can be expressed as $$ f(x,y) = \sum_{u=0}^{M-1}\sum_{v=0}^{N-1}F(u,v)e^{2\pi i\left(\frac{ux}{M} + \frac{vy}{N}\right)} $$

The 2D DFT is given by $$ F(u, v) = \frac{1}{M}\frac{1}{N}\sum_{x=0}^{M-1}\sum_{y=0}^{N-1}f(x,y)e^{-2\pi i\left(\frac{ux}{M} + \frac{vy}{N}\right)} $$

Most noise is found in high frequency components of an image. Removing high frequency coefficients in the frequency domain can remove noise in the pixel domain but does mean some information is lost.

Different masks can be applied to the frequency domain for noise removal

  • Ideal
  • Gaussian
  • Butterworth

These filters can be either high or low pass to remove high or low frequencies.

Notch filters can be used to remove repetitive spectral noise from an image.

The Discrete Cosine Transform (DCT) is similar to DFT. It uses cosines as a basis instead of harmonic functions. The forward 1D DCT is given by $$ F(u) = \sum_{x=0}^{N-1} f(x)\alpha(u)\cos\left[\frac{\pi(2x+1)u}{2N}\right] $$ and the inverse DCT is given by $$ f(x) = \sum_{u=0}^{N-1} F(u)\alpha(u)\cos\left[\frac{\pi(2x+1)u}{2N}\right] $$ where $\alpha(u) = \begin{cases}\sqrt{\frac{1}{N}} \text{ for } u = 0\\\sqrt{\frac{2}{N}} \text{ otherwise}\end{cases}$

The forward and inverse transforms have the same basis and the basis is orthonormal, real and symmetric.

The DC component is given by $F(u = 0)$ which simplifies to $$ \frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}f(x) $$

The 2D DCT is given by $$ F(u, v) = \sum_{x=0}^{M-1}\sum_{y=0}^{N-1}f(x,y)\alpha(u)\alpha(v)\cos\left[\frac{\pi(2x+1)u}{2M}\right]\cos\left[\frac{\pi(2y+1)v}{2N}\right] $$ and the DC component is given by $$ \frac{1}{\sqrt{MN}} \sum_{x=0}^{M-1}\sum_{y=0}^{N-1} f(x,y) $$ DCT is ideal for compression because values can be discarded without noticeable loss of image quality.

Digital watermarking

Digital watermarking inserts a signature pattern into data before it is distributed. There are several types

  • Blind - doesn't require original data to recover the watermark
  • Visible - claim image ownership
  • Private - can only be detected by authorised persons
  • Robust - resistant to cropping, rotation or resizing

Pixel domain

One method of watermarking is bitplane substitution. A bitplane is the value of the a specific bit of each pixel in an image. The LSB bitplane is every least significant bit in the image. The LSB bitplane has the least effect on the image when replaced with a watermark so can be used for invisible watermarking. Using higher significance bitplanes makes the watermark more visible. Watermarks can either be unrelated images or be derived from the original image.

Bitplane substitution is simple and fast, allows for visible or invisible watermarks and does not necessarily require the original image to recover the watermark but is fragile and vulnerable to cropping.

Frequency domain

Watermarking can be done in the frequency domain by altering components of the DCT or DFT of an image. This makes invisible watermarks which are more robust to common attacks.

The DCT basis can be divided into low, mid and high frequency regions. The low and mid frequency regions are perceptually important so are used for watermarking. The watermark can then be embedded using LSB substitution or spread spectrum watermarking.

Spread spectrum (SS) watermarking uses a random, Gaussian-distributed sequence as a watermark. This watermark can then be embedded into the perceptually important DCT coefficients. For coefficients $h$ and watermark $w$, the watermarked coefficients $h^*$ can be obtained using several methods such as

  • $h^* = h(1+\alpha w)$
  • $h^* = h(e^{\alpha w})$
  • $h^* = h(1+\alpha hw)$

where $\alpha$ is some constant. Higher values of $\alpha$ cause more distortion to the watermarked image.

Hybrid watermarking uses a combination of pixel and frequency domain watermarking on blocks in an image.

Watermarking in the frequency domain is more robust to cropping, compression, removal and filtering than in the pixel domain.

Recovered watermarks can be compared to the original watermark. For pixel domain watermarking, MSE, correlation or SSIM can be used. For spread spectrum, the similarity is given by $$ \operatorname{sim}(\hat{w}, w) = \frac{\hat{w}w^T}{\sqrt{\hat{w}\hat{w}^T}} $$ where $w$ is the original watermark and $\hat{w}$ is the recovered watermark.

Attacks on watermarks include

  • Compression
  • Filtering
  • Jitter attack - randomly duplicates and deletes visually imperceptible data

Compression

Raw images have lots of redundant data. Both lossless and lossy compression can be applied to images. Types of redundancy include

  • Spatial - neighbouring pixels have very similar values
  • Psychovisual - details in images humans can't see
  • Coding - an image can be represented using a lower bit depth

JPEG is the most common image compression algorithm. It has the following steps

  1. Colour space conversion - Convert RGB to YCbCr, chrominance channels can be subsampled
  2. Division into sub-images - Divide image into non-overlapping blocks called macroblocks or minimum encoded units, blocks are usually 8x8 or 16x16
  3. DCT on each block
  4. Quantiser - apply matrix that roughly models the human sensitivity to the different DCT coefficients, different matrices are used for different image qualities, results in lots of 0 coefficients
  5. Huffman coding - efficiently and losslessly encode quantised DCT coefficients

Compression-based forensics

Compression-based forensic techniques rely mostly on detecting artifacts introduced by multiple quantisation.

One technique is double compression detection. Quantisation changes DCT coefficients so double quantisation artifacts will be visible in the distribution of DCT coefficients. Quantisation can be expressed as $q_a(u) = \left\lfloor\frac{u}{a}\right\rfloor$ where $a$ is the quantisation factor. De-quantisation is given by $q_a(u)a = \left\lfloor\frac{u}{a}\right\rfloor a$ and double quantisation is given by $q_{ab}(u) = \left\lfloor q_a(u)\frac{a}{b}\right\rfloor = \left\lfloor\left\lfloor\frac{u}{a}\right\rfloor\frac{a}{b}\right\rfloor$.

The histograms of the DCT coefficients can be analysed to look for quantisation artifacts. Periodic empty bins or periodic peaks in the histogram are likely double quantisation artifacts.

Another technique is JPEG ghost detection. By looking at how the quantisation error changes when using different quantisation factors it is possible to see artifacts where different quantisation factors were used for different parts of an image.

Copy-move forgery

Copy-move forgery involves copying part of an image to another part of the same image. It can be detected using several methods

  • Exhaustive search by circular shift and match
  • Block matching in the pixel or frequency domains

Exhaustive search works by repeatedly circular shifting the image and looking for matching segments. Erosion and dilation are used to reduce false positives. Exhaustive search is computationally expensive.

Block matching works by converting blocks in an image to vectors and searching for matches. In the spatial domain, the block matching algorithm is as follows

  1. Slide a $b \times b$ block along the image from top left to bottom right
  2. Convert each block into a vector of length $b^2$ along with its position to form the feature table
  3. Sort the feature table lexicographically, identify consecutive similar rows as these correspond to similar blocks
  4. Identify the position of the similar blocks

This can also be done in the frequency domain using blocks of DCT coefficients. Lexicographical sorting in the pixel domain mainly considers the first pixel in a block which might not be representative of the whole block whereas in the frequency domain blocks are sorted by their DC component which contains the most information about the block and can produce better results.

Feature matching

Image features can also be used for forgery detection. Image features can be global or local depending on whether they capture properties of the entire image or a smaller part of it. Features can be pixel values, features in the frequency domain, statistical features or features that encode shape, colour or texture. A good feature should be a compact and robust against geometric and photometric distortion.

Local binary pattern (LBP) is a feature that can efficiently encode the texture information of an image region. It considers the values of neighbouring pixels to identify patterns. Rotation invariant LBP considers the LBP at different rotations and takes the minimum.

An LBP histogram can be used to detect copy-move forgery. The histogram uses a bin for each LBP value which isn't compact, so a more compact representation uses bins for each uniform LBP value and a bin for non-uniform LBP values. A value is uniform if there are at most 2 changes in bit from left to right. For example, 11101111 (2 changes) is uniform but 01101001 (5 changes) is not.

The gradients, $G$ of an image are given by $$ G_x = \nabla f_x * f(x, y)\\ G_y = \nabla f_y * f(x, y) $$ where $*$ denotes convolution and $$ \begin{aligned} \nabla f_x &= \begin{bmatrix} 1 & -1 \end{bmatrix} \text{ or } \begin{bmatrix} -1 & 1 \end{bmatrix} \\ \nabla f_y &= \begin{bmatrix} 1 \\ -1 \end{bmatrix} \text{ or }\begin{bmatrix} -1 \\ 1 \end{bmatrix} \end{aligned} $$ for forward and backward difference respectively.

The magnitude is given by $G_{mag} = \sqrt{G_x^2 + G_y^2}$ or $G_{mag} = |G_x| + |G_y|$ and the direction is given by $G_{ang} = \tan^{-1}\left(\frac{G_y}{G_x}\right)$.

$G_{mag}$ and $G_{ang}$ are used to produce a histogram where each bin is an angle. This is called a Histogram of Oriented Gradients and can be used like the LBP histogram to detect features.

Source Device Identification

Source device identification considers two problems

  • Verification - verifies if a given image was captured by a given device
  • Recognition - determines which device captured an image

Source device identification can be done using the following

  • Chrominance subsampling (only useful if different devices use different subsampling methods)
  • EXIF header - camera model, exposure, date/time, resolution (may be removed)
  • Watermarking - some camera models embed invisible watermarks for device identification
  • Sensor Pattern Noise (SPN) - imperfections and noise characteristics of imaging sensors

SPN is intrinsic to the image acquisition process so can't be avoided and is resistant to image processing such as JPEG compression.

Sensor noise consists of shot noise and SPN. Shot noise is random noise caused by the number of photons collected by a sensor and can be modelled as a Poisson distribution. SPN is a deterministic distortion component and stays approximately the same over multiple images.

SPN can be divided into Fixed Pattern Noise (FPN) and Photo-response Non-Uniformity (PRNU). PRNU can also be divided into Pixel Non-Uniformity (PNU) and low frequency defects.

FPN refers to the variation in pixel sensitivity when a sensor array is not exposed to light. It is also called the dark current noise. FPN can be suppressed by subtracting a dark image taken by the same sensor from a given image.

PRNU is the dominant component of SPN. The primary part is PNU which arises from the different sensitivity of pixels to light. This is caused by properties of silicon and manufacturing imperfections. PNU is not affected by temperature or other external conditions. PNU is unique and can be used for source device identification.

Low frequency defects are caused by light refraction from dust and optical surfaces.

Sensor noise can be modelled as $$ y_{ij} = f_{ij}(x_{ij}+\eta_{ij}) + c_{ij} + \epsilon_{ij} $$ where $y$ is the pre-processed pixel output, $f$ is PRNU, $x$ is the light level of the $ij$ th sensor, $\eta$ is shot noise, $c$ is FPN and $\epsilon$ is random noise.

The PRNU can be found by ignoring shot noise and random noise and rearranging $$ f_{ij} = \frac{y_{ij} - c_{ij}}{x_{ij}} $$ but the pre-processed pixel value, $y_{ij}$ is not available so estimation has to be used instead.

SPN can be approximated using multiple reference images. The scene content can be suppressed from each image by de-noising and then subtracting the de-noised image from the original image to get the noise residual. The average noise residual over all the images gives a reference estimate for the SPN.

The SPN of a single test image can be estimated by assuming no post-processing takes place, giving $SPN_{ij} = y_{ij} - F(y_{ij})$ where $F$ is a de-noising filter. Rearranging the sensor noise model and ignoring FPN and random noise gives $SPN_{ij} = f_{ij}(x_{ij}+\eta_{ij}) - F(f_{ij}(x_{ij}) + \eta_{ij})$. Assuming the filtered image is similar to the original, this can be rewritten as $SPN_{ij} = f_{ij}(x_{ij}+\eta_{ij}) - f_{ij}(\hat{x}_ {ij} + \hat{\eta}_ {ij})$ which can be rewritten as $SPN_{ij} = f_{ij}(\Delta x_{ij} + \Delta \eta_{ij})$ where $\Delta x_{ij}$ is the high frequency scene details and $\Delta \eta_{ij}$ is the shot noise residual.

Video forensics

As videos are just sequences of images, the same forgeries occur and the same forensic techniques can be applied to detect them.

One type of video-specific forgery is frame deletion. This can be detected through analysis of prediction errors.

Instead of analysing each frame in a video, small groups of frames are intracoded into a single image called an I-frame or keyframe. Frames predicted from I-frames are called P-frames. B-frames are predicted from both I- and P-frames. The difference between a real and predicted frame is called an error frame.

Video frames are divided into a sequence of groups of pictures which are encoded and decoded independently.