Back to topics

Digital Humans

Person-Specific Gaussian Splatting

The Challenge of Digital Humans

Representing humans is one of the hardest problems in computer vision. Unlike static scenes, humans deform — they move, gesture, and express. A good digital human model must handle articulated body motion, facial expressions, clothing dynamics, and hair — all in real time.

Person-specific Gaussian Splatting adapts the 3DGS framework for this challenge by combining Gaussians with parametric body models like SMPL (body) and FLAME (face).

Canonical vs. Posed Space

The fundamental idea is to separate appearance from motion. Gaussians are defined in a canonical space (a neutral T-pose or rest expression), then deformed into the target pose at render time.

This separation means the model can generalize to novel poses and expressions not seen during training — the appearance is stored once and the deformation can be driven by any set of pose parameters.

Pose-Dependent Deformation

x' = W(x, θ, w) = ∑k wk · Bk(θ) · x

x = canonical position, θ = pose parameters, w = skinning weights, Bk = bone transforms

Pose-Driven Gaussian Avatar

Move mouse left/right to rotate the head. Move up/down to control expression (smile). Watch how Gaussians deform with pose.

Linear Blend Skinning (LBS)

The deformation from canonical to posed space uses Linear Blend Skinning, the same technique used in game engines and animation software. Each Gaussian has skinning weights that determine how much it follows each bone in the skeleton.

However, LBS has known limitations — it can produce artifacts at joints (the "candy wrapper" effect). Modern approaches add a learned residual deformation on top of LBS to correct these artifacts.

Corrected Deformation

x' = LBS(x, θ) + Δx(θ, x)

Δx is a learned MLP that predicts pose-dependent corrections

Deformation Field

Move mouse left/right to control deformation strength. Move above/below center to toggle canonical vs. deformed view.

Expression Blending

Facial expressions are modeled using blendshapes — a set of predefined expression offsets that are linearly combined. The FLAME model provides 50 expression blendshapes covering the full range of human facial expressions.

Each Gaussian's position in the canonical space is offset by a weighted combination of these blendshapes, allowing the model to transition smoothly between expressions.

Expression Blendshapes

xexpr = xcanonical + ∑i ψi · Bexpr,i(x)

ψ ∈ ℝ50 = expression coefficients, Bexpr,i = i-th blendshape

Training from Monocular Video

A key advantage of person-specific Gaussian Splatting is that it can be trained from just a monocular video — a single camera recording of a person. The pipeline:

  1. Track — Fit SMPL/FLAME parameters to each frame (pose, expression, shape)
  2. Initialize — Place Gaussians on the mesh surface in canonical space
  3. Optimize — Train Gaussian parameters + deformation network to match all frames
  4. Render — Drive the avatar with new pose/expression parameters in real time

Applications

Person-specific Gaussian avatars enable exciting applications:

  • Telepresence — Photorealistic avatars for VR/AR communication
  • Virtual try-on — Realistic clothing simulation on personal avatars
  • Film & games — Rapid digital double creation from phone video
  • Accessibility — Sign language avatars, expression-driven communication aids