Methodology Reference

Documentation for all metrics and statistics in the SERUM dashboard. This page is auto-generated from the methodology registry.

SERUM Metrics

Metrics from the SERUM multi-pass analysis pipeline. Confidence scores use a 1-10 integer scale (LLM self-reported certainty, enforced by Pydantic validators).

Activity Stability (Vocabulary Jaccard)

Jaccard similarity of the unique state label sets between consecutive activity passes. Measures vocabulary overlap, not per-frame agreement.

J(A,B) = |A ∩ B| / |A ∪ B|

Scale: 0 – 1 (ratio)

Higher values indicate more stable vocabulary across passes. Penalizes valid refinement (e.g., "editing_video" becoming "editing_video_in_premiere" counts as instability).

eval/fsm/passes_to_fsm.py:748-758

Intent Stability (Vocabulary Jaccard)

Jaccard similarity of the unique intent label sets between consecutive intent passes. Measures vocabulary overlap, not per-frame agreement.

J(A,B) = |A ∩ B| / |A ∪ B|

Scale: 0 – 1 (ratio)

Penalizes valid refinement. Intents are meant to become more specific across passes, so lower stability is not necessarily bad.

eval/fsm/passes_to_fsm.py:748-770

Activity-Intent Consistency

Measures alignment between activity states and their inferred intents. Fraction of frames where the activity and intent share at least one significant word.

consistency = avg(token_overlap(activity, intent))

Scale: 0 – 1 (ratio)

Values closer to 1.0 indicate activities and intents are well-aligned. Low values may indicate mismatched abstraction levels.

eval/fsm/passes_to_fsm.py:772-788

Entropy

Shannon entropy measuring the diversity/disorder of state labels. Higher entropy means more varied states.

H = -Σ p(x) × log₂(p(x))

Scale: 0 – ∞ (bits)

p(x) is the probability of each unique state. Higher values indicate more diverse state labels.

eval/fsm/passes_to_fsm.py:184-192

Normalized Entropy

Entropy normalized to [0,1] range for comparison across different state vocabularies.

H_norm = H / log₂(n)

Scale: 0 – 1 (ratio)

n = number of unique states. 1.0 = maximum disorder (uniform distribution). 0 = all frames have same state.

eval/fsm/passes_to_fsm.py:701

Average Confidence

Mean confidence score across all frames in a pass. The LLM's self-reported certainty on a 1-10 integer scale, enforced by Pydantic validators. Anchored as: 1-3 = occluded/blurry/ambiguous, 4-6 = visible but multiple descriptions possible, 7-8 = clearly visible with minor ambiguity, 9-10 = only one reasonable description.

avg_conf = Σ confidence_i / n

Scale: 1 – 10 (integer score)

≥8: High≥6: Good≥4: Moderate≥2: Low

Known limitation: 8B VLMs exhibit ceiling effects on verbalized confidence, with scores clustering at 9-10 regardless of actual ambiguity. This metric is most informative for identifying genuinely unclear frames (scores 1-3) rather than distinguishing between "good" and "great" labels. Inter-pass label agreement may be a more reliable uncertainty signal.

gum/gum/prompts/state_prompt.py:34

High Confidence Ratio

Proportion of frames with confidence ≥ 8 (out of 10).

ratio = count(conf ≥ 8) / total_frames

Scale: 0 – 1 (ratio)

Typically very high (97%+) due to the confidence ceiling effect in small VLMs. Most useful for comparing across videos rather than interpreting absolute values.

eval/fsm/passes_to_fsm.py:661

Self-Transition Rate

Proportion of consecutive frame pairs where the primary state remains the same.

rate = count(state_t == state_{t+1}) / (n - 1)

Scale: 0 – 1 (ratio)

High values (90%+) indicate stable, long-duration states. Low values suggest frequent state changes or jittery labeling.

eval/fsm/passes_to_fsm.py:673-677

Confidence Delta

Change in confidence between first and last activity pass for a frame.

Δ = conf_final_pass - conf_first_pass

Scale: -9 – 9 (score difference)

Positive values indicate refinement improved confidence. Negative values suggest later passes were less certain.

web-dev/app/components/MetricsViz.tsx (RefinementChain)

Frames

Total number of screenshots analyzed in this pass. Each frame is a single point-in-time observation of the user's screen.

eval/fsm/passes_to_fsm.py:699

Unique States

Number of distinct primary state labels assigned across all frames in this pass. A lower count means the model consolidated frames into fewer categories.

eval/fsm/passes_to_fsm.py:715

Unique Intents

Number of distinct intent labels inferred across all frames in this pass. Intents describe the purpose behind the observed actions.

eval/fsm/passes_to_fsm.py:718

Total Passes

Number of analysis passes run over the footage. Odd passes annotate observable actions, even passes infer intents. The two levels are mutually informative: action evidence anchors intent inference, intent context disambiguates ambiguous actions.

Vocabulary empirically stabilizes by pass 8 (schematic equilibrium). Running beyond this point yields diminishing returns, though additional passes may still refine individual frame labels.

eval/fsm/passes_to_fsm.py:790

Schematic Equilibrium

The point at which the raw state vocabulary converges to a stable size across successive passes. Both action and intent vocabularies empirically stabilize by pass 8, indicating the iterative refinement has reached conceptual stability.

Scale: 1 – 12 (pass number)

Convergence is measured by vocabulary size plateauing across consecutive same-type passes. Schematic equilibrium validates the multi-pass design: early passes produce noisy, inconsistent labels; later passes converge to a stable behavioral taxonomy without an imposed ontology.

eval/charts.py

Behavioral Complexity Profile

Per-video comparison of normalized entropy, vocabulary size, and self-transition rate from the last activity/intent pass.

Higher entropy and vocabulary indicate more diverse behavior. Higher self-transition rate indicates longer dwell times.

web-dev/lib/comparison-utils.ts

State Embedding (SentenceBERT)

Each state label is converted from snake_case to natural language and encoded into a 384-dimensional dense vector using the all-MiniLM-L6-v2 SentenceBERT model. Semantically similar states (e.g., "editing code" and "writing code") produce nearby vectors.

The embedding captures semantic similarity between state labels. States that describe similar activities cluster together in the projected 2D space.

eval/embedding/compute.py

UMAP 2D Projection

High-dimensional SentenceBERT embeddings are projected to 2D using UMAP (Uniform Manifold Approximation and Projection). Preserves both local neighborhood structure and global topology.

UMAP(n_neighbors=15, min_dist=0.1, metric=cosine)

Nearby points represent semantically similar states. Clusters indicate groups of related activities. Edge thickness shows transition frequency. The projection is computed separately per pass; use Global mode for cross-pass comparison.

eval/embedding/compute.py

Markov Metrics

Metrics from the split-half Markov evaluation. Probabilities use a 0-1 continuous scale (Laplace-smoothed transition frequencies).

Transition Probability

Probability of transitioning from one state to another, learned from the training set using a first-order Markov chain with Laplace smoothing.

P(j|i) = (count(i→j) + 1) / (Σ_k count(i→k) + |V|)

Scale: 0 – 1 (probability)

≥0.5: High≥0.2: Medium≥0: Low

Higher probability means a transition is more likely. Self-transition probabilities indicate how long the model expects the user to stay in a state. Laplace (add-one) smoothing ensures no transition has zero probability (Chen & Goodman 1999).

eval/markov/model.py:32-59

Top-1 Accuracy

Percentage of test transitions where the model's single highest-probability prediction matched the actual next state.

top1_acc = correct_top1 / n_transitions

Scale: 0 – 1 (ratio)

Higher is better. Compare against majority-class baseline to assess whether transition structure adds predictive value beyond marginal frequency.

eval/markov/evaluate.py:82-86

Top-3 Accuracy

Percentage of test transitions where the actual next state appeared among the model's 3 highest-probability predictions.

top3_acc = correct_top3 / n_transitions

Scale: 0 – 1 (ratio)

Higher is better. A large gap between top-1 and top-3 suggests the model knows the likely transitions but struggles to pick the single best one.

eval/markov/evaluate.py:83-88

Perplexity

Measures how surprised the model is on average. A perplexity of N means the model is as uncertain as choosing uniformly among N states.

perplexity = exp(-mean(log(p_i)))

Scale: 1 – ∞ (effective choices)

Lower is better. Standard language model evaluation metric (Jurafsky & Martin 2024, §3.4). All models use Laplace smoothing to avoid infinite perplexity from unseen transitions.

eval/markov/evaluate.py:90-95

Cross-Video Predictability

Best-pass Markov top-1 accuracy compared across videos. A next-action prediction task: User Models trained on the first 60% of frames predict the held-out final 40%, measuring whether captured behavioral dynamics generalize to unseen activity.

Scale: 0 – 1 (ratio)

Action passes typically score higher than intent passes due to smaller vocabulary. Videos with repetitive activities (e.g., coding) score highest. The 60/40 temporal split avoids data leakage from shuffled frames.

web-dev/lib/comparison-utils.ts

Analysis Metrics

Metrics from the semantic and temporal analysis package. Uses SentenceBERT embeddings for vocabulary consolidation, alignment measurement, and convergence tracking across passes.

Vocabulary Consolidation (Label Normalization)

Agglomerative clustering of state labels by SentenceBERT cosine similarity. Merges synonym clusters (e.g., "editing_code" and "writing_code") using an optimal threshold calibrated on 200 human-annotated label pairs.

cosine_distance(embed(a), embed(b)) < t* → merge (t* = 0.4268, F1 = 0.822)

Scale: 0 – 100 (percent reduction)

Higher reduction percentage means the raw vocabulary had more synonyms. The normalized label for each cluster is the most frequent member. Threshold t* = 0.4268 selected by maximizing F1 on human annotations (precision = 0.768, recall = 0.883).

eval/analysis/semantic.py

Semantic Alignment (Activity-Intent)

Per-frame cosine similarity between the SentenceBERT embedding of the activity state and the intent state from paired passes (P1-P2, P3-P4, P5-P6).

alignment = cosine(embed(activity), embed(intent))

Scale: 0 – 1 (cosine similarity)

Values above 0.5 indicate strong semantic agreement between what the user is doing (activity) and why (intent). Distribution histogram reveals whether alignment is consistent or bimodal.

eval/analysis/semantic.py

Intent Grounding Score

How well each intent is grounded in observable activities. Computed as the maximum cosine similarity between the intent embedding and all activity embeddings.

grounding = max(cosine(embed(intent), embed(activity_i)))

Scale: 0 – 1 (cosine similarity)

Grounded intents (similarity > 0.5) have a close observable activity counterpart. Ungrounded intents may represent abstract goals without direct behavioral evidence.

eval/analysis/semantic.py

Cross-Pass State Lineage Drift

Tracks how state labels evolve across same-type passes (P1→P3→P5 for activities, P2→P4→P6 for intents). Drift measures how far the final label has moved from the root label.

drift = 1 - cosine(embed(root), embed(final))

Scale: 0 – 1 (semantic distance)

Low drift means the concept is stable across passes. High drift indicates significant refinement or reconceptualization.

eval/analysis/semantic.py

Vocabulary Convergence

Tracks vocabulary size and overlap between successive same-type passes using Jaccard and semantic overlap metrics.

Jaccard = |V_i ∩ V_{i+1}| / |V_i ∪ V_{i+1}|

Scale: 0 – 1 (similarity)

Increasing Jaccard between later passes indicates vocabulary stabilization. Trend is classified as converging, diverging, or stable.

eval/analysis/convergence.py

Transition Matrix Divergence

Jensen-Shannon divergence between transition probability matrices of successive same-type passes. Measures structural behavioral change.

JSD(T_i || T_{i+1}) computed row-by-row over shared states

Scale: 0 – 1 (divergence)

Decreasing JSD indicates the transition structure is stabilizing. Converging transitions with converging vocabulary suggests the model has found a stable behavioral representation.

eval/analysis/convergence.py

Normalized Mutual Information

NMI between each pass's state sequence and the final pass. Measures how much information each pass shares with the final result.

NMI(X,Y) = 2 × MI(X,Y) / (H(X) + H(Y))

Scale: 0 – 1 (ratio)

NMI > 0.9 indicates a pass is near-redundant with the final pass. Low NMI in early passes shows meaningful refinement occurred.

eval/analysis/convergence.py