PVChat: Personalized Video Chat with One-Shot Learning

Yufei Shi^1,5† Weilong Yan^2† Gang Xu⁴ Yumeng Li³ Yucheng Chen^1,5 Zhenxi Li^1,5 Fei Richard Yu⁴ Ming Li⁴^(✉) Si Yong Yeo^1,5^(✉)

¹MedVisAI Lab ²National University of Singapore ³Nankai University ⁴Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) ⁵Lee Kong Chian School of Medicine, Nanyang Technological University

^†Equal contribution. ^(✉) Email: yufei005@e.ntu.edu.sg, yanweilong@u.nus.edu

📄 Paper 📚 arXiv 🎥 Video 💻 Code 📊 Data

Video demonstration of PVChat's personalized video chat capabilities

Abstract

PVChat introduces a novel approach to personalized video chat with one-shot learning, enabling accurate identity recognition and content understanding from minimal examples. Traditional video chat systems struggle with personalized information recognition, requiring extensive training data and often failing to distinguish between similar identities.

Our method leverages one-shot learning to achieve robust personalized video understanding, where the system can accurately recognize and respond to queries about specific individuals after seeing just a single example. The approach incorporates specialized data collection pipelines, identity-preserving video generation, and the novel ReMoH (Representation Modeling with Hierarchical features) technique for enhanced characteristic learning.

PVChat demonstrates superior performance on personalized video understanding benchmarks, significantly outperforming existing models in accurately answering questions about personalized information while maintaining strong general video understanding capabilities.

Figure 1. Examples of PVChat's ability with one-shot learning (e.g., <Nz> and <Ab>). PVChat can answer questions about the personalized information correctly while other models fail.

Key Contributions

One-Shot Personalized Learning: First video chat system capable of accurate personalized understanding with just one example
Systematic Data Collection: Comprehensive pipeline for generating high-quality personalized video data with identity preservation
ReMoH Technique: Novel Representation Modeling with Hierarchical features for better specialized characteristic learning
Robust Identity Recognition: Hard negative sampling strategy ensures accurate discrimination between similar identities

Method Overview

PVChat employs a systematic data collection and training pipeline specifically designed for personalized video understanding:

Figure 2. The systematic data collection pipeline. For positive data collection, the original videos are processed by DeepFaceLab for high-quality face and InterVideo2 for demographic characteristics, which boost identity preservation. ConsisID and LivePortrait with PhotoMaker utilize the identity information to generate videos of various backgrounds or different motion/expression, respectively. For model's robust perception, hard negative samples are selected from either similar face retrieval to generate negative videos, or sampled from the CelebV-HQ dataset. These negative samples guarantee the model's accurate recognition of both identity and content.

The framework incorporates the following key components:

Identity-Preserving Generation: Uses DeepFaceLab and InterVideo2 for high-quality face processing and demographic characteristics
Video Variation Synthesis: Employs ConsisID, LivePortrait, and PhotoMaker to generate diverse backgrounds and expressions
Hard Negative Sampling: Selects challenging similar faces to improve discrimination capabilities
ReMoH Training: Hierarchical feature modeling for enhanced characteristic learning

PVChat: Personalized Video Chat with One-Shot Learning

Abstract

Key Contributions

Method Overview

Results & Performance

Technical Innovation

Conference Poster

Impact & Applications

Citation

Published in ICCV 2025