📢 Note:This project is still under active development, and the benchmark will be continuously maintained.
If you find this project helpful, please give us a ⭐️ on GitHub for the latest update.
TeleEgo is a comprehensive omni benchmark designed for multi-person, multi-scene, multi-task, and multimodal long-term memory reasoning in egocentric video streams. It reflects realistic personal assistant scenarios where continuous egocentric video data is collected across hours or even days, requiring models to maintain and reason over memory, understanding, and cross-memory reasoning. Omni here means that TeleEgo covers the full spectrum of roles, scenes, tasks, modalities, and memory horizons, offering all-round evaluation for egocentric AI assistants.
TeleEgo provides:
- 🧠 Omni-scale, diverse egocentric data from 5 roles across 4 daily scenarios.
- 🎤 Multi-modal annotations: video, narration, and speech transcripts.
- ❓ Fine-grained QA benchmark: 3 cognitive dimensions, 12 subcategories.
- Participants: 5 (balanced gender)
- Scenarios:
- Work & Study
- Lifestyle & Routines
- Social Activities
- Outings & Culture
- Recording: 3 days/participant (~14.4 hours each)
- Modalities:
- Egocentric video streams
- Speech & conversations
- Narration and event descriptions
TeleEgo-QA evaluates models along three main dimensions:
-
Memory
- Short-term / Long-term / Ultra-long Memory
- Entity Tracking
- Temporal Comparison & Interval
-
Understanding
- Causal Understanding
- Intent Inference
- Multi-step Reasoning
- Cross-modal Understanding
-
Cross-Memory Reasoning
- Cross-temporal Causality
- Cross-entity Relation
- Temporal Chain Understanding
Each QA instance includes:
- Question type: Single-choice, Multi-choice, Binary, Open-ended
TeleEgo/
├── teleego_data/ # Dataset samples / metadata
│ ├── outputs/ # Output results
│ ├── QAs/ # Question-Answer pairs
│ └── video_merged/ # Merged video files
├── weights/ # Pre-trained weights (MiniCPM-o, Qwen2.5-Omni, ...)
├── evaluate_gemini25_pro.py # Evaluation script for Gemini 2.5 Pro
├── evaluate_gpt_4o.py # Evaluation script for GPT-4o
├── evaluate_minicpm_o.py # Evaluation script for MiniCPM-o
├── evaluate_qwen25_omni.py # Evaluation script for Qwen2.5-Omni
├── evaluate_qwen25_vl.py # Evaluation script for Qwen2.5-VL
├── evaluate_videochat_online.py # Evaluation script for VideoChat
├── metrics.py # Evaluation metrics
├── utils.py # Utility functions
├── run.sh # Execution script
└── README.md # This file
-
Download the dataset from Hugging Face: 🔗 TeleEgo Dataset
Or Baidu Netdisk: 🔗 TeleEgo Dataset
-
Organize the dataset in the following structure:
./TeleEgo/teleego_data/
├── QAs/ # Question-Answer dataset
│ ├── merged_P1_A.json # QA data for participant P1
│ ├── merged_P2_A.json # QA data for participant P2
│ ├── merged_P3_A.json # QA data for participant P3
│ ├── merged_P4_A.json # QA data for participant P4
│ └── merged_P5_A.json # QA data for participant P5
├── outputs/ # Evaluation outputs
│ ├── gemini25_pro/ # Results for Gemini 2.5 Pro
│ ├── gpt-4o/ # Results for GPT-4o
│ ├── minicpm_o/ # Results for MiniCPM-o
│ ├── qwen25_omni/ # Results for Qwen2.5-Omni
│ ├── qwen25_vl/ # Results for Qwen2.5-VL
│ └── videochat-online/ # Results for VideoChat-Online
└── video_merged/ # Merged long videos with timestamps
├── merged_P1.mp4 # P1's 3-day video merged into one file
├── merged_P2.mp4 # P2's 3-day video merged into one file
├── merged_P3.mp4 # P3's 3-day video merged into one file
├── merged_P4.mp4 # P4's 3-day video merged into one file
├── merged_P5.mp4 # P5's 3-day video merged into one file
├── timeline_P1.json # P1's timestamp mapping file
├── timeline_P2.json # P2's timestamp mapping file
├── timeline_P3.json # P3's timestamp mapping file
├── timeline_P4.json # P4's timestamp mapping file
└── timeline_P5.json # P5's timestamp mapping file
Set up your environment according to the official requirements of the model you want to evaluate:
- Qwen2.5-Omni: Follow the official Qwen2.5-Omni setup guide
- MiniCPM-o: Follow the official MiniCPM-o setup guide
- Qwen2.5-VL: Follow the official Qwen2.5-VL setup guide
- VideoChat-Online: Follow the official VideoChat-Online setup guide
- GPT-4o / Gemini 2.5 Pro: Configure your API credentials in
run.sh
To evaluate a model on a specific GPU, use the following command format:
sh run.sh Examples:
# Evaluate Qwen2.5-Omni on GPU 0
sh run.sh eval_qwen25_omni 0Available evaluation functions:
eval_qwen25_omni- Qwen2.5-Omni modeleval_qwen25_vl- Qwen2.5-VL modeleval_minicpm_o- MiniCPM-o modeleval_videochat_online- VideoChat-Online modeleval_gpt_4o- GPT-4o (requires API key)eval_gemini25_pro- Gemini 2.5 Pro (requires API key)
After evaluation, the results will be saved in ./teleego_data/outputs/<model_name>/. To compute evaluation metrics:
python metrics.pyThis will calculate performance metrics across all evaluation dimensions (Memory, Understanding, Cross-Memory Reasoning).
Submit your results to our 🏆 Online Leaderboard.
If you find our TeleEgo in your research, please cite:
@article{yan2025teleego,
title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild},
author={Yan, Jiaqi and Ren, Ruilong and Liu, Jingren and Xu, Shuning and Wang, Ling and Wang, Yiheng and Zhong, Xinlin and Wang, Yun and Zhang, Long and Chen, Xiangyu and Sun, Changzhi and others},
journal={arXiv preprint arXiv:2510.23981},
year={2025}
}This project is licensed under the MIT License. Dataset usage is restricted under a research-only license.
If you have any questions, please feel free to reach out: chxy95@gmail.com.
TeleEgo is an Omni benchmark, a step toward building personalized AI assistants with true long-term memory, reasoning and decision-making in real-world wearable scenarios.
Made with ❤️ by the Ubiquitous AGI team at TeleAI.

