📄 Paper • 📝 Blog • 🤗 Dataset • 🚀 Demo
VideoScience-Bench evaluates whether video models can go beyond looking plausible to being scientifically correct.
- 200 undergraduate-level scientific scenarios (physics + chemistry)
- 160 for T2V evaluation
- and 40 for I2V evaluation
- 12 topics, 103 concepts, and multi-concept scientific reasoning required in a single prompt
- Evaluation along 5 dimensions (Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, Spatio-Temporal Coherence)
VideoScience-Judge is an auto evaluation pipeline that supports:
- Prompt-specific checklist generation
- CV-grounded evidence extraction (e.g., object detection, object tracking, motion tracking)
- Salient key frames selection where scientific phenomena occur
- final grading with a reasoning-capable VLM
- Dataset Overview
- Installation
- Usage
- Understand Evaluation Metrics
- VideoScience-Judge Results
- Citation
- License
VideoScience-Bench is curated to stress scientific reasoning in video generation: each prompt typically requires at least 2 interacting scientific concepts to produce the correct phenomenon.
Physics (7):
- Classical Mechanics
- Thermodynamics
- Electromagnetism
- Optics
- Fluid Mechanics
- Material Mechanics
- Modern Physics
Chemistry (5):
- Redox Reactions
- Acid-Base
- Reaction Kinetics
- Solution and Phase Chemistry
- Materials and Solid-State Chemistry
The prompt suite is lightweight and easy to integrate into any video generation harness.
Common fields (as in the HF release):
prompt: the experimental setup + procedureexpected phenomenon: a concise description of what should happen if the laws are obeyedkeywords: fine-grained scientific concepts involvedfield: Physics / Chemistryvid: instance id
from datasets import load_dataset
ds = load_dataset("lmgame/VideoScienceBench")
data = ds["test"]
# sanity check an example with
print(data[0]["prompt"])
print(data[0]["expected phenomenon"])
print(data[0]["keywords"])# Clone the repository
git clone https://github.com/hao-ai-lab/VideoScience.git
cd VideoScience
# Install dependencies
pip install -r requirements.txtFastVideo is a video generation provider that supports two modes of operation:
If you have a deployed FastVideo API server:
export FASTVIDEO_API_BASE="http://your-fastvideo-server:8000"
export FASTVIDEO_API_KEY="your-api-key" # Optional, if authentication is requiredFor local GPU inference:
# Install FastVideo package
pip install fastvideo
# Set the model path (will be downloaded on first use)
export FASTVIDEO_MODEL_PATH="FastVideo/FastWan2.1-T2V-1.3B-Diffusers"Requirements for local inference:
- CUDA-capable GPU with sufficient VRAM
- PyTorch with CUDA support
- Download the CSV data file under
data/database/data_filtered.jsonl. - Launch the script:
bash scripts/batched_generation_using_csv.shpython3 single_generation_frontend.py \
--provider {provider_name} \
--model {model_name} \
--prompt "{your_prompt}"bash judge/batched_evaluate_all_models.shWe evaluate each generated video on five dimensions (Likert 1–4):
- Prompt Consistency (PCS): is the setup/procedure faithful to the prompt?
- Phenomenon Congruency (PCG): does the correct scientific outcome occur?
- Correct Dynamism (CDN): are motions / dynamics physically consistent?
- Immutability (IMB): are static attributes preserved (no flicker/identity drift)?
- Spatio-Temporal Coherence (STC): is the video coherent over time and space?
Manual scientific evaluation is expensive. VideoScience-Judge aims to be human expert-aligned while remaining scalable.
We report ranking correlations between automatic metrics and domain-expert annotations across 7 evaluated video models.
| Metric | Kendall τ | Spearman ρ |
|---|---|---|
| VSci-Judge | 0.81 | 0.89 |
| VSci-Judge (Checklist) | 0.90 | 0.96 |
| VSci-Judge (Checklist + CV evidence) | 0.90 | 0.96 |
| PhyGenEval | 0.52 | 0.61 |
| VideoScore2 | 0.24 | 0.29 |
Note: adding prompt-specific checklists (and optional CV evidence) makes the judge align near-perfectly with expert-ranked model quality on VideoScience-Bench.
- [optional] Checklist generation: create an evaluative rubric tied to the prompt
- [optional] CV-based evidence extraction (optional but recommended): tracking, motion, attribute changes, key frames
- final grading: VLM-as-a-judge reasons over the checklist + all evidences
If you use VideoScience in your research, please cite:
@article{hu2025videoscience,
title={Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench},
author={Hu, Lanxiang and Shankarampeta, Abhilash and Huang, Yixin and Dai, Zilin and Yu, Haoyang and Zhao, Yujie and Kang, Haoqiang and Zhao, Daniel and Rosing, Tajana and Zhang, Hao},
journal={arXiv preprint arXiv:2512.02942},
year={2025}
}This project is released under the MIT License. See LICENSE.
