Skip to content

hao-ai-lab/VideoScience

Repository files navigation

vsci-bench-logo

 VideoScience-Bench: Benchmarking Scientific Reasoning in Video Generations

📄 Paper📝 Blog🤗 Dataset🚀 Demo


What this repo provides

VideoScience-Bench evaluates whether video models can go beyond looking plausible to being scientifically correct.

  • 200 undergraduate-level scientific scenarios (physics + chemistry)
    • 160 for T2V evaluation
    • and 40 for I2V evaluation
  • 12 topics, 103 concepts, and multi-concept scientific reasoning required in a single prompt
  • Evaluation along 5 dimensions (Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, Spatio-Temporal Coherence)

VideoScience-Judge is an auto evaluation pipeline that supports:

  1. Prompt-specific checklist generation
  2. CV-grounded evidence extraction (e.g., object detection, object tracking, motion tracking)
  3. Salient key frames selection where scientific phenomena occur
  4. final grading with a reasoning-capable VLM

Table of Contents


Dataset Overview

VideoScience-Bench is curated to stress scientific reasoning in video generation: each prompt typically requires at least 2 interacting scientific concepts to produce the correct phenomenon.

Topics (12)

Physics (7):

  • Classical Mechanics
  • Thermodynamics
  • Electromagnetism
  • Optics
  • Fluid Mechanics
  • Material Mechanics
  • Modern Physics

Chemistry (5):

  • Redox Reactions
  • Acid-Base
  • Reaction Kinetics
  • Solution and Phase Chemistry
  • Materials and Solid-State Chemistry

What each example contains

The prompt suite is lightweight and easy to integrate into any video generation harness.

Common fields (as in the HF release):

  • prompt: the experimental setup + procedure
  • expected phenomenon: a concise description of what should happen if the laws are obeyed
  • keywords: fine-grained scientific concepts involved
  • field: Physics / Chemistry
  • vid: instance id

Loading from Hugging Face

from datasets import load_dataset

ds = load_dataset("lmgame/VideoScienceBench")
data = ds["test"]

# sanity check an example with
print(data[0]["prompt"])
print(data[0]["expected phenomenon"])
print(data[0]["keywords"])

Installation

Basic Setup

# Clone the repository
git clone https://github.com/hao-ai-lab/VideoScience.git
cd VideoScience

# Install dependencies
pip install -r requirements.txt

FastVideo Setup

FastVideo is a video generation provider that supports two modes of operation:

Option 1: Remote API Server (Recommended for Production)

If you have a deployed FastVideo API server:

export FASTVIDEO_API_BASE="http://your-fastvideo-server:8000"
export FASTVIDEO_API_KEY="your-api-key"  # Optional, if authentication is required

Option 2: Local Inference Mode

For local GPU inference:

# Install FastVideo package
pip install fastvideo

# Set the model path (will be downloaded on first use)
export FASTVIDEO_MODEL_PATH="FastVideo/FastWan2.1-T2V-1.3B-Diffusers"

Requirements for local inference:

  • CUDA-capable GPU with sufficient VRAM
  • PyTorch with CUDA support

Usage

1) Batched video generation

  1. Download the CSV data file under data/database/data_filtered.jsonl.
  2. Launch the script:
bash scripts/batched_generation_using_csv.sh

2) Single video generation

python3 single_generation_frontend.py \
  --provider {provider_name} \
  --model {model_name} \
  --prompt "{your_prompt}"

3) VLM-as-a-judge evaluation

bash judge/batched_evaluate_all_models.sh

Evaluation Metrics

We evaluate each generated video on five dimensions (Likert 1–4):

  • Prompt Consistency (PCS): is the setup/procedure faithful to the prompt?
  • Phenomenon Congruency (PCG): does the correct scientific outcome occur?
  • Correct Dynamism (CDN): are motions / dynamics physically consistent?
  • Immutability (IMB): are static attributes preserved (no flicker/identity drift)?
  • Spatio-Temporal Coherence (STC): is the video coherent over time and space?

VideoScience-Judge vs. Human Annotations

Manual scientific evaluation is expensive. VideoScience-Judge aims to be human expert-aligned while remaining scalable.

Ranking correlation with expert ratings

We report ranking correlations between automatic metrics and domain-expert annotations across 7 evaluated video models.

Metric Kendall τ Spearman ρ
VSci-Judge 0.81 0.89
VSci-Judge (Checklist) 0.90 0.96
VSci-Judge (Checklist + CV evidence) 0.90 0.96
PhyGenEval 0.52 0.61
VideoScore2 0.24 0.29

Note: adding prompt-specific checklists (and optional CV evidence) makes the judge align near-perfectly with expert-ranked model quality on VideoScience-Bench.

VideoScience-Judge Features

  1. [optional] Checklist generation: create an evaluative rubric tied to the prompt
  2. [optional] CV-based evidence extraction (optional but recommended): tracking, motion, attribute changes, key frames
  3. final grading: VLM-as-a-judge reasons over the checklist + all evidences

Citation

If you use VideoScience in your research, please cite:

@article{hu2025videoscience,
  title={Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench},
  author={Hu, Lanxiang and Shankarampeta, Abhilash and Huang, Yixin and Dai, Zilin and Yu, Haoyang and Zhao, Yujie and Kang, Haoqiang and Zhao, Daniel and Rosing, Tajana and Zhang, Hao},
  journal={arXiv preprint arXiv:2512.02942},
  year={2025}
}

License

This project is released under the MIT License. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5