BayesianVSLNet

CVIU 2025

Carlos Plou*, Lorenzo Mur-Labadia*, Jose J. Guerrero, Ruben Martinez-Cantin, Ana C. Murillo,

🌍 Homepage | 📄 Paper | 📝 Challenge Report | 🪧 Poster |

🔔 News:

🆕 12/2025: Paper with an improved BayesianVSLNet++ version accepted at CVIU.
🔥 7/2024: Code released!
😎 6/2024: Poster presentation at EgoVis Workshop during CVPR2024.
🥳 6/2024: Challenge report is available on ArXiv!
🏆 6/01/2024: BayesianVSLNet wins the Ego4D Step Grounding Challenge CVPR24.

We build our approach BayesianVSLNet: Bayesian temporal-order priors for test time refinement. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, which adjusts for cyclic and repetitive actions within video, enhancing the accuracy of moment predictions.

Quick start

Install dependencies

git clone https://github.com/cplou99/BayesianVSLNet
cd BayesianVSLNet
pip install -r requirements.txt

Video Features

We use both Omnivore-L, EgoVideo and EgoVLPv2 video features. They should be pre-extracted and located at ./ego4d-goalstep/step-grounding/data/features/.

Model

It is necessary to locate the EgoVLPv2 weights to extract text features ./NaQ/VSLNet_Bayesian/model/EgoVLP_weights.

Train

cd ego4d-goalstep/step_grounding/
bash train_Bayesian.sh experiments/

Inference

cd ego4d-goalstep/step_grounding/
bash infer_Bayesian.sh experiments/

Results

Ego4D Step Grounding Challenge

The challenge is built over Ego4d-GoalStep dataset and code.

Goal: Given an untrimmed egocentric video, identify the temporal action segment corresponding to a natural language description of the step. Specifically, predict the (start_time, end_time) for a given keystep description.

You will find in the leaderboard 🚀 the results in the test set for the best approaches. Our method is currently in the first place 🚀🔥.

Case study: Robotics

We present qualitative results in a real-world assistive robotics scenario to demonstrate the potential of our approach in enhancing human-robot interaction in practical applications.

📝 Citation

@article{PLOU2025104622,
    title = {Temporal video segmentation with natural language using text-video cross attention and Bayesian order-priors},
    journal = {Computer Vision and Image Understanding},
    pages = {104622},
    year = {2025},
    issn = {1077-3142},
    doi = {https://doi.org/10.1016/j.cviu.2025.104622},
    url = {https://www.sciencedirect.com/science/article/pii/S1077314225003455},
    author = {Carlos Plou and Lorenzo Mur-Labadia and Jose J. Guerrero and Ruben Martinez-Cantin and Ana C. Murillo}
}

@misc{plou2024carlorego4dstep,
    title={CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement}, 
    author={Carlos Plou and Lorenzo Mur-Labadia and Ruben Martinez-Cantin and Ana C. Murillo},
    year={2024},
    eprint={2406.09575},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2406.09575}, 
}

Acknowledgements

This work was supported by a DGA scholarship and by DGA project T45_23R, and grants AIA2025-163563-C31, PID2024-159284NB-I00, PID2021-125209OB-I00, PID2021-125514NB-I00 and PID2024-158322OB-I00 funded by MCIN/AEI/10.13039/501100011033 and ERDF.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
NaQ		NaQ
docs		docs
ego4d-goalstep		ego4d-goalstep
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BayesianVSLNet - Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors

CVIU 2025

🔔 News:

BayesianVSLNet

Quick start

Install dependencies

Video Features

Model

Train

Inference

Results

Ego4D Step Grounding Challenge

Case study: Robotics

📝 Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

cplou99/BayesianVSLNet

Folders and files

Latest commit

History

Repository files navigation

BayesianVSLNet - Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors

CVIU 2025

🔔 News:

BayesianVSLNet

Quick start

Install dependencies

Video Features

Model

Train

Inference

Results

Ego4D Step Grounding Challenge

Case study: Robotics

📝 Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages