fix: heterogenous loading and critic conflict handling#60
fix: heterogenous loading and critic conflict handling#60LovelyBuggies merged 19 commits intomainfrom
Conversation
Results at v1.3.6Since CoMLRL v1.3.6 is primarily based on the feature development in this PR, I present some results obtained at this commit and briefly explain that. Quick insights:
Writing CollaborationSince the reward design for arXiv expansion and tldr summarization is quite similar (with different hyperparameters), I primarily use tldr for testing. cd LLM_Collab_Writing
python train_magrpo.py --config configs/magrpo_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None magrpo.num_agents=2 magrpo.num_turns=1 wandb.project=hetero wandb.name='magrpo_tldr_1.5_1.7'cd LLM_Collab_Writing
python train_magrpo.py --config configs/magrpo_tldr_config.yaml --override agents='[\"Qwen/Qwen3-1.7B\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None magrpo.num_agents=2 magrpo.num_turns=1 wandb.project=hetero wandb.name='magrpo_tldr_1.7_1.7'
MAGRPO takes about 17 hours to train on a piece of H100 with about 45G Vram usage. cd LLM_Collab_Writing
python train_maac.py --config configs/maac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critic_model.name="Qwen/Qwen3-1.7B" wandb.project=hetero wandb.name='maac_tldr_1.5_1.7'cd LLM_Collab_Writing
python train_maac.py --config configs/maac_tldr_config.yaml --override agents=None agent_model.name="Qwen/Qwen3-1.7B" critic_model.name="Qwen/Qwen3-1.7B" wandb.project=hetero wandb.name='maac_tldr_1.7_1.7'
MAAC takes about 34 hours to train on a piece of H100 with about 71G Vram usage. cd LLM_Collab_Writing
python train_iac.py --config configs/iac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critics='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' critic_model.name=None iac.use_separate_critic=true wandb.project=hetero wandb.name='iac_tldr_1.5_1.7'cd LLM_Collab_Writing
python train_iac.py --config configs/iac_tldr_config.yaml --override agents='[\"Qwen/Qwen2.5-1.5B-Instruct\",\"Qwen/Qwen3-1.7B\"]' agent_model.name=None critics=None critic_model.name=None iac.use_separate_critic=false wandb.project=hetero wandb.name='iac_tldr_1.5_1.7_shared'
IAC takes about 41-48 hours to train on a piece of H100 with about 80G (separate) or 48 (shared) Vram usage. Code Generationcd LLM_Collab_Code_Generation
python train_magrpo.py --config configs/magrpo_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None wandb.project=hetero wandb.name='magrpo_che_3b_4b'
MAGRPO takes about 10 hours to train on a piece of H100 with about 89G Vram usage. cd LLM_Collab_Code_Generation
python train_maac.py --config configs/maac_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name="Qwen/Qwen2.5-Coder-3B" wandb.project=hetero wandb.name='magrpo_che_3b_4b'
MAAC takes about 8 hours to train on a piece of H200 with about 118G Vram usage. cd LLM_Collab_Code_Generation
python train_iac.py --config configs/iac_che_config.yaml --override agents='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critics='[\"Qwen/Qwen2.5-Coder-3B\",\"Qwen/Qwen3-4B-Instruct-2507\"]' critic_model.name=None iac.use_separate_critic=true wandb.project=hetero wandb.name='magrpo_che_3b_4b'
IAC takes about 8 hours to train on a piece of H200 with about 140G (separate) or 74 (shared) Vram usage. MinecraftFor Minecraft, I select a house as a representative task for testing the new interface. cd LLM_Collab_Minecraft
python house_build/train/train_magrpo.py --config house_build/configs/house_build_magrpo_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None wandb.project=hetero-mc wandb.name='magrpo_house_3B_4B'
MAGRPO takes about 8 hours to train on a piece of H200 with about 108G Vram usage. cd LLM_Collab_Minecraft
python house_build/train/train_maac.py --config house_build/configs/house_build_maac_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name="Qwen/Qwen3-4B-Instruct-2507" wandb.project=hetero wandb.name='maac_house_3B_4B'
MAAC takes about 10 hours to train on a piece of H200 with about 138G Vram usage. cd LLM_Collab_Minecraft
python house_build/train/train_iac.py --config house_build/configs/house_build_iac_config.yaml --override agents='[\"Qwen/Qwen2.5-3B-Instruct\",\"Qwen/Qwen3-4B-Instruct-2507\"]' agent_model.name=None critic_model.name=None iac.use_separate_critic=false wandb.project=hetero wandb.name='maac_house_3B_4B_shared'
IAC-shared (I both use shared here to save vram) takes about 9 hours to train on a piece of H200 with about 102 (3B+4B) 114G (4Bx2) Vram usage. |












No description provided.