如何用语言学习构建世界模型?
摘要:博客地址:https:www.cnblogs.comzylyehuo Learning to Model the World with Language 项目地址:dynalang (结合项目 README,遇到问题再看该笔记) 运
博客地址:https://www.cnblogs.com/zylyehuo/
Learning to Model the World with Language
项目地址:dynalang (结合项目 README,遇到问题再看该笔记)
运行效果
部署步骤
第一步:配置 cuda、cudnn、torch
nvcc -V
ls -l /usr/local/ | grep cuda
nvidia-smi
ls ~/.local/lib/python3.8/site-packages | grep nvidia
rm -rf ~/.local/lib/python3.8/site-packages/nvidia*
ls ~/.local/lib/python3.8/site-packages | grep nvidia
export PYTHONNOUSERSITE=1
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
python -c "import torch; print('PyTorch CUDA:', torch.version.cuda); print('GPU Available:', torch.cuda.is_available())"
sudo cp /var/cudnn-local-repo-ubuntu2004-8.9.7.29/cudnn-local-CD2C2DD4-keyring.gpg /usr/share/keyrings/
sudo apt update
sudo apt install libcudnn8 libcudnn8-dev
sudo ldconfig
sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb
ldconfig -p | grep cudnn
conda install -c conda-forge "cudnn>=8.9.1"
第二步:安装相关依赖包和源码
mkdir VLN_learning
cd VLN_learning
git clone git@github.com:zylyehuo/dynalang.git
cd dynalang
conda create -n dynalang-vln python=3.8
conda activate dynalang-vln
pip install "jax[cuda11_cudnn82]==0.4.8" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# 验证 JAX 和 GPU
python3 -c "import jax; print('JAX Devices:', jax.devices())"
# 验证 PyTorch 和 GPU
python3 -c "import torch; print('Torch CUDA available:', torch.cuda.is_available())"
# 验证 IPython 依赖是否报错
python3 -c "import IPython; print('IPython loaded successfully')"
pip install homegrid
cd VLN_learning/dynalang
conda activate dynalang-vln
pip install optax==0.1.7
pip install ruamel.yaml==0.17.40
pip install gym==0.26.2
pip install absl-py rich chex optax
pip install tensorflow-cpu==2.13.1 tensorboard==2.13.0 tensorflow-probability==0.21.0
pip install "typing-extensions<4.6.0"
pip install pandas python-dateutil click platformdirs importlib-resources
sudo apt-get install \
libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev \
libsdl1.2-dev libsmpeg-dev subversion libportmidi-dev ffmpeg \
libswscale-dev libavformat-dev libavcodec-dev libfreetype6-dev
git clone https://github.com/ahjwang/messenger-emma
pip install pbr -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -e messenger-emma
cd VLN_learning/dynalang
conda activate dynalang-vln
conda install absl-py==1.4.0
conda env update -f env_vln.yml
conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
git clone https://github.com/jlin816/VLN-CE VLN_CE
git clone https://github.com/jlin816/habitat-lab habitat_lab
cd VLN_learning/dynalang
conda activate dynalang-vln
pip install colored
cd VLN_learning/dynalang
conda activate dynalang-vln
pip install datasets
第三步:下载数据集
huggingface
Download Matterport3D data into the VLN-CE directory (requires Python 2.7)
cd /home/yehuo/VLN_learning/dynalang/VLN_CE
conda create -n py27 python=2.7
conda activate py27
python scripts/download_mp.py --task habitat -o VLN_CE/data/scene_datasets/mp3d/
cd VLN_CE/data/scene_datasets
unzip mp3d/v1/tasks/mp3d_habitat.zip
conda deactivate
第四步:修改源码
/home/yehuo/VLN_learning/dynalang/dynalang/train.py
/home/yehuo/VLN_learning/dynalang/dynalang/embodied/envs/vln.py
/home/yehuo/VLN_learning/dynalang/habitat_lab/habitat/tasks/vln/vln.py
cd VLN_learning/dynalang
sed -i 's/spaces.Discrete(0)/spaces.Discrete(1)/g' /home/yehuo/VLN_learning/dynalang/habitat_lab/habitat/tasks/vln/vln.py
运行指令
① HomeGrid
CPU 运行指令
cd VLN_learning/dynalang
conda activate dynalang-vln
WANDB_MODE=disabled JAX_PLATFORMS=cpu CUDA_VISIBLE_DEVICES="" python dynalang/train.py \
--configs defaults debug \
--task homegrid_task \
--logdir ~/logdir/homegrid/debug_run_cpu \
--jax.platform cpu \
--jax.policy_devices 0 \
--jax.train_devices 0 \
--batch_size 1 \
--batch_length 64 \
--encoder.mlp_keys '^token$' \
--decoder.mlp_keys '^token$'
GPU 运行指令【GPU显存不够时候使用,若够直接使用项目原指令即可】
cd VLN_learning/dynalang
conda activate dynalang-vln
export HF_ENDPOINT=https://hf-mirror.com
export LD_LIBRARY_PATH=/home/yehuo/anaconda3/envs/dynalang-vln/lib:$LD_LIBRARY_PATH
export XLA_PYTHON_CLIENT_ALLOCATOR=platform
unset XLA_PYTHON_CLIENT_PREALLOCATE
unset XLA_PYTHON_CLIENT_MEM_FRACTION
export XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false
export JAX_PLATFORMS=cuda
WANDB_MODE=disabled \
CUDA_VISIBLE_DEVICES=0 \
python dynalang/train.py \
--configs defaults debug \
--task homegrid_task \
--logdir ~/logdir/homegrid/debug_run_gpu \
--jax.platform cuda \
--jax.policy_devices 0 \
--jax.train_devices 0 \
--batch_size 1 \
--batch_length 6 \
--encoder.mlp_keys '^token$' \
--decoder.mlp_keys '^token$' \
--run.log_every 999999 \
--run.eval_every 999999 \
--run.save_every 999999
TensorBoard 可视化
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/homegrid/debug_run_cpu_0 --port 6006
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/homegrid/debug_run_gpu_0 --port 6006
② Messenger
CPU 运行指令
cd VLN_learning/dynalang
conda activate dynalang-vln
WANDB_MODE=disabled \
JAX_PLATFORMS=cpu \
CUDA_VISIBLE_DEVICES="" \
sh scripts/run_messenger_s1.sh test_run 0 42 \
--configs defaults debug \
--jax.platform cpu \
--jax.policy_devices 0 \
--jax.train_devices 0 \
--batch_size 1 \
--batch_length 64 \
--encoder.mlp_keys '^token$' \
--decoder.mlp_keys '^token$'
GPU 运行指令【GPU显存不够时候使用,若够直接使用项目原指令即可】
cd VLN_learning/dynalang
conda activate dynalang-vln
# 1. 禁止 JAX 提前霸占显存,改为按需分配
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform
# 2. 听从报错信息的建议,关闭严格的卷积算法挑选(节省显存)
export XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false
# 3. 把虚拟环境的库路径提升到最高优先级
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
WANDB_MODE=disabled \
sh scripts/run_messenger_s2.sh train_from_scratch 0 42 \
--configs debug \
--batch_size 1 \
--batch_length 8 \
--jax.prealloc False
TensorBoard 可视化
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/messenger/s1_test_run_42 --port 6006
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/messenger/s2_train_from_scratch_42 --port 6006
③ VLN
CPU 运行指令
cd VLN_learning/dynalang
conda activate dynalang-vln
# 1. 彻底屏蔽物理显卡,切断 JAX 对 GPU 的混淆
export CUDA_VISIBLE_DEVICES=""
export JAX_PLATFORMS=cpu
unset XLA_PYTHON_CLIENT_PREALLOCATE
unset XLA_PYTHON_CLIENT_MEM_FRACTION
# 2. 强制 Habitat 降级为纯软件渲染(完全由 CPU 计算)
export HABITAT_SIM_BACKEND=cpu
export LIBGL_ALWAYS_SOFTWARE=1
export GALLIUM_DRIVER=llvmpipe
export MESA_GL_VERSION_OVERRIDE=3.3
export MESA_GLSL_VERSION_OVERRIDE=330
export PYTHONPATH=$PYTHONPATH:$(pwd)/VLN_CE
WANDB_MODE=disabled \
sh scripts/run_vln.sh vln_task debug_cpu 0 \
--jax.platform cpu \
--envs.amount 1 \
--run.actor_batch 1 \
--envs.parallel none \
--run.script train
TensorBoard 可视化
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/vln_task_0 --port 6006
④ LangRoom
git stash
git checkout langroom
# 回到上一个分支
git checkout -
git stash pop
CPU 运行指令
cd VLN_learning/dynalang
conda activate dynalang-vln
# 1. 彻底屏蔽物理显卡,防止底层脚本误调用
export CUDA_VISIBLE_DEVICES=""
export JAX_PLATFORMS=cpu
# 2. 启动纯 CPU 单进程训练
# 对应参数:EXP_NAME="debug_langroom_cpu" | GPU_IDS="" (留空) | SEED="0"
WANDB_MODE=disabled \
sh run_langroom.sh debug_langroom_cpu "" 0 \
--jax.platform cpu \
--envs.amount 1 \
--run.actor_batch 1 \
--envs.parallel none
TensorBoard 可视化
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/langroom --port 6006
⑤ Text Pretraining
在步骤 ② Messenger 中运行:scripts/run_messenger_s2.sh
GPU 运行指令【GPU显存不够时候使用,若够直接使用项目原指令即可】
cd VLN_learning/dynalang
conda activate dynalang-vln
# 环境变量
export HF_ENDPOINT=https://hf-mirror.com
export LD_LIBRARY_PATH=/home/yehuo/anaconda3/envs/dynalang-vln/lib:$LD_LIBRARY_PATH
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform
export XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false
# 在最后加上 --jax.platform cuda
WANDB_MODE=disabled \
CUDA_VISIBLE_DEVICES=0 \
sh scripts/pretrain_text.sh text_pretrain_debug 0 42 \
roneneldan/TinyStories \
/home/yehuo/logdir/messenger/s2_train_from_scratch_42/episodes \
--configs debug \
--batch_size 1 \
--batch_length 16 \
--jax.prealloc False \
--jax.platform cuda
TensorBoard 可视化
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/textpt/text_pretrain_debug_42 --port 6006
⑥ Text Finetuning
GPU 运行指令【GPU显存不够时候使用,若够直接使用项目原指令即可】
cd VLN_learning/dynalang
conda activate dynalang-vln
# 1. 显卡与网络
export HF_ENDPOINT=https://hf-mirror.com
export LD_LIBRARY_PATH=/home/yehuo/anaconda3/envs/dynalang-vln/lib:$LD_LIBRARY_PATH
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform
export XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false
export JAX_PLATFORMS=cuda
# 2. 启动文本预训练(尾部追加 --jax.platform cuda 覆盖 debug 默认值)
WANDB_MODE=disabled \
CUDA_VISIBLE_DEVICES=0 \
sh scripts/pretrain_text.sh text_pretrain_debug 0 42 \
roneneldan/TinyStories \
/home/yehuo/logdir/messenger/s2_train_from_scratch_42/episodes \
--configs debug \
--batch_size 1 \
--batch_length 16 \
--jax.prealloc False \
--jax.platform cuda
TensorBoard 可视化
conda activate dynalang-vln
tensorboard --logdir /home/yehuo/logdir/textpt/text_pretrain_debug_42 --port 6006
