LMDeploy 是一个开源的轻量级模型部署工具,它可以将预训练的语言模型(LLM)部署到边缘设备上。量化是模型部署中的一个重要步骤,它通过减少模型参数的精度来减小模型的大小和加速推理速度,同时尽量保持模型的性能。以下是使用 LMDeploy 量化部署 L

摘要:书生浦语大模型实战营第二期第5节作业 本页面包括实战营第二期第五节作业的全部操作步骤。如果需要知道模型量化部署的相关知识请访问学习笔记。 作业要求 基础作业 完成以下任务,并将实现过程记录截图: 配置lmdeploy运行环境 下载inter
书生浦语大模型实战营第二期第5节作业 本页面包括实战营第二期第五节作业的全部操作步骤。如果需要知道模型量化部署的相关知识请访问学习笔记。 作业要求 基础作业 完成以下任务,并将实现过程记录截图: 配置lmdeploy运行环境 下载internlm-chat-1.8b模型 以命令行方式与模型对话 进阶作业 完成以下任务,并将实现过程记录截图: 设置KV Cache最大占用比例为0.4,开启W4A16量化,以命令行方式与模型对话。 以API Server方式启动 lmdeploy,开启 W4A16量化,调整KV Cache的占用比例为0.4,分别使用命令行客户端与Gradio网页客户端与模型对话。 使用W4A16量化,调整KV Cache的占用比例为0.4,使用Python代码集成的方式运行internlm2-chat-1.8b模型。 使用 LMDeploy 运行视觉多模态大模型 llava gradio demo 将 LMDeploy Web Demo 部署到 OpenXLab (OpenXLab cuda 12.2 的镜像还没有 ready,可先跳过,一周之后再来做) LMDeploy量化LLM 新建环境 因为cuda11.7-conda的镜像与新版本的lmdeploy会出现兼容性问题。所以我们需要新建镜像为cuda12.2-conda的开发机,选择10% A100的GPU。 同时与之前的作业不同,这里使用studio-conda搭建的环境是基于“预制环境”pytorch-2.1.2的,而不是之前的internlm-base。这个环境是一个空环境,这意味着如果需要在本地使用直接创建一个python=3.10的空conda环境就ok。 studio-conda -t lmdeploy -o pytorch-2.1.2 点击查看完整的pytorch-2.1.2环境软件包列表 # packages in environment at /root/.conda/envs/lmdeploy: # # Name Version Build Channel _libgcc_mutex 0.1 main defaults _openmp_mutex 5.1 1_gnu defaults asttokens 2.4.1 pypi_0 pypi blas 1.0 mkl defaults brotli-python 1.0.9 py310h6a678d5_7 defaults bzip2 1.0.8 h5eee18b_5 defaults ca-certificates 2024.3.11 h06a4308_0 defaults certifi 2024.2.2 py310h06a4308_0 defaults charset-normalizer 2.0.4 pyhd3eb1b0_0 defaults comm 0.2.2 pypi_0 pypi cuda-cudart 12.1.105 0 nvidia cuda-cupti 12.1.105 0 nvidia cuda-libraries 12.1.0 0 nvidia cuda-nvrtc 12.1.105 0 nvidia cuda-nvtx 12.1.105 0 nvidia cuda-opencl 12.4.127 0 nvidia cuda-runtime 12.1.0 0 nvidia debugpy 1.8.1 pypi_0 pypi decorator 5.1.1 pypi_0 pypi einops 0.7.0 pypi_0 pypi exceptiongroup 1.2.0 pypi_0 pypi executing 2.0.1 pypi_0 pypi ffmpeg 4.3 hf484d3e_0 pytorch filelock 3.13.1 py310h06a4308_0 defaults freetype 2.12.1 h4a9f257_0 defaults gmp 6.2.1 h295c915_3 defaults gmpy2 2.1.2 py310heeb90bb_0 defaults gnutls 3.6.15 he1e5248_0 defaults idna 3.4 py310h06a4308_0 defaults intel-openmp 2023.1.0 hdb19cb5_46306 defaults ipykernel 6.29.4 pypi_0 pypi ipython 8.23.0 pypi_0 pypi jedi 0.19.1 pypi_0 pypi jinja2 3.1.3 py310h06a4308_0 defaults jpeg 9e h5eee18b_1 defaults jupyter-client 8.6.1 pypi_0 pypi jupyter-core 5.7.2 pypi_0 pypi lame 3.100 h7b6447c_0 defaults lcms2 2.12 h3be6417_0 defaults ld_impl_linux-64 2.38 h1181459_1 defaults lerc 3.0 h295c915_0 defaults libcublas 12.1.0.26 0 nvidia libcufft 11.0.2.4 0 nvidia libcufile 1.9.0.20 0 nvidia libcurand 10.3.5.119 0 nvidia libcusolver 11.4.4.55 0 nvidia libcusparse 12.0.2.55 0 nvidia libdeflate 1.17 h5eee18b_1 defaults libffi 3.4.4 h6a678d5_0 defaults libgcc-ng 11.2.0 h1234567_1 defaults libgomp 11.2.0 h1234567_1 defaults libiconv 1.16 h7f8727e_2 defaults libidn2 2.3.4 h5eee18b_0 defaults libjpeg-turbo 2.0.0 h9bf148f_0 pytorch libnpp 12.0.2.50 0 nvidia libnvjitlink 12.1.105 0 nvidia libnvjpeg 12.1.1.14 0 nvidia libpng 1.6.39 h5eee18b_0 defaults libstdcxx-ng 11.2.0 h1234567_1 defaults libtasn1 4.19.0 h5eee18b_0 defaults libtiff 4.5.1 h6a678d5_0 defaults libunistring 0.9.10 h27cfd23_0 defaults libuuid 1.41.5 h5eee18b_0 defaults libwebp-base 1.3.2 h5eee18b_0 defaults llvm-openmp 14.0.6 h9e868ea_0 defaults lz4-c 1.9.4 h6a678d5_0 defaults markupsafe 2.1.3 py310h5eee18b_0 defaults matplotlib-inline 0.1.6 pypi_0 pypi mkl 2023.1.0 h213fc3f_46344 defaults mkl-service 2.4.0 py310h5eee18b_1 defaults mkl_fft 1.3.8 py310h5eee18b_0 defaults mkl_random 1.2.4 py310hdb19cb5_0 defaults mpc 1.1.0 h10f8cd9_1 defaults mpfr 4.0.2 hb69a4c5_1 defaults mpmath 1.3.0 py310h06a4308_0 defaults ncurses 6.4 h6a678d5_0 defaults nest-asyncio 1.6.0 pypi_0 pypi nettle 3.7.3 hbbd107a_1 defaults networkx 3.1 py310h06a4308_0 defaults numpy 1.26.4 py310h5f9d8c6_0 defaults numpy-base 1.26.4 py310hb5e798b_0 defaults openh264 2.1.1 h4ff587b_0 defaults openjpeg 2.4.0 h3ad879b_0 defaults openssl 3.0.13 h7f8727e_0 defaults packaging 24.0 pypi_0 pypi parso 0.8.4 pypi_0 pypi pexpect 4.9.0 pypi_0 pypi pillow 10.2.0 py310h5eee18b_0 defaults pip 23.3.1 py310h06a4308_0 defaults platformdirs 4.2.0 pypi_0 pypi prompt-toolkit 3.0.43 pypi_0 pypi protobuf 5.26.1 pypi_0 pypi psutil 5.9.8 pypi_0 pypi ptyprocess 0.7.0 pypi_0 pypi pure-eval 0.2.2 pypi_0 pypi pygments 2.17.2 pypi_0 pypi pysocks 1.7.1 py310h06a4308_0 defaults python 3.10.14 h955ad1f_0 defaults python-dateutil 2.9.0.post0 pypi_0 pypi pytorch 2.1.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch pytorch-cuda 12.1 ha16c6d3_5 pytorch pytorch-mutex 1.0 cuda pytorch pyyaml 6.0.1 py310h5eee18b_0 defaults pyzmq 25.1.2 pypi_0 pypi readline 8.2 h5eee18b_0 defaults requests 2.31.0 py310h06a4308_1 defaults setuptools 68.2.2 py310h06a4308_0 defaults six 1.16.0 pypi_0 pypi sqlite 3.41.2 h5eee18b_0 defaults stack-data 0.6.3 pypi_0 pypi sympy 1.12 py310h06a4308_0 defaults tbb 2021.8.0 hdb19cb5_0 defaults tk 8.6.12 h1ccaba5_0 defaults torchaudio 2.1.2 py310_cu121 pytorch torchtriton 2.1.0 py310 pytorch torchvision 0.16.2 py310_cu121 pytorch tornado 6.4 pypi_0 pypi traitlets 5.14.2 pypi_0 pypi typing_extensions 4.9.0 py310h06a4308_1 defaults tzdata 2024a h04d1e81_0 defaults urllib3 2.1.0 py310h06a4308_1 defaults wcwidth 0.2.13 pypi_0 pypi wheel 0.41.2 py310h06a4308_0 defaults xz 5.4.6 h5eee18b_0 defaults yaml 0.2.5 h7b6447c_0 defaults zlib 1.2.13 h5eee18b_0 defaults zstd 1.5.5 hc292b87_0 defaults 之后激活刚刚创建的虚拟环境并安装0.3.0版本的lmdeploy,等待安装结束。 conda activate lmdeploy pip install lmdeploy[all]==0.3.0 点击查看完整的lmdeploy环境软件包列表 # packages in environment at /root/.conda/envs/lmdeploy: # # Name Version Build Channel _libgcc_mutex 0.1 main defaults _openmp_mutex 5.1 1_gnu defaults accelerate 0.29.1 pypi_0 pypi addict 2.4.0 pypi_0 pypi aiofiles 23.2.1 pypi_0 pypi aiohttp 3.9.3 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi altair 5.3.0 pypi_0 pypi annotated-types 0.6.0 pypi_0 pypi anyio 4.3.0 pypi_0 pypi asttokens 2.4.1 pypi_0 pypi async-timeout 4.0.3 pypi_0 pypi attrs 23.2.0 pypi_0 pypi blas 1.0 mkl defaults brotli-python 1.0.9 py310h6a678d5_7 defaults bzip2 1.0.8 h5eee18b_5 defaults ca-certificates 2024.3.11 h06a4308_0 defaults certifi 2024.2.2 py310h06a4308_0 defaults charset-normalizer 2.0.4 pyhd3eb1b0_0 defaults click 8.1.7 pypi_0 pypi comm 0.2.2 pypi_0 pypi contourpy 1.2.1 pypi_0 pypi cuda-cudart 12.1.105 0 nvidia cuda-cupti 12.1.105 0 nvidia cuda-libraries 12.1.0 0 nvidia cuda-nvrtc 12.1.105 0 nvidia cuda-nvtx 12.1.105 0 nvidia cuda-opencl 12.4.127 0 nvidia cuda-runtime 12.1.0 0 nvidia cycler 0.12.1 pypi_0 pypi datasets 2.18.0 pypi_0 pypi debugpy 1.8.1 pypi_0 pypi decorator 5.1.1 pypi_0 pypi dill 0.3.8 pypi_0 pypi einops 0.7.0 pypi_0 pypi exceptiongroup 1.2.0 pypi_0 pypi executing 2.0.1 pypi_0 pypi fastapi 0.110.1 pypi_0 pypi ffmpeg 4.3 hf484d3e_0 pytorch ffmpy 0.3.2 pypi_0 pypi filelock 3.13.1 py310h06a4308_0 defaults fire 0.6.0 pypi_0 pypi fonttools 4.51.0 pypi_0 pypi freetype 2.12.1 h4a9f257_0 defaults frozenlist 1.4.1 pypi_0 pypi fsspec 2024.2.0 pypi_0 pypi gmp 6.2.1 h295c915_3 defaults gmpy2 2.1.2 py310heeb90bb_0 defaults gnutls 3.6.15 he1e5248_0 defaults gradio 3.50.2 pypi_0 pypi gradio-client 0.6.1 pypi_0 pypi grpcio 1.62.1 pypi_0 pypi h11 0.14.0 pypi_0 pypi httpcore 1.0.5 pypi_0 pypi httpx 0.27.0 pypi_0 pypi huggingface-hub 0.22.2 pypi_0 pypi idna 3.4 py310h06a4308_0 defaults importlib-metadata 7.1.0 pypi_0 pypi importlib-resources 6.4.0 pypi_0 pypi intel-openmp 2023.1.0 hdb19cb5_46306 defaults ipykernel 6.29.4 pypi_0 pypi ipython 8.23.0 pypi_0 pypi jedi 0.19.1 pypi_0 pypi jinja2 3.1.3 py310h06a4308_0 defaults jpeg 9e h5eee18b_1 defaults jsonschema 4.21.1 pypi_0 pypi jsonschema-specifications 2023.12.1 pypi_0 pypi jupyter-client 8.6.1 pypi_0 pypi jupyter-core 5.7.2 pypi_0 pypi kiwisolver 1.4.5 pypi_0 pypi lame 3.100 h7b6447c_0 defaults lcms2 2.12 h3be6417_0 defaults ld_impl_linux-64 2.38 h1181459_1 defaults lerc 3.0 h295c915_0 defaults libcublas 12.1.0.26 0 nvidia libcufft 11.0.2.4 0 nvidia libcufile 1.9.0.20 0 nvidia libcurand 10.3.5.119 0 nvidia libcusolver 11.4.4.55 0 nvidia libcusparse 12.0.2.55 0 nvidia libdeflate 1.17 h5eee18b_1 defaults libffi 3.4.4 h6a678d5_0 defaults libgcc-ng 11.2.0 h1234567_1 defaults libgomp 11.2.0 h1234567_1 defaults libiconv 1.16 h7f8727e_2 defaults libidn2 2.3.4 h5eee18b_0 defaults libjpeg-turbo 2.0.0 h9bf148f_0 pytorch libnpp 12.0.2.50 0 nvidia libnvjitlink 12.1.105 0 nvidia libnvjpeg 12.1.1.14 0 nvidia libpng 1.6.39 h5eee18b_0 defaults libstdcxx-ng 11.2.0 h1234567_1 defaults libtasn1 4.19.0 h5eee18b_0 defaults libtiff 4.5.1 h6a678d5_0 defaults libunistring 0.9.10 h27cfd23_0 defaults libuuid 1.41.5 h5eee18b_0 defaults libwebp-base 1.3.2 h5eee18b_0 defaults llvm-openmp 14.0.6 h9e868ea_0 defaults lmdeploy 0.3.0 pypi_0 pypi lz4-c 1.9.4 h6a678d5_0 defaults markdown-it-py 3.0.0 pypi_0 pypi markupsafe 2.1.3 py310h5eee18b_0 defaults matplotlib 3.8.4 pypi_0 pypi matplotlib-inline 0.1.6 pypi_0 pypi mdurl 0.1.2 pypi_0 pypi mkl 2023.1.0 h213fc3f_46344 defaults mkl-service 2.4.0 py310h5eee18b_1 defaults mkl_fft 1.3.8 py310h5eee18b_0 defaults mkl_random 1.2.4 py310hdb19cb5_0 defaults mmengine-lite 0.10.3 pypi_0 pypi mpc 1.1.0 h10f8cd9_1 defaults mpfr 4.0.2 hb69a4c5_1 defaults mpmath 1.3.0 py310h06a4308_0 defaults multidict 6.0.5 pypi_0 pypi multiprocess 0.70.16 pypi_0 pypi ncurses 6.4 h6a678d5_0 defaults nest-asyncio 1.6.0 pypi_0 pypi nettle 3.7.3 hbbd107a_1 defaults networkx 3.1 py310h06a4308_0 defaults numpy 1.26.4 py310h5f9d8c6_0 defaults numpy-base 1.26.4 py310hb5e798b_0 defaults nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi nvidia-curand-cu12 10.3.5.147 pypi_0 pypi nvidia-nccl-cu12 2.21.5 pypi_0 pypi openh264 2.1.1 h4ff587b_0 defaults openjpeg 2.4.0 h3ad879b_0 defaults openssl 3.0.13 h7f8727e_0 defaults orjson 3.10.0 pypi_0 pypi packaging 24.0 pypi_0 pypi pandas 2.2.1 pypi_0 pypi parso 0.8.4 pypi_0 pypi peft 0.9.0 pypi_0 pypi pexpect 4.9.0 pypi_0 pypi pillow 10.2.0 py310h5eee18b_0 defaults pip 23.3.1 py310h06a4308_0 defaults platformdirs 4.2.0 pypi_0 pypi prompt-toolkit 3.0.43 pypi_0 pypi protobuf 4.25.3 pypi_0 pypi psutil 5.9.8 pypi_0 pypi ptyprocess 0.7.0 pypi_0 pypi pure-eval 0.2.2 pypi_0 pypi pyarrow 15.0.2 pypi_0 pypi pyarrow-hotfix 0.6 pypi_0 pypi pybind11 2.12.0 pypi_0 pypi pydantic 2.6.4 pypi_0 pypi pydantic-core 2.16.3 pypi_0 pypi pydub 0.25.1 pypi_0 pypi pygments 2.17.2 pypi_0 pypi pynvml 11.5.0 pypi_0 pypi pyparsing 3.1.2 pypi_0 pypi pysocks 1.7.1 py310h06a4308_0 defaults python 3.10.14 h955ad1f_0 defaults python-dateutil 2.9.0.post0 pypi_0 pypi python-multipart 0.0.9 pypi_0 pypi python-rapidjson 1.16 pypi_0 pypi pytorch 2.1.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch pytorch-cuda 12.1 ha16c6d3_5 pytorch pytorch-mutex 1.0 cuda pytorch pytz 2024.1 pypi_0 pypi pyyaml 6.0.1 py310h5eee18b_0 defaults pyzmq 25.1.2 pypi_0 pypi readline 8.2 h5eee18b_0 defaults referencing 0.34.0 pypi_0 pypi regex 2023.12.25 pypi_0 pypi requests 2.31.0 py310h06a4308_1 defaults rich 13.7.1 pypi_0 pypi rpds-py 0.18.0 pypi_0 pypi safetensors 0.4.2 pypi_0 pypi semantic-version 2.10.0 pypi_0 pypi sentencepiece 0.2.0 pypi_0 pypi setuptools 68.2.2 py310h06a4308_0 defaults shortuuid 1.0.13 pypi_0 pypi six 1.16.0 pypi_0 pypi sniffio 1.3.1 pypi_0 pypi sqlite 3.41.2 h5eee18b_0 defaults stack-data 0.6.3 pypi_0 pypi starlette 0.37.2 pypi_0 pypi sympy 1.12 py310h06a4308_0 defaults tbb 2021.8.0 hdb19cb5_0 defaults termcolor 2.4.0 pypi_0 pypi tiktoken 0.6.0 pypi_0 pypi tk 8.6.12 h1ccaba5_0 defaults tokenizers 0.15.2 pypi_0 pypi tomli 2.0.1 pypi_0 pypi toolz 0.12.1 pypi_0 pypi torchaudio 2.1.2 py310_cu121 pytorch torchtriton 2.1.0 py310 pytorch torchvision 0.16.2 py310_cu121 pytorch tornado 6.4 pypi_0 pypi tqdm 4.66.2 pypi_0 pypi traitlets 5.14.2 pypi_0 pypi transformers 4.38.2 pypi_0 pypi transformers-stream-generator 0.0.5 pypi_0 pypi tritonclient 2.44.0 pypi_0 pypi typing_extensions 4.9.0 py310h06a4308_1 defaults tzdata 2024.1 pypi_0 pypi urllib3 2.1.0 py310h06a4308_1 defaults uvicorn 0.29.0 pypi_0 pypi wcwidth 0.2.13 pypi_0 pypi websockets 11.0.3 pypi_0 pypi wheel 0.41.2 py310h06a4308_0 defaults xxhash 3.4.1 pypi_0 pypi xz 5.4.6 h5eee18b_0 defaults yaml 0.2.5 h7b6447c_0 defaults yapf 0.40.2 pypi_0 pypi yarl 1.9.4 pypi_0 pypi zipp 3.18.1 pypi_0 pypi zlib 1.2.13 h5eee18b_0 defaults zstd 1.5.5 hc292b87_0 defaults 下载模型 和之前一样为internlm2-chat-1_8b模型创建软链接。链接后的路径与前几个实战内容略有不同。 cd ~ ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/ 量化前模型推理测试 这里的代码和第二个作业里的第一个demo没什么区别,都是加载模型以后调用model.chat()获取模型输出。这一步的主要目的是测试模型输出是否正常以及体验模型推理速度。 import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda() model = model.eval() inp = "hello" print("[INPUT]", inp) response, history = model.chat(tokenizer, inp, history=[]) print("[OUTPUT]", response) inp = "please provide three suggestions about time management" print("[INPUT]", inp) response, history = model.chat(tokenizer, inp, history=history) print("[OUTPUT]", response) 运行结果如下: GPU占用如下: 对上面的代码稍加改造,测试一下模型压缩前的运行速度: # python benchmark_transformer.py import torch import datetime from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda() model = model.eval() # warmup inp = "hello" for i in range(5): print("Warm up...[{}/5]".format(i+1)) response, history = model.chat(tokenizer, inp, history=[]) # test speed inp = "请介绍一下你自己。" times = 10 total_words = 0 start_time = datetime.datetime.now() for i in range(times): response, history = model.chat(tokenizer, inp, history=history) total_words += len(response) end_time = datetime.datetime.now() delta_time = end_time - start_time delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0 speed = total_words / delta_time print("Speed: {:.4f} words/s".format(speed)) 记住这个速度16.4092 words/s,后面会做对比。 模型对话:lmdeploy chat 使用 lmdeploy chat 命令就能在命令行里直接与大模型对话了,并且推理速度要快不少: lmdeploy chat /root/internlm2-chat-1_8b GPU使用情况如下: 在lmdeploy中,如果要输入内容给模型,需要使用两次回车键;输入“exit”并按两下回车,可以退出对话。 这个命令有许多参数,可以通过lmdeploy chat -h查看帮助文档,改命令输出为: usage: lmdeploy chat [-h] [--backend {pytorch,turbomind}] [--trust-remote-code] [--meta-instruction META_INSTRUCTION] [--cap {completion,infilling,chat,python}] [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME] [--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--model-format {hf,llama,awq}] [--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR] model_path Chat with pytorch or turbomind engine. positional arguments: model_path The path of a model. it could be one of the following options: - i) a local directory path of a turbomind model which is converted by `lmdeploy convert` command or download from ii) and iii). - ii) the model_id of a lmdeploy- quantized model hosted inside a model repo on huggingface.co, such as "internlm/internlm-chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a model hosted inside a model repo on huggingface.co, such as "internlm/internlm-chat-7b", "qwen/qwen-7b-chat ", "baichuan- inc/baichuan2-7b-chat" and so on. Type: str options: -h, --help show this help message and exit --backend {pytorch,turbomind} Set the inference backend. Default: turbomind. Type: str --trust-remote-code Trust remote code for loading hf models. Default: True --meta-instruction META_INSTRUCTION System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template instead. Default: None. Type: str --cap {completion,infilling,chat,python} The capability of a model. Deprecated. Please use --chat-template instead. Default: chat. Type: str PyTorch engine arguments: --adapters [ADAPTERS ...] Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter.. Default: None. Type: str --tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int --model-name MODEL_NAME The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run `lmdeploy list` to get the supported model names. Default: None. Type: str --session-len SESSION_LEN The max session length of a sequence. Default: None. Type: int --max-batch-size MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int --cache-max-entry-count CACHE_MAX_ENTRY_COUNT The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float TurboMind engine arguments: --tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int --model-name MODEL_NAME The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run `lmdeploy list` to get the supported model names. Default: None. Type: str --session-len SESSION_LEN The max session length of a sequence. Default: None. Type: int --max-batch-size MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int --cache-max-entry-count CACHE_MAX_ENTRY_COUNT The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float --model-format {hf,llama,awq} The format of input model. `hf` meaning `hf_llama`, `llama` meaning `meta_llama`, `awq` meaning the quantized model by awq. Default: None. Type: str --quant-policy QUANT_POLICY Whether to use kv int8. Default: 0. Type: int --rope-scaling-factor ROPE_SCALING_FACTOR Rope scaling factor. Default: 0.0. Type: float 我们注意到,参数--cache-max-entry-count的用途是控制KV缓存占用剩余显存的最大比例,默认的比例为0.8。这意味着后续作业只要更改这个参数就ok 模型量化与校准:lmdeploy lite 量化前需要安装einops库: pip install einops==0.7.0 之后执行 lmdeploy lite auto_awq \ /root/internlm2-chat-1_8b \ --calib-dataset 'ptb' \ --calib-samples 128 \ --calib-seqlen 1024 \ --w-bits 4 \ --w-group-size 128 \ --work-dir /root/internlm2-chat-1_8b-4bit 完成模型量化。该命令使用AWQ算法,实现模型4bit权重量化。推理引擎TurboMind提供了高效的4bit推理cuda kernel,性能是FP16的2.4倍以上。这一步耗时会非常非常的长。量化工作结束后,新的HF模型被保存到/root/internlm2-chat-1_8b-4bit目录。 点击查看这段代码运行的输出内容 (lmdeploy) root@intern-studio-160311:~# lmdeploy lite auto_awq \ > /root/internlm2-chat-1_8b \ > --calib-dataset 'ptb' \ > --calib-samples 128 \ > --calib-seqlen 1024 \ > --w-bits 4 \ > --w-group-size 128 \ > --work-dir /root/internlm2-chat-1_8b-4bit Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.60s/it] Move model.tok_embeddings to GPU. Move model.layers.0 to CPU. Move model.layers.1 to CPU. Move model.layers.2 to CPU. Move model.layers.3 to CPU. Move model.layers.4 to CPU. Move model.layers.5 to CPU. Move model.layers.6 to CPU. Move model.layers.7 to CPU. Move model.layers.8 to CPU. Move model.layers.9 to CPU. Move model.layers.10 to CPU. Move model.layers.11 to CPU. Move model.layers.12 to CPU. Move model.layers.13 to CPU. Move model.layers.14 to CPU. Move model.layers.15 to CPU. Move model.layers.16 to CPU. Move model.layers.17 to CPU. Move model.layers.18 to CPU. Move model.layers.19 to CPU. Move model.layers.20 to CPU. Move model.layers.21 to CPU. Move model.layers.22 to CPU. Move model.layers.23 to CPU. Move model.norm to GPU. Move output to CPU. Loading calibrate dataset ... /root/.conda/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at <https://hf.co/datasets/ptb_text_only> You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( Downloading builder script: 6.50kB [00:00, 24.9MB/s] Downloading readme: 4.21kB [00:00, 19.9MB/s] Downloading data: 5.10MB [01:05, 78.1kB/s] Downloading data: 400kB [00:00, 402kB/s] Downloading data: 450kB [00:09, 48.3kB/s] Generating train split: 100%|██████████████████████████████████████████████████| 42068/42068 [00:00<00:00, 88086.81 examples/s] Generating test split: 100%|████████████████████████████████████████████████████| 3761/3761 [00:00<00:00, 100075.98 examples/s] Generating validation split: 100%|██████████████████████████████████████████████| 3370/3370 [00:00<00:00, 100399.93 examples/s] model.layers.0, samples: 128, max gpu memory: 2.25 GB model.layers.1, samples: 128, max gpu memory: 2.75 GB model.layers.2, samples: 128, max gpu memory: 2.75 GB model.layers.3, samples: 128, max gpu memory: 2.75 GB model.layers.4, samples: 128, max gpu memory: 2.75 GB model.layers.5, samples: 128, max gpu memory: 2.75 GB model.layers.6, samples: 128, max gpu memory: 2.75 GB model.layers.7, samples: 128, max gpu memory: 2.75 GB model.layers.8, samples: 128, max gpu memory: 2.75 GB model.layers.9, samples: 128, max gpu memory: 2.75 GB model.layers.10, samples: 128, max gpu memory: 2.75 GB model.layers.11, samples: 128, max gpu memory: 2.75 GB model.layers.12, samples: 128, max gpu memory: 2.75 GB model.layers.13, samples: 128, max gpu memory: 2.75 GB model.layers.14, samples: 128, max gpu memory: 2.75 GB model.layers.15, samples: 128, max gpu memory: 2.75 GB model.layers.16, samples: 128, max gpu memory: 2.75 GB model.layers.17, samples: 128, max gpu memory: 2.75 GB model.layers.18, samples: 128, max gpu memory: 2.75 GB model.layers.19, samples: 128, max gpu memory: 2.75 GB model.layers.20, samples: 128, max gpu memory: 2.75 GB model.layers.21, samples: 128, max gpu memory: 2.75 GB model.layers.22, samples: 128, max gpu memory: 2.75 GB model.layers.23, samples: 128, max gpu memory: 2.75 GB model.layers.0 smooth weight done. model.layers.1 smooth weight done. model.layers.2 smooth weight done. model.layers.3 smooth weight done. model.layers.4 smooth weight done. model.layers.5 smooth weight done. model.layers.6 smooth weight done. model.layers.7 smooth weight done. model.layers.8 smooth weight done. model.layers.9 smooth weight done. model.layers.10 smooth weight done. model.layers.11 smooth weight done. model.layers.12 smooth weight done. model.layers.13 smooth weight done. model.layers.14 smooth weight done. model.layers.15 smooth weight done. model.layers.16 smooth weight done. model.layers.17 smooth weight done. model.layers.18 smooth weight done. model.layers.19 smooth weight done. model.layers.20 smooth weight done. model.layers.21 smooth weight done. model.layers.22 smooth weight done. model.layers.23 smooth weight done. model.layers.0.attention.wqkv weight packed. model.layers.0.attention.wo weight packed. model.layers.0.feed_forward.w1 weight packed. model.layers.0.feed_forward.w3 weight packed. model.layers.0.feed_forward.w2 weight packed. model.layers.1.attention.wqkv weight packed. model.layers.1.attention.wo weight packed. model.layers.1.feed_forward.w1 weight packed. model.layers.1.feed_forward.w3 weight packed. model.layers.1.feed_forward.w2 weight packed. model.layers.2.attention.wqkv weight packed. model.layers.2.attention.wo weight packed. model.layers.2.feed_forward.w1 weight packed. model.layers.2.feed_forward.w3 weight packed. model.layers.2.feed_forward.w2 weight packed. model.layers.3.attention.wqkv weight packed. model.layers.3.attention.wo weight packed. model.layers.3.feed_forward.w1 weight packed. model.layers.3.feed_forward.w3 weight packed. model.layers.3.feed_forward.w2 weight packed. model.layers.4.attention.wqkv weight packed. model.layers.4.attention.wo weight packed. model.layers.4.feed_forward.w1 weight packed. model.layers.4.feed_forward.w3 weight packed. model.layers.4.feed_forward.w2 weight packed. model.layers.5.attention.wqkv weight packed. model.layers.5.attention.wo weight packed. model.layers.5.feed_forward.w1 weight packed. model.layers.5.feed_forward.w3 weight packed. model.layers.5.feed_forward.w2 weight packed. model.layers.6.attention.wqkv weight packed. model.layers.6.attention.wo weight packed. model.layers.6.feed_forward.w1 weight packed. model.layers.6.feed_forward.w3 weight packed. model.layers.6.feed_forward.w2 weight packed. model.layers.7.attention.wqkv weight packed. model.layers.7.attention.wo weight packed. model.layers.7.feed_forward.w1 weight packed. model.layers.7.feed_forward.w3 weight packed. model.layers.7.feed_forward.w2 weight packed. model.layers.8.attention.wqkv weight packed. model.layers.8.attention.wo weight packed. model.layers.8.feed_forward.w1 weight packed. model.layers.8.feed_forward.w3 weight packed. model.layers.8.feed_forward.w2 weight packed. model.layers.9.attention.wqkv weight packed. model.layers.9.attention.wo weight packed. model.layers.9.feed_forward.w1 weight packed. model.layers.9.feed_forward.w3 weight packed. model.layers.9.feed_forward.w2 weight packed. model.layers.10.attention.wqkv weight packed. model.layers.10.attention.wo weight packed. model.layers.10.feed_forward.w1 weight packed. model.layers.10.feed_forward.w3 weight packed. model.layers.10.feed_forward.w2 weight packed. model.layers.11.attention.wqkv weight packed. model.layers.11.attention.wo weight packed. model.layers.11.feed_forward.w1 weight packed. model.layers.11.feed_forward.w3 weight packed. model.layers.11.feed_forward.w2 weight packed. model.layers.12.attention.wqkv weight packed. model.layers.12.attention.wo weight packed. model.layers.12.feed_forward.w1 weight packed. model.layers.12.feed_forward.w3 weight packed. model.layers.12.feed_forward.w2 weight packed. model.layers.13.attention.wqkv weight packed. model.layers.13.attention.wo weight packed. model.layers.13.feed_forward.w1 weight packed. model.layers.13.feed_forward.w3 weight packed. model.layers.13.feed_forward.w2 weight packed. model.layers.14.attention.wqkv weight packed. model.layers.14.attention.wo weight packed. model.layers.14.feed_forward.w1 weight packed. model.layers.14.feed_forward.w3 weight packed. model.layers.14.feed_forward.w2 weight packed. model.layers.15.attention.wqkv weight packed. model.layers.15.attention.wo weight packed. model.layers.15.feed_forward.w1 weight packed. model.layers.15.feed_forward.w3 weight packed. model.layers.15.feed_forward.w2 weight packed. model.layers.16.attention.wqkv weight packed. model.layers.16.attention.wo weight packed. model.layers.16.feed_forward.w1 weight packed. model.layers.16.feed_forward.w3 weight packed. model.layers.16.feed_forward.w2 weight packed. model.layers.17.attention.wqkv weight packed. model.layers.17.attention.wo weight packed. model.layers.17.feed_forward.w1 weight packed. model.layers.17.feed_forward.w3 weight packed. model.layers.17.feed_forward.w2 weight packed. model.layers.18.attention.wqkv weight packed. model.layers.18.attention.wo weight packed. model.layers.18.feed_forward.w1 weight packed. model.layers.18.feed_forward.w3 weight packed. model.layers.18.feed_forward.w2 weight packed. model.layers.19.attention.wqkv weight packed. model.layers.19.attention.wo weight packed. model.layers.19.feed_forward.w1 weight packed. model.layers.19.feed_forward.w3 weight packed. model.layers.19.feed_forward.w2 weight packed. model.layers.20.attention.wqkv weight packed. model.layers.20.attention.wo weight packed. model.layers.20.feed_forward.w1 weight packed. model.layers.20.feed_forward.w3 weight packed. model.layers.20.feed_forward.w2 weight packed. model.layers.21.attention.wqkv weight packed. model.layers.21.attention.wo weight packed. model.layers.21.feed_forward.w1 weight packed. model.layers.21.feed_forward.w3 weight packed. model.layers.21.feed_forward.w2 weight packed. model.layers.22.attention.wqkv weight packed. model.layers.22.attention.wo weight packed. model.layers.22.feed_forward.w1 weight packed. model.layers.22.feed_forward.w3 weight packed. model.layers.22.feed_forward.w2 weight packed. model.layers.23.attention.wqkv weight packed. model.layers.23.attention.wo weight packed. model.layers.23.feed_forward.w1 weight packed. model.layers.23.feed_forward.w3 weight packed. model.layers.23.feed_forward.w2 weight packed. 我们查看一下量化后的模型的详细信息: 对比量化前的模型: 模型小了不少。 下面使用Chat功能运行W4A16量化后的模型。 lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq 输出的内容与量化前的没有区别,不过这次一不小心试出了一个内部使用的token: 有关LMDeploy的lite功能的更多参数可通过lmdeploy lite -h命令查看。 (lmdeploy) root@intern-studio-160311:~# lmdeploy lite -h usage: lmdeploy lite [-h] {auto_awq,calibrate,kv_qparams,smooth_quant} ... Compressing and accelerating LLMs with lmdeploy.lite module options: -h, --help show this help message and exit Commands: This group has the following commands: {auto_awq,calibrate,kv_qparams,smooth_quant} auto_awq Perform weight quantization using AWQ algorithm. calibrate Perform calibration on a given dataset. kv_qparams Export key and value stats. smooth_quant Perform w8a8 quantization using SmoothQuant. 作业检查点:设置KV Cache最大占用比例为0.4,开启W4A16量化,以命令行方式与模型对话 上面,我们完成了模型的W4A16量化。所以我们这次运行的时候直接指定0.4的cache-max-entry-count就好了。 lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.4 LMDeploy部署LLM 搭建API服务器:lmdeploy serve api_server 以API Server方式启动 lmdeploy: lmdeploy serve api_server \ /root/internlm2-chat-1_8b-4bit \ --model-format awq \ --quant-policy 0 \ --server-name 0.0.0.0 \ --server-port 23333 \ --tp 1 \ --cache-max-entry-count 0.4 其中,model-format、quant-policy这些参数是与量化推理模型一致的;server-name和server-port表示API服务器的服务IP与服务端口;tp参数表示并行数量(GPU数量)。请注意这里作业要求是开启 W4A16量化,调整KV Cache的占用比例为0.4的模型,所以模型名称、模型文件名、模型格式和cache-max-entry-count都要手动设定。这里的代码已经设置好了。 可以通过运行lmdeploy serve api_server -h指令,查看更多参数及使用方法(这里好像和之前没啥区别。。。): (lmdeploy) (base) root@intern-studio-160311:~# lmdeploy serve api_server -h usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT] [--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials] [--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]] [--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH] [--backend {pytorch,turbomind}] [--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}] [--api-keys [API_KEYS ...]] [--ssl] [--meta-instruction META_INSTRUCTION] [--chat-template CHAT_TEMPLATE] [--cap {completion,infilling,chat,python}] [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME] [--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--model-format {hf,llama,awq}] [--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR] model_path Serve LLMs with restful api using fastapi. positional arguments: model_path The path of a model. it could be one of the following options: - i) a local directory path of a turbomind model which is converted by `lmdeploy convert` command or download from ii) and iii). - ii) the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as "internlm/internlm-chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a model hosted inside a model repo on huggingface.co, such as "internlm/internlm-chat-7b", "qwen/qwen-7b-chat ", "baichuan-inc/baichuan2-7b-chat" and so on. Type: str options: -h, --help show this help message and exit --server-name SERVER_NAME Host ip for serving. Default: 0.0.0.0. Type: str --server-port SERVER_PORT Server port. Default: 23333. Type: int --allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...] A list of allowed origins for cors. Default: ['*']. Type: str --allow-credentials Whether to allow credentials for cors. Default: False --allow-methods ALLOW_METHODS [ALLOW_METHODS ...] A list of allowed http methods for cors. Default: ['*']. Type: str --allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...] A list of allowed http headers for cors. Default: ['*']. Type: str --qos-config-path QOS_CONFIG_PATH Qos policy config path. Default: . Type: str --backend {pytorch,turbomind} Set the inference backend. Default: turbomind. Type: str --log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET} Set the log level. Default: ERROR. Type: str --api-keys [API_KEYS ...] Optional list of space separated API keys. Default: None. Type: str --ssl Enable SSL. Requires OS Environment variables 'SSL_KEYFILE' and 'SSL_CERTFILE'. Default: False --meta-instruction META_INSTRUCTION System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template instead. Default: None. Type: str --chat-template CHAT_TEMPLATE A JSON file or string that specifies the chat template configuration. Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification. Default: None. Type: str --cap {completion,infilling,chat,python} The capability of a model. Deprecated. Please use --chat-template instead. Default: chat. Type: str PyTorch engine arguments: --adapters [ADAPTERS ...] Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter.. Default: None. Type: str --tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int --model-name MODEL_NAME The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run `lmdeploy list` to get the supported model names. Default: None. Type: str --session-len SESSION_LEN The max session length of a sequence. Default: None. Type: int --max-batch-size MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int --cache-max-entry-count CACHE_MAX_ENTRY_COUNT The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float --cache-block-seq-len CACHE_BLOCK_SEQ_LEN The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int TurboMind engine arguments: --tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int --model-name MODEL_NAME The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run `lmdeploy list` to get the supported model names. Default: None. Type: str --session-len SESSION_LEN The max session length of a sequence. Default: None. Type: int --max-batch-size MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int --cache-max-entry-count CACHE_MAX_ENTRY_COUNT The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float --cache-block-seq-len CACHE_BLOCK_SEQ_LEN The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int --model-format {hf,llama,awq} The format of input model. `hf` meaning `hf_llama`, `llama` meaning `meta_llama`, `awq` meaning the quantized model by awq. Default: None. Type: str --quant-policy QUANT_POLICY Whether to use kv int8. Default: 0. Type: int --rope-scaling-factor ROPE_SCALING_FACTOR Rope scaling factor. Default: 0.0. Type: float 你也可以直接打开http://localhost:23333查看接口的swagger文档获取具体使用说明: 当然,这需要先使用SSH端口转发23333端口至本地: ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p <你的ssh端口号> 通过命令行连接API服务器:lmdeploy serve api_client 运行命令行客户端: lmdeploy serve api_client http://localhost:23333 运行后,可以通过命令行窗口直接与模型对话: 此时服务端输出为: 资源占用情况是: 通过Gradio连接服务器:lmdeploy serve gradio 使用Gradio作为前端,启动网页server。在远程开发机里新建一个终端,运行下面代码: lmdeploy serve gradio http://localhost:23333 \ --server-name 0.0.0.0 \ --server-port 6006 然后在本地转发6006端口: ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p <你的ssh端口号> 打开浏览器,访问地址http://127.0.0.1:6006就可以与模型进行对话了: 此时,服务端输出为: Gradio端输出为: 资源占用为: Python代码集成:lmdeploy.pipeline 我们创建代码/root/pipeline_kv.py: from lmdeploy import pipeline, TurbomindEngineConfig # 因为作业要求是**开启 W4A16量化,调整KV Cache的占用比例为0.4**,所以这里和教程上略微有所不同 # 调整KV Cache的占用比例为0.4 backend_config = TurbomindEngineConfig(cache_max_entry_count=0.4) # 指定使用的模型为W4A16量化后的模型 model_path = '/root/internlm2-chat-1_8b-4bit' pipe = pipeline(model_path, backend_config=backend_config) # 运行pipeline。这里用一个列表包含几个输入,lmdeploy同时推理这几个输入产生多个输出结果 response = pipe(['Hi, pls intro yourself', '上海是', 'please provide three suggestions about time management']) print(response) 运行并得到结果,很快啊: 这里的回答是和前面都是一样的,看不出太大差别。 上面的代码中,pipeline()的backend_config参数是可选的。 推理速度比较 在前面,我们已经测试了压缩前的速度,下面来测试一下LMDeploy的推理速度。新建python文件benchmark_lmdeploy.py,填入以下内容: import datetime from lmdeploy import pipeline pipe = pipeline('/root/internlm2-chat-1_8b') # warmup inp = "hello" for i in range(5): print("Warm up...[{}/5]".format(i+1)) response = pipe([inp]) # test speed inp = "请介绍一下你自己。" times = 10 total_words = 0 start_time = datetime.datetime.now() for i in range(times): response = pipe([inp]) total_words += len(response[0].text) end_time = datetime.datetime.now() delta_time = end_time - start_time delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0 speed = total_words / delta_time print("Speed: {:.4f} words/s".format(speed)) 运行结果为: 相比于量化之前,这边测试出来快了7倍。 如果将pipeline('/root/internlm2-chat-1_8b')换为pipeline('/root/internlm2-chat-1_8b-4bit'),运行结果为: 相比于量化之前,这边测试出来快了17倍。 LMDeploy量化部署llava 搭建环境 在我们创建的lmdeploy环境中安装llava依赖库: pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874 Python运行压缩的llava 我们新建代码/root/pipeline_llava.py,内容为: from lmdeploy import pipeline from lmdeploy.vl import load_image # pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令 pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b') # 从github下载一张关于老虎的图片 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg') response = pipe(('describe this image', image)) print(response) 运行: 占用资源卡在了80%: 但是其实我换张图之后,它就会在运行时什么都不输出 Gradio运行压缩的llava 新建代码/root/pipeline_llava_gradio.py,内容为: import gradio as gr from lmdeploy import pipeline # pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令 pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b') def model(image, text): if image is None: return [(text, "请上传一张图片。")] else: response = pipe((text, image)).text return [(text, response)] demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot()) demo.launch() 同样的,我们ssh转发7860端口。 ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p <你的ssh端口> 之后通过浏览器访问http://127.0.0.1:7860。 但是,模型并没有任何输出。 我们换回教程中的那张图试试: 然后输出又正常了。 那就这样吧。