
Qwen2-Audio Windows部署
Qwen2-Audio 是阿里巴巴通义千问团队推出的一款开源 AI 语音模型,支持语音聊天和音频分析,能够接受音频和文本输入,生成文本输出,支持多种语言和方言。
本文将介绍如何在 Windows 系统上部署 Qwen2-Audio 模型,包括环境配置、模型下载、推理操作等步骤,帮助您快速上手并体验这一强大的 AI 工具。
环境准备
系统:win11
安装:Anaconda Navigator
Python 3.12.7
显卡:我这里是5090D
由于5090D需要安装最新的pytorch才可以使用cuda,否则会弹出警告不可用cuda,这里建议的版本是12.8以上。
下面可以安装最新的pytorch版本。
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
如果存在旧版本可以先卸载再安装。
pip uninstall torch torchvision torchaudio
下载与安装
首先从git上下载Qwen2-Audio
,
git clone https://github.com/QwenLM/Qwen2-Audio
下载完成后,我们将在该目录下通过命令行创建虚拟环境。
# 这里自己填写自己的版本
conda create --name myenv python=3.12.7
# 激活虚拟环境
conda activate myenv
安装requirements_web_demo.txt
中相关依赖。
pip install -r requirements_web_demo.txt
pip install accelerate
然后下载模型,首先我们可以在modelscope
和huggingface
中进行下载模型,链接在这儿:modelscope和huggingface,重点是下面5个文件:
这里我将它放在了D:\Learning\AI\audio\qwen2-audio\Qwen2-Audio-7B-Instruct
目录下,接下来我们来更改一下demo
目录下面的web_demo_audio.py
代码:
# 首先修改模型的目录
DEFAULT_CKPT_PATH = 'D:\\Learning\\AI\\audio\\qwen2-audio\\Qwen2-Audio-7B-Instruct'
然后我们这里我们修改全部使用gpu进行计算(不这样改会报这样的错误:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
)。
if not _get_args().cpu_only:
# inputs["input_ids"] = inputs.input_ids.to("cuda")
inputs = {k: v.to("cuda") for k, v in inputs.items()} # Move all inputs to GPU
完整代码如下:
import gradio as gr
import modelscope_studio as mgr
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
from argparse import ArgumentParser
#D:\Learning\AI\audio\qwen2-audio\Qwen2-Audio-7B-Instruct
#DEFAULT_CKPT_PATH = 'Qwen/Qwen2-Audio-7B-Instruct'
# DEFAULT_CKPT_PATH = r'D:\Learning\AI\audio\qwen2-audio\Qwen2-Audio-7B-Instruct'
DEFAULT_CKPT_PATH = 'D:\\Learning\\AI\\audio\\qwen2-audio\\Qwen2-Audio-7B-Instruct'
def _get_args():
parser = ArgumentParser()
parser.add_argument("-c", "--checkpoint-path", type=str, default=DEFAULT_CKPT_PATH,
help="Checkpoint name or path, default to %(default)r")
parser.add_argument("--cpu-only", action="store_true", help="Run demo with CPU only")
parser.add_argument("--inbrowser", action="store_true", default=False,
help="Automatically launch the interface in a new tab on the default browser.")
parser.add_argument("--server-port", type=int, default=8000,
help="Demo server port.")
parser.add_argument("--server-name", type=str, default="127.0.0.1",
help="Demo server name.")
args = parser.parse_args()
return args
def add_text(chatbot, task_history, input):
text_content = input.text
content = []
if len(input.files) > 0:
for i in input.files:
content.append({'type': 'audio', 'audio_url': i.path})
if text_content:
content.append({'type': 'text', 'text': text_content})
task_history.append({"role": "user", "content": content})
chatbot.append([{
"text": input.text,
"files": input.files,
}, None])
return chatbot, task_history, None
def add_file(chatbot, task_history, audio_file):
"""Add audio file to the chat history."""
task_history.append({"role": "user", "content": [{"audio": audio_file.name}]})
chatbot.append((f"[Audio file: {audio_file.name}]", None))
return chatbot, task_history
def reset_user_input():
"""Reset the user input field."""
return gr.Textbox.update(value='')
def reset_state(task_history):
"""Reset the chat history."""
return [], []
def regenerate(chatbot, task_history):
"""Regenerate the last bot response."""
if task_history and task_history[-1]['role'] == 'assistant':
task_history.pop()
chatbot.pop()
if task_history:
chatbot, task_history = predict(chatbot, task_history)
return chatbot, task_history
def predict(chatbot, task_history):
"""Generate a response from the model."""
print(f"{task_history=}")
print(f"{chatbot=}")
text = processor.apply_chat_template(task_history, add_generation_prompt=True, tokenize=False)
audios = []
for message in task_history:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(
librosa.load(ele['audio_url'], sr=processor.feature_extractor.sampling_rate)[0]
)
if len(audios)==0:
audios=None
print(f"{text=}")
print(f"{audios=}")
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
if not _get_args().cpu_only:
# inputs["input_ids"] = inputs.input_ids.to("cuda")
inputs = {k: v.to("cuda") for k, v in inputs.items()} # Move all inputs to GPU
generate_ids = model.generate(**inputs, max_length=256)
#generate_ids = generate_ids[:, inputs.input_ids.size(1):]
generate_ids = generate_ids[:, inputs["input_ids"].size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(f"{response=}")
task_history.append({'role': 'assistant',
'content': response})
chatbot.append((None, response)) # Add the response to chatbot
return chatbot, task_history
def _launch_demo(args):
with gr.Blocks() as demo:
gr.Markdown(
"""<p align="center"><img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/blog/qwenaudio/qwen2audio_logo.png" style="height: 80px"/><p>""")
gr.Markdown("""<center><font size=8>Qwen2-Audio-Instruct Bot</center>""")
gr.Markdown(
"""\
<center><font size=3>This WebUI is based on Qwen2-Audio-Instruct, developed by Alibaba Cloud. \
(本WebUI基于Qwen2-Audio-Instruct打造,实现聊天机器人功能。)</center>""")
gr.Markdown("""\
<center><font size=4>Qwen2-Audio <a href="https://modelscope.cn/models/qwen/Qwen2-Audio-7B">?? </a>
| <a href="https://huggingface.co/Qwen/Qwen2-Audio-7B">??</a>  |
Qwen2-Audio-Instruct <a href="https://modelscope.cn/models/qwen/Qwen2-Audio-7B-Instruct">?? </a> |
<a href="https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct">??</a>  |
 <a href="https://github.com/QwenLM/Qwen2-Audio">Github</a></center>""")
chatbot = mgr.Chatbot(label='Qwen2-Audio-7B-Instruct', elem_classes="control-height", height=750)
user_input = mgr.MultimodalInput(
interactive=True,
sources=['microphone', 'upload'],
submit_button_props=dict(value="?? Submit (发送)"),
upload_button_props=dict(value="?? Upload (上传文件)", show_progress=True),
)
task_history = gr.State([])
with gr.Row():
empty_bin = gr.Button("?? Clear History (清除历史)")
regen_btn = gr.Button("??? Regenerate (重试)")
user_input.submit(fn=add_text,
inputs=[chatbot, task_history, user_input],
outputs=[chatbot, task_history, user_input]).then(
predict, [chatbot, task_history], [chatbot, task_history], show_progress=True
)
empty_bin.click(reset_state, outputs=[chatbot, task_history], show_progress=True)
regen_btn.click(regenerate, [chatbot, task_history], [chatbot, task_history], show_progress=True)
demo.queue().launch(
share=False,
inbrowser=args.inbrowser,
server_port=args.server_port,
server_name=args.server_name,
)
if __name__ == "__main__":
args = _get_args()
if args.cpu_only:
device_map = "cpu"
else:
device_map = "auto"
model = Qwen2AudioForConditionalGeneration.from_pretrained(
args.checkpoint_path,
torch_dtype="auto",
device_map=device_map,
resume_download=True,
).eval()
model.generation_config.max_new_tokens = 2048 # For chat.
print("generation_config", model.generation_config)
processor = AutoProcessor.from_pretrained(args.checkpoint_path, resume_download=True)
_launch_demo(args)
开始运行。
python web_demo_audio.py
这样在本地就部署成功了。
中途遇到了一些问题:
问题一
web界面有时候没有输入框,这最先开始也是我不解的地方,然后我执行下面的命令后就有了。
pip install modelscope-studio==0.5.2
问题二
然后我们通过录制音频,编辑文本后发送发现报错。
发现是什么Exception in ASGI application
什么什么的错误,通过各种各样的尝试最后换了一下gradio版本解决了。
pip install gradio --upgrade -i https://mirrors.cloud.tencent.com/pypi/simple
pip包我提供一下
(myenv) D:\Learning\AI\audio\qwen2-audio>pip list
WARNING: Ignoring invalid distribution ~radio-client (D:\ProgramData\anaconda3\envs\myenv\Lib\site-packages)
WARNING: Ignoring invalid distribution ~umpy (D:\ProgramData\anaconda3\envs\myenv\Lib\site-packages)
Package Version
------------------------- ------------------------
accelerate 1.6.0
aiofiles 23.2.1
altair 5.5.0
annotated-types 0.7.0
anyio 4.9.0
asttokens 3.0.0
attrs 25.3.0
audioread 3.0.1
certifi 2025.4.26
cffi 1.17.1
charset-normalizer 3.4.2
click 8.1.8
colorama 0.4.6
comm 0.2.2
contourpy 1.3.2
cycler 0.12.1
debugpy 1.8.14
decorator 5.2.1
executing 2.2.0
fastapi 0.115.12
ffmpy 0.5.0
filelock 3.18.0
fonttools 4.57.0
fsspec 2025.3.2
gradio 5.29.0
gradio_client 1.10.0
groovy 0.1.2
h11 0.16.0
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.30.2
idna 3.10
importlib_resources 6.5.2
ipykernel 6.29.5
ipython 9.2.0
ipython_pygments_lexers 1.1.1
jedi 0.19.2
Jinja2 3.1.4
joblib 1.5.0
jsonschema 4.23.0
jsonschema-specifications 2025.4.1
jupyter_client 8.6.3
jupyter_core 5.7.2
kiwisolver 1.4.8
lazy_loader 0.4
librosa 0.11.0
llvmlite 0.44.0
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
modelscope_studio 0.5.2
mpmath 1.3.0
msgpack 1.1.0
narwhals 1.37.1
nest-asyncio 1.6.0
networkx 3.3
numba 0.61.2
numpy 1.26.4
orjson 3.10.18
packaging 25.0
pandas 2.2.3
parso 0.8.4
pillow 10.4.0
pip 25.1
platformdirs 4.3.7
pooch 1.8.2
prompt_toolkit 3.0.51
psutil 7.0.0
pure_eval 0.2.3
pycparser 2.22
pydantic 2.11.4
pydantic_core 2.33.2
pydub 0.25.1
Pygments 2.19.1
pyparsing 3.2.3
python-dateutil 2.9.0.post0
python-multipart 0.0.20
pytz 2025.2
pywin32 310
PyYAML 6.0.2
pyzmq 26.4.0
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
rich 14.0.0
rpds-py 0.24.0
ruff 0.11.8
safehttpx 0.1.6
safetensors 0.5.3
scikit-learn 1.6.1
scipy 1.15.2
semantic-version 2.10.0
setuptools 78.1.1
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
soxr 0.5.0.post1
stack-data 0.6.3
starlette 0.46.2
sympy 1.13.3
threadpoolctl 3.6.0
tokenizers 0.21.1
tomlkit 0.12.0
torch 2.8.0.dev20250503+cu128
torchaudio 2.6.0.dev20250504+cu128
torchvision 0.22.0.dev20250504+cu128
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.52.0.dev0
typer 0.15.3
typing_extensions 4.13.2
typing-inspection 0.4.0
tzdata 2025.2
urllib3 2.4.0
uvicorn 0.34.2
wcwidth 0.2.13
websockets 11.0.3
wheel 0.45.1
如果有其他的问题可以留言。
参考链接
https://juejin.cn/post/7490445394468438028
https://blog.csdn.net/qq_55736201/article/details/142835651
https://github.com/QwenLM/Qwen2-Audio/issues/112
欢迎加群讨论技术,1群:677373950(满了,可以加,但通过不了),2群:656732739

