推理速度慢,求加速推理指导

#6
by Alanxc - opened

目前我是用ollama拉取的qwen2 72b,langchain设计任务链,然后用flask封装了一个接口(post方法),然后用postman测得,我的业务任务是文本分析生成JSON数据,但是现在处理一次的速度慢的可怕,请求朋友们指导一下这个要怎么加速比较合理啊。 我硬件设备用的是H800 80GB *2 pcie,但是只是推理的话,72b也跑不满一张卡,应该不是硬件的问题吧。

Any recommend all welcome

reasoning speed is quite slow of my target task, here is my work flow, i used the OLLAMA to pull the qwen2:72b,and then use the langchain to build a work chain, and give prompt bulabula,finally i used the flask(python web lib) to build a service api provide for the service, when i use postman to test it , the reasoning speed are so slow, and my hard ware set is H800 80GB * 2, one thing i can believe is it's might not the hardware problem, cause when the llm work the Video memory only used 40GB+, the question is how can i make the reasoning speed upupupupupupupupupup,pls. any help are welcome ~

Sign up or log in to comment