Qwen3-RL-with-QAT Qwen3-RL tensorflow模型转onnx模型转tensorrt动态batchsize code vectorzation tensorflow onnx tensort tensorflow python deploy tensorflow C++ deploy tensorflow ckpt to pb From conv to atrous Person ReID Image Parsing Show, Attend and Tell Neural Image Caption Generation with Visual Attention dense crf Group Normalization 灵敏度和特异性指标 人体姿态检测 segmentation标注工具 利用多线程读取数据加快网络训练 利用tensorboard调参 深度学习中的loss函数汇总 纯C++代码实现的faster rcnn windows下配置caffe_ssd use ubuntu caffe as libs use windows caffe like opencv windows caffe implement caffe model convert to keras model Fully Convolutional Models for Semantic Segmentation Transposed Convolution, Fractionally Strided Convolution or Deconvolution 基于tensorflow的分布式部署 用python实现mlp bp算法 用tensorflow和tflearn搭建经典网络结构 Data Augmentation Tensorflow examples Training Faster RCNN with Online Hard Example Mining RNN(循环神经网络)推导 深度学习中的稀疏编码思想 利用caffe与lmdb读写图像数据 分析voc2007检测数据 用python写caffe网络配置 ssd开发 将KITTI的数据格式转换为VOC Pascal的xml格式 Faster RCNN 源码分析 在Caffe中建立Python layer 在Caffe中建立C++ layer 为什么CNN反向传播计算梯度时需要将权重旋转180度 Caffe使用教程(下) Caffe使用教程(上) CNN反向传播 Softmax回归 Caffe Ubuntu下环境配置
windows编译tensorflow tensorflow单机多卡程序的框架 tensorflow的操作 tensorflow的变量初始化和scope 人体姿态检测 segmentation标注工具 tensorflow模型恢复与inference的模型简化 利用多线程读取数据加快网络训练 tensorflow使用LSTM pytorch examples 利用tensorboard调参 深度学习中的loss函数汇总 纯C++代码实现的faster rcnn tensorflow使用记录 windows下配置caffe_ssd use ubuntu caffe as libs use windows caffe like opencv windows caffe implement caffe model convert to keras model flappyBird DQN Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Neural Networks Fast-style-transfer tensorflow安装 tensorflow DQN Fully Convolutional Models for Semantic Segmentation Transposed Convolution, Fractionally Strided Convolution or Deconvolution 基于tensorflow的分布式部署 用python实现mlp bp算法 用tensorflow和tflearn搭建经典网络结构 Data Augmentation Tensorflow examples Training Faster RCNN with Online Hard Example Mining 使用Tensorflow做Prisma图像风格迁移 RNN(循环神经网络)推导 深度学习中的稀疏编码思想 利用caffe与lmdb读写图像数据 分析voc2007检测数据 用python写caffe网络配置 ssd开发 将KITTI的数据格式转换为VOC Pascal的xml格式 Faster RCNN 源码分析 在Caffe中建立Python layer 在Caffe中建立C++ layer 为什么CNN反向传播计算梯度时需要将权重旋转180度 Caffe使用教程(下) Caffe使用教程(上) CNN反向传播 Softmax回归 Caffe Ubuntu下环境配置

llamacpp usage

2025年08月11日

#get code
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
#compilation
#GPU windows first install Visual studio then CUDA Toolkit
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j8

#iGPU windows first install Visual studio then oneAPI
./examples/sycl/build.sh
#
#usage cli
/dfs/data/LlamaCpp/llama.cpp-master/build/bin/llama-cli -m model.gguf --single-turn -cnv -fa -p "Tell me something about Beijing." -ngl 99
#usage server
##linux start model service
/dfs/data/LlamaCpp/llama.cpp-master/build/bin/llama-server -fa -m model.gguf -ngl 99 --ctx-size 8192 --predict 1024 --temp 0.8 --top-k 40 --top-p 0.9 --repeat-penalty 1.1 --rope-freq-base 500000

## cli api
curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Tell me something about Beijing.","n_predict": 128}'
	
##windows

### in iGPU need oneAPI environment
####in cmd
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
####in powershell
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'

llama.cpp\build\bin\llama-server.exe -m model.gguf -c 2048
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data "{\"prompt\": \"Tell me something about Beijing.\",\"n_predict\": 2048}"
	
#python env openai style
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": "Tell me something about Beijing."}
  ],
  max_tokens=512
)

print(completion.choices[0].message)

#curl style
import subprocess,json

prompt="Tell me something about Beijing."
json_data={"prompt": prompt,"n_predict": 512}
#print(json.dumps(json_data))
cmd_line = ["curl", "--request", "POST", "--url", "http://localhost:8080/completion", "--header", "Content-Type: application/json", "--data", json.dumps(json_data)]
result = subprocess.run(cmd_line, capture_output=True, text=True)
print(result.stdout)

hf model to gguf

#hf model to fp16 gguf model
python llama.cpp-master/convert_hf_to_gguf.py Model_Path

offline quantization

llama-quantize model.gguf q4_0

Self trained QAT model or PTQ model to gguf format

It will get higher performance than official default quantization method. It is applied internal and is not open source for the time being.


blog comments powered by Disqus