Local Deployment Guide for DeepSeek V3: From Basics to Advanced

Overview

This guide provides detailed instructions on deploying and running the DeepSeek V3 model in your local environment. We'll cover the complete process from basic setup to advanced deployment options, helping you choose the most suitable deployment strategy.

Environment Setup

Basic Requirements

NVIDIA GPU (A100 or H100 recommended) or AMD GPU
Sufficient system memory (32GB+ recommended)
Linux operating system (Ubuntu 20.04 or higher recommended)
Python 3.8 or higher

Code and Model Preparation

Clone the official repository:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference
pip install -r requirements.txt

Download model weights:

Download official model weights from HuggingFace
Place weight files in the designated directory

Deployment Options

1. DeepSeek-Infer Demo Deployment

This is the basic deployment method, suitable for quick testing and experimentation:

python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 \
                 --save-path /path/to/DeepSeek-V3-Demo \
                 --n-experts 256 \
                 --model-parallel 16


torchrun --nnodes 2 --nproc-per-node 8 generate.py \
         --node-rank $RANK \
         --master-addr $ADDR \
         --ckpt-path /path/to/DeepSeek-V3-Demo \
         --config configs/config_671B.json \
         --interactive \
         --temperature 0.7 \
         --max-new-tokens 200

2. SGLang Deployment (Recommended)

SGLang v0.4.1 offers optimal performance:

MLA optimization support
FP8 (W8A8) support
FP8 KV cache support
Torch Compile support
NVIDIA and AMD GPU support

3. LMDeploy Deployment (Recommended)

LMDeploy provides enterprise-grade deployment solutions:

Offline pipeline processing
Online service deployment
PyTorch workflow integration
Optimized inference performance

4. TRT-LLM Deployment (Recommended)

TensorRT-LLM features:

BF16 and INT4/INT8 weight support
Upcoming FP8 support
Optimized inference speed

5. vLLM Deployment (Recommended)

vLLM v0.6.6 features:

FP8 and BF16 mode support
NVIDIA and AMD GPU support
Pipeline parallelism capability
Multi-machine distributed deployment

Performance Optimization Tips

Memory Optimization:
- Use FP8 or INT8 quantization to reduce memory usage
- Enable KV cache optimization
- Set appropriate batch sizes
Speed Optimization:
- Enable Torch Compile
- Use pipeline parallelism
- Optimize input/output processing
Stability Optimization:
- Implement error handling mechanisms
- Add monitoring and logging
- Regular system resource checks

Common Issues and Solutions

Memory Issues:
- Reduce batch size
- Use lower precision
- Enable memory optimization options
Performance Issues:
- Check GPU utilization
- Optimize model configuration
- Adjust parallel strategies
Deployment Errors:
- Check environment dependencies
- Verify model weights
- Review detailed logs

Next Steps

After basic deployment, you can:

Conduct performance benchmarking
Optimize configuration parameters
Integrate with existing systems
Develop custom features

Now you have mastered the main methods for locally deploying DeepSeek V3. Choose the deployment option that best suits your needs and start building your AI applications!