GPU Inference Performance and Cost Analysis: RTX 4090, P40, and Cloud GPU (A100, H100) Comparison

1. Lambda Labs GPU Instance Creation and Testing.

1.1 Log in to the Console

1.2 Download Ollama Models and Test Different DeepSeek Models

curl -fsSL https://ollama.com/install.sh | sh
ollama run deepseek-r1:1.5b --verbose
ollama run deepseek-r1:7b -- verbose
ollama run deepseek-r1:14b --verbose
ollama run deepseek-r1:32b --verbose
ollama run deepseek-r1:70b --verbose
ollama run deepseek-r1:671b –verbose
$ nvidia-smi
$ watch -n 1 nvidia-smi

1.3 Test result

2. Consumer-Grade GPUs: RTX 4090 vs. P40

Model Size (B)RTX 4090 (tokens/s)P40 (tokens/s)Speed Ratio
1.5B219-264104-1102.0-2.5x
7B123-13953-582.3-2.5x
14B78-8717-233.3-4.3x
32B73-769-11P40 Leads
  • RTX 4090 is the best choice for small to mid-sized models: In 1.5B-14B models, RTX 4090 achieves 2-4x the speed of P40, making it ideal for latency-sensitive applications (e.g., real-time conversations).
  • P40 offers better value for large models: At 32B, P40 outperforms RTX 4090, likely due to memory bandwidth or optimization factors, making it a better choice for budget-conscious large-model inference.

3. Cloud GPUs: A100/H100 Cluster Performance and Cost

GPU ConfigurationRepresentative Model Performance (tokens/s)Hourly CostCost Efficiency (tokens/$)
1x A100 ($1.29/h)1.5B: 206Low ($1.29 per hour)~160 tokens/$
8x A100 ($14.32/h)70B: 21Medium ($14.32 per hour)~1.5 tokens/$
8x H100 ($23.92/h)671B: 25High ($23.92 per hour)~1.0 tokens/$
  • A100 is more cost-effective for small-scale tasks: A single A100 significantly outperforms clusters in terms of cost-efficiency for 1.5B model inference, making it ideal for individual developers or small-scale applications.
  • Clusters are necessary for large models: 70B/671B models achieve 21-25 tokens/s on an 8x H100 cluster, but costs rise sharply, making them suitable for enterprise-level high-load scenarios.
  • H100 offers a significant performance boost: At the same model scale, 8x H100 is 20-50% faster than 8x A100, but the cost increase must be carefully evaluated.

4. Detailed Comparison of GPU Inference Performance

5. Hardware Selection Recommendations

  • Models ≤14B → Prioritize RTX 4090 for high inference speed at low cost.
  • Models ≥32B → Consider multi-GPU P40 setups to balance memory and cost.
  • Standard workloads → Use 8x A100 clusters to balance throughput and cost.
  • Extreme performance needs → Choose 8x H100 clusters, particularly for real-time inference on trillion-parameter models.
  • Use elastic scaling (e.g., AWS spot instances) to reduce resource waste during idle periods.
  • For latency-insensitive tasks (e.g., batch processing), prioritize 1x A100 for better cost efficiency.

Appendix

  1. Video Source: https://www.youtube.com/watch?v=bOp9ggH4ztE, or watch it right here.

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Posts