GPU Inference Performance and cost analysis

GPU Inference Performance and Cost Analysis: RTX 4090, P40, and Cloud GPU (A100, H100) Comparison

February 26, 2025

As large language models enter practical applications, developers face a dual challenge when selecting hardware: meeting real-time inference speed requirements while controlling the rising cost of computing power. This article presents measured data on the Deepseek model series (1.5B/7B/14B/32B/70B/671B) and compares consumer-grade GPUs with cloud GPUs, revealing GPU inference performance trends across different model scales. (Test videos and raw data can be found in the reference section at the end.)

1. Lambda Labs GPU Instance Creation and Testing.

1.1 Log in to the Console

Visiting the Lambda Labs console → Select “Instances” → Click “Launch instance”

Choose the required instance type and click “Create”. Wait for the deployment to complete, then click “Launch” to enter the operation interface.

1.2 Download Ollama Models and Test Different DeepSeek Models

Select “Terminal” to enter command-line mode.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Test Command:

ollama run deepseek-r1:1.5b --verbose
ollama run deepseek-r1:7b -- verbose
ollama run deepseek-r1:14b --verbose
ollama run deepseek-r1:32b --verbose
ollama run deepseek-r1:70b --verbose
ollama run deepseek-r1:671b –verbose

Monitor NVIDIA GPU status:

$ nvidia-smi
$ watch -n 1 nvidia-smi

1.3 Test result

2. Consumer-Grade GPUs: RTX 4090 vs. P40

Testing results show that the RTX 4090 significantly outperforms the P40 in most model sizes, especially for small to mid-sized models:

Model Size (B)	RTX 4090 (tokens/s)	P40 (tokens/s)	Speed Ratio
1.5B	219-264	104-110	2.0-2.5x
7B	123-139	53-58	2.3-2.5x
14B	78-87	17-23	3.3-4.3x
32B	73-76	9-11	P40 Leads

For larger deepseek models like the 70B and 671B, both GPUs struggle to maintain performance.

Key Conclusions:

RTX 4090 is the best choice for small to mid-sized models: In 1.5B-14B models, RTX 4090 achieves 2-4x the speed of P40, making it ideal for latency-sensitive applications (e.g., real-time conversations).
P40 offers better value for large models: At 32B, P40 outperforms RTX 4090, likely due to memory bandwidth or optimization factors, making it a better choice for budget-conscious large-model inference.

3. Cloud GPUs: A100/H100 Cluster Performance and Cost

Cloud GPUs offer greater scalability but require balancing between hourly costs and performance:

GPU Configuration	Representative Model Performance (tokens/s)	Hourly Cost	Cost Efficiency (tokens/$)
1x A100 ($1.29/h)	1.5B: 206	Low ($1.29 per hour)	~160 tokens/$
8x A100 ($14.32/h)	70B: 21	Medium ($14.32 per hour)	~1.5 tokens/$
8x H100 ($23.92/h)	671B: 25	High ($23.92 per hour)	~1.0 tokens/$

Key Conclusions:

A100 is more cost-effective for small-scale tasks: A single A100 significantly outperforms clusters in terms of cost-efficiency for 1.5B model inference, making it ideal for individual developers or small-scale applications.
Clusters are necessary for large models: 70B/671B models achieve 21-25 tokens/s on an 8x H100 cluster, but costs rise sharply, making them suitable for enterprise-level high-load scenarios.
H100 offers a significant performance boost: At the same model scale, 8x H100 is 20-50% faster than 8x A100, but the cost increase must be carefully evaluated.

4. Detailed Comparison of GPU Inference Performance

5. Hardware Selection Recommendations

For Individual Developers / Small Teams:

Models ≤14B → Prioritize RTX 4090 for high inference speed at low cost.
Models ≥32B → Consider multi-GPU P40 setups to balance memory and cost.

For Enterprise Applications:

Standard workloads → Use 8x A100 clusters to balance throughput and cost.
Extreme performance needs → Choose 8x H100 clusters, particularly for real-time inference on trillion-parameter models.

Cloud Cost Optimization Strategies:

Use elastic scaling (e.g., AWS spot instances) to reduce resource waste during idle periods.
For latency-insensitive tasks (e.g., batch processing), prioritize 1x A100 for better cost efficiency.