Your cart is currently empty!
GPU Inference Performance and Cost Analysis: RTX 4090, P40, and Cloud GPU (A100, H100) Comparison
BY

As large language models enter practical applications, developers face a dual challenge when selecting hardware: meeting real-time inference speed requirements while controlling the rising cost of computing power. This article presents measured data on the Deepseek model series (1.5B/7B/14B/32B/70B/671B) and compares consumer-grade GPUs with cloud GPUs, revealing GPU inference performance trends across different model scales. (Test videos and raw data can be found in the reference section at the end.)
1. Lambda Labs GPU Instance Creation and Testing.
1.1 Log in to the Console
Visiting the Lambda Labs console → Select “Instances” → Click “Launch instance”

Choose the required instance type and click “Create”. Wait for the deployment to complete, then click “Launch” to enter the operation interface.

1.2 Download Ollama Models and Test Different DeepSeek Models
Select “Terminal” to enter command-line mode.


Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Test Command:
ollama run deepseek-r1:1.5b --verbose
ollama run deepseek-r1:7b -- verbose
ollama run deepseek-r1:14b --verbose
ollama run deepseek-r1:32b --verbose
ollama run deepseek-r1:70b --verbose
ollama run deepseek-r1:671b –verbose
Monitor NVIDIA GPU status:
$ nvidia-smi
$ watch -n 1 nvidia-smi
1.3 Test result

2. Consumer-Grade GPUs: RTX 4090 vs. P40
Testing results show that the RTX 4090 significantly outperforms the P40 in most model sizes, especially for small to mid-sized models:
Model Size (B) | RTX 4090 (tokens/s) | P40 (tokens/s) | Speed Ratio |
1.5B | 219-264 | 104-110 | 2.0-2.5x |
7B | 123-139 | 53-58 | 2.3-2.5x |
14B | 78-87 | 17-23 | 3.3-4.3x |
32B | 73-76 | 9-11 | P40 Leads |
For larger deepseek models like the 70B and 671B, both GPUs struggle to maintain performance.
Key Conclusions:
- RTX 4090 is the best choice for small to mid-sized models: In 1.5B-14B models, RTX 4090 achieves 2-4x the speed of P40, making it ideal for latency-sensitive applications (e.g., real-time conversations).
- P40 offers better value for large models: At 32B, P40 outperforms RTX 4090, likely due to memory bandwidth or optimization factors, making it a better choice for budget-conscious large-model inference.
3. Cloud GPUs: A100/H100 Cluster Performance and Cost
Cloud GPUs offer greater scalability but require balancing between hourly costs and performance:
GPU Configuration | Representative Model Performance (tokens/s) | Hourly Cost | Cost Efficiency (tokens/$) |
1x A100 ($1.29/h) | 1.5B: 206 | Low ($1.29 per hour) | ~160 tokens/$ |
8x A100 ($14.32/h) | 70B: 21 | Medium ($14.32 per hour) | ~1.5 tokens/$ |
8x H100 ($23.92/h) | 671B: 25 | High ($23.92 per hour) | ~1.0 tokens/$ |
Key Conclusions:
- A100 is more cost-effective for small-scale tasks: A single A100 significantly outperforms clusters in terms of cost-efficiency for 1.5B model inference, making it ideal for individual developers or small-scale applications.
- Clusters are necessary for large models: 70B/671B models achieve 21-25 tokens/s on an 8x H100 cluster, but costs rise sharply, making them suitable for enterprise-level high-load scenarios.
- H100 offers a significant performance boost: At the same model scale, 8x H100 is 20-50% faster than 8x A100, but the cost increase must be carefully evaluated.
4. Detailed Comparison of GPU Inference Performance

5. Hardware Selection Recommendations
For Individual Developers / Small Teams:
- Models ≤14B → Prioritize RTX 4090 for high inference speed at low cost.
- Models ≥32B → Consider multi-GPU P40 setups to balance memory and cost.
For Enterprise Applications:
- Standard workloads → Use 8x A100 clusters to balance throughput and cost.
- Extreme performance needs → Choose 8x H100 clusters, particularly for real-time inference on trillion-parameter models.
Cloud Cost Optimization Strategies:
- Use elastic scaling (e.g., AWS spot instances) to reduce resource waste during idle periods.
- For latency-insensitive tasks (e.g., batch processing), prioritize 1x A100 for better cost efficiency.
Appendix
- Video Source: https://www.youtube.com/watch?v=bOp9ggH4ztE, or watch it right here.
2. Raw Data: Download or check the data in the following Excel file.
Welcome to my Youtube channel https://www.youtube.com/@frankfu007 for more AI technology information.
🔥 Want to stay ahead with the latest in AI? Dive into my articles below!
- Education Nano 01 – Modular Wheel-Leg Robot for STEM
- Audio-Visual Synchronization Algorithms in Digital Humans and the TIME_WAIT Challenge in WebSocket Communication
- Building a Voice-Controlled Robot Using OpenAI Realtime API: A Full-Link Implementation from RDK X5 to ES02
- Desktop Balancing Bot(ES02)-Dual-Wheel Legged Robot with High-Performance Algorithm
- Wheeled-Legged Robot ES01: Building with ESP32 & SimpleFOC
Leave a Reply