News Categories
45x faster AI performance delivered by new NVIDIA Tesla P4 and P40
By Bryan Chan & Marcus Wong - on 15 Sep 2016, 9:09am

45x faster AI performance delivered by new NVIDIA Tesla P4 and P40

Modern Artificial Intelligence (AI) is capable of doing everything from picking up our calls to filtering our emails and recommending us movies. However, the quality of the AI experience depends largely on its ability to respond to us in real-time, and that involves a lot of computing power - which most CPU-based technology isn’t capable of delivering.

Enter NVIDIA’s latest Tesla P4 and P40 chips. Designed from the ground-up with inferencing in mind, these chips use trained deep neural networks to recognize speech, images or text and are based on NVIDIA’s new Pascal architecture for optimum efficiency. Using specialized inference instructions based on 8-bit (INT8) operations, the Tesla P4 and P40 chips are able to deliver 45x faster response than CPUs and 4x times improvement from GPU solutions launched less than a year ago. 

Each Tesla P4 is said to be able to replace 13 CPU-only servers for video inferencing workloads.

The Tesla P4 comes in 50W and 70W designs, allowing it to provide over 60x better energy efficiency than CPUs. NVIDIA estimates that a single server with one Tesla P4 replaces 13 CPU-only servers for video inferencing workloads, giving you over eight times savings in terms of total cost of ownership.

It can also transcode and infer up to 35 HD video streams in real-time, thanks to a dedicated hardware-accelerated decode engine that works in parallel with the GPU doing inference. This allows you to integrate deep learning in the video pipeline, letting you offer smart services to your users.

The Tesla P40 is built for the maximum inference throughput possible.

The Tesla P40 on the other hand, delivers the maximum inference throughput possible for deep-learning deployment; capable of 47 TOPS (Tera-Operations-Per-Second) of INT8 operations. It has 24GB of GPU memory and memory bandwidth of 346 GB/s, allowing it to do transcoding and inference of 35 HD video streams in real time.

It also delivers over 30x lower latency than a CPU, so a server with eight Tesla P40 accelerators can deliver the equivalent computing performance of 140 CPU-based servers, greatly saving you time and money.

Software Tools for Faster Inferencing

Further speeding up the performance of these two cards are two new software innovations: TensorRT and NVIDIA DeepStream SDK.

TensorRT is a library created for optimizing deep learning models for production deployment that delivers instant responsiveness by maximizing throughput and efficiency of deep learning applications. It takes trained neural nets – defined with 32-bit or 16-bit operations – and optimizes them for reduced precision INT8 operations.

The NVIDIA DeepStream SDK on the other hand, taps on Pascal servers to simultaneously decode and analyze up to 93 HD video streams in real time. This allows the cards to understand video content at-scale for applications such as self-driving cars, interactive robots, filtering and ad placement.

Here are some basic specifications for both GPUs:

Specifications of the Tesla P4 and P40 GPUs
Specification Tesla P4 Tesla P40
Single Precision FLOPS* 5.5 12
INT8 TOPS* (Tera-Operations Per Second) 22 47
CUDA Cores 2,560 3,840
GPU GDDR5 Memory 8GB 24GB
Memory Bandwidth 192GB/s 346GB/s
Power 50 Watt (or higher) 250 Watt

*With boost clock on

The NVIDIA Tesla P4 and P40 are planned to be available in November and October,respectively, in qualified servers offered by ODM, OEM and channel partners.