NVIDIA TensorRT 3 promises big AI inferencing performance boost

Posted on Tuesday, September 26 2017 @ 12:45 CEST by Thomas De Maesschalck

NVIDIA made a couple of announcements at its GPU Technology Conference in Beijing, the most interesting being the reveal of the new NVIDIA TensorRT 3 AI interference software. Combined with the new Volta GPUs, version 3 of TensorRT promises 3.7x higher performance than the Pascal generation.

There are basically two major fields in AI processing. First up there's the training process, this requires a lot of computational horsepower and trains a model to learn new things from existing data. Once the training process is done, it can be applied to new data. The latter process is called inference. NVIDIA claims its GPU-based solutions can do this job 40x faster than CPUs, at one-tenth the cost.

NVIDIA pitches its TensorRT software as an "off-the-shelf" highly optimized compiler and runtime engine for the deployment of AI algorithms on GPUs. TensorRT helps researchers and companies to deploy trained neural nets on GPUs and makes the neural networks run a lot faster. The GPU in question can be as powerful as an array of the flagship Tesla V100 datacenter cards, or something as little as the Jetson TX2, which is aimed at embedded application.

- NVIDIA today unveiled new NVIDIA® TensorRT 3 AI inference software that sharply boosts the performance and slashes the cost of inferencing from the cloud to edge devices, including self-driving cars and robots.

The combination of TensorRT 3 with NVIDIA GPUs delivers ultra-fast and efficient inferencing across all frameworks for AI-enabled services -- such as image and speech recognition, natural language processing, visual search and personalized recommendations. TensorRT and NVIDIA Tesla® GPU accelerators are up to 40 times faster than CPUs(1) at one-tenth the cost of CPU-based solutions.(2)

"Internet companies are racing to infuse AI into services used by billions of people. As a result, AI inference workloads are growing exponentially," said NVIDIA founder and CEO Jensen Huang. "NVIDIA TensorRT is the world's first programmable inference accelerator. With CUDA programmability, TensorRT will be able to accelerate the growing diversity and complexity of deep neural networks. And with TensorRT's dramatic speed-up, service providers can affordably deploy these compute intensive AI workloads."

More than 1,200 companies have already begun using NVIDIA's inference platform across a wide spectrum of industries to discover new insights from data and deploy intelligent services to businesses and consumers. Among them are Amazon, Microsoft, Facebook and Google; as well as leading Chinese enterprise companies like Alibaba, Baidu, JD.com, iFLYTEK, Hikvision, Tencent and WeChat.

"NVIDIA's AI platform, using TensorRT software on Tesla GPUs, is an outstanding technology at the forefront of enabling SAP's growing requirements for inferencing," said Juergen Mueller, chief innovation officer at SAP. "TensorRT and NVIDIA GPUs make real-time service delivery possible, with maximum machine learning performance and versatility to meet our customers' needs."

"JD.com relies on NVIDIA GPUs and software for inferencing in our data centers," said Andy Chen, senior director of AI and Big Data at JD. "Using NVIDIA's TensorRT on Tesla GPUs, we can simultaneously inference 1,000 HD video streams in real time, with 20 times fewer servers. NVIDIA's deep learning platform provides outstanding performance and efficiency for JD."

TensorRT 3 is a high-performance optimizing compiler and runtime engine for production deployment of AI applications. It can rapidly optimize, validate and deploy trained neural networks for inference to hyperscale data centers, embedded or automotive GPU platforms.

It offers highly accurate INT8 and FP16 network execution, which can save data center operators tens of millions of dollars in acquisition and annual energy costs. A developer can use it to take a trained neural network and, in just one day, create a deployable inference solution that runs 3-5x faster than their training framework.

To further accelerate AI, NVIDIA introduced additional software, including:

DeepStream SDK:NVIDIA DeepStream SDK delivers real-time, low-latency video analytics at scale. It helps developers integrate advanced video inference capabilities, including INT8 precision and GPU-accelerated transcoding, to support AI-powered services like object classification and scene understanding for up to 30 HD streams in real time on a single Tesla P4 GPU accelerator.

CUDA 9: The latest version of CUDA®, NVIDIA's accelerated computing software platform, speeds up HPC and deep learning applications with support for NVIDIA Volta architecture-based GPUs, up to 5x faster libraries, a new programming model for thread management and updates to debugging and profiling tools. CUDA 9 is optimized to deliver maximum performance on Tesla V100 GPU accelerators.

Inference for the Data Center
Data center managers constantly balance performance and efficiency to keep their server fleets at maximum productivity. Tesla GPU accelerated servers can replace over a hundred hyperscale CPU servers for deep learning inference applications and services, freeing up precious rack space, reducing energy and cooling requirements, and reducing cost as much as 90 percent.

NVIDIA Tesla GPU accelerators provide the optimal inference solution -- combining the highest throughput, best efficiency and lowest latency on deep learning inference workloads to power new AI-driven experiences.

Inference for Self-Driving Cars and Embedded Applications
With NVIDIA's unified architecture, deep neural networks on every deep learning framework can be trained on NVIDIA DGX™ systems in the data center, and then deployed into all types of devices -- from robots to autonomous vehicles -- for real-time inferencing at the edge.

TuSimple, a startup developing autonomous trucking technology, increased inferencing performance by 30 percent after TensorRT optimization. In June, the company successfully completed a 170-mile Level 4 test drive from San Diego to Yuma, Arizona, using NVIDIA GPUs and cameras as the primary sensor. The performance gains from TensorRT allow TuSimple to analyze additional camera data, and add new AI algorithms to their autonomous trucks, without sacrificing response time.

NVIDIA TensorRT 3 promises big AI inferencing performance boost

About the Author