Apple Teams Up with NVIDIA to Explore Enhanced LLM Performance

Apple Teams Up with NVIDIA to Explore Enhanced LLM Performance

In a recent blog entry, Apple engineers revealed new insights about their collaboration with NVIDIA aimed at enhancing text generation speeds using large language models.

This year, Apple introduced its Recurrent Drafter (ReDrafter) technique, which has been open-sourced. It is a novel approach to generating text with LLMs that significantly increases speed and delivers “state-of-the-art performance.” The technique merges two methods: beam search (to explore various options) and dynamic tree attention (to effectively manage selections).

While their research yielded impressive results, Apple teamed up with NVIDIA to implement ReDrafter in a production environment. As part of this initiative, ReDrafter was incorporated into NVIDIA TensorRT-LLM, a framework designed to accelerate LLM operations on NVIDIA GPUs.

Here are some key outcomes:

To facilitate the integration of ReDrafter, NVIDIA introduced new operators or enhanced existing ones, greatly expanding TensorRT-LLM’s ability to support complex models and decoding techniques. Machine learning developers utilizing NVIDIA GPUs can now easily leverage ReDrafter’s improved token generation speed for their production LLM projects using TensorRT-LLM.

When benchmarking a production model with tens of billions of parameters on NVIDIA GPUs, the NVIDIA TensorRT-LLM inference acceleration framework combined with ReDrafter demonstrated a 2.7x increase in tokens generated per second during greedy decoding. These benchmark findings suggest that this technology could significantly minimize latency for users, while also requiring fewer GPUs and consuming less power.

“Large language models (LLMs) are increasingly integral to production applications, and enhancing inference efficiency can influence both computational costs and user latency,” Apple’s machine learning researchers conclude. “With ReDrafter’s innovative speculative decoding approach now part of the NVIDIA TensorRT-LLM framework, developers can enjoy faster token generation on NVIDIA GPUs for their production LLM applications.”

For more details on this collaboration, check out Apple’s website and the relevant blog post on NVIDIA’s site:

Follow Chance: Threads, Bluesky, Instagram, and Mastodon.

: . More.