ExtremeTech has an article about Intel's upcoming 45nm Penryn processor which features the new SSE4 instructions. They say this processor might be the first step on Intel's road to create CPUs with GPU functionality as the SSE4 instructions already include a "streaming load instruction" which gives special priority to graphics data, allowing it to bypass the normal CPU cache.
To briefly recap some of the new Penryn features:
Deep Power Down technology: Each processor core contains a voltage regulatr sensor that monitors the CPUs. Upon receiving what's known as a "MWAIT Level 6" request, the CPU flushes the level-1 cache and saves its state, then the level-2 caches. The chip makes a check to make sure that there aren't any inbound clock or DMA traffic, then enters the "leakage off" state.
Dynamic Acceleration Technology: When a dual-core chip encounters a single-threaded application, the other core sits idle. In that case, the first core can enter a "frequency boost" state where the clock speed is ramped up beyond its rated speed, or overclocked. The core remains in the accelerated state for a "thermally significant" amount of time, making sure that the chip isn't damaged by increasing the clock frequency.
VTX: Intel's hardware support for virtualization, also known as VMCS. When a virtual machine is run on a Penryn-class chip, the hardware hides the entry/exit virtualization commands from the software, accelerating the instruction context switches by 25 to 75 percent, according to Intel.
SSE-4: While the specific instructions themselves will primarily be of interest to developers, Intel highlighted four specific areas that the SSE-4 instructions will be useful for: dot products, for 3d content creation; motion estimation; finding the best sum-of-absolute differences, a "branchy" operation that usually requires several lines of code; and the streaming load instruction. The architecture also includes a "super shuffle" engine – used to more efficiently process SSE data formatting, and a radix-16 divider code that's half as fast as the previous architecture. The improved motion estimation uses the performance of the CPU to look for motion across the bulk of the image, not just on a per-pixel basis.