Today Intel publicly unveiled its Larrabee GPU architecture, no actual products were unveiled though, this is just a closer look at what makes Larrabee tick. One important thing to note here is that Larrabee is very different from the GPUs made by AMD and NVIDIA. Intel's Larrabee is a highly programmable architecture, it will be able to run the DirectX and OpenGL libraries so it can act as a high-end graphics card but developers will also be able to write C/C++ code for Larrabee which makes it attractive for stream computing.
Larrabee is different because it's basically a highly scalable many-core CPU architecture optimized for parallel workloads - Intel says the Larrabee will feature a scalable number of in-order x86 cores which are based on a variant of the original Pentium processor. The architecture should be very scalable and some slides unveil initial versions of Larrabee may have 8 to 48 cores. Each of these cores can do two instructions per clock and they will have a L1 cache consisting of 32KB instruction cache and 32KB of data cache, along with 256KB of L2 cache and a vector unit that can perform 16 32-bit operations per clock.
The only thing in Larrabee's graphics pipeline that has a fixed function are the texture units, these will be used for all the usual texture operations while the fully programmable x86 cores of Larrabee will do everything else. One important thing to note here is that the texture units aren't part of the main graphics pipeline but are connected through the 1024-bit (512-bit in each direction) ring bus which also connects the Larrabee's L2, the memory controllers and the system interface. One of the benefits of this is that only the data that needs to be processed by the texture units will follow that path, all other data will move through the main pipeline. This is very useful for GPGPU applications and Intel claims even some games can benefit greatly from this feature.
Another interesting thing is that Larrabee uses bin-rendering, which is the same as the tile-rendering PowerVR used many years ago. Intel claims bin-rendering is far more bandwidth-efficient than the hierarchical Z-buffering rendering techniques used by the competition.
Overall the Intel Larrabee architecture looks pretty impressive on paper and I'm very interested in how this new architecture will perform. Rumour has it that the first Larrabee products may be available in late 2009 or 2010, these will likely be 45nm or even 32nm parts. We'll have to wait until then to see if Larrabee is as good as Intel wants us to believe. Intel's current IGP graphics aren't impressive at all and if they want Larrabee to be a success they will need to have a chip that performs at least on par with the solutions offered by AMD and NVIDIA. Furthermore, Intel will also need to have great drivers for its Larrabee and that's another area its integrated graphics are pretty infamous for.
Some of the features of the Larrabee architecture:
The Larrabee architecture has a pipeline derived from the dual-issue Intel Pentium® processor, which uses a short execution pipeline with a fully coherent cache structure. The Larrabee architecture provides significant modern enhancements such as a wide vector processing unit (VPU), multi-threading, 64-bit extensions and sophisticated pre-fetching. This will enable a massive increase in available computational power combined with the familiarity and ease of programming of the Intel architecture.
Larrabee also includes a select few fixed function logic blocks to support graphics and other applications. These units are carefully chosen to balance strong performance per watt, yet contribute to the flexibility and programmability of the architecture.
A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data to be access by CPU cores, making the writing of software programs simpler.
The Larrabee native programming model supports a variety of highly parallel applications, including those that use irregular data structures. This enables development of graphics APIs, rapid innovation of new graphics algorithms, and true general purpose computation on the graphics processor with established PC software development tools.
Larrabee features task scheduling which is performed entirely with software, rather than in fixed function logic. Therefore rendering pipelines and other complex software systems can adjust their resource scheduling based each workload's unique computing demand.
The Larrabee architecture supports four execution threads per core with separate register sets per thread. This allows the use of a simple efficient in-order pipeline, but retains many of the latency-hiding benefits of more complex out-of-order pipelines when running highly parallel applications.
The Larrabee architecture uses a 1024 bits-wide, bi-directional ring network (i.e., 512 bits in each direction) to allow agents to communicate with each other in low latency manner resulting in super fast communication between cores.
The Larrabee architecture fully supports IEEE standards for single and double precision floating-point arithmetic. Support for these standards is a pre-requisite for many types of tasks including financial applications.