Intel Larrabee finally hits 1TFLOPS - 2.7x faster than nVidia GT200

During the recently held SC09 conference in Portland, Oregon - Intel finally managed to reach its original performance goal for Larrabee. Back in 2006, when the first details about Larrabee emerged, the performance goal was "[email protected] 16 cores, 2.0 GHz clock, 150W TDP". During Justin Rattner's keynote, Intel demonstrated the performance of LRB as it stands today.

At SGEMM Performance test [4K by 4K Matrix Multiply, QCD], Intel achieved 417 GFLOPS using half the cores on the prototype card, and reached 825 GFLOPS by enabling all the cores. While looking at the numbers alone, one might think that these scores are below the level of ATI Radeon 4850 and nVidia GeForce GTX 280/GTX 285. Of course, there is a "but" coming - unlike theoretical numbers that are usually disclosed by ATI and nVidia - this was an actual SGEMM benchmark calculation used in the HPC community.

Read Full Story >>
The story is too old to be commented.
darkequitus3243d ago

Yawn. I will have another look when some non-HPC stats come out as I will not being doing weather sims at home. And until they set a re;ease date, I will stick with a Radeon 5850 as soon a they come available again in Dec.

JsonHenry3243d ago

agreed, until they show some REAL numbers that we will actually be using I am sticking with my 5870.

Nihilism3243d ago

Tflops mean very little in terms of performance for the average user, can this play crysis? no it cannot. If they want to market it as a cpu/gpu solution then it had better be able to hold it's own in the graphics department too, and it won't.

meetajhu3243d ago

Tflops is nothing to games department! It handles Crysis Maximum like a 8800 Ultra. But Geforce 300 or Nvidia Fermi will be atleast 10x faster in games when compared to Larrabee!

champ213243d ago

current gen gpus get about 120-150gb/s memory bandwidth from the dedicated memory they got.

Larabee will be sharing the memory bandwidth between a cpu & gpu, I7 with its tripple channel memory only has 25gb/s memory bandwidth now imagine sharing that with a gpu.

IMO for larrabee to be viable memory bandwidth would have to grow 5 folds just to meet the bandwidth on current gpus.

all of us know how important this bandwidth is, i dont care how powerful a gpu is, without the memory bandwidth nothing is happening.

Ju3243d ago

These are pretty amazing numbers. It remains to be seen what real world application (other then scientific number crunching) will produce with this baby. Anyhow, (close to) 1TF for 16 cores is a huge achievement. PowerXCell is 160GF with 8 cores, that chip is more then twice as fast. Impressive.

Graphics benchmarks would be certainly interesting, too. But the raw performance of that chip allows a pure SW shader to outperform today's top off the line graphics cards.

ProblemSolver3242d ago (Edited 3242d ago )

@Ju: Well, the PowerXCell 8i processor performs not 160 GFLOPS it does 202
GFLOPS on SGEMM (4k x 4k) utilising 8 SPEs. The PowerXCell's peak-performance,
counting only the SPEs (neglecting the PPU), computes as follows;

8 [[email protected]] = 8*(8 flops * 3.2GHz) = 8 * 25.6 Gflops(SP) = 204.8 Gflops(SP)
(SP := single precision)

Hence, the PowerXCell 8i as well as the Cell/B.E. processor inside the PS3
perform the SGEMM computational kernel with ~99% of its peak-performance. No
other processor in existence can match this number.

To put the PowerXCell 8i in perspective to Larrabee, with respect to the amount
of cores, we have to take two PowerXCell 8i to gain 16 SPEs. Two PowerXCell 8i
processors perform the SGEMM kernel at 406.04 GFLOPS, which amount to ~99% of
the theoretical peak performance of 409.60 GFLOPS.

// Fast Matrix Multiplication on Cell (SMP) Systems

Larrabee performs the SGEMM kernel with 16 cores at 2Ghz with 825 GFLOPS, which
is only twice as fast as two PowerXCell 8i processors (16SPEs) where one has to
consider that the vector width of Larrabee is 16 and that of Cell processor
only 4.

What's the theoretical peak performance of the given Larrabee configuration?

We have,
16 [[email protected]] = 16*(32 flops * 2.0GHz) = 16 * 64 Gflops(SP) = 1024 Gflops(SP)

Now we can compute the efficiency of the SGEMM kernel for Larrabee;
(825 GFLOPS * 100) / 1024 = ~81%

Hence, we have
2 PowerXCell 8i @ SGEMM (4k x 4k) = 406.04 GFLOPS; efficiency = ~99%
Larrabee @ SGEMM (4k x 4k) = 825 GFLOPS; efficiency = ~81%

This shows that Larrabee's computational units get starved for date -- a possible
weak spot of Larrabee's architecture. It seems like that Larrabee's memory model
(an implicit coherent shared memory model) can't deliver the data fast enough to
compete with the explicit memory model (direct DMA'ing) of the Cell processor.
The explicit memory model of the Cell processor is what makes this processor
so efficient and unique, but also difficult to program for some developers.

TABSF3242d ago

Great read ProblemSolver.

Although the Cell my be efficient it does not mean its the best.

IBM, Intel and AMD are very scared at the moment imho with what Nvidia is doing and that's the reason these chip makers are going down the hybrid route

Nvidia plan is to have CPU free Desktop, Laptops, Net-tops, Netbooks and Servers. Fermi is the big step to Nvidia overall plan.

Ju3242d ago

@ProblemSover. Interesting thoughts. As always. Just wanted to note that down here.

ProblemSolver3242d ago

@TABSF: Best depends on a metric.

If I wanna have a processor that computes scientific workloads at near
peak-performance and that offers the most GFLOPS / $1 / Watt, then the Cell
processor is the best within this regard. As such it's no surprise that the
Cell processor was used in Roadrunner. And given the latest order of 2000 PS3s
from the military proves this fact as well. In HPC (High Performance Computing)
the metric GFLOPS / $1 isn't even the most important one, its GLOPS / Watt that
counts. The electricity bill of a cluster can easily exceed the initial cost of
a cluster over its lifetime.

From an scientific view-point I haven't seen any better architecture as the
one of the Cell processor. Unfortunately, this processor wasn't backed up
sufficiently with strong libraries from the get go, making developing software
difficult for many game-developers. Like I've said before, this architecture is
ahead of its time. I currently don't know whether IBM is going to make another
Cell processor, but I'm pretty sure that we will see a similar architecture
again in the future. Why? If you wanna know this, then you have to understand
at what point a coherent (cache-) memory / multi-threaded architecture breaks.
The only advantage of such systems is that it's easier to program. But their
efficiency diminishes when they start to scale. Larrabee will be an indication
of this. Standard programming on Larrabee won't give any performance due to
many cache misses. Hiding those misses via hyper-threading is very limited. The
assumption is that the three other threads (out of four) have to work on the
current data within the cache since otherwise they would produce another cache
miss leading to trash the cache completely. To gain efficiency on Larrabee one
has to program as follows;
Given that each core has only 256KB cache at its disposal, the cache must
implicitly be splitted. One thread is the main thread that works out of about
200KB cache. Another thread (a helper thread) loads data (implicitly by
touching a memory address) into the remaining part of the cache that is worked
on by thread three and four once thread one undergoes a cache miss. If you
look at it, this is a much more complicated way to gain efficiency compared
to the Cell processor, because the memory hierarchy within Larrabee isn't
programmable. However, as I said, in general it will be easier for the
programmer to throw code at Larrabee, but this code won't run very efficiently.
Cache miss elimination will be key. And if one goes as far trying to eliminate
most of the cache misses by code-restructuring and intelligent threading, then
that person might come to the conclusion that programming on Cell would be much
easier. So why Larrabee at all? Well, Intel has spent many years in optimizing
their compilers for coherent (cache-) memory / multi-threaded architectures.
They have no other choice. If they could, they would start anew, for sure.
IBM got the chance to do so, due to high-volume productions of the PS3. But even
than, it requires too much time to build strong compilers and backends to
support the new processor, which is actually Cell's weakest point. To bring the
Cell architecture back on stage, IBM has to spend some more year in optimizing
compilers for such an architecture. Possibly that is what they are doing now.
If true, we may likely see something similar to Cell, if not Cell2, in the

+ Show (1) more replyLast reply 3242d ago
TABSF3243d ago (Edited 3243d ago )

Intel propaganda

GTX 285 (GT 200) has a single precision floating point of 1.1Tflop/s
GT 300 (Fermi) were looking at around 630Gflop/s Double Precision

Perkel3243d ago


1. Intel Larrabee [LRB, 45nm] - 1006 GFLOPS
2. EVGA GeForce GTX 285 FTW - 425 GFLOPS
3. nVidia Tesla C1060 [GT200, 65nm] - 370 GFLOPS
4. AMD FireStream 9270 [RV770, 55nm] - 300 GFLOPS
5. IBM PowerXCell 8i [Cell, 65nm] - 164 GFLOPS

test is fail itself how can gtx285 be better than firestream, tesla, powerxcell ?

TABSF3242d ago

@ Perkel

Sorry but your claims are wrong

the Intel Larrabee 1006 GFLOP/S is correct but I'm guessing it will be 32nm like the Gulftown.
Nvidia GeForce GTX 285 is 1.1TFlop/s
NVidia Tesla C1060u is 4.1TFlop/s with 4 GT 200 chips

"test is fail itself how can gtx285 be better than firestream, tesla, powerxcell ?"

the GTX 285 is better the PowerXcell 8i. the Tesla S1070u consists of 4 GTX 280 chip and is 4.1TFlop/s Single Precision, not sure about the Firestream

PowerXcell 8i is 25.6Gflop/s Double precision
GTX 285 is 77.8Gflop/s Double precision
GT 300 (Fermi) is 630Gflop/s Double precision

Parallel processing the S1070u blades are the Best in the world soon to be S2070u which will be 2.5TFlops Double Precision on one blade and thats why IBM has abandoned the Cell and roadrunner for future HPC.

IBM, Intel and AMD are all going down the CPU/GPU chips route because the benefits of using unified core (or as Nvidia has branded them Cuda core)are amazing. These unified core are High speed, Low power consuming and produces less heat than conventional processors and yet are almost 25 fold the performance.

Whats better 8 cores clocked at 3.2GHz (Cell) or 960 cores clock at 1.7GHz (s1070u)

S1070u Blade is the best Power to Performance blade in the world
S1070u Blade is the best Price to Performance blade in the world

Tell me a Blade that will do 2.5Tflop/s Double precision, because the S2070u Blade being release in January or February will do that.

Show all comments (33)
The story is too old to be commented.