Q&A: IBM' Kahle Talks Cell, PlayStation 3 Dev Complexity

Following a recent panel on gaming hardware, Gamasutra spoke exclusively with James A. Kahle, IBM Fellow and the lead architect for the PlayStation 3's Cell chip.

Gamasutra: I'm sure you're aware, that there's some criticism from game developers in terms of ease of use. Do you think that that's just a learning curve thing, or do you think that there are hurdles that the Cell presents, compared to what they have been familiar with?

James A. Kahle: Anything besides a single-threaded processor is going to be more difficult. Whether it's a large symmetric multi-processor, or heterogeneous type of structure the Cell is, it presents more challenges. And, to get some of the efficiencies that we have in Cell, we are expecting the software programmer to be a little more structured in how they design work.

Read Full Story >>
The story is too old to be commented.
feejo4012d ago (Edited 4012d ago )

I understand that Sony dose not plan Cell only for PS3. It mean to me that PS3 is not their #1 priority for Cell.
About game it mean they have to work better, that is a good news. Game will improve.

neogeo4012d ago

When the devs put in the work, the cell puts out the results.

nuff said

Bebedora4012d ago

Even the processor is pretty. It HAS to be good!

On a serious note. I think 2008 will be an interesting year for us all.

emaddox844012d ago

We are approaching the maximum speed of flat microprocessors due to the speed of electrical signals (the speed of light). There are two approaches to solving this problem, 3 dimensional cpu's, or parallel processing systems, and I'd say the latter is more feasible atm. People can complain all they want, but I give props to Sony for forcing devs to start using the Cell. It may be harder and people will complain about it, but it will become the norm in time. Think about when computer science was in it's early days and people had to write code in assembler. I'm sure they were griping about how hard it was. Then with assembler they created higher level languages and techniques for doing things. This change came from the fact that they had to do it this way, there was no other choice. Well, we are running out of choices now to increase processing power and in time, better tools will be developed for parallel programming and things will become easier.

karlostomy4012d ago

you said:
"I give props to Sony for forcing devs to start using the Cell"

Hmm.. i dunno if sony is doing anyone any favours. Intel and IBM and AMD et al have had general purpose multicore processors for quite some time now. It really is nothing new.

The cell is not particularly suited to general purpose programming. It is very fast at linear stuff only. Like movies at high definition.
Hence for everything else devs have to work harder to get the same result.
What's good about that?

emaddox844012d ago

I'm assuming the processors you speak of being ones such as the Duo from Intel and so forth. While these processors do indeed have multiple cores, they are not true parallel processing systems. They are still programmed for on the most part in a linear basis, and the chip does most of the work to divide up the threads. From what I understand (although I have no actual experience with the Cell), programming for the Cell is more like having multiple independent single core processors, in which the programmers have to divide the work up themselves in a parallel fashion to take advantage of its parallel processing capabilities. While this is much harder to accomplish correctly, its performance yields are potentially much much greater. And in case you are wondering, I'm not just blabbing off about this, I'm a software engineer.

Ju4012d ago

I don't understand your argument about "not being true parallel system", emaddox. What's not really parallel in a multicore ? The shared cache or what?

I think the idea behind the cell is, that in a real world application - and especially in a gaming environment, where you usually run one very specific application - most of the general purpose workload is done on one core. Threading is nice from a SW perspective, because you simply don't care where your thread runs as long as the OS finds the fastest HW to spawn your thread. However, sometimes you need specific dedicated code execute. Imagine a parallel renderer. You really want to make sure, that your "tile" is run on dedicated HW.
(Also, I'd guess, the additional cores on a general purpose computer are idling around 80% of the time, anyhow. Even if you do Photoshop and don't apply filters all the time).

You can do that with threading, too, sure. But then you have very much the same structure (threads bound to cores, one main execution unit, syncing child threads, etc) what you have on the CELL right now. Its nice to have coherent code, so you can test on one core and scale linear to multiple cores, true. So that's an advantage. Strictly SW speaking, multicores are way nicer to program then the CELL.

But, these sub tasks usually require a specific function set. In the case of a renderer probably 95% SIMD instructions. Doing things like a matrix transformation (math intensive) is pretty fast (and small) compared to pixel ops on a 720p frame (data intensive). On a x86 you'd end up doing SSE optimization, on the PPC VMX, and then it still doesn't run in parallel.

My (limited) experience with the CELL (and SPU) is, it programs like SSE in C, instead of assembler (I have yet to measure intrinsic against real spu asm). You implement what you want and it even runs in parallel. You need some more work in your master process to sync all spu units, and you might even end up writing your own SPU kernel (e.g. keeping the IO as is, but you might want to load code dynamically on the SPU - with low overhead and even do some SPU code scheduling and the like).

It still feels like dedicated HW, say you'd do a decoder, you hand it over to an SPU and you all of a sudden have an additional flexible HW execution unit. But that unit does exactly what you want, when you want it. And its damn fast.

The other argument what they mentioned somewhere is, that because cache coherency is a real nightmare on multicore (3 cores and beyond) and requires a lot of transistors the CELL is smaller and cooler and scales pretty well. So eventually, there is a cost advantage. Maybe somebody has some links (and pictures) but the CELL is a pretty small CPU compared to multicores. Perfect for a gaming console.

I have yet to finish my very first attempts to really take advantage of the SPUs. I have, however, implemented a (pure C) png decoder (just for fun), and that very C code complies straight on the SPU as well it does on the PPU (except additional IO code to read and write a pixel line). Not very efficient, but you do not need to do SIMD optimization to get code running at first hand. This allows you to implement and test an algorithm, port it to the SPU and write the sync framework in a first step. Later you can start optimizing the code for the SPU.

emaddox844011d ago (Edited 4011d ago )

Thanks for your input, I found your post interesting. I didn't really expect to find someone else replying to me that knew what they are talking about, haha.

Anyhow, you do make some good points. However, I still hold my belief that the Core processors are not true parallel systems. Yes the chips actually have separate execution cores, but they are still programmed for, for the most part, sequentially. And like you said, the cores do not have their own cache, and the instruction window does most of the work load division for the cores on the chip. The potential performance yield increase in this architecture is no where near as great as with processors like the Cell, where the SPU work loads are generally programmed for by people, as opposed to determined after the fact the best they can by a chip. The division of execution is a reaction, not an preemptive effort. Now, I do realize that OS's do divide threads for execution, but usually that's when you have 2 or more separate programs running at once. That's fine and all, but I'm concerned with better parallel execution of single programs, particularly games in this discussion.

An analogy off the top of my head is that it is as if an event planner organizes a series of events, (some depending on others) sequentially. Then someone down the line realizes some of these events can be done in parallel and parallelizes them the best they can. Instead, had the planner designed the events to be more parallel from the beginning, the potential timeliness for finishing all the events would have probably been much greater.

Here is a good article I read that you will probably be interested in:

Note how it says, "The end result is that each SPE is like a very small vector computer, with its own "CPU" and RAM."

My take is that, the software must balance the work load between the SPU's, whether through parallel threading, task specific SPU's, or a combination of both. It is not done for you through analysis of instructions in an instruction window on the CPU - which as the above article states, can also become a problem for memory to CPU latency. My whole argument was that the cell is more like having separate CPU's with their own cache, and without the assistance of a instruction window. If you don't program for it, you might as well only have the one POWERPC Core.

I guess in the end what I was trying to say was, programming for the Cell is more like programming for systems before the multicore processors existed on the market -- when research labs in colleges had multiprocessor computers with many boards in them, each having it's own CPU and memory. And don't get me wrong, I'm not saying multicore processors are bad, they are easy to program for and do have performance increases over traditional single core processors. I just like the Cell for it's potential and the change it may bring about to the software community.

Ju4011d ago (Edited 4011d ago )

I am not sure if I understand your "instruction window does most of the work load division". No, on a multicore there is no HW which does that. I think, what you are referring to, is a HW threading within the CPU (like even the CELL does within the PPU core). No, multicores are more like two separate entities, really like 2 (or more) cpus. The only difference between previous SMP systems (symmetric multi processor systems) and multicore is, that multiple cores are on one die and the caches are shared (where in the other case, each cpu is physical separate and has separate caches, plus some CPU-to-CPU interconnects).

No, if you spawn two or more threads on a multicore, they are running in parallel. They really do. However, usually the load is balanced thru the OS, not the application. Also, a thread can be shifted from one core to another on the fly while preempted. In that sense it is simply not deterministic, something what the CELL (and the SPU) is, simply because it is 100% SW controlled.

Say, on multicore, you would do geometry processing in one thread, and pixel pushing (shading, rasterizing or what not) in another. It would very much be appreciated, if both threads will stay on their own dedicated HW (CPU), because it simply doesn't makes sense to have sequential execution on one core in this case, and even worse, if you would partition the rendering into tiles, it would be nice to let each tile render at the same time - guaranteed. Something, which is easily possible on the CELL, but needs special thread priorities or locking on a multicore (I don't even know if that's implemented in the OSs out now - not that I know it would work in pthreads, e.g.).

Your analogy with SPUs being separate vector computers is pretty much OK with me. However, they have no real IO except the interconnects within the CELL (the DMA engine) and a HW messaging system (the mailboxes).

But OTHO they are also very close to real multi core cpus. The own RAM has its benefits but also is a challenge. For example, even code needs to reside in that very memory. Which limits the execution size to 256KB and, obviously no jumps out of that bounds. There is no loading from out of cache (well, this SRAM is kind of cache memory and registers at once. The SPUs don't have registers, but you could see the whole 256KB as registers).

Insomniac has solved the code size problem thru what they call "code streaming" in an article, which basically states "code is just data", and the code, when needed is streamed from XDRAM (like a SW driven cache).

The cache analogy is not 100% accurate. Usually caches have some logic, which assures data integrity across its "clients", say, in case of 2 cores, if one core fetches data from main mem, and stores that in the cache, the second core would "see" these data automatically (coherent cache). Not so on the CELL. Each SPU has to take care that the contents of it is in sync with the rest of the system. (Well, there are also non-coherent cache systems out in the wild, I think ARM is one of it, and then its also a chipset feature, not just a CPU feature).

Also, some libraries try to implement a preemtive code scheduler for SPUs. Nice thought, however, because the SPU does not have registers, it needs a well thought (static?) memory layout within the SRAM. E.g. on preempting a new thread, you might want to save the current thread state (this is a context switch). Now, because the SRAM (considered as registers) is so big, it does not make sense to put the whole mem on to the stack, but probably a sub set. What you end up with, is basically a SW driven CPU, which does loading of code and data transparent on the fly when the SPU kernel would need to re-schedule a thread.

But all that is pretty interesting, IMO. Something to play with. To explore, to learn. Somebody came up with a complete new idea. Sure this is challenging. But the basics haven't really changed. If you want to program performance effective, you gotta understand the basics. Even VB or .Net or C# or what not should not hide that away - sure, for some application, why not. I'm happy that somebody found the courage to break out of the just mainstream thinking. How boring would that be. (Eventually this will happen, though, I fear).

+ Show (2) more repliesLast reply 4011d ago