Wednesday, June 20, 2007

nVidia Tesla: let's see some benchmarks first

nVidia announced their Tesla GPU platform today.

Stats:

  • 4 GPUs in a 1U rack
  • Recommends 1 CPU per GPU
  • 800 watts at max load

That last stat should scare the crap out of anyone who imagines building a full-height rack (42U) out of these. Ask your datacenter guys how excited they'd be to put in a rack that draws 280 amps? This is double what a rack of dual quad core Clovertown Xeons will use at full load.

They also cost $12000 per 1U, according to Gizmodo I think. A rack? A cool half mil (retail). They'll throw in a free nVidia hat, I'm sure.

If you buy these, you better be damn sure you can get orders of magnitude better performance on the GPU with your algorithm. 3x performance for 3x the price won't make it worth any headaches.

I've long been a skeptic of general computing for the GPU, mostly because of the practicality of the hardware itself in a farm. When nVidia first approached a company I was working for in 2003, we politely listened to the GPU computation idea but knew it wasn't workable. We had 1,000 machines in the renderfarm, how could we justify a $1,000,000 expense of putting a Quadro FX GPU in every one of them, much less add the A/C to deal with it?

That was completely aside of the point that our rendering issues at the time were entirely memory bound issues. We usually could not render two main characters in the same pass. (Sadly, this was just before AMD came out with x64. We could have really used that on that project.)

Finally, when 300 people are trying to get their stuff rendered, the perceived latency of the farm is related to the quantity of machines, not the speed of the individual machine. Having a large swath of your frames rendering appears to be better than knowing frames that have been waiting for 3 hours will eventually run on really fast machines. Either way, it's hard to justify an individual machine's speed increase to double (or triple!) the price of the node (unless rackspace and/or power are a problem).

And now getting down to the nitty gritty... where are the benchmarks for Tesla? I found some for the G80, but nVidia has written an entire 3D renderer that has GPU acceleration. You'd think they would just fire up Gelato on one of Tesla machines and blow our minds with the performance, right?

It seems that nVidia is trying to sell these machines to industry (not just research) by appealing to the financial sector. They've put out a bunch of stats that they can compute billions of Black-Scholes prices per second using their CUDA toolkit on G80. An example that uses random number generation is great, but how about shuttling that amount of data from the network, disk, or even RAM? It doesn't even matter if you have a GPU computing the stuff when your bottleneck is the speed you can get option quotes from the CBOT.

There are crazy number crunchers who can use this, but I'm pretty sure that the overwhelming majority of the time, someone will be better off with x86 (or x64), in terms of cost, power and ease of programming. And let's not forget about hardware reliability and ease of replacement. What's the MTBF? Is nVidia prepared to service one of these GPU compute servers in two hours, like Dell, IBM and HP support offer on plain old x86 units?

7 comments:

Turcca said...

I don't think you're considering this from a very fruitful perspective. Lets say you put one Tesla GPU card on your computer (1500$) and by that you increase the floating point calculations ten fold on your computer.

And what was that about render farms benefiting from quantity of computers, not from the speed of individual machines? Clearly it's the net capacity of the entire render farm that counts. Gflops of a computer times the quantity of computers in the network.

If you have computers on a rack that do 50 GFlops and cost 2000$ each, then adding 75% to the value of the computer you add potentially 1000% to the GFlops.

But lets see the benchmarks first, say for Mental Ray rendering. How well does it scale.

Trimbo said...

Ha. I love that bringing up all of the practical problems of having a render farm of Teslas is not a "fruitful perspective". Oh well, such is the internet.

Are you astroturfing? You talk about 1000% GFlops -- and I Googled for you and noticed you give the same stat on Tom's hardware -- but then you conclude your comment with exactly the point of my whole post. Before claiming that Tesla is some great speed demon that justifies using it in render farms, we need to see real benchmarks. When I wrote this post, nVidia's benchmark were contrived examples that have zero CPU or bus overhead (e.g. their Black-Scholes benchmark). Waxing poetic about 10x performance isn't worth a ham sandwich without the box in people's hands.

Here's the kicker: nVidia has Gelato... why haven't we seen that benchmark? They should have been able to compare that to a quad core Intel chip months ago. I suspect it's because the stats don't justify the price. We'll see.

As far as quantity of computers vs. individual machines... I'm not sure if you have worked with a large farm before, but this has to do with the social dynamics of a large render farm that is under high demand from dozens (hundreds) of people.

Let's say you have ten people and ten machines, then you replace that ten machines with one Tesla. Say that on paper, this is the same performance. In the real world, you've slowed down production. Before, you had 10 renders going for ten people. Now, you have 1 render going for one person. The rendering latency just got much much higher for the other 9 people, who are waiting on one machine to finish someone else's complete job to see their first frame.

Why is this important, since it's the same computing time anyway? Well, bad renders, for one. Unless someone is a 3DSlacker™, they will the render as soon as results are coming back.

Another is because machines go bad. If the one Tesla you bought dies, then you have 10 people who can't work. Just like Google uses lots and lots of servers and doesn't care when one goes down, I like farms of lots of machines that have the cheapest high performance you can get. If one dies, production barely sees a blip.

10 people is a number I came up with just to make this easy. Try this with 1,000 or 2,000 people working on 5 or 10 different projects, like ILM.

So let's sum up some of the issues even if the price/performance comes out better:

* Latency of seeing results for a large number of users, as outlined above.

* Service and support. What is nVidia's support going to be like? Do they have Dell or HP-like support? Do they have 2 hour support, 24 hour support?

* The power requirements! A rack of these will take an absurd amount of power that no data center can currently support.

* Added coding overhead.

* Vendor lock-in.

On the flip side:

* If performance is really great, and you can justify the proportionally higher expense, one huge benefit is that you need less beefy networking and file servers. My last job with a big farm had, I think 9 or 11 Netapps to serve all of the data to the farm. If the farm was cut in half, that means less scaling required on the filer and network infrastructure side.

So... I see the promise there. I've been wanting to use the GPU more for years, but first nVidia needs to justify these things with real world numbers with Gelato or Mental Ray. They might not be putting those out there because I don't think these target Mental Ray users. Given the Black-Scholes benchmark, it seems they're targeting financial companies. Before the credit crisis, that probably looked really good as a market to open up. Post-crisis, it might not look so promising. Maybe they'll change their direction.

Either way, we'll find out next month, since these will be released next month according to their website. I'm excited to see what it can do. A friend of mine used CUDA with his 8800 to get a great performance boost on something he was working on. The strategy of using the GPU for general computing does work -- now it's time to start getting an idea of what applications justify Tesla.

SuezanneC Baskerville said...

Anyone know of any benchmarks at all for the Tesla on any applications that run on normal computers?

Video game benchmarks, spreadsheet benchmarks, synthetic benchmarks, are there any results to compare?

Trimbo said...

Hey Suezanne, for video games and spreadsheets, the Tesla isn't likely to help out much. Video games are designed to use consumer GPUs like the Geforce, so they're not likely to be doing tons of computation on the GPU that isn't framebuffer related. Some games may do it more than others, but probably not enough for Tesla to make a difference. For spreadsheets, it would require writing all new code to use the GPU from Excel. I'm not sure anyone has done that yet.

There are some artificial benchmarks like the Black-Scholes I mentioned in my article.

GPU acceleration can be pretty awesome. But the Tesla is kind of a lame attempt by nVidia to stop the inevitable: when stream processing ends up on the same die as the CPU. I guess the Cell processor could kind of be considered one of these already, and high performance computing like Roadrunner has used it.

AMD is attempting to do this with their Fusion product is one of these being worked on. Intel is working on a GPU called Larrabee but I'm not sure what their plans are for merging tech onto the main die of the CPU. Supposedly it's going to be done in the current generation from Intel (so within the next 18 months) but I kind of doubt that.

sciencectn said...

How's this for a benchmark?

http://www.elcomsoft.com/eprb.html

Scroll down a little bit, and you'll find a graph comparing password hashing speeds on different GPUs and CPUs.

Trimbo said...

Hey sciencectn, thanks for pointing that out. I feel that once again, this is a very limited use of the GPU. Password cracking is almost as artificial as the black-scholes benchmark that nVidia uses to tout the Tesla.

Also, we have no idea if their code on the Intel processor is multithreaded or uses SSE.

Two years after I first wrote this post, I still believe we need to see something meaningful like H.264 encoding or non-realtime 3D rendering. Again, if Tesla is so great, why isn't nVidia providing benchmarks with their own Gelato renderer? The fact that this blog post is the top result in Google for "tesla gelato benchmark" should tell you something (2 years later!).

BTW, password cracking on the GPU is patented by these guys? That would be funny if it wasn't so sad. Patent abuse +1.

sciencectn said...

"Also, we have no idea if their code on the Intel processor is multithreaded or uses SSE."

Well, I'm not sure which one they use, but be assured that the password hashing done on the CPU is as efficient as possible. That's because only about 4 of the 20 different hashing types that this software can crack are done on the GPU. I'd assume that since a lot of the hashing is still done on the CPU that it would be coded as efficiently as possible.

You do have a point about Nvidia being completely silent about their benchmarks. They do have videos of software engineers talking about it on their website but no real benchmarks! It makes no sense.

And have they actually patented GPU password cracking? Because, you can't really patent a concept, but you can patent a particular implementation of a concept. It's like putting a patent on the idea of using a calculator to play games. You can't really patent that, because there are lots of possible implementations. But I'm not a lawyer, so I don't really know.