DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

[email protected]

It's already happening. This article takes a long look at many of the rising threats to nvidia. Some highlights:

Google has been running on their own homemade TPUs (tensor processing units) for years, and say they on the 6th generation of those.
Some AI researchers are building an entirely AMD based stack from scratch, essentially writing their own drivers and utilities to make it happen.
Cerebras.ai is creating their own AI chips using a unique whole-die system. They make an AI chip the size of entire silicon wafer (30cm square) with 900,000 micro-cores.

So yeah, it's not just "China AI bad" but that the entire market is catching up and innovating around nvidia's monopoly.

[email protected]

Yeah I'd like to see size comparisons too. The cuda stack is massive.

[email protected]

PTX also removes NVIDIA lock-in.

[email protected]

Kind of the opposite actually. PTX is in essence nvidia specific assembly. Just like how arm or x86_64 assembly are tied to arm and x86_64.

At least with cuda there are efforts like zluda. Cuda is more like objective-c was on the mac. Basicly tied to platform but at least you could write a compiler for another target in theory.

[email protected]

Eh, even for many console games it's not optimised that much.

Check out Kaze Emanaur's (& co) rewrite of the N64s Super Mario 64 engine. He's now building an entirely new game on top of that engine, and it looks considerably better than SM64 did and runs at twice the FPS on original hardware.

But you're probably right that today it happens even less than before.

[email protected]

Ah, I hoped it was cross platform, more like Opencl. Thinking about it, a lower level language would be more platform specific.

[email protected]

That disregards the massive advancement in technology, hindsight, tooling and theory they can make use of now. There is a world of difference there even with the same hardware. So not comparable imo, it wasn't for a lack of effort on Nintendo's part.

[email protected]

IIRC Zluda does support compiling PTX. My understanding is that this is part of why Intel and AMD eventually didn't want to support it - it's not a great idea to tie yourself to someone else's architecture you have no control or license to.

OTOH, CUDA itself is just a set of APIs and their implementations on NVIDIA GPUs. Other companies can re-implement them. AMD has already done this with HIP.

[email protected]

Wtf, this is literally the opposite of true. PTX is nvidia only.

[email protected]

A substantial part of the optimisation was simply not compiling as a debug target. There were plenty of oversights by Nintendo devs (not to discredit all they've accomplished here). And most tooling for this Kaze developed himself (because who else develops for the N64?).

It's mostly the result of a couple really clever and passionate people actually taking it apart to a very low level. Nintendo could have absolutely done most of these optimisations themselves, they don't really rely on many newly discovered techniques or anything. Still, they had deadlines of course, which Kaze & Co. don't.

[email protected]

Google was giving me bad search results about PTX so I just posted am opinion and hoped Cunningham's Law would work.

[email protected]

How cunning.

[email protected]

What I'm curious to see is how well these types of modifications scale with compute. DeepSeek is restricted to H800s instead of H100s or H200. These are gimped cards to get around export controls, and accordingly they have lower memory bandwidth (~2 vs ~3 TB/s) and most notably, much slower GPU to GPU communication (something like 400 GB/s vs 900 GB/s). The specific reason they used PTX in this application was to help alleviate some of the bottlenecks due to the limited inter-GPU bandwidth, so I wonder if that would still improve performance on H100 and H200 GPUs.

[email protected]

I think the thing that Jensen is getting at is that CUDA is merely a set of APIs. Other hardware manufacturers can re-implement the CUDA APIs if they really wanted to (especially since AFAIK, Google v Oracle ruled that APIs cannot be copyrighted). In fact, AMD's HIP implements many of the same APIs as CUDA, and they ship a tool (HIPIFY) to convert code written for CUDA for HIP instead.

Of course, this does not guarantee that code originally written for CUDA is going to perform well on other accelerators, since it likely was implemented with NVIDIA's compute model in mind.

[email protected]

Part of this was an optimization that was necessary due to their resource restrictions. Chinese firms can only purchase H800 GPUs instead of H200 or H100. These have much slower inter-GPU communication (less than half the bandwidth!) as a result of export bans by the US government, so this optimization was done to try and alleviate some of that bottleneck. It's unclear to me if this type of optimization would make as big of a difference for a lab using H100s/H200s; my guess is that it probably matters less.

agnos.is Forums

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead