DeepSeek Proves It: Open Source is the Secret to Dominating Tech Markets (and Wall Street has it wrong).
-
The training corpus of these large models seem to be “the internet YOLO”. Where it’s fine for them to download every book and paper under the sun, but if a normal person does it.
Believe it or not:
-
But then, people would realize that you got copyrighted material and stuff from pirating websites...
-
Sounds legit
-
-
I wouldn’t call it the accepted terminology at all. Just because some rich assholes try to will it into existence doesnt mean we have to accept it.
-
-
We already have all the evidence. This isn’t some developing story, the paper is reproducible. What’s dehumanizing is assuming that Asians can’t make good software.
-
-
-
-
-
-
the accepted terminology
No, it isn't. The OSI specifically requires the training data be available or at very least that the source and fee for the data be given so that a user could get the same copy themselves. Because that's the purpose of something being "open source". Open source doesn't just mean free to download and use.
https://opensource.org/ai/open-source-ai-definition
Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.
In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.
As per their paper, DeepSeek R1 required a very specific training data set because when they tried the same technique with less curated data, they got R"zero' which basically ran fast and spat out a gibberish salad of English, Chinese and Python.
People are calling DeepSeek open source purely because they called themselves open source, but they seem to just be another free to download, black-box model. The best comparison is to Meta's LlaMa, which weirdly nobody has decided is going to up-end the tech industry.
In reality "open source" is a terrible terminology for what is a very loose fit when basically trying to say that anyone could recreate or modify the model because they have the exact 'recipe'.
-
Well maybe. Apparntly some folks are already doing that but its not done yet. Let's wait for the results. If everything is legit we should have not one but plenty of similar and better models in near future. If Chinese did this with 100 chips imagine what can be done with 100000 chips that nvidia can sell to a us company
-
-
-
the accepted terminology nowadays
Let's just redefine existing concepts to mean things that are more palatable to corporate control why don't we?
If you don't have the ability to build it yourself, it's not open source. Deepseek is "freeware" at best. And that's to say nothing of what the data is, where it comes from, and the legal ramifications of using it.
-
Snowden really proved he wasn't a Russian spy when he check notes immediately fled to Russia with troves of American secrets...
-
-