Pythian Blog: Technical Track

Seeking Deep: DeepSeek and the Saga of a Shattered Entry Barrier in the World of Artificial Intelligence

Seeking Deep: DeepSeek and the Saga of a Shattered Entry Barrier in the World of Artificial Intelligence
9:47

DeepSeek has captured the imagination of the AI world and initiated a diverse set of conversations that were not happening a few weeks ago.

Open AI, Google, Meta, and Anthropic have all built their foundational models after spending 100s of millions of dollars in training using state-of-the-art hardware (such as H100 GPUs), resulting in huge valuations (Open AI > $150Bn) and unprecedented stock gains (Nvidia) with some players already realizing billions in revenue. The $500Bn Stargate project talks of increasing computational power with a city-sized power grid to support it, and everyone started believing that more is better.

Several Chinese companies have released their AI models previously but failed to achieve performance on par with their American counterparts. DeepSeek launched its V3 model in late Dec 2024 and its R1 model in Jan 2025. In a short period, it became the highest-rated AI Assistant on the App Store.

About DeepSeek

As per their paper, DeepSeek invested less than $6Mn in training the V3 model on 2048 H800 GPUs, took little less than two months in training and surpassed several, and performs at par with the best AI models, including the proprietary ones on benchmarks such as the American Invitational Mathematics Examination (AIME) 2024 and the Chatbot Area leaderboard at UC Berkeley among others, while providing API service at 90% lower cost than Open AI.

DeepSeek has far-reaching implications for the AI industry, making it a hot topic of discussion. It can impact the entire AI ecosystem, which consists of firms building foundational models, firms consuming these, service partners, and end consumers. DeepSeek is a milestone in the evolutionary timeline of artificial intelligence, not because of frugality, security standards, organized censored outputs, or geopolitical implications but because of the innovative optimization techniques it has employed to achieve more with less. At Pythian, this reinforces our decade-old values that optimization and tuning techniques are always better than throwing infinite resources at technology problems. So first, let’s try to understand the optimization strategies DeepSeek has devised to achieve these outstanding results.

It uses a mixture of expert (MoE) architecture to reduce computing costs during pre-training. This methodology, which has been used by several others, including the Mistral7x8B model, forms the base of the V3 model. Each “expert” is a neural network, feed-forward NN, or a more complex NN. There is a router that sends a token to a particular expert instead of each token passing through all the experts. DeepSeek-V3 is a 671B model but needs only 37B active parameters due to MoE architecture. 

DeepSeek introduced an auxiliary loss-free load-balancing strategy, efficiently routing tokens among experts depending on their characteristics. This enhances efficiency and reduces computational costs. It utilizes FP8 mixed precision training to improve the efficiency and speed of pre-training. This reduces memory consumption compared to FP16 training, and when combined with other techniques, such as MTP, it does not degrade the model's performance.

Another mechanism it utilizes is multi-head latent attention (MLA) for inference. This is a novel way to overcome the challenge of using the fast computation power of GPUs, where limits on communication speed can create bottlenecks in transferring data in different parts of GPUs for computation. MLA reduces the memory requirement called KV cache per token to improve inference speed. DeepSeek also utilized multi-token prediction (MTP). Instead of predicting the following token based on the previous tokens, it predicts the subsequent n tokens. This has improved the model's performance.

Post-training, it uses supervised fine-tuning (SFT) and reinforcement learning (RL) to make the model human-friendly. The objective of SFT is to modify the parameters of the pre-trained model to make it more aligned with a specific field. This is done using labeled data and minimizing the difference between prediction and desired output. Reinforcement learning is an experiential way in which a machine learning model learns how to achieve the most optimum outcome by taking action, getting feedback through reward functions, and then improving to achieve that outcome. DeepSeek overcomes the cold start problem, which is the lack of contextual data at the beginning of the RL process, by fine tuning the model with several thousand data points. This has improved readability of the response while also improving the model performance.

Both these techniques, SFT and RL, are used extensively to train R1 model using a base model. The R1 model has enhanced reasoning abilities. DeepSeek's paper describes the emergence of reflection as the ‘aha moment’ during reinforcement learning. The model is trained to add its reasoning before presenting the actual response. The approach can be re-assessed and modified during the reasoning or “thinking” process before generating a response. 

DeepSeek further uses the distillation method, wherein it uses the reasoning capabilities of larger models like R1 to fine-tune and improve smaller models. This technique has given dividends in improving smaller models' performance while reducing computing costs. Distillation helps in the real world application of models, by reducing the infra requirement and thereby cost during inference while still maintaining high performance due to knowledge transfer from larger models which can be expensive during inference.

Open Source 

The impact of releasing the open-source DeepSeek models has been felt far and wide. While it puts in question the business model of large tech companies like Open AI, Anthropic, whose bread and butter is AI, and others like Google, Meta, and Amazon who are more cushioned, there is also widespread enthusiasm among the developer community due to the open source nature of the models and intelligent engineering which went into it.  

The low cost of training for a high-performing AI model will open many new opportunities. The pace of AI adoption increased in the past few years after the emergence of Gen AI. However, many organizations fell into a trough of disillusionment due to the complex nature of AI projects requiring complicated data pipelines & integrations, lack of skill set, and lower RoI, among other reasons. This can be just the push needed to take organizations on the slope of enlightenment. Now that DeepSeek has shown the way, others will follow, resulting in the emergence of cheaper models. Distillation can be leveraged more to achieve this. This will allow organizations to experiment with AI technology and integrate it into their broader digital ecosystem, unlocking new business cases. The decision cycle for such projects, related to these business cases can go down significantly, considering the potential lower cost of such models.

Perplexity, which positions itself as a competitor of Google’s search engine, has already brought DeepSeek onto its platform. It has taken the open-source version and put it in data centers in the US and Europe thus preventing any leakage of data to China. The fact that DeepSeek AI assistant did become a top-rated app on the App Store means the end consumer likes it or at least finds it curious enough to give this level of attention.

It doesn’t mean DeepSeek has no issues. When asked any sensitive question about China it does not answer. Funny enough, it answers and then deletes it instantly, stating harmlessness as a principle. This kind of moderation, which depends on a particular country’s policy, can infuse bias and prevent the free flow of information over the internet. DeepSeek's website states that the data is shared with the government of China, which can prevent its API usage. There are also doubts about whether the claim of $5.5Mn spent on training is actually correct or whether it may have used frontier models like Open AI’s o-1 model to initiate its training. Even if today it is an open-source model, tomorrow the licensing structure can change. 

Whatever the case, the cat is out of the box, and changes in the AI landscape are inevitable. The clutter in AI space and the pace of change will further increase. It will be more difficult for a non-AI-focused firm to make sense of all the changes and to utilize the best technology or platform for its specific use. Here, AI focussed partners such as Pythian can come into play and help these firms throughout their AI journey. 

Why Pythian?

Pythian focuses on data operations and curation techniques that bring your organization immediate visible benefits within the ever-evolving AI ecosystem. We believe that AI architecture has to be nimble and performant. Build once and use it with every LLM, including the latest one on the block. DeepSeek is just the beginning of human ingenuity applied to overcoming resource constraints. Pythian, as your AI partner, can help you prepare for the coming changes and protect and enhance your AI investment.

At Pythian we are staying ahead of the latest and greatest innovations in the world of AI. In a series of blogs, we will go deeper to understand each of the optimization techniques. Talk to us today!  

No Comments Yet

Let us know what you think

Subscribe by email