TNS
VOXPOP
Do You Resent AI?
If you’re a developer, do you resent generative AI’s ability to write code?
Yes, because I spent a lot of time learning how to code.
0%
Yes, because I fear that employers will replace me and/or my peers with it.
0%
Yes, because too much investment is going to AI at the expense of other needs.
0%
No, because it makes too many programming mistakes.
0%
No, because it can’t replace what I do.
0%
No, because it is a tool that will help me be more productive.
0%
No, I am a highly evolved being and resent nothing.
0%
I don’t think much about AI.
0%
AI / DevOps / Hardware

How CIOs Can Battle GPU Poverty in the Age of AI

By adopting a model-first mentality, optimizing utilization and wielding load balancing strategically, CIOs can mitigate the shortage of chips.
May 7th, 2024 11:16am by
Featued image for: How CIOs Can Battle GPU Poverty in the Age of AI
AI-generated art from Pixabay.

The gold rush of the AI era is on, but for many companies, the pickaxes are on backorder. A phenomenon known as “GPU poverty” is plaguing CIOs as demand for artificial intelligence skyrockets, outpacing the ability to build the data centers and, more importantly, the chips needed to power it all.

In a nutshell, GPU poverty means that organizations that would like to use GPUs for AI computing simply cannot buy capacity on these powerful parallel processing systems that are the most efficient way to run many types of machine learning.

This scarcity has its roots in the perfect storm of perfect storms. A global chip shortage of powerful graphical processing units has led startups to raise money specifically to buy GPUs — an insane tactic when you consider that massive capital expenditures prior to revenues is exactly the problem cloud computing solves. Then there are the ever-increasing demands of AI workloads.

As more and more enterprises look to either leverage AI services from the likes of OpenAI and Google or to tap into AI models and toolchains in the cloud, they add to the pressure on GPU pricing — putting GPUs further out of reach for startups and other organizations lacking capital.

GPU poverty is rippling up and down the entire supply chain and along the whole toolbelt for AI builders. Data center construction outfits face multiyear backlogs for in-demand core components such as backup generators and electrical transformers. Even finding suitable locations with cheap real estate, cheap and abundant power and fast connectivity to the global internet has become far more daunting.

Then there’s the matter of the missing chips. Semiconductor fabrication plants are struggling to keep up and their efforts to rapidly build new fabs will only yield fruit over many years.

Meanwhile, hyperscale cloud providers and large enterprises are gobbling up the limited supply of GPU production, driving prices through the roof. For many companies, particularly those without bottomless budgets, difficulties accessing GPUs in the cloud for AI applications is becoming a significant business risk.

Smart CIOs, however, can take the edge off GPU insanity with common sense steps to reduce resource requirements to run AI in their enterprises.

Use Frugal Models and Inferencing

Just like a resourceful traveler learns to pack light, data scientists can achieve amazing results with smaller, more efficient AI models. For example, Microsoft’s Phi-2 model, which was trained on textbooks and super high-quality data, is both compact and resource-efficient, requiring far less compute to tune and inference.

Newer techniques like quantization and pruning are allowing researchers to shrink down behemoth models without sacrificing accuracy. Frameworks like TensorFlow Lite are specifically designed for deploying these leaner models on edge devices, and startups like Hugging Face are democratizing access to pre-trained, efficient models. The team responsible for the PyTorch framework is also creating new ways to train models effectively with less data and overhead.

Optimize Everything

With the stratospheric prices of GPU time, optimizing AI workloads pays off quickly and well. AI engineering and MLOps teams should aggressively and frequently profile performance to identify bottlenecks. This can mean benchmarking different configurations (batch sizes, number of GPUs) to find the most efficient setup for your specific task, because it’s not always straightforward.

Savvy teams will combine and tune data precisions (FP16, FP32, etc.) during training to reduce memory usage and run larger batch sizes. Managing memory allocation and data movements with techniques like data pre-fetching and finely timed data transfers to closely follow compute availability can help.

Finding the ideal batch size for AI jobs is crucial. A larger batch size can better utilize the GPU, but too large can lead to out-of-memory errors. Experiment to find the sweet spot. Make sure to try out GPU virtualization software if you have larger GPUs or have reserved a lot of GPU capacity. This can allow you to repurpose valuable and rare compute necessary for training models or doing larger tunings to address more run-of-the-mill model inferencing required for AI application operations.

Lastly, deploy on a foundation of containers that enables automatic scaling, if possible, to dynamically adjust the number of GPUs allocated to a workload based on real-time needs. This helps avoid overprovisioning while ensuring enough resources for peak periods.

Tune Load Balancing for AI

Properly tuned load balancing tackles the challenge of GPU poverty while ensuring AI jobs receive the resources they need without timeouts and offering enhanced security. It differs from traditional load balancing by recognizing the diverse computational requirements of AI tasks.

By profiling workloads, assessing their CPU and GPU needs, and prioritizing time-sensitive operations, AI-specific load balancers dynamically distribute work across the most suitable hardware. This approach safeguards your expensive GPUs for operations that genuinely demand their capabilities, while offloading CPU-bound work to more cost-effective resources.

Critically, AI-specific load balancing introduces a new dimension of control with token management. In AI systems where tokens play a role (language models), balancing loads isn’t just about hardware efficiency. Load balancers can monitor token usage associated with AI jobs, dynamically rerouting requests to optimize token consumption and prevent cost overruns.

Moreover, by intelligently routing jobs based on their potential security implications and token sensitivities, AI load balancers help isolate high-risk workloads, providing an additional layer of protection for your AI systems. Implementing such a load-balancing strategy necessitates careful consideration of framework integration, robust monitoring and potential cost savings with cloud-based AI load-balancing solutions.

AI-tuned load balancers might deliver more granular control — token-based rate limiting, for example, and algorithms that ship or shift jobs to LLM clusters that are the most economical in terms of token usage or costs.

The Future Is (Hopefully) Abundant

The good news is that the industry isn’t sitting idly by. Chipmakers are ramping up production, and new chip architectures specifically designed for AI are on the horizon. More AI data centers will come online. Many smart developers and engineering teams are continually improving the way AI models work and reducing the burden for training models while holding the line or even improving on performance.

However, these solutions won’t arrive overnight. In the meantime, by adopting a model-first mentality, optimizing utilization and wielding load balancing strategically, CIOs can mitigate the worst excesses of the current infrastructure bubble and avoid GPU poverty, ensuring that their organizations have enough AI for the jobs that need to be done.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: run.ai.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.