AI workloads require massive datasets. Parallel file systems running on high-speed networks are the foundational infrastructure for quick access to these datasets.
Success in implementing AI projects depends on the entire data center having the performance, scalability, and availability to drive GPUs at peak utilization. Given the significant investment that GPUs represent in the modern day market, maximizing the ROI on GPUs is a must.
CEO and co-founder of Quobyte.
While fast networks and parallel file systems are critical, what is often overlooked is system availability. It’s reported that many high performance computing (HPC) systems only achieve 60% total availability, mostly because of maintenance windows and unplanned downtime used to replace failed components, upgrade systems, software updates, and the like.
Downtime is expensive and unproductive; the more hardware you have, the more failures you get. People make mistakes, the bigger the team and the data center are, the more mistakes will be made, like pulling the wrong server, cable, or drive.
Think of all the idle resources: servers, storage, network, staff, power consumption, GPUs, CPUs, unhappy data scientists, and so on. What is the cost per hour? According to ITIC’s 2024 Hourly Cost of Downtime Survey, the hourly cost of downtime for 90% of all organizations is at least $300,000 per hour. 41% of enterprises say that downtime costs between $1 million and $5 million per hour.
A system suited for the high demand of AI applications must start with a foundation of a parallel file system constructed with hyperscaler principles. Keeping in mind the mammoth data sets that will be the raw info from which AI research and development uses to inform itself with, the system must be designed to scale thousands of individual nodes with exabytes of capacity.
Such initial considerations enable linear scalability and maximum throughput for next-generation data storage.
The necessity of no single point of failure
The data center is the war zone of modern distributed computing; system failures are common, and problems are a matter of ‘when’ and not ‘if’. The framework demands 100% availability from the systems design. This means that adaptive redundancy is built into systems.
There is a discrete reason for this. Advancing the possibilities of technology will not be brought about by burdensome systems that experience down-time. The IT industry has accepted for a long time that downtime is a necessary evil for system upgrades and maintenance.
However, a more modern view is that a new standard for what is possible from advanced computing. That’s why starting from the ground-up, one must build storage infrastructure that is resilient against platform failure. Hyperscalers have normalized the expectation that systems are available 24/7 with full performance.
Fault-tolerance, requires the software that does ‘not trust’ the underlying hardware’. The base unit is a cluster, which should consist of a minimum of four nodes. Each cluster should be able to resolve failures sans downtime; routine maintenance must be performed without downtime. Advanced projects such as artificial intelligence need continuous uptime.
The maintenance window, in our thinking, is completely obsolete. Such complex projects, such as medical research, require such continuous availability. As industry advances and more elaborate tasks are assigned to HPC systems, the brass ring of storage is undisrupted system operation.
This type of storage structure is what we engineer. To meet the demands of the future requires a system that can conduct end-to-end checks of everything, including network connections and drives. The bar for system operation under dire conditions is now much higher; we build storage architectures that can lose a node, a rack–or even an entire data center; the system will still run.
Abolishing heavy maintenance, and system flexibility
A key feature as well of systems that are attuned to the work of AI is the lack of a maintenance window. The employment of a modular and heterogeneous system architecture results in no planned downtime for upgrades and updates.
Subsequently, this provides no space for anything that storage admins and hardware operators need to do on a routine basis: updates, hardware repairs / replacements, recabling, hardware refreshes, reconfigurations, kernel and security updates.
Legacy storage systems that rely on dual controllers have multiple points of failure, and the maintenance windows are disruptive, making continuous operations impossible. A cluster is a much better solution. The modularity of this approach provides an order of magnitude better level of redundancy.
For example, nodes can be taken out of service, have their components replaced or their software updated and rejoin the cluster. This enables true fault tolerance and non-disruptive operations. Clusters should be built on a minimum of four nodes, but also be able to scale to thousands of nodes if necessary.
The needs of both today’s and future AI workloads are going to require storage that can handle hybrid architectures; the ability to stack and integrate newer systems that can play nice with previous investments is not to be overlooked. Compatibility is a budgetary issue, in this sense.
The ideal storage system has a user space with no custom kernel modules or drivers. Consequently, the system can be managed by staff with basic Linux operating knowledge. If there is a system issue, you don’t have to wait for the one expert to get to the data center.
Final thoughts
The data capacity and power consumption of artificial intelligence is set to grow. Already estimated at about 20%of global data-center power demand, that number is set to double by the end of the year as of this writing. Smarter AIs will simply need even more storage space.
Performance alone won’t get you anywhere once production ensues on projects with petabytes of complexity. Performance looks great on paper, but reality is different. Much like a race car, as soon as it touches the track, maintenance is an issue.
Reliability and performance keep the car from sitting in the garage being fixed all the time. That’s why constructing storage systems requires efficient, flexible, hardware and agnostic systems that will be of service for the future being built today.
We've featured the best hard drive.
This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro