Facebook Expands the Open Compute Project (OCP) to Cover AI Applications

Facebook supports 2.7B people, does 200 trillion predictions, and over 6 billion daily translations. Additionally, 3.5 billion images are analyzed to better recognize and tag content. So, it should come as no surprise that AI is used extensively to make all this happen. To support various AI tasks (training, inference, feature engineering, etc.) they have extended the Open Compute Project (OCP) to include hardware platforms suited to tackle AI-intensive computations.

They announced the availability of two new platforms one intended for model training (Zion) while the other (Kings Canyon) suited for inference jobs. These platforms consist of server blades, AI accelerator modules, connectivity, and the chassis.

So why have two separate platforms? The hardware architecture needed for effective training and inference are vastly different. Gear intended for training require massive amounts of memory, ability to support floating point operations while latency, power, and cost are less important. On the other hand, inference hardware requires less memory and will do just fine by only integer math operations. The most critical requirement of inference hardware is latency since inferences have to be done in real time.

Zion (training platform) is designed to efficiently handle the training of a spectrum of neural networks including CNN, LSTM, and SparseNN. The AI accelerators are housed in a vendor-agnostic OCP module (OAM) and it can support devices from AMD, Habana, Graphcore, Intel, and NVIDIA.

Kings Canyon is specifically designed for inference tasks. The AI accelerator chips are housed in M.2 module that are too vendor agnostic and can support devices from Esperanto, Habana, Intel, Marvell, and Qualcomm.

Uber Architecting Next-generation AI Servers

Similar to Facebook and Google, Uber is yet another company that utilizes AI in just about every aspect of their business. This includes recommendation engines, Uber Eats, fraud detection services among others. Imagine doing this to accommodate 15 million daily rides in 600 cities spanning through 65 countries. In order to support such a scale, they have two large data centers (each consuming 5 MW) and several smaller ones around the world.

More ...