Now, as you need to have understood what’s DeepEP, let’s speak about some technical particulars (you possibly can skip this)
Excessive-Throughput and Low-Latency Kernels:
Helps MoE dispatch (sending knowledge to completely different consultants) and mix (merging outputs) with low latency.
Optimized for each NVLink and RDMA communications to enhance knowledge switch velocity.
Optimized for Group-Restricted Gating Algorithm:
Makes use of specialised kernels for asymmetric-domain bandwidth forwarding, that means it effectively handles knowledge switch between completely different {hardware} domains (like from NVLink to RDMA, that are each interconnect applied sciences).
Latency-Delicate Inference:
Contains low-latency kernels that use RDMA for inference duties to attenuate delays throughout knowledge processing.
Makes use of a hook-based technique to permit for communication and computation to overlap with out occupying computational assets like SMs (Streaming Multiprocessors) on GPUs.
Efficiency Testing:
Examined on H800 GPUs with CX7 InfiniBand 400 Gb/s RDMA community playing cards, displaying excessive efficiency in numerous configurations like dispatching and mixing EPs (professional parallelism models) with varied community bandwidths.
RDMA and NVLink Integration:
Helps RDMA (Distant Direct Reminiscence Entry) for quick knowledge switch throughout completely different nodes and NVLink for intra-node communication, making it extremely environment friendly for distributed machine studying duties.
Visitors Isolation and Adaptive Routing:
Makes use of Digital Lanes (VL) in InfiniBand to separate several types of site visitors, making certain workloads don’t intervene with one another.
Helps adaptive routing to keep away from community congestion, although it’s at the moment restricted to low-latency kernels.
Congestion Management:
Congestion management is disabled as there’s no important congestion noticed within the manufacturing setting, simplifying deployment.
Compatibility:
Works with InfiniBand networks and is theoretically appropriate with RDMA over Converged Ethernet (RoCE).
FP8: A low-precision floating-point format with 8 bits, which is used to hurry up computations and scale back reminiscence utilization at the price of some precision.
RDMA (Distant Direct Reminiscence Entry): A expertise that enables knowledge to be transferred immediately between the reminiscence of two computer systems with out involving the CPU, enhancing velocity and decreasing latency.
NVLink: A high-bandwidth, energy-efficient interconnect expertise developed by NVIDIA to attach GPUs and speed up knowledge switch.
SM (Streaming Multiprocessors): These are the fundamental processing models in a GPU that deal with the vast majority of computational duties.
Digital Lanes (VL): A part of InfiniBand’s networking expertise, the place site visitors is segregated into completely different logical channels to stop interference between several types of site visitors.
Adaptive Routing: A community routing function that dynamically adjusts the trail of information to keep away from congestion, enhancing total efficiency.