1-2hit |
Junnan LI Biao HAN Zhigang SUN Tao LI Xiaoyan WANG
FPGA-based switches are appealing nowadays due to the balance between hardware performance and software flexibility. Packet parser, as the foundational component of FPGA-based switches, is to identify and extract specific fields used in forwarding decisions, e.g., destination IP address. However, traditional parsers are too rigid to accommodate new protocols. In addition, FPGAs usually have a much lower clock frequency and fewer hardware resources, compared to ASICs. In this paper, we present PLANET, a programmable packet-level parallel parsing architecture for FPGA-based switches, to overcome these two limitations. First, PLANET has flexible programmability of updating parsing algorithms at run-time. Second, PLANET highly exploits parallelism inside packet parsing to compensate FPGA's low clock frequency and reduces resource consumption with one-block recycling design. We implemented PLANET on an FPGA-based switch prototype with well-integrated datacenter protocols. Evaluation results show that our design can parse packets at up to 100 Gbps, as well as maintain a relative low parsing latency and fewer hardware resources than existing proposals.
Chenxu WANG Yutong LU Zhiguang CHEN Junnan LI
Training deep learning (DL) is a computationally intensive process; as a result, training time can become so long that it impedes the development of DL. High performance computing clusters, especially supercomputers, are equipped with a large amount of computing resources, storage resources, and efficient interconnection ability, which can train DL networks better and faster. In this paper, we propose a method to train DL networks distributed with high efficiency. First, we propose a hierarchical synchronous Stochastic Gradient Descent (SGD) strategy, which can make full use of hardware resources and greatly increase computational efficiency. Second, we present a two-level parameter synchronization scheme which can reduce communication overhead by transmitting parameters of the first layer models in shared memory. Third, we optimize the parallel I/O by making each reader read data as continuously as possible to avoid the high overhead of discontinuous data reading. At last, we integrate the LARS algorithm into our system. The experimental results demonstrate that our approach has tremendous performance advantages relative to unoptimized methods. Compared with the native distributed strategy, our hierarchical synchronous SGD strategy (HSGD) can increase computing efficiency by about 20 times.