(image credits)

  Bokun Wang

I am a Ph.D. student advised by Prof. Tianbao Yang at Texas A&M University. I got my bachelor's degree in Computer Science from University of Electronic Science and Technology of China (UESTC) in 2018. My current research interests are stochastic optimization and distributed optimization for machine learning.

Recent Papers [Full List]

  • Theoretically Better and Numerically Faster Distributed Optimization with Smoothness-Aware Quantization Techniques
    Bokun Wang, Mher Safaryan, and Peter Richtárik
    To appear in the 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
    + abstract
    To address the high communication costs of distributed machine learning, a large body of work has been devoted in recent years to designing various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them. Recently, Safaryan et al. [2021] pioneered a dramatically different compression design approach: they first use the local training data to form local smoothness matrices and then propose to design a compressor capable of exploiting the smoothness information contained therein. While this novel approach leads to substantial savings in communication, it is limited to sparsification as it crucially depends on the linearity of the compression operator. In this work, we generalize their smoothness-aware compression strategy to {\em arbitrary unbiased compression} operators, which also includes sparsification. Specializing our results to stochastic quantization, we guarantee significant savings in communication complexity compared to standard quantization. In particular, we prove that block quantization with n blocks theoretically outperforms single block quantization, leading to a reduction in communication complexity by an O(n) factor, where n is the number of nodes in the distributed system. Finally, we provide extensive numerical evidence with convex optimization problems that our smoothness-aware quantization strategies outperform existing quantization schemes as well as the aforementioned smoothness-aware sparsification strategies with respect to three evaluation metrics: the number of iterations, the total amount of bits communicated, and wall-clock time.
  • Finite-Sum Coupled Compositional Stochastic Optimization: Theory and Applications
    Bokun Wang and Tianbao Yang
    In Proc. of the 39th International Conference on Machine Learning (ICML), 2022.
    [paper] [arxiv] [bib] [code] [poster] [slides]
    + abstract
    This paper studies stochastic optimization for a sum of compositional functions, where the inner-level function of each summand is coupled with the corresponding summation index. We refer to this family of problems as finite-sum coupled compositional optimization (FCCO). It has broad applications in machine learning for optimizing non-convex or convex compositional measures/objectives such as average precision (AP), p-norm push, listwise ranking losses, neighborhood component analysis (NCA), deep survival analysis, deep latent variable models, etc., which deserves finer analysis. Yet, existing algorithms and analyses are restricted in one or other aspects. The contribution of this paper is to provide a comprehensive convergence analysis of a simple stochastic algorithm for both non-convex and convex objectives. Our key result is the improved oracle complexity with the parallel speed-up by using the moving-average based estimator with mini-batching. Our theoretical analysis also exhibits new insights for improving the practical implementation by sampling the batches of equal size for the outer and inner levels. Numerical experiments on AP maximization, NCA, and p-norm push corroborate some aspects of the theory.
  • IntSGD: Adaptive Floatless Compression of Stochastic Gradients
    Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, and Peter Richtárik
    In Proc. of the 10th International Conference on Learning Representations (ICLR), 2022 (Spotlight).
    [paper] [arxiv] [bib] [code] [poster] [slides]
    + abstract
    We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to integers. In contrast to the prior work on integer compression for SwitchML by Sapio et al. (2021), our IntSGD method is provably convergent and computationally cheaper as it estimates the scaling of vectors adaptively. Our theory shows that the iteration complexity of IntSGD matches that of SGD up to constant factors for both convex and non-convex, smooth and non-smooth functions, with and without overparameterization. Moreover, our algorithm can also be tailored for the popular all-reduce primitive and shows promising empirical performance.


  • Optimization and Machine Learning Lab, King Abdullah University of Science and Technology (KAUST)
    Research intern advised by Prof. Peter Richtárik, September 2020 - August 2021.
  • University of California Davis (UC Davis)
    Research assistant advised by Prof. Shiqian Ma, July 2019 - September 2019.


  • Teaching assistant of ECS 32B: Introduction to Data Structures, ECS 154A: Computer Architecture, ECS 271: Machine Learning & Discovery, ECS 170: Introduction to Artificial Intelligence at University of California Davis.


  • Reviewer of NeurIPS 2021, ICML 2022, AISTATS 2023.