DISTRIBUTED COMPUTING

Automatic Decomposition and Parallelization of Python Programs for Heterogeneous Computing

Distributed ComputingPythonHPCJupyterParallel Processing

This paper introduces a revolutionary system that automatically decomposes, parallelizes, and distributes Python programs across heterogeneous infrastructures including CPUs, GPUs, Docker containers, Kubernetes clusters, and Slurm environments. The approach simplifies high-performance and distributed computing for non-experts by combining a skeleton-based compiler with lightweight Python decorators, allowing developers to write regular notebook code while annotations describe computational patterns.

System Overview and Innovation

The system pairs a skeleton-based compiler with lightweight Python decorators, enabling developers to write regular notebook code while annotations describe computational patterns that the compiler transforms into parallel workflows. A backend-aware scheduler executes these workflows across heterogeneous infrastructures, while a custom Jupyter kernel and front-end maintain an interactive and portable experience across different backends.

Positioning Against Prior Art

The authors survey existing source-to-source compilers (ROSE, PIPS, OMNI, Cetus), workflow engines (StreamFlow, Snakemake), and Jupyter-centric orchestration tools (Jupyter-Workflow). They argue that existing solutions either target homogeneous shared-memory parallelism, require new domain-specific languages, or expose users to explicit workflow management. Their system uniquely blends compiler-based transformation with notebook metadata to preserve Python ergonomics while enabling distributed execution.

Core Architecture Design

The design consists of three main components: (1) a Jupyter kernel that intercepts code and extracts structure from decorators; (2) a compiler back-end that maps annotated segments to high-level "skeletons" (map-reduce, pipeline, data-parallel training) and builds a DAG with explicit data dependencies; and (3) a distributed runtime within GPU-enabled Docker containers that orchestrates resources across devices and nodes. The backend supports Docker, Kubernetes, and Slurm without requiring changes to user code.

Practical Demonstration

To demonstrate the approach, the authors implement a convolutional neural network training workflow on the Fashion-MNIST dataset for image classification. Using TensorFlow's MirroredStrategy, the system replicates the model across visible GPUs, performs synchronous gradient aggregation, and manages placement and communication automatically via the compiled execution graph. The notebook workflow handles data loading, preprocessing, training, and evaluation as tasks, with decorators conveying roles, dependencies, and preferred execution targets.

Performance Results

Empirical results show nearly linear scaling from one to four GPUs: training time for 20 epochs drops from approximately 1,800 seconds (single GPU) to 480 seconds (four GPUs), achieving a 3.75× speedup with 92-95% average GPU utilization and only 6-8% overhead from NCCL ring-allreduce. These performance gains require only minimal code changes through decorators and kernel metadata rather than extensive manual parallelization and device management.

Automated Resource Management

The system automatically handles execution artifacts including model weights, training logs, and performance metrics, which are versioned and organized by the runtime. This automation eliminates the need for manual resource management and provides comprehensive tracking of computational experiments across different execution environments.

Future Directions and Impact

The paper concludes that compiler-assisted decomposition embedded in Jupyter can democratize access to scalable, heterogeneous computing for AI/ML practitioners. Future work targets broader model support (transformers), dynamic resource adaptation, hybrid CPU-GPU and multi-node strategies, integration with PyTorch/JAX, improved fault tolerance and checkpointing, and support for emerging accelerators including TPUs, IPUs, and FPGAs.