AI WORKFLOWS & CLOUD CONTINUUM

AI Workflow Orchestration Across Cloud Continuum: From Jupyter to Heterogeneous Execution

Distributed SystemsScalabilityFault ToleranceArchitecturePerformance

This paper explores strategies for defining, deploying, and orchestrating AI workflows across a "Cloud Continuum" that spans cloud, HPC, and edge environments. Starting from interactive computing platforms—particularly Jupyter—it examines how modified kernels and extensions can execute code on heterogeneous, distributed backends while preserving the notebook user experience. The overarching aim is "code once, run anywhere" for AI workflows, including those used to fine-tune foundation and large language models.

Vision and Motivation

The paper explores strategies for defining, deploying, and orchestrating AI workflows across a "Cloud Continuum" that spans cloud, HPC, and edge environments. It starts from interactive computing platforms—particularly Jupyter—and examines how modified kernels and extensions can execute code on heterogeneous, distributed backends while preserving the notebook user experience. The overarching aim is "code once, run anywhere" for AI workflows, including those used to fine-tune foundation and large language models.

State of the Art Analysis

After surveying the landscape—workflow engines such as Airflow, Kubeflow, Arvados, Nextflow, and StreamFlow; container platforms like Docker, Singularity, and Kubernetes; and serverless patterns—the authors motivate combining workflow orchestration with a source-to-source compiler that uses annotations and skeletons to automatically decompose, parallelize, and place code. This blend promises portability across multi-cloud, hybrid cloud, HPC clusters, and edge devices without forcing users to refactor notebooks into separate pipeline DSLs.

Project Objectives and Innovation

The project objectives crystallize around lifting notebook-authored workflow nodes so they are independent of infrastructure specifics. By abstracting environment details and automating distribution, a single prototype can be executed on diverse targets (HPC, cloud, edge), enabling optimized placement across a combined HPC–Cloud fabric while engaging the large Jupyter user base.

Methodological Framework

Methodologically, the work proceeds in four phases: (1) analyze state of the art in interactive computing on heterogeneous architectures; (2) integrate existing workflow tools with interactive computing; (3) integrate a decorator- and skeleton-based source-to-source compiler with those workflow tools; and (4) fuse notebook-level annotation with the compiler and orchestrators to create an efficient, portable execution stack. A PoC is planned to validate these integrations.

Interactive Execution Strategies

Two concrete strategies are detailed. First, a Jupyter extension provides an interactive UI to create/select execution environments per cell (local, Docker, Kubernetes, SSH/HPC), manage reusable "templates" vs. instantiated environments (both JSON-described), and support Admin/Group/Users sharing for labs and teams—turning cell pipelines into a logical workflow that maps to heterogeneous resources.

Compiler-Driven Parallelization

Second, decorators drive intra-cell parallelization and on-the-fly "sub-environment" creation (e.g., lightweight container images tailored to a cell's dependencies), leveraging the source-to-source compiler to emit a parallel DAG for the orchestrator. This approach enables automatic decomposition and optimization of computational patterns within individual notebook cells.

Impact and Future Directions

The paper concludes that cell-level parallelism, dynamic environment customization, and heterogeneous execution substantially improve notebook-based AI development at scale. Future work targets polyglot notebooks, richer execution reporting and performance metrics, deeper integration with embedded/FPGA platforms, and "sub-cell" execution for even finer-grained splitting—pushing the platform toward comprehensive, portable workflow optimization across cloud, HPC, and edge.

Conclusion

This work contributes to the advancement of distributed systems architecture by providing practical solutions to fundamental challenges in scalability, fault tolerance, and performance. The proposed framework offers a solid foundation for building next-generation distributed applications that can meet the demands of modern computing environments. The combination of theoretical rigor and practical implementation makes this approach valuable for both academic research and industrial applications.