Enabling Resilient Machine Learning Systems

5 min readJan 25, 2023

Resilient machine learning systems are fast, accurate, and flexible. They assist you in your day-to-day tasks for maximum efficiency, they leverage the latest software and hardware for the fastest performance, and they guide you through complex tasks for the best accuracy. The Azure ML team has long focused on bringing you a resilient product, and its latest features take one giant leap in that direction, as illustrated in the graph below (Figure 1). Continue reading to learn more about Azure ML’s latest announcements.

Figure 1. The two steps to building resilient matching learning systems.

1. Speed improvements in ML workflow

When choosing a machine learning cloud platform, speed is top-of-mind. We want to be efficient by reusing our code, sharing it with others, and being able to use pre-made solutions for our most common problems. And we want our long-running operations, such as data processing and model training, to run blazing fast by taking advantage of the last hardware and software available in the industry. This is the motivation behind several of Azure ML’s latest features.

Azure ML recently announced the public preview availability of registries, which are organization-wide repositories of assets, such as machine learning models, components (reusable pieces of custom code), and environments (software configuration needed to run your code). With registries, you can share assets with users across different workspaces, which greatly simplifies collaboration. You can also re-use assets from the “azureml” system registry, which is available to all users. For now, this public registry contains reusable components that make it easier to work with Responsible AI, but new capabilities will continuously be added in the future.

Another recent announcement, also still in public preview, is the integration of Spark with Azure ML. This new feature enables you to run large data wrangling operations efficiently, within Azure ML, by leveraging Azure Synapse Analytics to get access to an Apache Spark pool. Users can choose between two approaches: Managed Spark, where Microsoft takes care of managing the Synapse Spark resources and infrastructure on your behalf, and Attached Spark, which allows you to bring your own Synapse Spark compute into Azure ML. This speedup in data processing can be accessed regardless of whether you’re using notebooks, the Azure ML Studio UI, the CLI, or the SDK.

In addition, Azure ML recently made the public preview announcement of the Azure Container for PyTorch, a new container that delivers innovative technologies that greatly accelerate training and inference for PyTorch models. This container consists of a DSVM (the Azure DSVM for PyTorch) which can be used independently of Azure ML, and a curated environment (the Azure Curated Environment for PyTorch) which deeply integrates with Azure ML. The performance improvements are significant — we’ve seen gains up to 85% on training and inference of Hugging Face models! If you’re working with large PyTorch models within Azure ML, switching to this new environment is an easy change that brings a great payoff.

2. Better transparency, accuracy, and accessibility to powerful ML frameworks

Data scientists know that accuracy and performance are no longer the only objectives when developing machine learning systems, fairness and interpretability must be considered as well. To make sure that machine learning solutions are fair and the value of their predictions easy to understand and explain, it is essential to get access to tools that developers and data scientists can use to improve accuracy and asses their models’ fairness and mitigate any observed unfairness issues.

Azure Machine Learning CLI v2 and Azure Machine Learning Python SDK v2 introduce standardization of features and terminology across the interfaces to improve the experience of data scientists on Azure. Azure ML Python SDK v2 is an updated Python SDK package, which allows developers and data scientists to submit training jobs, to manage data, models, environments, to perform managed inferencing (real time and batch) and to stitch together multiple tasks and production workflows using Azure ML pipelines. The SDK v2 is on par with CLI v2 functionality and is reliable and consistent in how assets and actions are used between SDK and CLI, to improve the speed of your ML development cycle.

To improve the speed of your development cycle and the accuracy of your ML models, Azure ML is now offering the opportunity to leverage the power of automated ML for specific key scenarios, such as vision and natural language processing (NLP) models. You can create NLP models with automated ML via the Azure Machine Learning Python SDK v2 or the Azure Machine Learning CLI v2: this capability allows data scientists to bring their own text data and build custom models for tasks such as multi-class text classification, multi-label text classification, and named entity recognition (NER). Moreover, the support for computer vision tasks allows you to easily generate models trained on image data for scenarios like image classification, object detection, and instance segmentation. You can either leverage the Azure Machine Learning data labeling capability or use labeled data for training image and NLP models, and optimize model performance by specifying the model algorithm and tuning the hyperparameters.

To improve the transparency and interpretability of your ML solutions, Azure ML offers the Responsible AI dashboard that provides a single interface to help you implement Responsible AI in practice effectively and efficiently. The dashboard is integrated with Azure Machine Learning CLI v2, Azure Machine Learning Python SDK v2, and Azure Machine Learning studio. The tools include:

Model interpretability, to understand predictions, both at the individual and global level.
Counterfactual what-if, to examine feature perturbations and see how they would affect your model predictions.
Causal analysis, to understand the causal effects of treatment features on real-world outcomes.
Data analysis, to understand and explore distributions and statistics in your data.
Model overview and fairness assessment, to assess your model’s unfairness issues.
Error analysis, to analyze how model errors are distributed in your data.

About the authors on resilient machine learning systems:

Francesca Lazzeri is a data scientist. She currently leads an organization of data scientists and engineers at Microsoft, and teaches Python for ML at Columbia University. Previously, she was a researcher in the Technology and Operations Management unit at Harvard University.

Bea Stollnitz is a developer advocate at Microsoft, focusing on Azure ML and other AI/ML technologies. She has a background in scientific machine learning, applied math, and software engineering. She loves to share her knowledge with others through her Azure ML and machine learning blog.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

Enabling Resilient Machine Learning Systems

1. Speed improvements in ML workflow

2. Better transparency, accuracy, and accessibility to powerful ML frameworks

Written by ODSC - Open Data Science