March 14, 2025

ikayaniaamirshahzad@gmail.com

Beyond Notebook: Building Observable Machine Learning Systems

Key Takeaways

A unified ML management system requires careful orchestration of multiple components, from experiment tracking with MLflow to model serving with FastAPI.

Interactive visualization through Streamlit enables rapid prototyping, validation, and stakeholder communication, serving as both a development tool and a platform for model behavior analysis.

Use containerization technologies like Docker and Kubernetes for resource Management and scaling requirements, particularly for the monitoring service.

The monitoring trinity (Prometheus, Grafana, and Evidently AI) provides comprehensive system observability by combining infrastructure metrics, visualization capabilities, and ML-specific monitoring to ensure reliable model performance.

Dual approach Data drift detection and Shapley Additive exPlanations (SHAP) analysis enable a deep understanding of model behavior and feature importance patterns between small and large transactions, leading to more interpretable and trustworthy fraud detection.

Machine Learning pipelines encompass several key components: data preprocessing, model experimentation, training, deployment, and evaluation. Machine learning engineers often face significant challenges in production environments, such as difficulty reproducing the code from notebooks and finding the correct model version when transitioning models from notebooks developed by data scientists.

To deploy a prototype to production, we require careful consideration of various aspects, such as code refactoring for scalability, version control implementation, containerization, and automated testing with continuous integration. Adopting a production-oriented mindset and observability practices during the initial stages of model development can significantly facilitate the deployment process. We can create more robust and maintainable machine learning systems that are better suited for real-world applications.

Building the ML Foundation

As shown in Figure 1 below, a machine learning pipeline starts with data processing, where raw information is collected and processed for analysis. Next, in the experimentation and training stage, data scientists explore algorithms, feature engineering techniques and model architectures to learn patterns. Then, the trained model is deployed in the inference stage to make predictions based on new incoming data through APIs .

In the subsequent monitoring stage, we ensure the model’s effectiveness by tracking system performance metrics and detecting drift. Organizations can streamline their workflows, improve team collaboration, and accelerate the delivery of AI-powered solutions through these comprehensive steps.

ML pipelines serve as a critical structured framework for managing the increasing complexity of modern machine learning projects for developing and deploying models, enabling data scientists and engineers to focus on innovation rather than getting bogged down in operational details.

Figure 1: Workflow of a Machine Learning Project

Building an Observable ML Pipeline

This article will demonstrate an ML pipeline and its observability for real-world credit card fraud detection. When the customers swipe their digital cards, they expect an instant decision – approve or decline. Behind this split-second decision lies a complex machine-learning system that must be accurate, observable, maintainable, and reliable. This article explores how to build an observable fraud detection system, sharing practical insights from development to production monitoring.

The credit card fraud dataset is based on real-world transactions made by credit cards in September 2013 by European cardholders. However, it’s a simplified and pre-processed version, and this example application demonstrates the workflow of moving from local model development to a centralized platform. This process applies to real-world fraud detection systems, where extensive transaction details and pre-engineered features, similar to the Kaggle dataset’s structure, would be used to train models that predict fraudulent transactions.

The focus is on building an observable platform and workflow, not the specific model’s ultimate performance on real-world fraud data. The dataset also provides information on the transaction amount, time, and already preprocessed features (V1-V28), the principal components obtained with PCA. The scikit-learn, an open-source machine learning Python library, was used to develop a logistic-regression model for credit card data.

The model predicts the probability of the translation being fraud. Without focusing much on model development, let’s understand how to transition from localized model development to a scalable centralized experimentation platform, which facilitates navigation through the complex model creation and implementation process.

Containerization

It’s a good practice to design with scalability in mind from the beginning of a project, using a containerization tool such as Docker to ensure the application code can run anywhere. Within the Docker container, each component is encapsulated with necessary dependencies and runtime environments to ensure the system’s integrity and ability to scale on different environments. The system uses Apache Kafka as a distributed event streaming platform to handle real-time transaction data. Docker’s horizontal scaling capability and Kafka’s ability to process large data streams allow the system to adapt and stay efficient even under heavy load. We will revisit the data streaming while implementing the inference pipeline.

The main advantages of docker containers are that they are:

Resource-efficient (no virtual machines).

Platform-independent.

Simplify requirements using base images.

The .dockerfile describes how to build the application which is available as a GitHub project. The fraud detection model has several components, each of which will be placed in a separate container and is reflected in docker-compose.yml. See Figure 2 to understand how the docker containers are structured to support experimentation and inference pipelines for the fraud detection model.

Figure 2: Docker Container Architecture for Real-time Fraud Detection System

Experiment Tracking & Model Registry with MLflow

Building a reproducible ML Pipeline: MLflow is the backbone of our fraud detection system’s experiment tracking and model registry management. By setting up MLflow with “mlflow.set_tracking_uri("http://mlflow:5001")“, we establish a centralized location for all our experiments and models. Each experiment is organized under the “fraud_detection” namespace using mlflow.set_experiment("fraud_detection") so that different modeling approaches are separated.

We use MLflow’s run-tracking capabilities when training our models through the “mlflow.start_run()” context manager. This allows us to capture all relevant information about each training session. Within each run, model parameters such as the model type, scaling method, feature count, and performance metrics like accuracy, precision, and recall, create a comprehensive record of each experiment. After successful training, models are registered using “mlflow.sklearn.log_model()” with a specific version and name. The model registry maintains a clear history of model versions, making staging and production versions and their performance characteristics easy to track.



Setting up experiment tracking:

mlflow.set_tracking_uri("http://mlflow:5001")
mlflow.set_experiment("fraud_detection")

with mlflow.start_run(run_name=run_name) as run:

# Add structured tags here
     
 # Your training code here

# Log model to registry
mlflow.sklearn.log_model(
   sk_model=self.model,
   artifact_path="model",
   signature=signature,
   registered_model_name="fraud_detection_model"
)

To build and run the docker container so it is accessible in the MLflow registry



# Build the container
docker-compose build

# Start services in correct order
docker-compose up -d mlflow
docker-compose run --rm train

With this setup, we capture all relevant information about each training session. Within each run, model parameters such as the model type, scaling method, feature count, and performance metrics like accuracy, precision, and recall create a comprehensive record of each experiment as shown in Figure 3. After successful training, models are registered using “mlflow.sklearn.log_model()” with a specific version and name. The model registry maintains a clear history of model versions and their performance metrics, making staging and production version tracking easy.

Figure 3: Tracking model performance metrics

One of the most valuable aspects of the experiment and model registry implementation is its ability to compare different experiments. Figure 4 shows a centralized repository for all model experiments, moving away from localized environments such as standalone scripts, notebooks, or other environments. Data scientists can easily compare the performance of different model versions, analyze the impact of parameter changes, and make informed decisions about which models to promote production. This systematic approach to experiment tracking ensures that our fraud detection system remains reproducible, maintainable, and production-ready.

Figure 4: Displaying model runs for experiments

Building an Interactive Demo with Streamlit

Streamlit is an excellent choice for creating interactive machine-learning demos due to its simplicity and powerful features. It can be used creatively to set up key features like real-time fraud detection, interactive input controls for transaction details, visual probability scores, feature importance visualization, detailed transaction analysis, and more.

Key Components

The application is structured into essential components that work together to provide a comprehensive fraud detection UI demo interface on Streamlit app:

Model Loading: Seamless integration with MLflow

User Interface: Clean, two-column layout for input parameters

Real-time Predictions: Instant feedback on transaction risk

Visual Analytics: Interactive charts showing feature importance

Error Handling: Robust error management with user-friendly messages

To build the streamlit fraud detection app demo, the script would look like


# Import libraries here 

def load_models():
   
   mlflow.set_tracking_uri("http://mlflow:5001") → this port would be defined in the docker-compose.yml file

  # Load models from the MLFlow registry here

# Creating the Streamlit demo app
def main():
   st.title("Fraud Detection Model Playground")
  
   # Load MLflow model
   model = load_models()
  
   # Input form with two columns
   st.subheader("Transaction Details")
   col1, col2 = st.columns(2)
  
   # Transaction amount input
   with col1:
       amount = st.number_input(
           "Transaction Amount ($)",
           min_value=0.0,
           value=100.0
       )
  
   # Time sequence input
   with col2:
       time_value = st.slider(
           "Seconds from first transaction",
           min_value=0,
           max_value=172800  # 48 hours
       )
   # Add explanations about the time feature from here

To spin up the streamlit fraud detection demo app, simply run the below command: the streamlit app will be live on the port assigned in the docker-compose.yml file for the streamlit_playground key. Then, you will see the UI, as shown in Figure 5.


docker-compose up streamlit_playground

Figure 5: Fraud detection model interactive UI interface on streamlit app

Real-time feature analysis with Streamlit

Another great advantage is building visual analytics that shows the model’s importance. The model considers all features together to make its prediction, and the final fraud probability comes from the combined pattern of all features. Knowing which feature values, as shown in Figure 6, or characteristics that influenced the likelihood of fraud would greatly benefit the model and help it improve further.

Figure 6: Displaying the top 10 features which influence the fraud detection model prediction

The bars show feature values, NOT fraud probabilities

Blue/Positive values and Red/Negative values represent different transaction patterns.

The model learns which combinations of these patterns indicate fraud

A single blue (positive) value doesn’t necessarily mean fraud

It’s the combination of all features that determines the final prediction

Practical Advantages of Streamlit Demo

Rapid Prototyping: Tests model behavior with different inputs instantly

Validation: Quick iteration and easy verification of model performance across different scenarios

Model Monitoring: Early detection of model drift or unexpected behaviors

Hypothesis Testing: Validate assumptions about feature importance and model decisions

Documentation: Living documentation of model behavior and features

Collaboration: Shared platform for team discussions about model behavior

Training: Excellent way for onboarding new team members to the project

Building Inference Pipeline

We must build an inference pipeline to make model predictions available to the user. The Fast API service is the primary endpoint for real-time fraud detection predictions, orchestrated via the docker-compose.yml file. It communicates with MLflow to load the latest production model pickle file generated from training and processes incoming transaction requests. Apache Kafka is used for streaming capabilities. The streaming service provides the foundation for real-time data processing. It handles transaction streams using Kafka and Zookeeper, enabling future implementations of real-time monitoring and continuous model updating.

The inference flow begins when a transaction request hits our API endpoint. The FastAPI service loads the latest model file from MLflow’s model registry using the configured tracking URI (http://mlflow:5001). This setup ensures that the most recent production model is used for predictions while maintaining version control and reproducibility.

Each prediction is logged with its unique identifier, input features, prediction result, and processing time. These data points are crucial for monitoring model performance and detecting various drifts in production, such as changes in features and performance, or even for ensuring data quality.

Resource Management and Scaling

The docker-compose configuration includes careful resource management, particularly for the monitoring service, which handles complex calculations and report generation. We’ve allocated specific memory limits of up to 4G and 2G for memory reservations to ensure stable performance:

This setup allows our inference pipeline to handle production loads while maintaining reliable performance monitoring. Separating concerns between services (API, streaming, monitoring) allows for independent scaling and maintenance of each component.

Our fraud detection system’s inference pipeline, built with FastAPI and MLflow, serves as a robust foundation for real-time predictions. A single server can only handle a limited number of concurrent requests as transaction volumes grow and response time requirements become more stringent. Additionally, our model serving approach, which loads the model from MLflow for predictions, needs to be optimized for distributed scenarios. So, we must evolve from a single-instance architecture to a distributed system capable of handling millions of transactions while maintaining sub-second response times. The complete implementation of horizontal scaling is a progressive journey that requires careful planning and execution. Let’s explore the essential tools that form the foundation of a scalable fraud detection system.

Kubernetes: The Foundation for Horizontal Scaling

Kubernetes is the natural choice for orchestrating fraud detection services and providing automated deployment, scaling, and management of our containerized applications. Its ability to handle rolling updates, self-heal failed containers, and efficiently manage resources makes it ideal for maintaining consistent performance across multiple instances of our fraud detection API. In addition to Kubernetes, load balancing and service management tools such as NGINX Ingress Controller should be considered for traffic distribution and Istio for service mesh capabilities.

While the Kubernetes implementation deserves its deep dive, the key is to introduce these components progressively as the transaction volume increases.

Model Observability: Keeping a Pulse on Production ML Systems

Comprehensive monitoring requires a multifaceted approach in production ML systems, as shown in Figure 7, especially for critical applications like fraud detection. The monitoring stack combines three specialized tools: Prometheus, Grafana, and Evidently AI. Each serves a unique purpose in ensuring system reliability and performance.

The Monitoring Trinity

Figure 7: Model Observability Strategies

Prometheus is the metrics collection backbone, gathering system-level and operational metrics. It efficiently collects and stores time-series data about our API performance, resource utilization, and request patterns. It also acts as the system’s health monitor, tracking vital signs like:

Response times for prediction requests

API throughput and error rates

Model loading times

Resource utilization across services

Grafana provides powerful visualization and custom alerting capabilities. It allows system metrics to be combined with business KPIs to transform raw metrics into actionable insights through interactive dashboards that serve technical and business stakeholders. Grafana helps us visualize trends, set alerts, and create comprehensive views combining ML metrics, system performance, and business KPIs.

Evidently AI serves as our ML-specific monitoring tool, focusing on what matters most to data scientists and model performance. It excels at detecting data drift, analyzing model performance shifts, and generating detailed reports about our fraud detection model’s behavior. This tool helps us answer crucial questions like “Is our model still performing as expected?” and “Have transaction patterns significantly changed?”

While each tool could handle some aspects of the others’ responsibilities, their combination provides a robust, specialized monitoring solution that covers all aspects of the fraud detection system – from model performance to system health and business impact.

The two critical aspects of the ML system are data drift and model performance, which we will analyze through Shapley Additive exPlanations (SHAP) analysis to ensure the fraud detection system remains reliable and interpretable.

Data Drift Detection

Our implementation focuses on meaningful drift detection, which is critical in fraud detection, where transaction patterns can shift rapidly. Using Evidently AI, we monitor drift across different transaction segments and visualize data drift, as shown in Figure 8. For demonstration purposes, we simulated real-world scenarios by manipulating transaction patterns to showcase how our system detects and responds to drift. We engineered our data to represent realistic drift scenarios in the following ways:

Transaction Pattern Analysis

We segmented transactions based on meaningful monetary thresholds. This method is particularly valuable for fraud detection as patterns often vary significantly across different transaction sizes.


# Business-defined thresholds
SMALL_TRANSACTION_THRESHOLD = 100  # Transactions <= $100
LARGE_TRANSACTION_THRESHOLD = 500  # Transactions > $500

Transaction Distribution Analysis

We analyze the full spectrum of transaction amounts using key statistical measures: mix, max, mean, median, and the p25 and P75 quantiles.

Small transactions (≤ $100) often represent everyday consumer spending.

Large transactions (> $500) might indicate business or high-value transfers. This value can be set to higher in real-world use cases. Based on our credit card dataset, the transaction references are higher in the $500 range, so it was possible to show a drift in transaction amounts.

Figure 8: Transactions drift and distribution analysis

Feature Drift Analysis

In our fraud detection system, features V1-V28 are PCA-transformed components from the original transaction data. When we analyze drift in these features, we look for significant changes in underlying transaction patterns. A drift in these features could indicate:

New fraud techniques emerging

Changes in legitimate transaction patterns

Shifts in consumer behavior

Evolution of business payment methods

Drift across any of these features, as shown in Figure 9, may suggest changes in the underlying patterns of credit card transactions.

Monitor all V1-V28 features for changes over time.

Analyze drift at various time scales (day-over-day, week-over-week) to capture different patterns.

By carefully monitoring all V1-V28 features for drift, fraud detection systems can adapt to evolving transaction patterns and maintain their effectiveness in identifying potentially fraudulent activities.

Figure 9: Drift detection

The report in Figure 10 shows the overall model performance metrics ROC and Precision-Recall Curves on Evidently.

Figure 10: Model performance metrics

The combination of model performance evaluation metrics and drift monitoring provides a comprehensive view of our model’s health in production. Visualizing these metrics through interactive dashboards and reports enables technical teams and business stakeholders to understand and act on model performance insights.

SHAP Analysis for Model Interpretability

To complement our drift detection, we leverage SHAP values to understand how our model makes decisions across different transaction types. This analysis is crucial for both model validation and stakeholder trust.

To generate a SHAP analysis for a fraud detection model, we compare the importance of features of small and large transactions. SHAP analysis can highlight how fraud patterns differ between small and large transactions. For example, certain features might indicate fraud in large transactions more than in small ones. A high frequency of small transactions can also be a sign of potential fraudulent activity.

Feature Importance:

Figure 11: Feature Importance of Small and Large Transactions

This feature importance plot, Figure 11, is the comparison between small and large transactions and indicates:

Several features show different levels of importance between small and large transactions and how importance shifts

Some features maintain consistent importance across both transaction sizes. V4, V3, and V14 show impact across both transaction sizes

The longer bars indicate a more substantial influence on the model’s predictions.

The orange bars (Current/large transactions) often show different patterns compared to the brown bars (Reference/small transactions)

Model uses different feature combinations when evaluating small versus large transactions.

Figure 12: SHAP summary plots for small and high transaction data

Let us understand how to interpret Figure 12 summary plots. Each row represents a feature (V1-V28, Time, Amount). The X-axis shows the SHAP value (impact on model output). Each point represents a single transaction. Colors indicate feature values (red = high, blue = low).

How to Read Points

Points to the right (positive SHAP values) increase fraud probability

Points to the left (negative SHAP values) decrease fraud probability

The spread of points shows the range of impact

Clustering of points shows common patterns

Key Differences Between Plots

Small Transactions (Reference)

Features V3, V4, and V14 show strong impacts

The time feature shows a balanced distribution

V5 has moderate importance

Most features show concentrated clusters near zero

Large Transactions (Current)

Time and V5 show increased importance

The Amount feature becomes more significant

V3 and V4 maintain a strong influence on model predictions

A wider spread of points indicating more varied impacts

This visualization helps identify which features are most crucial for fraud detection at different transaction sizes and how their influence changes. Some features become more critical for large transactions and the spread of points indicates more complex patterns in large transactions. Based on the drift detection and SHAP analysis, we know when the model’s performance changes and why these changes occur, enabling proactive maintenance of the ML system.

Challenges in Building ML Systems

Building and maintaining a production ML system for fraud detection presents several interconnected challenges:

While implementing the inference pipeline, security measures must be taken to protect API endpoints and sensitive transaction data. This adds another layer of complexity to the system’s maintenance. Read this additional information on securing APIs for ML and AI applications in cloud environments such as AWS and Google Cloud.

Managing complex distributed systems on Kubernetes with load balancers poses infrastructure and performance challenges, particularly in maintaining sub-second response times for millions of transactions and efficiently managing computing resources across containers. Some real-world case studies demonstrate how organizations have successfully leveraged Kubernetes to address complex infrastructure and performance challenges. For additional insight, read how to overcome Kubernetes challenges for easy deployment.

Monitoring presents unique difficulties in balancing granularity with system performance, particularly in setting appropriate drift detection thresholds and managing alert systems without causing alert fatigue. Read the monitoring and explainability for models in the production paper to understand the importance of having ML-observable systems. You can also read more about detecting dataset shifts in the paper Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. Data drift doesn’t always equate to model degradation because models can generalize or have varying error tolerances. Therefore, tune drift detection methods and thresholds to your specific use case and data because there is no one-size-fits-all solution to avoid false alarms. For example, a classifier model with an ROC AUC score takes values from 0 to 1. You can set the threshold to 0.60. In this case, 1 means absolute drift and requires immediate action.

Another example is considering a cosine distance metric used to detect drift in text embeddings; a model predicting customer sentiment might only flag drift exceeding a cosine distance of 0.8, whereas a model classifying medical diagnoses based on textual data might require an alert at a cosine distance of 0.2. ML practitioners should experiment with tuning these alert thresholds based on the specific needs of the ML application. For more information on this topic, read drift in ML embeddings.

Operational maintenance requires careful coordination of model updates, system deployments, and documentation while managing technical debt in ML pipelines. All of these require continuous attention to maintain a robust system. The paper “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” outlines 28 specific tests and monitoring needs derived from real-world production ML systems, offering a roadmap to improve production readiness and reduce technical debt. Another paper, Hidden Technical Debt in Machine Learning Systems, uses the concept of technical debt to highlight the hidden maintenance costs in real-world ML systems, exploring ML-specific risk factors and system-level anti-patterns to consider during system design.

About the Author

Lakshmithejaswi Narasannagari

Source link

Beyond Notebook: Building Observable Machine Learning Systems

Key Takeaways

Building the ML Foundation

Building an Observable ML Pipeline

Containerization

Experiment Tracking & Model Registry with MLflow

Building an Interactive Demo with Streamlit

Key Components

Real-time feature analysis with Streamlit

Practical Advantages of Streamlit Demo

Building Inference Pipeline

Resource Management and Scaling

Kubernetes: The Foundation for Horizontal Scaling

Model Observability: Keeping a Pulse on Production ML Systems

The Monitoring Trinity

Data Drift Detection

Transaction Pattern Analysis

Transaction Distribution Analysis

Feature Drift Analysis

SHAP Analysis for Model Interpretability

Feature Importance:

Key Differences Between Plots

Challenges in Building ML Systems

About the Author

Lakshmithejaswi Narasannagari

Latest articles

Reddit – Heart of the internet

Reddit – Heart of the internet

Visual Studio Devs Share Copilot AI Prompts to Improve Code — Visual Studio Magazine

Leave a Comment Cancel reply

Reddit – Heart of the internet

Reddit – Heart of the internet

Visual Studio Devs Share Copilot AI Prompts to Improve Code — Visual Studio Magazine

Beyond Notebook: Building Observable Machine Learning Systems

Key Takeaways

Building the ML Foundation

Building an Observable ML Pipeline

Containerization

Experiment Tracking & Model Registry with MLflow

Building an Interactive Demo with Streamlit

Key Components

Real-time feature analysis with Streamlit

Practical Advantages of Streamlit Demo

Building Inference Pipeline

Resource Management and Scaling

Kubernetes: The Foundation for Horizontal Scaling

Model Observability: Keeping a Pulse on Production ML Systems

The Monitoring Trinity

Data Drift Detection

Transaction Pattern Analysis

Transaction Distribution Analysis

Feature Drift Analysis

SHAP Analysis for Model Interpretability

Feature Importance:

Key Differences Between Plots

Challenges in Building ML Systems

About the Author

Lakshmithejaswi Narasannagari

Latest articles

Reddit – Heart of the internet

Reddit – Heart of the internet

Visual Studio Devs Share Copilot AI Prompts to Improve Code — Visual Studio Magazine

Leave a Comment Cancel reply

Featured articles

Reddit – Heart of the internet

Reddit – Heart of the internet

Visual Studio Devs Share Copilot AI Prompts to Improve Code — Visual Studio Magazine