Literature Review: Causal Inference as an MLOps Concern

OffComputer by Pablo — Sun, 07 Jun 2026 14:34:37 GMT

1This review is based on Ronir Raggio Luiz and Claudio José Struchiner’s book Causal Inference in Epidemiology: The Potential Outcomes Model, published by Editora Fiocruz in 2002 and available through SciELO Books.

Create a black-and-white realistic photograph in a documentary photojournalism style that visually represents the idea of causal inference through a real-world scene. Show a busy urban street intersection with multiple people and events connected in a believable cause-and-effect chain: for example, a pedestrian accidentally drops a stack of papers, wind carries them into the street, a cyclist swerves to avoid them, nearby pedestrians react, and a driver brakes at the crosswalk. The image should feel like a single candid moment captured in the real world, not a staged conceptual illustration. Emphasize realism, natural human behavior, rich textures, high contrast lighting, detailed faces and clothing, and the layered complexity of everyday life where causes and effects interact. Make it evocative, intelligent, and visually suitable as a cover image for the subject of causal inference. No text, no diagrams, no symbols, just a strong black-and-white realistic photographic scene.

Introduction

Causal inference is often introduced as a statistical topic, but it is also highly relevant to software engineering and MLOps. Modern machine learning systems are not only used to predict what is likely to happen; they are increasingly used to decide what action should be taken. Examples include deciding which recommendation to show, which user experience to deploy, which alert to trigger, which customer intervention to apply, or which operational change to make in a production system.

Traditional machine learning focuses mainly on prediction. It asks questions such as: “Given the available data, what outcome is most likely?” Causal inference asks a different question: “What would happen if we changed something?” This difference matters because a model can be highly accurate at prediction while still being unreliable for decision-making. A predictive model may learn patterns from historical data, but those patterns do not automatically tell us whether changing one variable will cause a change in another.

From an MLOps perspective, causal inference can be understood as a discipline for making machine learning systems safer, more interpretable, and more useful for decisions. It provides a framework for thinking about interventions, experiments, observational data, bias, monitoring, and assumptions.

From Prediction to Intervention

Most machine learning pipelines are designed around prediction tasks. Data is collected, features are engineered, models are trained, and predictions are served. This workflow works well when the goal is to estimate an unknown label or forecast a future event.

However, many production systems go further than prediction. They trigger actions. For example, a recommendation system does not only predict what a user may like; it changes what the user sees. A pricing model does not only estimate demand; it may change the price shown to customers. A fraud detection model does not only estimate risk; it may block a transaction or request verification.

This creates a causal problem. Once a system acts on the world, the data it observes later is affected by its own previous decisions. The system becomes part of the data-generating process. In this setting, engineers need to distinguish between correlation and causation. A correlation may be useful for prediction, but it may fail when used to guide interventions.

A causal approach encourages engineers to define the action being evaluated, the alternative action, the expected outcome, and the assumptions needed to compare them fairly.

Counterfactual Thinking in Software Systems

A central idea in causal inference is counterfactual reasoning. In simple terms, this means asking what would have happened under a different decision.

For software engineers, this idea appears naturally in product experimentation and MLOps. Suppose a team deploys a new ranking algorithm and observes that user engagement increases. The causal question is not simply whether engagement increased. The causal question is whether engagement increased because of the new ranking algorithm, compared with what would have happened if the old algorithm had remained in place.

The challenge is that we cannot observe both realities for the same user at the same time. A user either saw the old ranking or the new ranking. A transaction was either blocked or allowed. A customer either received an intervention or did not. The unobserved alternative is the counterfactual.

This is the core difficulty of causal inference. Since the alternative outcome is missing, causal conclusions require careful design, assumptions, or both.

Randomized Experiments as the Engineering Gold Standard

In software systems, randomized experiments are often implemented as A/B tests. They are powerful because they create comparable groups by design. If users are randomly assigned to version A or version B, then differences in outcomes are more plausibly attributed to the product change rather than to pre-existing differences between users.

From an MLOps perspective, randomized experiments are valuable because they provide a clean assignment mechanism. The system knows why a user received one version rather than another: the assignment was random. This reduces the risk that hidden factors are responsible for the observed difference.

However, randomized experiments are not always easy or appropriate. They may be expensive, slow, risky, or ethically questionable. Some interventions cannot be randomized in production. Other times, randomization may interfere with user experience, business rules, legal constraints, or platform stability.

Therefore, while A/B testing is a strong tool, causal inference cannot rely only on randomized experiments. MLOps teams also need methods for reasoning with observational data.

Observational Data and the Problem of Confounding

Most production data is observational. It is generated by systems, users, business rules, previous models, and operational constraints. In observational data, actions are usually not assigned randomly.

This creates confounding. Confounding occurs when the group that received an action differs from the group that did not receive it in ways that also affect the outcome.

For example, suppose a platform sends retention emails to users who appear likely to churn. Later, the team observes that users who received emails churn more often than users who did not. A naive analysis might conclude that emails increase churn. But this may be wrong: the email group was already at higher risk before the email was sent.

In software systems, confounding can come from many sources: user behavior, geography, device type, account age, traffic source, prior model scores, manual business rules, seasonality, or platform constraints. A causal analysis must account for these differences before interpreting an observed association as an effect.

This is one reason causal inference should be integrated into MLOps metadata and data lineage. Engineers need to know not only what happened, but also why an action was assigned.

Treatment Assignment as a First-Class MLOps Component

One of the most useful ideas for software engineering is to treat assignment logic as part of the system architecture.

In causal inference, the assignment mechanism describes how a unit receives an action, treatment, exposure, or intervention. In software terms, this could mean how users are assigned to an experiment group, how recommendations are selected, how customers are targeted for a campaign, how alerts are triggered, how transactions are flagged, or how models choose automated actions.

For causal analysis, this assignment process is critical. If engineers do not record how and why an action was assigned, later analysis may be unreliable.

This suggests an important MLOps principle: decision logs should include enough information to reconstruct the assignment process. This may include model version, policy version, feature values at decision time, eligibility rules, randomization seed, experiment ID, fallback logic, and manual overrides.

Without this information, causal evaluation becomes much harder. The team may know what action occurred and what outcome followed, but not whether the comparison group is valid.

Propensity and Balancing in Observational Systems

When randomization is unavailable, one practical approach is to make treated and untreated groups more comparable using observed data. In software systems, this often means comparing users, transactions, or entities with similar characteristics before the action occurred.

The general idea is simple: if two users looked similar before an intervention, but one received the intervention and the other did not, then comparing their later outcomes may be more informative than comparing all treated users against all untreated users.

This type of approach is related to propensity-based methods. A propensity score can be understood as an estimated likelihood that a unit would receive a given action based on observed features. Engineers can then use this score to match, group, or adjust comparisons.

For MLOps, the important lesson is not the mathematical detail. The important lesson is that causal analysis requires pre-action features. Features collected after the action may already be affected by the action and can create misleading conclusions.

Therefore, feature stores and event logs should preserve time-aware data. Engineers need to know what was known before the decision, not only what is known now.

Validity, Bias, and Monitoring

Causal inference depends heavily on validity. In software engineering terms, validity means that the evaluation actually measures the effect it claims to measure.

Several threats to validity are common in MLOps environments.

Selection bias occurs when the analyzed population is not representative of the population where the decision will be applied. For example, evaluating a model only on users who completed a workflow may ignore users who dropped out earlier.

Measurement bias occurs when the outcome or features are recorded inaccurately. For example, a click may not always represent satisfaction, and a logged conversion may depend on tracking quality.

Specification problems occur when the analysis model does not represent the real decision process well enough. For example, the analysis may ignore eligibility rules, delayed outcomes, or interaction between multiple running experiments.

Interference occurs when one unit’s treatment affects another unit’s outcome. This is common in networked software systems. For example, changing recommendations for one user may affect content popularity, which then affects recommendations for other users.

These issues show that causal inference is not only a modeling problem. It is also a systems problem. Good causal analysis requires reliable logging, stable identifiers, versioned interventions, clear exposure definitions, and monitoring of assumptions.

Causal Inference and the MLOps Lifecycle

Causal inference can be mapped directly into the MLOps lifecycle.

During problem framing, teams should define the decision, the alternative, and the outcome of interest. The question should be written as an intervention question, not only as a prediction question.

During data collection, teams should capture decision-time context. This includes features, assignment rules, model versions, experiment identifiers, and eligibility criteria.

During model development, teams should separate predictive goals from causal goals. A model that predicts an outcome well is not automatically a model that estimates the effect of an action.

During deployment, teams should consider whether the system allows randomized evaluation, phased rollout, or other designs that support causal learning.

During monitoring, teams should track not only model performance but also policy effects. A model may remain accurate while the effect of its recommended action changes over time.

During governance, teams should document assumptions. Causal conclusions often depend on assumptions that cannot be fully tested from the data. Making these assumptions explicit improves review, auditability, and responsible deployment.

Practical Implications for Software Engineers

For software engineers, causal inference encourages a shift in mindset.

Instead of asking only, “Can we predict this outcome?” teams should also ask, “What action are we evaluating?” and “Compared with what alternative?”

Instead of logging only predictions and outcomes, systems should log decisions, assignment rules, model versions, and the state of relevant features at decision time.

Instead of assuming that historical data directly answers intervention questions, teams should examine how the data was generated.

Instead of treating A/B testing as separate from machine learning, experimentation should be seen as part of the MLOps feedback loop.

Instead of relying only on dashboards of associations, teams should build evaluation workflows that distinguish prediction quality from decision impact.

Conclusion

Causal inference provides a useful foundation for building machine learning systems that support decisions, not just predictions. Its central contribution is the discipline of comparing what happened with what would have happened under an alternative action.

For MLOps, this means causal thinking should be embedded in system design. Assignment mechanisms, decision logs, feature timing, experiment infrastructure, monitoring, and governance all matter. Without these components, teams may confuse correlation with causation and deploy systems that optimize misleading signals.

A software-engineering view of causal inference does not require starting with complex mathematics. It begins with clear questions, careful logging, explicit assumptions, and disciplined comparison. In this sense, causal inference is not only a statistical framework. It is also an engineering practice for building more reliable decision-making systems.

This literature review was fully generated with the assistance of artificial intelligence. The AI system was used to interpret the source material, translate its main ideas into a general software engineering and MLOps context, organize the discussion, and draft the final text. As a result, the review should be read as an AI-generated synthesis rather than as a human-authored academic analysis.

Data Science Research Notebook