Technical04/05/2023

Jacob Zweig
Co-Founder, Principal Consultant

## Introduction

Marketing attribution is a critical challenge faced by businesses seeking to understand the effectiveness of their marketing efforts. It involves determining the contributions of different marketing touchpoints (e.g., online ads, social media posts, email campaigns) to a desired customer outcome, such as a purchase or a sign-up. Multi-touch attribution (MTA) models are an advanced class of marketing attribution models that attribute credit to multiple touchpoints in the customer journey, instead of assigning full credit to just one touchpoint.

In this blog post, we will explore various approaches to multi-touch attribution, starting with traditional heuristic models (e.g., last touch, first touch, linear) and progressing to more sophisticated data-driven models based on deep learning architectures (e.g., long short-term memory (LSTM), transformers). We will also discuss the mathematical concepts behind these models and highlight their applications in optimizing marketing strategies and analyzing customer journeys.

Last touch attribution assigns 100% of the credit for a conversion to the last touchpoint encountered by the customer before the conversion event. The formula for this model can be represented as:

$$\text{Credit}(\text{Touchpoint}_i) = \begin{cases} 1 & \text{if } \text{Touchpoint}_i \text{ is the last touchpoint} \\ 0 & \text{otherwise} \end{cases}$$

While simple to implement, last touch attribution fails to account for the impact of previous touchpoints in the customer journey, potentially leading to biased and incomplete insights.

First touch attribution, as the name suggests, attributes all the credit to the first touchpoint in the customer journey. The formula for this model is similar to that of the last touch model, but with credit assigned to the first touchpoint:

$$\text{Credit}(\text{Touchpoint}_i) = \begin{cases} 1 & \text{if } \text{Touchpoint}_i \text{ is the first touchpoint} \\ 0 & \text{otherwise} \end{cases}$$

This model also has its limitations, as it ignores the influence of subsequent touchpoints on the conversion event.

Linear attribution distributes credit equally among all touchpoints in the customer journey. The formula for linear attribution is given by:

$$\text{Credit}(\text{Touchpoint}_i) = \frac{1}{N}$$

where $$N$$ is the total number of touchpoints in the journey. While the linear model accounts for all touchpoints, it assumes that each touchpoint has an equal impact, which may not always be the case.

## Modern Attribution Models: Deep Learning Approaches

Traditional heuristic models, while straightforward, often lack the ability to capture complex interactions between touchpoints. In contrast, deep learning-based models can leverage historical data to learn patterns and attribute credit more accurately.

Long short-term memory (LSTM) networks are a type of recurrent neural network (RNN) that excel at processing sequences of data, making them suitable for modeling customer journeys. An LSTM-based attribution model takes the sequence of touchpoints as input and predicts the likelihood of conversion. Credit is then attributed to touchpoints based on their contribution to the predicted outcome.

The architecture of an LSTM network includes input, hidden, and output layers, as well as LSTM cells with forget, input, and output gates. The mathematical formulation of an LSTM cell is as follows:

\begin{align*} f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ \tilde{C}_t &= \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \\ C_t &= f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \\ o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ h_t &= o_t \cdot \tanh(C_t) \end{align*}

Where $$f_t$$, $$i_t$$, and $$o_t$$ are the forget, input, and output gate activations, respectively; $$C_t$$ is the cell state; $$h_t$$ is the hidden state; $$\sigma$$ is the sigmoid activation function; and $$W_f$$, $$W_i$$, $$W_C$$, and $$W_o$$ are the weight matrices; $$b_f$$, $$b_i$$, $$b_C$$, and $$b_o$$ are the biases associated with each gate. The input $$x_t$$ represents the feature vector for touchpoint $$t$$ in the customer journey, and $$h_{t-1}$$ is the hidden state from the previous time step.

Credit attribution can be performed by calculating the gradient of the predicted conversion likelihood with respect to each touchpoint's input features, allowing the model to quantify each touchpoint's influence on the conversion.

The transformer architecture, introduced in the seminal paper "Attention is All You Need" by Vaswani et al., has gained popularity for its ability to model long-range dependencies in sequential data. It has been applied to MTA to capture complex interactions between touchpoints and provide accurate attribution.

The core component of the transformer is the self-attention mechanism, which allows each touchpoint to attend to all other touchpoints in the customer journey. The self-attention mechanism is mathematically formulated as follows:

\begin{align*} Q &= X \cdot W_Q \\ K &= X \cdot W_K \\ V &= X \cdot W_V \\ A &= \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \\ Z &= A \cdot V \end{align*}

Where $$X$$ is the input sequence of touchpoints, $$W_Q$$, $$W_K$$, and $$W_V$$ are the weight matrices for query, key, and value projections, respectively; $$Q$$, $$K$$, and $$V$$ are the projected query, key, and value matrices; $$A$$ is the attention matrix, and $$Z$$ is the output. $$d_k$$ is the dimension of the key vectors.

Similar to the LSTM-based model, credit attribution can be performed by calculating the gradient of the predicted conversion likelihood with respect to each touchpoint's input features.

### 3. Temporal Convolutional Networks

Temporal Convolutional Networks (TCNs) are a type of neural network designed to handle sequence data. Unlike recurrent neural networks (RNNs) that process sequences in a recurrent manner, TCNs employ convolutional layers to capture both local and long-range dependencies in a sequence. The use of dilated causal convolutions ensures that the model respects the temporal ordering of the data and allows for the capturing of dependencies at different time scales. Due to their ability to handle sequential data effectively, TCNs have been employed for multi-touch attribution (MTA)

The architecture of a TCN typically consists of multiple layers of dilated causal convolutions, each followed by a non-linear activation function (e.g., ReLU). Additionally, TCNs incorporate residual connections that facilitate the training of deep networks by allowing gradients to flow more easily through the network.

The key mathematical property of a dilated causal convolution is that it applies convolutional filters to input data with spacing (dilation rate) between values, while ensuring that the convolution is causal, meaning that the output at time step $$t$$ only depends on the input values up to time step $$t$$. The dilation rate is increased at each layer of the network, allowing the model to capture patterns at different scales.

Given a sequence of touchpoints in a customer journey, a TCN-based attribution model takes the sequence as input and predicts the likelihood of a conversion. Similar to LSTM and transformer-based models, credit attribution can be performed by calculating the gradient of the predicted conversion likelihood with respect to each touchpoint's input features. This allows the model to quantify the contribution of each touchpoint to the predicted outcome.

The gradients can be calculated using backpropagation, and the obtained attribution scores can be used to understand the effectiveness of various touchpoints and marketing channels.

A key benefit of TCNs is that they can process the entire input sequence simultaneously, making them amenable to parallelization and leading to faster training and inference time. Additionally, TCNs naturally handle sequences of varying lengths, making them well-suited for modeling customer journeys, which can vary in length across different customers.

MTA models, especially those based on deep learning, provide valuable insights for businesses, including:

1. Budget Optimization: By quantifying the contribution of different marketing channels, MTA allows businesses to allocate their marketing budget more efficiently, maximizing return on investment.

2. Customer Journey Analysis: MTA provides a granular view of the customer journey, revealing the effectiveness of specific touchpoints in driving conversions. This information helps businesses improve customer experience and tailor marketing strategies.

3. Personalization: By understanding the impact of touchpoints on individual customers, MTA enables businesses to create personalized marketing campaigns, enhancing customer engagement and loyalty.

## Example: Budget Optimization Using Multi-Touch Attribution

Budget optimization is a critical aspect of marketing strategy, as it involves allocating marketing resources to achieve the maximum return on investment (ROI). Multi-touch attribution (MTA) models provide a means to accurately attribute credit to different marketing touchpoints, enabling businesses to optimize their marketing budget allocation. In this section, we will discuss the technical details of how MTA results can be used for budget optimization.

MTA models generate attribution weights for each marketing touchpoint, representing the relative contribution of that touchpoint to a desired customer outcome, such as a conversion or purchase. These weights quantify the effectiveness of different touchpoints and channels in driving the desired outcome. For example, an MTA model may output the following attribution weights for a customer journey with three touchpoints:

Touchpoint A (Social Media Ad): 0.3

Touchpoint B (Email Campaign): 0.5

The weights indicate that the email campaign (Touchpoint B) had the highest contribution to the conversion, followed by the social media ad (Touchpoint A) and the search ad (Touchpoint C).

### The Budget Optimization Problem

The goal of budget optimization is to allocate a given marketing budget across different channels in a way that maximizes the overall effectiveness of the marketing campaign. Mathematically, this can be formulated as an optimization problem:

\begin{align*} \max_{\mathbf{x}} & \quad f(\mathbf{x}) \\ \text{subject to} & \quad \sum_{i=1}^n x_i = B \\ & \quad x_i \geq 0, \quad \forall i \in \{1, \ldots, n\} \end{align*}

Here:

$$\mathbf{x} = [x_1, x_2, \ldots, x_n]$$ is the vector of budget allocations for $$n$$ marketing channels.

$$f(\mathbf{x})$$ is the objective function representing the total attributed value (e.g., total conversions, revenue, ROI) resulting from the budget allocation $$\mathbf{x}$$.

$$B$$ is the total marketing budget available for allocation.

$$x_i$$ is the budget allocated to channel $$i$$.

The objective function $$f(\mathbf{x})$$ can be derived from the attribution weights generated by the MTA model. It represents the expected contribution of each channel to the overall campaign performance, given the budget allocation.

### Solving the Optimization Problem

To solve the optimization problem, businesses can use various optimization techniques, such as linear programming, gradient-based methods, or evolutionary algorithms. The choice of optimization technique may depend on the complexity and constraints of the problem.

When solving the optimization problem, businesses may also consider additional constraints, such as minimum or maximum spending limits for certain channels or ensuring that the allocation is aligned with broader strategic objectives.

### Interpreting and Implementing the Solution

The solution to the optimization problem provides the optimal budget allocation across different marketing channels. Businesses can use this information to guide their marketing spend decisions and maximize the effectiveness of their campaigns.

It is important to note that the MTA model and budget optimization are based on historical data and assumptions, and they may not fully capture the dynamic nature of marketing and consumer behavior. Therefore, businesses should continuously monitor the performance of their marketing campaigns, validate the model's predictions, and update the model as needed to reflect changing conditions.

## Conclusion

Multi-touch attribution is an essential tool for understanding the effectiveness of marketing efforts and driving business success. Traditional attribution models, such as last touch, first touch, and linear attribution, offer simple heuristics for credit assignment but lack the ability to capture complex interactions between touchpoints.

In contrast, deep learning-based models, such as LSTM and transformer architectures, leverage historical data to model the customer journey in a more sophisticated manner. These models are capable of accounting for both short- and long-range dependencies between touchpoints, thereby providing more accurate and meaningful attribution results.

As marketing channels continue to evolve and customer interactions become increasingly complex, MTA models play a crucial role in enabling businesses to make data-driven decisions. By effectively attributing credit to marketing touchpoints, businesses can optimize their marketing spend, gain insights into the customer journey, and deliver personalized experiences to their customers.

While deep learning-based MTA models offer significant advantages, it is important for practitioners to consider the data quality, quantity, and representativeness when implementing these models. Additionally, it is essential to interpret the results with caution and continuously validate the model's performance against real-world outcomes.

Overall, multi-touch attribution is a powerful framework for enhancing marketing effectiveness and driving business growth. As the field of marketing attribution continues to advance, we can expect further innovations in both modeling techniques and applications, leading to even greater value for businesses and customers alike.

Share:

Strong Analytics builds enterprise-grade data science, machine learning, and AI to power the next generation of products and solutions. Our team of full-stack data scientists and engineers accelerate innovation through their development expertise, scientific rigor, and deep knowledge of state-of-the-art techniques. We work with innovative organizations of all sizes, from startups to Fortune 500 companies. Come introduce yourself on Twitter or LinkedIn, or tell us about your data science needs.