# Leveraging Design of Experiments in Financial Services

**Author**: Alice Liu

At a high level, getting better at credit risk means better understanding ** how** the world works.

**influences the cause and effect of credit risk behavior? For example, if pricing changes, then customer behavior changes — or vice versa. There may be a change in macroeconomic conditions, influencing a shift in marketing strategies.**

*What*How might one seek to understand and anticipate the expected change, or drive consumer behavior? Design of experiments (DOE), which forms the very foundation of statistics, can support endeavors to understand and identify drivers of credit risk behavior. The field of DOE, according to historical accounts, stemmed from the simple test of a lady tasting tea, where an experiment was designed to test Muriel Bristol’s claim that she could determine whether tea or milk was first put into the cup. Of course, in credit risk and the wider financial services industry, the experiments are designed more around the products, policies, and marketing strategies that may have an influential impact on customer behavior and/or the underlying business.

As with any framework, the foundation of DOE depends on its component parts — the necessary nuts, bolts, and gears needed. These components are modular, with the flexibility to be rearranged into a variety of different designs.

**Exploration of DOE Components Through a Toy Example**

The DOE framework revolves around the expected impact of the input factors, as well as other factors that influence the process (both controllable and uncontrollable), on the output of the given process.

With all the component elements of DOE, an experimenter can mix and match to create a variety of commonly used experimental designs. Let’s walk through a toy scenario* and describe what those component elements are.

**Note: the toy scenario is a simplified version of the DOE process, meant to illustrate the process.*

Suppose a bank is interested in a designed experiment for testing out an introductory bonus points offer for new credit card holders. The goal is to optimize customer acquisition by assessing the impact of the introductory bonus point offer, with the **hypothesis** that increasing the offer will result in greater customer acquisition. The **null hypothesis**, in the offer scenario, would be that increasing the offer will result in the same number of acquired customers; whereas the **alternate hypothesis** would be that the acquired customers will be greater.

The **(statistical) hypothesis** refers to a supposition regarding the process — a statement on what the expected impact of the design factors on the output will be. The “baseline” scenario (i.e., the current impact) is defined as the **null hypothesis **H₀, whereas the experimental scenario (i.e., impact of the design factors) is defined as the **alternate hypothesis **H₁ (sometimes Hₐ).

Now with the desired hypothesis in mind, what is the **target variable**? This could be the number of applications, the rate of approved applications, the rate of new-to-bank customers, etc. For simplicity, let’s assume the target variable is the number of acquired customers.

Next — what **factors** should be varied? What factors can and can’t be controlled? **Factors** are variables that may influence the experiment outcome and are categorized as:

**Design factors**: what is purposefully being manipulated to assess changes in outcome**Held-constant factors**: factors that may affect outcome, but are held constant, to avoid conflating effects**Allowed-to-vary factors**: factors that are likely not to affect outcome, and are not hold constant**Nuisance factors**: influential factors on the experiment that aren’t of interest, which could either be controllable (i.e., the experimenter can set the levels) or uncontrollable (i.e., the levels can’t be set, but may be measurable and potentially adjustable for effects*post*-experiment)

In the toy scenario, the main factor of interest is the number of intro bonus points offered. This will the **design factor**. Let’s assume, for now, there are two levels to the design factor: (1) Regular points offer of 10k and (2) 5k bump in points offer to 15k.

The regular points offer of 10k is the “baseline,” which allows for a comparison point for customer behavior relative to the increased 15k offer. This can be thought of as a **control** group, as this group receives no experimental treatment (i.e., changes in treatment).

Assigning customers to each of those two levels **randomly** is an **A/B test**. The A/B test is among the simplest of DOE frameworks, in which one factor (e.g., a control versus a treatment group) is varied at a time and is also known as a simple randomized experiment. The A/B test is popular because it’s effective, easy to understand, and easy to productionize. However, if multiple levels and/or factors are desired, then an A/B test may not work.

Now why are customers assigned randomly?** Randomization** helps to average out the effects of uncontrollable factors on the process, to help mitigate systemic bias that may be introduced to the experiment. This process involves randomizing the order in which experimental trials are run and the observations (or individuals) the experiment is run on.

Now, for some further complexity, let’s add in a third level the experiment: (3) 10k bump in points offer to 20k. Let’s also assume there’s a desire to test APR pricing, which will also have three levels, in addition to the bonus points offer: (1) 20% APR (assume this is the baseline, or what is currently offered), (2) 25% APR, and (3) 30% APR.

Trying all combinations of bonus points and APR price points together, instead of one at a time, will result in a **factorial** design. For the three levels of bonus points and three different APR price points, there will be a total 3² = 9 combinations of the design factors.

Now, due to the nature of credit card applications, the types of customers (e.g., FICO, income, etc.) will vary, but can’t feasibly be controlled for. Therefore, as customers come in with applications, they should be **randomly** **assigned** to one of the nine design segments.

In the toy scenario, suppose weekday versus weekend is expected to have an impact (i.e., the customers who apply on the weekend are different than the customers who apply on weekdays); then, the randomization could be split and **blocked** based on time of the week. This doubles the number of trials run, with 18 different trials based on the bonus point offer, APR pricing, and time of the week.

**Blocking** (i.e., using a **block **design) is used to have greater precision when comparing among the controllable factors that are being tested in the experiment. The idea is to lessen or erase the effects of nuisance factors, which are factors that are not of interest but could influence the outcome of the experiment, resulting in a conflation of cause and effect. The block design essentially runs “mini” versions of the experiment among all the blocks (i.e., grouping similar customers with one another). This design restricts randomization on the factor used to create the blocks.

Of course, many applicants should be assigned to each of the design segments for **replication**, which boils down to running additional, independent, repeated trials of the experiment, for all factor combinations involved. This allows two items to be determined with greater confidence:

- The experimental error to identify statistical significance (i.e., is customer acquisition truly different among design segments?), and
- The true mean response of the experiment (i.e., what can be expected for customer acquisition for the design segments?).

For the offer experiment, let’s suppose the number of replicates was determined to be 100 customers per design segment.

Now, let’s suppose the platform for the bonus point and APR offers is online only (e.g., through the web site). Assignment of replicates could be based on IP address, for example. Though consideration would need to be given to mitigating circumstances such as using a VPN or changing location.

Finally, the last step is to determine **sample size** *n* (which is the sum total of all replicates in the experiment) based on how comfortable one is with type I and II errors (α and β, respectively). The total number of samples needed, based on the replicates, for the offer experiment is 100 replicates × 18 segments = 1800 customers.

Recall the description of the two hypotheses — the null and alternate. The idea is to determine which of these two hypotheses is “correct” based on the statistical significance of the experimental impact, which is based on the **test statistic**, to either reject (i.e., there’s statistically significant different behavior) or fail to reject (i.e., no statistically significant difference) the null hypothesis. As with any decision, there is a chance of making the wrong one — of making an error:

- A
**type I error**(i.e., false positive) occurs when the null hypothesis is wrongly rejected; this is related to the**significance level**(and, consequentially, the confidence) of the test, or how unlikely a type I error is to occur - A
**type II error**(i.e., false negative) occurs when the null hypothesis is wrongly not rejected; this is related to the**power**of the test, or how unlikely a type II error is to occur

Concluding that increasing the offered bonus points increases customer acquisition, when in fact it doesn’t, is a false positive (type I error). On the other hand, concluding that increasing bonus points doesn’t increase customer acquisition, when in fact it does, is a false negative (type II error).

As mentioned, these two types of errors are important to **sample size** ** n** decisions for the experiment, as α and β are used to understand at

*minimum*what the thresholds for false positive and negative rates should be — this is what the experimenter is comfortable with. Some common thresholds — α is set at 0.05 and β is set at 0.2 (therefore, power is 0.8). As sample size increases, the power of the test will also increase and the precision of the estimation will increase, for a given α. Depending on the design chosen, sample size determination can be complicated. Note that this is one of the most important aspects of the experiment, as it will likely determine how long an experiment needs to be run for adequate data collection.

Once sample size is determined and careful consideration has been given to the components of the designed experiment, all that is left to do is to productionize it. The productionization of experiments can be costly, from both a monetary and time aspect, so it’s vital to make a conscientious effort to design the “right” experiment for business needs.

***Note: the above DOE methodologies focus on “screening” factors to understand and optimize the fundamental process. Other DOE methodologies exist, such as response surface methodology, which focuses on hitting an optimization target or the inclusion of additional constraints or specifications.*

**Use Cases**

Beyond the toy scenario, what other use cases are there for designed experiments for financial services? Several options are identified below:

**Marketing experiments**(e.g., to reduce overall marketing costs)- Running
**always-on experiments**, which can provide validation for continued or changing behavior - Designing experiments to reduce variability in
**product****performance**or improving product yield/profit - Determination of
**product****design**parameters to either formulate new products or identifying impacts of those parameters on product performance (e.g., price elasticity, product robustness, etc.)

**Do you have questions? We’d love to answer them!**

Contact the author by email at:

**Interested in 2OS insights?**

Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!