# October 5th

### Session 1 (9:15-10AM)

**Statistics in the Knowledge Economy**

David Banks, Duke University

There is a rich field of statistics associated with manufacturing industries. We know much about quality control, process monitoring, experimental design, and response surface methodology. But today, the focus is upon information technology industries, and statisticians need new tools and new thinking to address their problems. This talk lays out some of the relevant issues and opportunities.

**Order-of-addition Experiments: Design and Analysis**

Dennis Lin, Purdue University

In Fisher (1971), a lady was able to distinguish (by tasting) from whether the tea or the milk was first added to the cup. This is probably the first popular Order of Addition (OofA) experiment. In general, there are m required components and we hope to determine the optimal sequence for adding these m components one after another. It is often unaffordable to test all the m! treatments (for example, m!=10! is about 3.5 millions), and the design problem arises. We consider the model in which the response of a treatment depends on the pairwise orders of the components. The optimal design theory under this model is established, and the optimal values of the D-, A-, E-, and M/S-criteria are derived. For Model-Free approach, an efficient sequential methodology is proposed, building upon the basic concept of quick-sort algorithm, to explore the optimal order without any model specification. The proposed method is capable to obtain the optimal order for large m (≥ 20). This work can be regarded as an early work of OofA experiment for large number of components. Some theoretical supports are also discussed. One case study for job scheduling will be discussed in detail.

**Case Study: Teaching Engineers Thermodynamics to Monitor Engine Performances**

Eric Bae and G. Geoffrey Vining, Virginia Tech

The ability to predict accurately the critical quality characteristics of aircraft engines is essential for modeling the degradation of engine performance over time. The acceptable margins for error grow smaller with each new generation of engines. This paper focuses on turbine gas temperature (TGT). The goal is to improve the first principles predictions through the incorporation of the pure thermodynamics, as well as available information from the engine health monitoring (EHM) data and appropriate maintenance records. The first step in the approach is to develop the proper thermodynamics model to explain and to predict the observed TGTs. The resulting residuals provide the fundamental information on degradation. The current engineering models are ad hoc adaptations of the underlying thermodynamics not properly tuned by actual data. Interestingly, pure thermodynamics model uses only two variables: atmospheric temperature and a critical pressure ratio. The resulting predictions of TGT are at least similar, and sometimes superior to these ad hoc models. The next steps recognize that there are multiple sources of variability, some nested within others. Examples include version to version of the engine, engine to engine within version, route to route across versions and engines, maintenance to maintenance cycles within engine, and flight segment to flight segment within maintenance cycle. The EHM data provide an opportunity to explain the various sources of variability through appropriate regression models. Different EHM variables explain different contributions to the variability in the residuals, which provides fundamental insights as to the causes of the degradation over time. The resulting combination of the pure thermodynamics model with proper modeling based on the EHM data yield significantly better predictions of the observed TGT, allowing analysts to see the impact of the causes of the degradation much more clearly.

### Session 2 (10:30AM-12PM)

**Sequential Design for Contour Location with Deep Gaussian Process Surrogates**

Annie Sauer, North Carolina State University, Robert B. Gramacy, Virginia Tech, and Ashwin Renganathan, The Pennsylvania State University

Many research objectives in aerospace engineering, such as developing new wing shapes that optimize efficiency while still meeting safety regulations, require computer simulations to serve as experimental observations because real-world experiments on this scale are not cost effective or practical. The high computational costs associated with such computer “experiments” necessitate statistical surrogate models, trained on limited simulation data, that offer predictions with appropriate uncertainty quantification (UQ) at un-run input configurations. Gaussian processes (GPs) are popular surrogates, but they suffer from stationarity assumptions. Deep GPs offer a more flexible alternative, while still prioritizing UQ when kept in a fully-Bayesian framework. DGP’s ability to model non-stationary dynamics is important in aerospace applications, where regime shifts naturally occur (i.e. dynamics changing across the sound barrier). When the research objective is to identify a failure region, training data may be chosen sequentially to strategically target inputs on the failure contour. State-of-the-art sequential design for contour location requires numerical optimization of the entropy criteria. Numerical optimization is not compatible with Bayesian DGP surrogates, which require hefty MCMC sampling. We propose a method for sequentially designed contour location that only requires evaluating the surrogate at candidate inputs, crucially avoiding the need for any numerical optimization. Our candidate-contour-location scheme utilizes triangulation candidates and makes acquisitions on the pareto front of entropy and uncertainty. It facilitates the use of Bayesian DGP surrogates in contour location contexts that were previously unreachable. We showcase prowess on a motivating aerospace wing-shape simulation.

**Active Learning with Design-based Sampling**

Lin Wang, Purdue University

Gaussian process models have been popular as surrogate models for simulation experiments. Active learning of Gaussian process models is useful for optimizing experimental designs and controlling data sizes yet can be computationally expensive or even prohibitive. We propose a novel active learning approach for Gaussian process models using design-based sampling. Generating training data using the proposed approach is fast and allows accurate modeling with small data sizes. The superiority of the proposed approach is demonstrated with synthetic and real simulation experiments.

**If you have to use a Supersaturated Design …**

John Stufken, George Mason University

For a designed experiment with many factors, when observations are expensive, it is common that the number of model effects is much larger than the number of observations. A design for such a problem is known as a supersaturated design. Experiments that use such designs are intended to differentiate between a few factors that can explain most of the differences in a response variable and the factors that are unimportant. Various methods of analysis have been proposed for such experiments, but correctly identifying the few important factors is very challenging because, typically, many models will fit the data approximately equally well. We will discuss how identifying the important factors can be improved by considering multiple fitted models.

**Design Selection for Multi- and Mixed-level Supersaturated Designs**

Rakhi Singh, Binghamton University

The literature offers various design selection criteria and analysis techniques for screening experiments. The traditional optimality criteria do not work for supersaturated designs; as a result, most criteria aim to minimize some function of pairwise orthogonality between different factors. For two-level designs, the Gauss-Dantzig Selector is often preferred for analysis, but it fails to capture differences in screening performance among different designs. Two recently proposed criteria utilizing large-sample properties of the Gauss-Dantzig Selector by Singh and Stufken (Technometrics, 2023) result in better screening designs. Unfortunately, the straightforward extension of these criteria to higher-level designs is not possible. For example, it is unclear if the Gauss-Dantzig Selector is still an appropriate analysis method for multi- and mixed-level designs. In this talk, I will first argue that group LASSO is a more appropriate method to analyze such data. I will then use large sample properties of group LASSO to propose new optimality criteria and construct novel and efficient designs that demonstrate superior screening performance.

**Deep Gaussian Process Emulation using Stochastic Imputation**

Deyu Ming, University College London

Deep Gaussian processes (DGPs) provide a rich class of models that can better represent functions with varying regimes or sharp changes, compared to conventional GPs. In this talk, we introduce a novel inference method for DGPs for computer model emulation. By stochastically imputing the latent layers, our approach transforms a DGP into a linked GP: a novel emulator developed for systems of linked computer models. This transformation permits an efficient DGP training procedure that only involves optimizations of conventional GPs. In addition, predictions from DGP emulators can be made in a fast and analytically tractable manner by naturally using the closed form predictive means and variances of linked GP emulators. We demonstrate the method in a financial application and show that it is a competitive candidate for DGP surrogate inference, combining efficiency that is comparable to doubly stochastic variational inference and uncertainty quantification that is comparable to the fully-Bayesian approach. We also present an extension to our approach for constructing networks of DGP emulators. Lastly, we showcase the implementation of the technique using our developed R package, dgpsi, which is available on CRAN.

**A Sharper Computational Tool for L2E Regression**

Xiaoqian Liu, The University of Texas MD Anderson Cancer Center

Robust regression has gained more traction due to its ability to handle outliers better than the traditional least squares regression. With the growing complexity of modern datasets, robust structured regression is finding numerous applications in real-world problems. In this talk, we introduce a general framework for robust structured regression under the L2E criterion. This framework estimates regression coefficients and a precision parameter simultaneously. We develop a sharper computational tool for this L2E regression framework. We adopt the majorization-minimization (MM) principle to design a new algorithm for updating the vector of regression coefficients. Our sharp majorization achieves faster convergence than the previous alternating proximal gradient descent algorithm. In addition, we reparametrize the model by substituting precision for scale and estimate precision via a modified Newton’s method. This simplifies and accelerates overall estimation. We also introduce distance-to-set penalties to enable constrained estimation under nonconvex constraint sets. This tactic also improves performance in coefficient estimation and structure recovery. Finally, we demonstrate the merits of our improved tactics through a rich set of simulation examples and a real data application.

### Session 3 (2-3:30PM)

**Experiences with Developing and Operating a Design of Experiments MOOC**

Douglas Montgomery, Arizona State University

This MOOC provides a basic course in designing experiments and analyzing the resulting data. It is intended for engineers, physical/chemical scientists, scientists from other fields such as biotechnology and biology, market researchers, and data analysists from a wide variety of businesses including e-commerce. The course deals with the types of experiments that are frequently conducted in these settings. The prerequisite background is a basic working knowledge of statistical methods. Participants should know how to compute and interpret the sample mean and standard deviation, have previous exposure to the normal distribution, be familiar with the concepts of testing hypotheses (the t-test, for example), constructing and interpreting a confidence interval, and model-fitting using the method of least squares. Most of these concepts are discussed and reviewed as they are needed. We describe how the MOOC is structured, and discuss how applications from various fields and the use of computer software are integrated into the course. The MOOC also features a live, monthly opportunity for participants to attend a “Fireside Chat” which features a guest presentation on a topic of interest in experimental design and interact with the host and guest in a question and answer session. The MOOC currently has had over 27,000 participants from many different countries.

**Expanding Access to Graduate Education in Analytics & Data Science: Behind the Scenes at Georgia Tech’s Online Masters Program**

Joel Sokol, Georgia Tech

Georgia Tech’s MS in Analytics, an interdisciplinary hybrid analytics/data-science degree, was formed as a collaborative effort between GT’s Colleges of Computing, Business, and Engineering in 2014 to meet rising demand among both students and employers. In 2017, to increase accessibility GT also began offering the degree online. The online program currently has about 6000 students from over 140 countries and has graduated over 3000 individuals since its inception. This talk will focus on lessons learned related to program design, scope, and content; startup, expansion, and growth; technology requirements; teaching and advising remote students; student enrollment patterns; meeting both student and employer needs; integrating with the rest of campus (much more here than meets the eye!); teaching philosophy and methodologies, etc. New trends and future directions will also be discussed, including opportunities for industry and academic collaboration.

**Broadening the spectrum of OMARS designs**

Peter Goos and José Núñez Ares, University of Leuven

The family of orthogonal minimally aliased response surface designs or OMARS designs bridges the gap between the small definitive screening designs and classical response surface designs, such as central composite designs and Box-Behnken designs. The initial OMARS designs involve three levels per factor and allow large numbers of quantitative factors to be studied efficiently using limited numbers of experimental tests. Many of the OMARS design possess good projection properties and offer better powers for quadratic effects than definitive screening designs with similar numbers of runs. Therefore, OMARS designs offer the possibility to perform a screening experiment and a response surface experiment in a single step, and thereby offer the opportunity to speed up innovation and process improvement. A technical feature of the initial OMARS designs is that they study every quantitative factor at its middle level the same number of times. As a result, every main effect can be estimated with the same precision using the initial OMARS designs, the power is the same for every main effect, and the quadratic effect of every factor has the same probability of being detected. In this talk, we show how to create OMARS designs in which the main effects of some factors are emphasized at the expense of their quadratic effects, or vice versa. We call the new OMARS designs non-uniform-precision OMARS designs, and show that relaxing the uniform-precision requirement opens a new large can of useful three-level experimental designs. The non-uniform-precision OMARS designs form a natural connection between the initial OMARS design, involving three levels for every factor and corresponding to one end of the OMARS spectrum, and the mixed-level OMARS designs, which involve three levels for some factors and two levels for other factors and correspond to another end of the OMARS spectrum.

**A Family of Orthogonal Main Effects Screening Designs for Mixed Level Factors**

There is little literature on screening when some factors are at three levels and others are at two levels. In this talk, I will introduce a family of orthogonal, mixed-level screening designs in multiples of eight runs. The 16-run design can accommodate up to four continuous three-level factors and up to eight two-level factors. The two-level factors can be either continuous or categorical. All of the designs supply substantial bias protection of the main effects estimates due to active 2FIs. I will show a direct construction of these designs.

**Predictive Resilience Modeling**

Priscila Silva and Lance Fiondella, University of Massachusetts Dartmouth

Resilience is the ability of a system to respond, absorb, adapt, and recover from a disruptive event. Dozens of metrics to quantify resilience have been proposed in the literature. However, fewer studies have proposed models to predict these metrics or the time at which a system will be restored to its nominal performance level after experiencing degradation. This talk presents alternative approaches to model and predict performance and resilience metrics with elementary techniques from reliability engineering and statistics. We will also present a free and open source tool developed to apply the models without requiring detailed understanding of the underlying mathematics, enabling users to focus on resilience assessments in their day to day work.

**Semi-destructive Gage R&R: De-confounding Variance Components in Systems with Predictable Degradation**

Douglas Gorman, BD

Sometimes the process of measuring a part changes the characteristic being measured due to deformation, wear, or other degradation mechanism. In these situations, it is natural to treat the test method as destructive, since repeat measurements exhibit variation that is not pure repeatability, but the combination of repeatability and the change in performance of the item being measured. Common strategies for performing gage RR studies with destructive test methods confound part-to-part variation with measurement system repeatability. In cases where part-to-part variation is large, destructive tests often fail typical Gage R&R acceptance criteria. When measured parts exhibit a predictable degradation effect with repeat measurements, repeat measures can de-confound the part-to-part variation and repeatability error. Including the degradation effect in the gage R&R ANOVA model enables the part-to-part variation to be independently estimated. This talk explains how to design an analyze a gage study in such a semi-destructive situation and provides examples from industry.

## October 6th

### Session 4 (8-9:30AM)

**HODOR: A Two-stage Hold-out Design for Online Controlled Experiments on Networks
**

Nicholas Larsen, Jonathan Stallrich, and Srijan Sengupta, North Carolina State University

The majority of methods for online controlled experiments rely on the Stable Unit Treatment Value Assumption, which presumes the response of individual users depends only on the assigned treatment, not the treatments of others. Violations of this assumption occur when users are subjected to network interference, a common phenomenon in social media platforms. Standard methods for estimating the average treatment effect typically ignore network effects and produce biased estimates. Additionally, unobserved user covariates, such as variables hidden due to privacy restrictions, that influence user response and network structure may also bias common estimators of the average treatment effect. We demonstrate that network-influential lurking variables can heavily bias popular network clustering-based methods, thereby making them unreliable. We then propose a two-stage design and estimation technique called HODOR (Hold-Out Design for Online Randomized experiments). We show that the HODOR estimator is unbiased for the average treatment effect even when the underlying network is partially-unknown or uncertain. We then derive an optimal allocation to minimize the variance of the estimator without knowledge of the network.

**Monitoring Time-Varying Networks with Principal Component Network Representations**

James Wilson, University of San Francisco

We consider the problem of network representation learning (NRL) for network samples and how NRL can be used for monitoring change in a sequence of time-varying networks. We first introduce a technique, Principal Component Analysis for Networks (PCAN), that identifies statistically meaningful low-dimensional representations of a network sample via subgraph count statistics. We then describe a computationally fast sampling-based procedure, sPCAN, that not only is significantly more efficient than its counterpart, but also enjoys the same advantages of interpretability. We show how the features from (s)PCAN can be used to detect local changes in a sequence of time-varying networks. We explore the time series of co-voting networks of the US Senate paying close attention to eras of political polarization as well as harmonization. If time permits, we will discuss large-sample properties of the methods when the sample of networks analyzed is a collection of kernel-based random graphs.

**County-Level Surveillance of COVID Counts**

Steven E. Rigdon, Saint Louis University

We fit spatially correlated models at the county level for weekly COVID data. Poor data quality in many regions requires special attention. By applying a conditional (spatial) autoregressive model, we are able to assess the effect of a number of variables, such as demographics, vaccine rates, etc., on the transmission rate of the disease.

**Empirical Calibration for a Nonparametric Lower Tolerance Bound**

Caleb King, JMP

In many industries, the reliability of a product is often determined by a quantile of a distribution of a product’s characteristics meeting a specified requirement. A typical approach to address this is to assume a distribution model and compute a one-sided confidence bound on the quantile. However, this can become difficult if the sample size is too small to reliably estimate a parametric model. Linear interpolation between order statistics is a viable nonparametric alternative if the sample size is sufficiently large. In most cases, linear extrapolation from the extreme order statistics can be used, but can result in inconsistent coverage. In this talk, we will present an empirical study used to generate calibrated weights for linear extrapolation that greatly improves the accuracy of the coverage across a feasible range of distribution families with positive support. We will demonstrate this calibration technique using two examples from industry.

**Augmenting Definitive Screening Designs: Going Outside the box**

Robert Mee, University of Tennessee Knoxville

Definitive screening designs (DSDs) have grown rapidly in popularity since their introduction by Jones and Nachtsheim (2011). Their appeal is that the second-order response surface (RS) model can be estimated in any subset of three factors, without having to perform a follow-up experiment. However, their usefulness as a one-step RS modeling strategy depends heavily on the sparsity of second-order effects and the dominance of first-order terms over pure quadratic terms. To address these limitations, we show how viewing a projection of the design region as spherical and augmenting the DSD with axial points in factors found to involve second-order effects remedies the deficiencies of a stand-alone DSD. We show that augmentation with a second design consisting of axial points is often the Ds-optimal augmentation, as well as minimizing the average prediction variance. Supplemented by this strategy, DSDs are highly effective initial screening designs that support estimation of the second-order RS model in three or four factors.

**Constructing Control Charts for Autocorrelated Data Using an Exhaustive Systematic Samples Pooled Variance Estimator**

Scott Grimshaw, Brigham Young University

SPC with positive autocorrelation is well known to result in frequent false alarms if the autocorrelation is ignored. The autocorrelation is a nuisance and not a feature that merits modeling and understanding. This paper proposes exhaustive systematic sampling, which is similar to Bayesian thinning except no observations are dropped, to create a pooled variance estimator that can be used in Shewhart control charts with competitive performance. The expected value and variance are derived using quadratic forms that is nonparametric in the sense no distribution or time series model is assumed. Practical guidance for choosing the systematic sampling interval is offered to choose a value large enough to be approximately unbiased and not too big to inflate variance. The proposed control charts are compared to time series residual control charts in a simulation study that validates using the empirical reference distribution control limits to preserve stated in-control false alarm probability and demonstrates similar performance.

### Session 5 (10-11:30AM)

**PERCEPT: a New Online Change-point Detection Method Using Topological Data Analysis**

Simon Mak, Duke University

Topological data analysis (TDA) provides a set of data analysis tools for extracting embedded topological structures from complex high-dimensional datasets. In recent years, TDA has been a rapidly growing field which has found success in a wide range of applications, including signal processing, neuroscience and network analysis. In these applications, the online detection of changes is of crucial importance, but this can be highly challenging since such changes often occur in low-dimensional embeddings within high-dimensional data streams. We thus propose a new method, called PERsistence diagram-based ChangE-PoinT detection (PERCEPT), which leverages the learned topological structure from TDA to sequentially detect changes. PERCEPT follows two key steps: it first learns the embedded topology as a point cloud via persistence diagrams, then applies a non-parametric monitoring approach for detecting changes in the resulting point cloud distributions. This yields a non-parametric, topology-aware framework which can efficiently detect online geometric changes. We investigate the effectiveness of PERCEPT over existing methods in a suite of numerical experiments where the data streams have an embedded topological structure. We then demonstrate the usefulness of PERCEPT in two applications on solar flare monitoring and human gesture detection.

**Statistical Approaches for Testing and Evaluating Foundation Models**

Giri Gopalan and Emily Casleton, Los Alamos National Laboratory

A modern thrust of research in artificial intelligence (AI) has focused on foundation models (FMs) – deep neural networks, typically involving a transformer architecture and trained on a prodigious corpus of data in a self-supervised fashion, that can be adapted to solve downstream tasks. The introduction of FMs has been referred to as a paradigm shift away from “narrow AI”, where models are built and trained for specific tasks and generally do not perform well on out-of-distribution data. Before these models are to be trusted by domain scientists, a crucial aspect is to rigorously evaluate FM performance and quantify uncertainties when testing, evaluating, and comparing different FMs, especially as FMs (and derivatives thereof) become widely used in a variety of critical applications; a specific example is for event detection and characterization with an FM model built on seismic data. Given the fact that FMs are so large (GPT3, the engine behind the popular ChatGPT, contains 175 billion parameters) and that they are able to address multiple tasks, traditional approaches used to evaluate and quantify uncertainty of AI models need to be recast for FMs.

We will discuss statistically grounded ways to perform uncertainty quantification with FMs, focusing on salient topics such as comparing and contrasting extrinsic and intrinsic evaluations of FMs, adapting to situations where resampling of the training data and iteratively re-fitting FMs is not computationally feasible, and combining information from performance on different tasks for an assessment of FM performance. For instance, an FM that achieves a low loss score when being trained in a self-supervised manner might perform poorly on a set of tasks related to a particular application context – hence, appropriately combining extrinsic and intrinsic FM performance depends on the specific use case. For another example, a battery of tasks that the FM is tested on might have substantial variance in quality, difficulty, and number of questions, all of which should be considered when combining task-specific performance metrics. Finally, it is imperative that comparisons of FM performance account for uncertainty on the metric so that superior performance cannot be attributed to sampling variability, for example. We will attempt to suggest best practices for dealing with issues such as those aforementioned, and our work should be helpful to practitioners who wish to evaluate and assess FMs from a statistically informed perspective.

**Practical Design Strategies for Robust Nonlinear Statistical Modelling**

Timothy E. O’Brien, Loyola University of Chicago

Researchers often find that generalized linear or nonlinear regression models are applicable for scientific assessment including physical, agricultural, environmental and synergy modelling. In line with other processes these nonlinear models perform better than linear models since they tend to fit the data well and the associated model parameter(s) are typically scientifically meaningful. Generalized nonlinear model parameter estimation presents statistical modelling challenges including computational, convergence and curvature issues. Further, researchers are also often in a position of requiring optimal or near-optimal (so-called “robust”) designs for the given chosen nonlinear model. A common shortcoming of most optimal designs for nonlinear models used in practical settings, however, is that these designs typically focus only on (first-order) parameter variance or predicted variance, and thus ignore the inherent nonlinear of the assumed model function. Another shortcoming of optimal designs is that they often have only p support points, where p is the number of model parameters. This talk examines modelling and estimation methods connected with generalized nonlinear models useful for physical, environmental and biological modelling and provides concrete novel suggestions related to robust design strategies for these models. Numerous examples will be provided and software methods will be discussed.

**Entropy-Driven Design of Sensitivity Experiments**

David M. Steinberg, Tel Aviv University, Rotem Rozenblum, Consultant, and Amit Teller, Rafael Advanced Defense Systems Ltd.

Sensitivity experiments assess the relationship of a stress variable to a binary response, for example how drop height affects the probability of product damage. These experiments usually employ a sequential design where the next stress level is chosen based on results from the data obtained thus far. For example, the popular Bruceton (up/down) experiment increases the stimulus if no response is observed and decreases it following a positive response. In this work we present a new entropy-driven design algorithm that chooses stress levels to maximize the mutual information between the next observation and the goals of the inference. The approach is parametric, so the goals could be the parameters themselves or functions of them, such as a quantile or a set of quantiles. We use a Bayesian analysis that quantifies the entropy of the goals from the current posterior distribution. The method can also be used when there are two or more stress variables (e.g. drop height and the hardness of the landing surface). We illustrate the use of the algorithm by simulation and on real experiment.

**Problem Framing: Essential to Successful Statistical Engineering Applications**

Roger Hoerl, Union College

The first two phases of the statistical engineering process are to identify the problem, and to properly structure it. These steps relate to work that is often referred to elsewhere as framing of the problem. While these are obviously critical steps, we have found that problem-solving teams often “underwhelm” these phases, perhaps being over-anxious to get to the analytics. This approach typically leads to projects that are “dead on arrival” because different parties have different understandings of what problem they are actually trying to solve. In this expository article, we point out evidence for a consistent and perplexing lack of emphasis on these first two phases in practice, review some highlights of previous research on the problem, offer tangible advice for teams on how to properly frame problems to maximize the probability for success, and share some real examples of framing challenging problems.

**Testing the Prediction Profiler with Disallowed Combinations—A Statistical Engineering Case Study**

Yeng Saanchi, North Carolina State University

The prediction profiler is an interactive display in JMP statistical software that allows a user to explore the relationships between multiple factors and responses. A common use case of the profiler is for exploring the predicted model from a designed experiment. For experiments with a constrained design region defined by disallowed combinations, the profiler was recently enhanced to obey such constraints. In this case study, we show how our validation approach for this enhancement touched upon many of the fundamental principles of statistical engineering. The team approached the task as a design of experiments problem.

### Session 6 (1:30-3PM)

**Multi-criteria Evaluation and Selection of Experimental Designs from a Catalog**

Mohammed Saif Ismail Hameed, José Núñez Ares, and Peter Goos, University of Leuven

In recent years, several researchers have published catalogs of experimental plans. First, there are several catalogs of orthogonal arrays, which allow experimenting with two-level factors as well as multi-level factors. The catalogs of orthogonal arrays with two-level factors include alternatives to the well-known Plackett-Burman designs. Second, recently, a catalog of orthogonal minimally aliased response surface designs (or OMARS designs) appeared. OMARS designs bridge the gap between the small definitive screening designs and the large central composite designs, and they are economical designs for response surface modeling. Third, catalogs of D- and A-optimal main-effect plans have been enumerated. Each of these catalogs contains dozens, thousands or millions of experimental designs, depending on the number of runs and the number of factors, and choosing the best design for a particular problem is not a trivial matter. In this presentation, we introduce a multi-objective method based on graphical tools to select a design. Our method analyzes the trade-offs between the different experimental quality criteria and the design size, using techniques from multi-objective optimization. Our procedure presents an advantage compared to the optimal design methodology, which usually considers only one criterion for generating an experimental design. Additionally, we will show how our methodology can be used for both screening and optimization experimental design problems. Finally, we will demonstrate a novel software solution, illustrating its application for a few industrial experiments.

**Functional Response from a Mixture Experiment: Design Choice and Analysis**

Mona Khoddam, Ventana Medical Systems

Mixture experiments are commonly used in situations when the levels of experimental factors consist of varying proportions of several components that add up to a constant. Previous research on the design and analysis of mixture experiments majorly focused on single-response values. In certain cases, however, the response of interest may be a series of data points collected over a continuum, which is referred to as functional data in the literature. As an example from the chemical industry, viscosity over varying shear rates is a critical-to-customer attribute because it directly affects the consistency or “flow” of products such as shampoo, body wash, or cleansers. Using a single viscosity value at a fixed shear rate would fail to capture the differences in rheological properties among different chemical formulations. Further, it may be of interest to optimize the formulation to target a competing product’s rheological profile. In this presentation, we deal with the problem of designing and analyzing mixture experiments when the response is functional. Space-filling designs (SFD) and $I$-optimal designs, constructed using standard design software such as JMP, are compared and contrasted with respect to measures of predictive performance. Due to the small-sample attributes of running experiments with mixtures, we also demonstrate the advantages of using self-validated ensemble modeling (SVEM) for simultaneous estimation and validation of the predictive models.

**Multi-layer Sliced Designs with Application in AI Assurance**

Qing Guo, Virginia Tech, Peter Chien, University of Wisconsin, Madison, and Xinwei Deng,

Motivated by the importance of enhancing AI assurance with respect to configuration and hyper-parameters in AI algorithms, this work provides an experimental design angle to address this challenging problem. Our focus is to conduct an efficient experimental design to quantify and detect the effects of hyper-parameters affecting the performance of AI algorithms. Specifically, we propose a multi-layer sliced design to enable quantifying the effects of sliced factors and design factors, such that it can account for hyper-parameters having different effects under different configurations of the AI algorithm. Moreover, we develop a novel estimation procedure to estimate the effects of these factors and test for their significance. The performance of the proposed design and analysis is evaluated by both simulation studies and real-world AI applications.

**Planning Reliability Assurance Tests for Autonomous Vehicles**Simin Zheng and Yili Hong, Virginia Tech, Lu Lu, University of South Florida, and Jian Liu, University of Arizona

Artificial intelligence (AI) technology has become increasingly prevalent and transforms our everyday life. One important application of AI technology is the development of autonomous vehicles (AV). AV combines robotic process automation and AI, which has become an essential representation of AI applications. Hence, the reliability of an AV needs to be carefully demonstrated via an assurance test so that the product can be used with confidence in the field. To plan for an assurance test, one needs to determine how many AVs need to be tested for how many miles. Existing research has made great efforts in investigating reliability demonstration tests in the other fields of study for product development and assessment. However, statistical methods have not been utilized in AV test planning. This paper aims to fill in this gap by developing statistical methods for planning AV reliability assurance tests based on recurrent event data. We explore the relationship between multiple criteria of interest in the context of planning reliability assurance tests. Specifically, two test plans based on homogeneous and non-homogeneous Poisson processes are developed with recommendations offered for practical uses. The disengagement events data from the California Department of Motor Vehicles AV testing program is used to illustrate proposed assurance testing methods. We conclude the paper with suggestions for practitioners

**Assessing Variable Activity for Bayesian Regression Trees**

Akira Horiguchi, Duke University and Matthew T. Pratola and Thomas J. Santner, The Ohio State University

Bayesian Additive Regression Trees (BART) are non-parametric models that can capture complex exogenous variable effects. In any regression problem, it is often of interest to learn which variables are most active. Variable activity in BART is usually measured by counting the number of times a tree splits for each variable. Such one-way counts have the advantage of fast computations. Despite their convenience, one-way counts have several issues. They are statistically unjustified, cannot distinguish between main effects and interaction effects, and become inflated when measuring interaction effects. An alternative method well-established in the literature is Sobol´ indices, a variance-based global sensitivity analysis technique. However, these indices often require Monte Carlo integration, which can be computationally expensive. This paper provides analytic expressions for Sobol´ indices for BART posterior samples. These expressions are easy to interpret and computationally feasible. We also provide theoretical guarantees of contraction rates. Furthermore, we will show a fascinating connection between first-order (main-effects) Sobol´ indices and one-way counts. Finally, we compare these methods using analytic test functions and the En-ROADS climate impacts simulator.

**The Effectiveness and Flexibility of a Predictive Distribution Approach to Process Optimization with Multiple Responses**

John Peterson, PDQ Research and Consulting and Enrique del Castillo, The Pennsylvania State University

One can often view a manufacturing process as a stochastic process that has several sources of variation and uncertainty. There is variation due to input materials, ambient conditions, operator-to-operator changes, etc. In addition, the use of predictive models involves parameter uncertainty and model form lack-of-fit. Therefore, process optimization in the face of such uncertainties, and complicated by the presence of multiple quality responses, can be a challenge for effective and flexible quantification and associated statistical inference. Over the past two decades several papers have appeared which have utilized predictive distribution approaches to effectively address multiple response process optimization problems. These include both early and later stage optimization problems. This presentation will provide an overview of the predictive distribution approach and discuss computational tools as well.