October 8th

Session 1 (9:15-10AM)

Assessing Predictive Capability for Binary Classification Models

Mindy Hotchkiss, Enquery Research LLC

Abstract: Classification models for binary outcomes are in widespread use across a variety of industries. Results are commonly summarized in a misclassification table, also known as an error or confusion matrix, which indicates correct vs incorrect predictions for different circumstances. Models are developed to minimize both false positive and false negative errors, but the optimization process to train/obtain the model fit necessarily results in cost-benefit trades. However, how to obtain an objective assessment of the performance of a given model in terms of predictive capability or benefit is less well understood, due to both the rich plethora of options described in literature as well as the largely overlooked influence of noise factors, specifically class imbalance. Many popular measures are susceptible to effects due to underlying differences in how the data are allocated by condition, which cannot be easily corrected.

This talk considers the wide landscape of possibilities from a statistical robustness perspective. Results are shown from sensitivity analyses for a variety of different conditions for several popular metrics and issues are highlighted, highlighting potential concerns with respect to machine learning or ML-enabled systems. Recommendations are provided to correct for imbalance effects, as well as how to conduct a simple statistical comparison that will detangle the beneficial effects of the model itself from those of imbalance. Results are generalizable across model type.

Optimal Design of Experiments for Powerful Equivalence Testing

Peter Goos, KU Leuven

Abstract: In manufacturing, an optimized process typically operates under a target condition with specific parameter settings. Products produced under this condition meet the desired quality standards. However, these parameters can inevitably deviate from their target settings during routine production. For instance, the actual pH level of a blend may differ from the target setting due to variability in the materials used to calibrate the pH meter. Similarly, production delays can cause the actual temperature of a material to vary by a few degrees from its target setting. To assess whether products still meet quality specifications despite these uncontrollable variations in process parameters, manufacturers can conduct an equivalence study. In such studies, the ranges of the process parameters correspond to the upper and lower bounds of the observed deviations from the target settings, referred to as the normal operating ranges. The manufacturer then compares the quality attributes of products produced within the normal operating ranges of the process parameters to those produced under the target condition. If the differences in quality fall within an acceptable range, the products are considered practically equivalent, indicating that the manufacturing process is robust and capable of consistently producing quality products.

In this presentation, we adapt existing methods for calculating power in bioequivalence studies to the context of industrial experimental design. We also introduce a novel design criterion, termed “PE-optimality,” to generate designs that allow for powerful equivalence testing in industrial experiments. An adequate design for equivalence testing should provide a high probability of declaring equivalence of the mean responses at various locations within the normal operating ranges of process parameters and the mean response at the target condition, when equivalence truly holds. The PE-optimality design criterion achieves this by performing prospective power analyses at a large number of locations within the experimental region and selecting a design that ensures sufficiently high power across a substantial portion of the region.

Can ChatGPT Think Like a Quality and Productivity Professional?

Jennifer Van Mullekom and Anne Driscoll, Virginia Tech

Abstract: As the capability of generative AI (GenAI) evolves at a rapid pace, organizations are eager to leverage it for efficiency gains. Various GenAI tools include a data analysis component. More than any other disruptive technology in recent years, GenAI has the potential to change both how data analysis is accomplished and by whom. Subsequently, it will change how we educate those that pose data centric questions and try to answer them in the future.

With all the buzz and the hype, we were curious: “Can ChatGPT Think Like a Quality and Productivity Professional?” Come to the talk and we’ll let you know the answer. Through a pilot at Virginia Tech, faculty and graduate students are evaluating the ChatGPT EDU Advanced Data Analytics tool. Our evaluation is non-other than a systematic one using a designed experiment. (After all, we are quality and productivity professionals!) Our evaluation spans across select prompt types, specificity of the prompts, and applications in Quality and Productivity. Results will be reported in a quantitative and qualitative way. Our talk will include a brief overview of GenAI for data analysis, an overview of our study design, use case examples, and the results of our pilot culminating in recommendations and best practices which you can leverage regardless of your position or organization.

Session 2 (10:30AM-12PM)

Monitoring Parametric, Nonparametric, and Semiparametric Linear Regression Models using a Multivariate CUSUM Bayesian Control Chart

Abdel-Salam G. Abdel-Salam, Qatar University

Abstract: In this work we develop a Bayesian multivariate cumulative sum (mCUSUM) control chart for monitoring multiple response variables for count data using their coefficients. We use the squared error loss function for parameter estimation of models fit using parametric, nonparametric, and semiparametric regression methods. We deploy the nonparametric penalized splines (p-splines) and semiparametric model robust regression 1 (MRR1) methods, comparing the regression accuracy on their mean squared error (MSE), Akaike information criterion (AIC) and

Bayesian information criterion (BIC). To assess the chart’s out-of-control detection capabilities, we perform simulation studies for both the hyper-parameters and sample size. After the results

are confirmed, we provide a real application on suicide count data made available by the World Health Organization (WHO) to predict and monitor global suicide rates.

To Drive Improvement: Act on Every Count

James Lucas, J. M. Lucas and Associates

Abstract: The act on every count procedure (AOEC) is an effective way to drive improvement. It is useful when a count indicates the occurrence of an event of interest, usually an adverse event, and the goal is to reduce the occurrence of adverse events.

This talk, will discuss the AOEC procedure. A unique aspect of this talk is that the title almost tells the whole story so a major part of the talk is a discussion of situations where the AOEC procedure has been successful in driving improvement. Two situations where the AOEC procedure has been successfully implemented are the FAA’s approach to airline accidents and Dupont’s benchmarked safety system. We then describe two situations where the AOEC procedure should be implemented. These situations are hospital errors and police killings of unarmed civilians. We also discuss barriers to the wider use of the AOEC procedure.

The DuPont Company’s Product Quality Management manual differentiated between low-count and rare event quality problems. It stated: “A property that has a low count of nonconformances is presumed to be a phenomenon continuously (or at least often) present in normal product, even though only infrequently counted in a typical routine sample. A rare event quality problem is presumed to be entirely absent in all normal product, and to be due to a specific unusual malfunction in each specific instance of a quality breakdown. It is not always important to distinguish between these two categories of problems because the same counted data CUSUM technology is applied in either case.” The last sentence refers to the fact that the DuPont PQM quality system used CUSUM for monitoring both variables and counts; it is the largest known CUSUM implementation. The AOEC procedure is a special case of a CUSUM, and also of a Shewhart monitoring procedure so it has all the optimality properties of both procedures. The rare event quality problem model is more applicable for the AOEC procedure because each count is considered to be due to its own assignable cause. The goal action for the AOEC procedure is the removal of the assignable cause thereby driving improvement.

As background, “Detection Possibilities When Count Levels Are Low” (Lucas et al., 2025) will be discussed. A “portable” version of the 2-in- m control procedure for detecting an order-of-magnitude shift and a conceptual version of a detecting a doubling procedure will be provided. This background will quantify why it almost never feasible to detect shifts of a doubling or smaller when count levels are low. In low count situations the AOEC procedure should be considered.

Vecchia Approximated Bayesian Heteroskedastic Gaussian Processes

Parul Vijay Patil, Virginia Tech

Abstract: Many computer simulations are stochastic and exhibit input dependent noise. In such situations, heteroskedastic Gaussian processes (hetGPs) make for ideal surrogates as they estimate a latent, non-constant variance. However, existing hetGP implementations are unable to deal with large simulation campaigns and use point estimates for all unknown quantities, including latent variances. This limits applicability to small simulation campaigns and undercuts uncertainty quantification (UQ). We propose a Bayesian framework to fit hetGPs using elliptical slice sampling (ESS) for latent variances, improving UQ, and Vecchia approximation to circumvent computational bottlenecks. We are motivated by the desire to train a surrogate on a large (8-million run) simulation campaign for lake temperatures forecasts provided by the Generalized Lake Model (GLM) over depth, day, and horizon. GLM simulations are deterministic, in a sense, but when driven by NOAA weather ensembles they exhibit the features of input-dependent noise: variance changes over all inputs, but in particular increases substantially for longer forecast horizons. We show good performance for our Bayesian (approximate) hetGP compared to alternatives on those GLM simulations and other classic benchmarking examples with input-dependent noise.

Impact of the Choice of Hyper-parameters on Statistical Inference of SGD Estimates

Yeng Saanchi, JMP

Abstract: In an era where machine learning is pervasive across various domains, understanding the characteristics of the underlying methods that drive these algorithms is crucial. Stochastic Gradient Descent (SGD) and its variants are key optimization techniques within the broader class of Stochastic Approximation (SA) algorithms. Serving as the foundation for most modern machine learning algorithms, SGD has gained popularity due to its efficient use of inexpensive gradient estimators to find the optimal solution of an objective function. Its computational and memory efficiency makes it particularly well-suited for handling large-scale datasets or streaming data.

SGD and its variants are widely applied in fields such as engineering, computer science, applied mathematics, and statistics. However, due to early stopping, SGD typically produces estimates that are not exact solutions of the empirical loss function. The difference between the SGD estimator and the true minimizer is influenced by factors such as the observed data, the tuning parameters of the SGD method, and the stopping criterion. While these methods have been successful in a wide range of applications, SGD can be erratic and highly sensitive to hyper-parameter choices, often requiring substantial tuning to achieve optimal results.

To explore the impact of step size scheduling on SGD’s accuracy and coverage probability, we conduct a simulation study. Additionally, we propose a new approach for hyper-parameter tuning that combines the double bootstrap method with the Simultaneous Perturbation Stochastic Approximation (SPSA) technique.

Screening Designs for Continuous and Categorical Factors

Ryan Lekivetz, JMP

Abstract: In many screening experiments, we often need to consider both continuous and categorical factors. In this presentation, we introduce a new class of saturated designs that effectively address this need. These designs include m three-level continuous factors and m–1 two-level factors—either categorical or continuous—within just n = 2*m runs, where m is at least 4.

A major advantage of our approach is its flexibility: these designs are available for any even number of runs starting from 8. Depending on whether n is a multiple of 8, 4, or just 2, the designs exhibit varying degrees of orthogonality.

We demonstrate that these designs typically have power near one for identifying up to m active main effects when the signal-to-noise ratio is greater than 1.5.

Rerandomization Algorithms for Optimal Designs of Network A/B Tests

Qiong Zhang, Clemson

Abstract: A/B testing is an effective method to assess the potential impact of two treatments. For A/B tests conducted by IT companies like Meta and LinkedIn, the test users can be connected and form a social network. Users’ responses may be influenced by their network connections, and the quality of the treatment estimator of an A/B test depends on how the two treatments are allocated across different users in the network. This paper investigates optimal design criteria based on some commonly used outcome models, under assumptions of network-correlated outcomes or network interference. We demonstrate that the optimal design criteria under these network assumptions depend on several key statistics of the random design vector. We propose a framework to develop algorithms that generate rerandomization designs meeting the required conditions of those statistics under a specific assumption. Asymptotic distributions of these statistics are derived to guide the specification of parameters in the algorithms. We validate the proposed algorithms using both synthetic and real-world networks.

Session 3 (2-3:30PM)

Bayesian calibration of material properties for shape memory alloys from nanoindentation data

Oksana Chkrebtii, The Ohio State University

Abstract: Shape memory alloys (SMA) are materials with exceptionally high recoverable strains and temperature-sensitive material response. Design of functional SMAs with desired thermomechanical properties for use in medical and aerospace applications is challenged by a large design space, computationally expensive simulations, and local variation in material properties. Nanoindentation is an experimental method where a shaped tip indenter is pushed into a material surface while recording applied force and penetration depth, providing information about material deformation. We explore the problem of inferring thermomechanical properties of SMAs given a limited number of finite element simulations and experimentally measured indentation curves at nano-micron scales. Sampling and spatial variation in the local grain structure at each indentation site result in spatially-varying material properties modeled as random effects, while the computationally expensive constitutive model is learned from a training sample of finite element simulations within a Bayesian hierarchical model. Bayesian calibration characterizes SMA material properties with associated uncertainty based on data from adaptively selected indentation locations and multimodal tip shapes.

Robust Panel Set Definition for RS-25 FPI POD Demonstration Testing and Investigations into the Mechanics of Model Convergence

Mindy Hotchkiss, Enquery Research LLC

Abstract: The Aerojet Rocketdyne RS-25 program went through an effort in 2019-2021 to replace the panel set used for Probability of Detection (POD) demonstration testing for Fluorescent Penetrant Inspection (FPI), due to the deteriorating condition of the existing NASA-owned specimen standard, at that time already in use for over 40 years and which had been subject to multiple refurbishments. This legacy set was being further reduced in size in terms of the number of flaws available to inspection due to flaw growth (porosity). Additional investment in further life-extension was deemed inadvisable, so a new panel set had to be acquired. The panel set acquisition process involved many practical decisions about panel and flaw characteristics, intended to improve overall utility and system robustness to meet potential future needs, particularly important given the long-lived nature of the set. This talk discusses the many statistical aspects of panel set design, particularly flaw sizing and allocation, including tools developed to facilitate decision-making, since limited guidance was available.

Once the panel set was manufactured and available on-site, it went through detailed assessments and was separated into statistically equivalent subsets to prolong its life and enhance qualification test validity. A database was also designed to serve as a repository for the data, facilitate decision-making, and be a resource for the future. This project and results were presented at the 2024 Research Symposium for the American Society for Nondestructive Testing. At that time, the new panel set had been utilized for over 50 demonstration tests, none of which presented any issues with model convergence, which had become a worsening issue with the previous set, complicating POD estimation. The general behavior of the system was suggestive that this trend would continue. Additional work was done to explicate why.

The robustness of the new panel set is suspected to be attributable to the number of clusters with indications of the same magnitude flaw size, i.e., replicates, since these reduce the likelihood of encountering the complete or quasi-complete separation in the data that leads to logistic regression models failing to converge. Lack of model convergence would be initiated by complete separation of the data. Therefore, inclusion of replicate clusters should reduce the potential for separation of data since each cluster provides an opportunity for those points to be split, or mixed, between hits and misses. A sensitivity study based on the binomial distribution was conducted to validate this concept. The objective here was to evaluate how the number, size, and location of clusters impacts the aggregate probability of generating mixed outcomes. Results shown here provide insight into how to optimize panel set layout with respect to replicate clusters to minimize convergence issues for maximum benefit. As might be expected, improved results, i.e., higher probabilities of mixed findings, can be obtained by increasing both cluster size and the number of clusters. A generalized equation was developed to assess and compare unique combinations and allow different combinations of any structure to be quantified with respect to the probability of obtaining mixed outcomes, as a surrogate for model convergence.

A scalable algorithm for generating non-uniform space filling designs

Xiankui Yang, Univ. of South Florida

Abstract: Traditional space-filling designs have long been popular to achieve uniform spread or coverage throughout the design space for diverse applications. Non-uniform space-filling (NUSF) designs were recently developed to achieve flexible densities of design points to allow users to emphasize and de-emphasize different regions of the input space. However, using a point exchange algorithm, the construction of the NUSF designs entails substantial computational costs, particularly for higher-dimensional scenarios. To improve computing efficiency and scalability, we propose a new algorithm consistent with the fundamentals of NUSF, termed Quick Non-Uniform Space-Filling (QNUSF) designs. By combining hierarchical clustering with group average linkage and strategic point selection methods, the QNUSF algorithm expedites computation. We present two point selection methods, i.e. maximin and minimax QNUSF, to achieve non-uniformity with different emphasis on the spread or coverage of the design characteristic to facilitate broader adaptability. In addition, QNUSF designs allow great flexibility for handling discrete or continuous, regular or irregular input spaces to achieve the desired density distribution and hence improve versatility and applicability for achieving different experimental goals. The computational efficiency and performance of QNUSF designs are illustrated via several examples.

A critique of neutrosophic statistical analysis illustrated with interval data from designed experiments

William Woodall, Virginia Tech

Abstract: Recent studies have explored the analysis of data from experimental designs using neutrosophic statistics. These studies have reported neutrosophic bounds on the statistics in analysis of variance tables. In this paper, following Woodall et al. (2025), a simple simulation-based approach is used to demonstrate that the reported neutrosophic bounds on these statistics are either incorrect or too inaccurate to be useful. We explain why the neutrosophic calculations are incorrect using two simple examples.

Large Row-Constrained Supersaturated Designs for High-throughput Screening

Byran Smucker, Henry Ford Health

Abstract: High-throughput screening, in which large numbers of compounds are traditionally studied one-at-a-time in multiwell plates, is widely used across many areas of the biological and chemical sciences including drug discovery. To improve the efficiency of these screens, we propose a new class of supersaturated designs that guide the construction of pools of compounds in each well. Because the size of the pools are typically limited by the particular application, the new designs accommodate this constraint and are part of a larger procedure that we call Constrained Row Screening, or CRowS. We develop an efficient computational procedure to construct CRowS designs, provide some initial lower bounds on the average squared off-diagonal values of their main-effects information matrix, and study the impact of the constraint on design quality. We also show via simulation that CRowS is statistically superior to the traditional one-compound-one-well approach as well as an existing pooling method, and provide results from two separate applications having to do with the search for solutions to antibiotic-resistant bacteria.

Optimizing User Experience in Statistical Tools through Experimental Design

Jacob Rhyne and Mark Bailey, JMP

Abstract: Modern statistical software is increasingly used by users with a wide range of statistical training to address complex real-world problems. However, developers of such software may not always consider assessing its usability. In this talk, we present case studies that assess the usability of statistical software using a design of experiments approach. Topics include user interaction with the software, correct interpretation of results, and accommodating users of varying expertise levels.

October 9th

Session 4 (8-9:30AM)

Predictive Model Monitoring and Assessment in the Lubrizol Q.LIFE® Formulation Optimization System

Kevin Manouchehri, Lubrizol

Abstract: Q.LIFE® is a comprehensive predictive system to solve complex formulation optimization problems. A key piece of this Lubrizol system is its suite of empirical predictive models. Dependent upon the application, prior knowledge, and quality, quantity and nature of the data, models are developed using different modeling techniques from least-squares regression to complex ensemble models. The integration of Q.LIFE® into Lubrizol formulation strategy and practice has necessitated the need for models to be assessed and monitored using control charts, residual checks, and internal algorithms comparing similar formulations. Lubrizol’s techniques for model assessment and monitoring as new data is generated, embedded within Q.LIFE®, are demonstrated, as well as some of Lubrizol’s ideas and techniques for estimating the error around predictions regardless of the model origin.

A Case Study in Image Analysis for Engine Cleanliness

Quinn Frank, Lubrizol

Abstract: Laboratory Engine Tests are used in the evaluation of engine oil quality and capability. Assessments range from chemical and physical analysis of the used oil after the completion of the test to the rating of test parts for sludge, varnish and deposits. The rating of used parts after a test is complete is typically done by trained raters that look at and examine pistons, rings, liners, and screens in a clean environment with standardized time windows, lighting, rating instructions, and other related criteria. Raters receive both training and calibration credentials in mandatory rating workshops where rater repeatability and reproducibility are measured and assessed. In these workshops, differences in ratings are also compared between actual test parts and digital photographs of test parts. If possible, the use of digitals is much more cost effective in training and assessment due to the elimination of transporting people and test parts around the world. In most cases, however, the ratings are more accurate for the actual test parts than the photographs.

There is a unique opportunity in the rating of Oil Screen Clogging in the 216-hour, Sequence VH Sludge Test for passenger car motor oils. Rater repeatability and reproducibility of both parts and photographs are considered poor. We therefore took the opportunity to conduct a study to generate data and develop a Machine-Learning model and algorithm to rate the digital photographs. Our database consists of ratings by multiple raters on both parts and photographs on an array of clean, dusty, dirty, and completely clogged engine oil screens. Our case study culminates in the comparison of repeatability and reproducibility for the scenarios of rating actual parts by raters, rating digital photographs by raters, and rating digital photographs using models.

The Power of Foldover Designs

Jonathan Stallrich, North Carolina State University

Abstract: The analysis of screening designs is often based on a second-order model with linear main effects, two-factor interactions, and quadratic effects. When the main effect columns are orthogonal to all the second-order terms, a two-stage analysis may be conducted starting with fitting a main effect only model. A popular technique to achieve this orthogonality is to take any design and append its foldover runs. In this talk, we show that this foldover technique is even powerful than originally thought because it also includes opportunities for unbiased estimation of the variance either by pure error or lack of fit. We find optimal foldover designs for main effect estimation and other designs that balance main effect estimation and model selection for the important factors. A real life implementation of our new designs involving 8 factors and 20 runs is discussed.

Applications of Weibull Based Parametric Regression for Survival Data

Chad Foster and Sarah Burke, GE Aerospace

Abstract: Survival analysis is routinely used to characterize the life distribution of a part or system based on a time or usage parameter (e.g., hours, number of cycles, days since manufacture). When the population can be divided into groups, a consistent way is required to address the reliability estimates of these sub-groups that may only be different in damage accumulation rate and not in the failure mechanism. In the aerospace industry for instance, if data is limited to number of flights and the failure was dependent on the length of the flight, the population could be divided into two groups: domestic (short flights) and international (long flights). Separate probability distributions could be fit to each group iterating to ensure a common shape parameter. Alternatively, the shape parameter and the two scale parameters could be fit through a life regression model.

In practice, more complex situations occur. Components of systems are frequently traded, moved, and reused numerous times throughout its life. Reliability estimates are required to inform timing for maintenance activities, part removal, or other field management actions. Life regression models can be useful to manage issues in the field while accounting for these complex populations.

In this presentation, the foundation and background of life regression models using the Weibull distribution is given and supplemented by several examples of field management in the aerospace industry. The method is contrasted with Cox proportional hazard regression which does not require distributional assumptions but also does not provide survival time estimates. It is also highlighted that many real-world applications have parameter covariance and thus require log-likelihood significance tests. This presentation will guide practitioners on how to use life regression models in actual complex situations.

Identifying Prognostic Variables across Multiple Historical Clinical Trials

Xueying Liu, Virginia Tech

Abstract: In clinical trials, it is critical to incorporate prognostic variables to improve precision for estimating treatment effects. Per FDA guidance, these prognostic variables need to be prospectively specified in the statistical analysis plan. Therefore, there is a great need to effectively identify prognostic variables from historical studies that are similar to the new study. In this work, we propose a multi-task learning approach to identify prognostic variables from multiple related historical clinical trials. Specifically, a bi-level constraint variable selection (Bi-CVS) method is developed to identify both study-specific (within-study) and program-level (cross-study) prognostic variables. In addition, we introduce an efficient algorithm specifically designed for clinical trial settings and investigate its consistency property. The performance of our proposed method is evaluated through simulations for various endpoints, including continuous, binary, and time-to-event. A real clinical application is also presented to illustrate our method.

Session 5 (10-11:30AM)

Designing Experiments to Identify Optimal Operation Conditions for a Dynamic Cloth Media Primary Wastewater Treatment System

Madison De Boer, Baylor University

Abstract: Operation of wastewater treatment (WWT) processes is essential for human health and environmental effects and is highly energy intensive, prompting the need for optimal operation to minimize energy consumption while providing high quality effluent. Our goal is to identify optimal operational setpoints for a primary cloth media filtration system under dynamic influent conditions. Machine learning techniques like reinforcement learning are gaining traction in WWT, but many facilities lack the necessary automation to adopt these advanced methods.

Here, we apply response surface methodology (RSM) paired with constrained optimization as a practical alternative. We target a reduction of effluent total suspended solids (TSS) to enhance effluent primary water quality, reduce backwashes per hour to minimize energy consumption, as well as monitoring tank level changes to account for long-term performance of the filter. RSM is used to identify optimal input settings, maximum tank level, and influent flow rate. The system is tested under various setpoints. Modeling effluent TSS per cycle with a second-order model achieves an R^2 exceeding 75%, demonstrating strong predictive performance.

Under fixed influent flow, optimized setpoints improve filter operation. In varied flow scenarios, the approach enhances TSS removal and long-term filter performance.

Microstructure-based Statistical Tests for Material State Comparison

Simon Mason, The Ohio State University

Abstract: In materials science, material properties and performance are heavily tied to the microstructure of materials, or the myriad features at multiple length scales. The development of new and improved industrially important materials relies upon our ability to meaningfully capture and quantify characteristics of these microstructural features. The natural variation in microstructures across samples of a given material suggests a theoretical probability distribution over these patterns, which may be used for formulating tests of statistical hypotheses. The non-Euclidian structure of these objects, however, prevents the use of standard non-parametric tests of homogeneity such as Kolmogorov-Smirnoff or Cramer-von-Mises. We combine a new approach for metric distribution function-based testing with the development of quantitative descriptors to establish metric distances between microstructure samples. We show that for a materials domain, this test can be used to determine resolvability limits between neighboring material states in terms of processing parameters, differentiating between similar microstructure. We further examine its use as a tool for recognizing/distinguishing deep-learning generated microstructures from physics-generated images.

Optimal Robust Designs with both centered and baseline factors

Xietao Zhou, King’s College London

Abstract: Traditional optimal designs are optimal under a pre-specified model. When the final fitted model differs from the pre-specified model, traditional optimal designs may cease to be optimal, and the corresponding parameter estimators may have larger variances. The Q_B criterion has been proposed to offer the capacity to consider hundreds of alternative models that could potentially be useful for data from a multifactor design.

Recently, an alternative parameterization of factorial designs called the baseline parameterization has been considered in the literature. It has been argued that such a parameterization arises naturally if there is a null state of each factor, and the corresponding minimum K-aberration has been explored. In our previous work, we have generalized the Q_B criterion to apply to the baseline parameterization, and it has been shown that the optimal designs found can be projected on more eligible candidate models than the minimal K-aberration design for various specified prior probabilities of main effects and two-factor interactions being in the best model.

In the present work, we have extended the Q_B criterion to the scenario when eligible candidate models contain both baseline and centered parameterization factors. This shall be of interest in practice when some of the factors naturally do have a reasonable null state alongside other factors whose levels are equally important and are more naturally represented under the centered parameterization. We have compared our optimal designs with their counterparts in the most recent literature and have shown that the projection capacity of eligible candidate models/accuracy of estimation of models in terms of the A_s criterion can be improved when the number of runs in the experiment is a multiple of 4 and have also examined and solved the same problem with no restrictions on the number of runs of the experiment so that It could be applied in a more general way in practice.

The basic framework of the Q_B criterion and its variation on baseline parameterization will be briefly discussed, followed by a detailed explanation of the new version dealing with factors under which both parameterizations are present, finished by evaluating the robust and accurate performance of the Q_B optimal designs we have found.

Robust Parameter Designs Constructed from Hadamard Matrices

Yingfu Li, University of Houston – Clear Lake

Abstract: The primary objective of robust parameter design is to identify the optimal settings of control factors in a system to minimize the response variance while achieving an optimal mean response. This article investigates fractional factorial designs constructed from Hadamard matrices of orders 12, 16, and 20 to meet the requirements of robust parameter design. These designs allow for the estimation of critical factorial effects, including all control-by-noise interactions and the main effects of both control and noise factors, while saving experimental runs and often providing better estimation of other potentially important interactions. Top candidates for various combinations of control and noise factors are provided, offering practical choices for efficient and resource-constrained experimental designs with minimal runs.

Exploratory Image Data Analysis for Quality Improvement Hypothesis Generation

Theodore T. Allen, The Ohio State University

Abstract: Images can provide critical information for quality engineering. Exploratory image data analysis (EIDA) is proposed here as a special case of EDA (exploratory data analysis) for quality improvement problems with image data. The EIDA method aims to obtain useful information from the image data to identify hypotheses for additional exploration relating to key inputs or outputs. The proposed four steps of EIDA are: (1) image processing, (2) image-derived quantitative data analysis and display, (3) salient feature (pattern) identification, and (4) salient feature (pattern) interpretation. Three examples illustrate the methods for identifying and prioritizing issues for quality improvement, identifying key input variables for future study, identifying outliers, and formulating causal hypotheses.

Boundary peeling: An outlier detection method

Maria L. Weese, Miami University of Ohio

Abstract: Unsupervised outlier detection constitutes a crucial phase within data analysis and remains an open area of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce Boundary Peeling, an unsupervised outlier detection algorithm. Boundary Peeling uses the average signed distance from iteratively peeled, flexible boundaries generated by one-class support vector machines to flag outliers. The method is similar to convex hull peeling but well suited for high-dimensional data and has flexibility to adapt to different distributions. Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In unimodal and multimodal synthetic data simulations Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers. Boundary Peeling performs competitively or better in terms of correct classification, AUC, and processing time using semantically meaningful benchmark datasets.

Session 6 (1:30-3PM)

Integrating SPC, DOE, and AI/ML for Enhanced Quality

Daksha Chokshi, StatQualTech

Abstract: In today’s competitive landscape, continuous improvement is essential for achieving operational excellence and maintaining a competitive edge. This presentation explores the synergistic integration of Statistical Process Control (SPC), Design of Experiments (DOE), and Artificial Intelligence (AI) to develop a more adaptive and intelligent approach to process optimization and decision-making. SPC establishes a robust framework for monitoring and controlling process variability, ensuring consistent product quality. DOE offers a systematic approach to experimenting with process parameters, identifying optimal conditions that enhance performance. The incorporation of AI/ML further strengthens these traditional methodologies by enabling predictive analytics, anomaly detection, pattern recognition, and automated optimization. By combining these approaches, organizations can transition from reactive quality control to a proactive, data-driven strategy that drives self-learning process improvements.

Several case studies and practical applications will be discussed to illustrate how this triad of methodologies fosters a culture of continuous improvement, empowering organizations to achieve higher levels of productivity, quality, and innovation. The presentation will conclude with an exploration of challenges, implementation strategies, and future directions for AI-driven continuous improvement.

Using Input-Varying Weights to Determine a Soft Changepoint in Mixture distributions

Di Michelson and Don McCormack, JMP

Abstract: It is quite common for data to come from populations that actually consist of two or more subpopulations. For example, the lifetime distribution of a product may actually consist of multiple distributions depending on specific failure modes. The typical approach in these instances is to use a mixture distribution, where the likelihood of each observation is a weighted combination of several distribution models. These weights may be constant or a function of covariates. Another approach is to consider a changepoint where the distribution model makes a sudden change from one model to another. In this talk, we propose an approach that falls between these two extremes. Instead of a hard changepoint, we instead use a probit or logistic model that allows the mixture proportion to vary over the range of the variable, with the point at which the mixture is evenly split serving as a “soft changepoint”. We illustrate this new approach using data from a an industrial application.

Monotonic Warpings for Additive and Deep Gaussian Processes

Steven D. Barnett, Virginia Tech

Abstract: Gaussian processes (GPs) are canonical as surrogates for computer experiments because they enjoy a degree of analytic tractability. But that breaks when the response surface is constrained, say to be monotonic. Here, we provide a mono-GP construction for a single input that is highly efficient even though the calculations are non-analytic. Key ingredients include transformation of a reference process and elliptical slice sampling. We then show how mono-GP may be deployed effectively in two ways. One is additive, extending monotonicity to more inputs; the other is as a prior on injective latent warping variables in a deep Gaussian process for (non-monotonic, multi-input) non-stationary surrogate modeling. We provide illustrative and benchmarking examples throughout, showing that our methods yield improved performance over the state-of-the-art on examples from those two classes of problems.

Deep Gaussian processes for estimation of failure probabilities in complex systems

Annie S. Booth, Virginia Tech

Abstract: We tackle the problem of quantifying failure probabilities for expensive deterministic computer experiments with stochastic inputs. The computational cost of the computer simulation prohibits direct Monte Carlo (MC) and necessitates a statistical surrogate model, turning the problem into a two-stage enterprise (surrogate training followed by probability estimation). Limited evaluation budgets create a design problem: how should expensive evaluations be allocated between and within the training and estimation stages? One may relegate all simulator evaluations to greedily train the surrogate, with failure probabilities then estimated from “surrogate MC”. But extended surrogate training offers diminishing returns, and surrogate MC relies too stringently on surrogate accuracy. Alternatively, a surrogate trained on a fraction of the simulation budget may be used to inform importance sampling, but this is data hungry and can provide erroneous results when budgets are limited. Instead we propose a two-stage approach: sequentially training Gaussian process (GP) surrogates through contour location, halting training once learning of the failure probability has plateaued, then employing a “hybrid MC” estimator which combines surrogate predictions in certain regions with true simulator evaluations in uncertain regions. Our unique two-stage design strikes an appropriate balance between exploring and exploiting and outperforms alternatives, including both of the aforementioned approaches, on a variety of benchmark exercises. With these tools, we are able to effectively estimate small failure probabilities with only hundreds of simulator evaluations, showcasing functionality with both shallow and deep GPs, and ultimately deploying our method on an expensive computer experiment of fluid flow around an airfoil.

Dealing with Sample Bias—Alternative Approaches and the Fundamental Questions They Raise

Frederick W. Faltin, Virginia Tech

Abstract: Now and then we hear on the news of some data analysis (generally attributed to “AI”), the outcome of which has gone badly wrong. The root cause is nearly always found to have been some form of sample bias, or its flipside, unintended extrapolation. Awareness of such issues is generally high in the statistical community, but is often very much less so among data scientists more broadly. Statisticians have responded by developing and promoting several very useful lines of research for countering sample bias in observational data. This expository talk presents an overview of some of these approaches, as well as of algorithmic means being developed in other fields to adjust for bias or extrapolation in the process of fitting machine learning models. These alternative approaches raise fundamental questions about what purpose(s) the models being developed are intended to serve, and how our analysis approach needs to adapt to the answer the “right” question.

Generalization Problems in Machine Learning with Case Studies

Jay Chen, Shell

Abstract: Generalization is a critical concern in machine learning, referring to a model’s ability to perform well on unseen data. While many models excel on benchmark datasets, their performance often deteriorates in real-world applications due to factors such as data distribution shifts, noise, and overfitting. This presentation offers a technical overview of generalization challenges, examining key contributing factors and mitigation strategies—including feature stationarity. We illustrate these challenges through case studies drawn from real business applications. Our objective is to deepen our understanding of generalization pitfalls and to develop practical approaches that enhance model robustness in real-world deployments.

Introducing Continuous Restrictions into Spatial Models via Gaussian Random Fields with Linear Boundary Constraints

Yue Ma, The Ohio State University

Abstract: Boundary constraints are extensively used in physical, environmental and engineering models to restrict smooth states (e.g., temperature fields) to follow known physical laws. Examples include fixed-state or fixed-derivative (insulated) boundaries, and boundaries which relate the state and the derivatives (e.g., convective boundaries). Gaussian random fields (GRFs), as flexible, non-parametric models, are widely applied to recover smooth states from discrete spatial measurements across a domain. We formally define boundary-constrained random fields and introduce a representation-based approach to fully enforce linear boundary constraints on GRFs over multi-dimensional, convex domains. This new class of boundary-constrained random fields can be used for recovering smooth states, with known physical mechanisms working at the domain boundaries. Such constrained random fields make flexible priors for modeling smooth states, enable data-driven discovery of dynamic systems, and improve performance and uncertainty quantification of probabilistic solvers for differential equations.

Scale-Location-Truncated Beta Regression: Expanding Beta Regression to Accommodate 0 and 1

Mingang Kim, Virginia Tech

Abstract: Beta regression is frequently used when the outcome variable y is bounded within a specific interval, transformed to the (0, 1) domain if necessary. However, standard beta regression cannot handle data observed at the boundary values of 0 or 1, as the likelihood function takes on values of either 0 or ∞. To address this issue, we propose the Scale-Location-Truncated (SLT) beta regression model, which extends the beta distribution’s domain to the [0, 1] interval. By using scale-location transformation and truncation, SLT beta distribution allows positive finite mass to the boundary values, offering a flexible approach for handling values at 0 and 1.

In this paper, we demonstrate the effectiveness of the SLT beta model in comparison to stan- dard beta regression models and other approaches like the Zero-One Inflated Beta (ZOIB) model [Liu and Kong, 2015] and XBX regression [Kosmidis and Zeileis, 2024]. Using empirical and sim- ulated data, we compare the performance including predictive accuracy of the SLT beta model with other methods, particularly in cases with observed boundary data values for y. The SLT beta model is shown to offer greater flexibility, supporting both linear and nonlinear relation- ships. Additionally, we implement the SLT beta model within classical and Bayesian frameworks, employing both hierarchical and non-hierarchical models. This comprehensive implementations demonstrate its broad applicability for modeling bounded data in a range of contexts.

Pitfalls and Remedies for Maximum Likelihood Estimation of Gaussian Processes

Ayumi Mutoh, North Carolina State University

Abstract: Gaussian processes (GPs) are nonparametric regression models favored for their nonlinear predictive capabilities, making them popular as surrogate models for computationally expensive computer simulations. Yet, GP performance relies heavily on effective estimation of unknown kernel hyperparameters. Maximum likelihood estimation is the most common tool of choice, but it can be plagued by numerical issues in small data settings. Penalized likelihood methods attempt to overcome likelihood optimization challenges, but their success depends on tuning parameter selection. Common approaches select the penalty weight using leave-one-out cross validation (CV) with root mean squared error (RMSE). Although this method is easy to implement, it is computationally expensive and ignores the uncertainty quantification (UQ) provided by the GP. We propose a novel tuning parameter selection scheme which combines k-fold CV with a score metric that accounts for GP predictive performance and UQ. Additionally, we incorporate a one-standard-error rule to encourage smoother predictive surfaces in the face of limited data, which remedies flat likelihood issues. Our proposed tuning parameter selection for GPs matches the performance of standard MLE when no penalty is warranted, excels in settings where regularization is preferred, and outperforms the benchmark leave-one-out CV with RMSE.

Monitoring Functional Anomalies in a Water Treatment Process

Hunter Privett, Baylor University

Abstract: This research is intended for an intermediate statistical audience, with a focus on functional data analysis, outlier detection applied to process control, and water and wastewater treatment data. Batch processes in water and wastewater treatment (W/WWT) often produce data with a repetitive, functional pattern. Detecting faults in these systems is important both for the quality of the system’s effluent water and to prevent system damage. However, the expected nonstationary changes in cyclical behavior over time within these systems and unique treatment process parameters between facilities make fault detection challenging. Some case studies have been done to retroactively assess the efficacy of fault detection in these systems, but case studies are limited in that they rely on assumptions of when a fault may be occurring, rather than a controlled and known fault. In this work, we use recently developed approaches to simulating functional W/WWT datasets in order to compare different process monitoring methods when applied to W/WWT systems for several controlled, simulated faults, and we develop a new real-time monitoring method based on metrics of functional outlyingness. First, a dataset is simulated based using functions from an observed ultrafiltration system for reference, and the functions are contaminated with one of four different types of faults after a period of normal operation. Then, four different fault detection methods are applied, and true positive and false alarm rates are calculated. The first two methods applied are traditional Shewhart charts for each individual measurement using either the original raw values or globally detrended values. The next two methods are based on the functional structure of the data applied to the detrended data. The first method is one that we develop that incorporates metrics of a function’s directional outlyingness into a T2 chart. The last method uses a set of T2 and SPE charts based on functional PCA decomposition of the functions. These methods are then compared for four types of faults that include changes to the shape or magnitude of the functions. We demonstrate the importance of accounting for global behavior in W/WWT systems when performing fault detection to increase accuracy. We also demonstrate when a functional based method improves over a traditional method that ignores the functional structure.

Optimal Experimental Designs Robust to Missing Observations

Jace Ritchie, Sandia National Laboratories

Abstract: Modeling and simulation are essential for predicting structural responses across engineering applications. Optimal experimental design (OED) is a useful approach for predicting what experimental design (e.g. sensor placement strategy, input forces, etc.) will result in the most informative data for these models and simulations. One challenge in real-world engineering applications is that sensors can fail, potentially reducing the efficacy of a predicted optimal design. Due to the high cost of experiments and the limited number of sensors that can be placed in a given experiment, it is advantageous to ensure experimental designs are robust to such failures. Existing methodologies to generate such designs either (i) assume probabilities of failure are known or (ii) sample across failure scenarios to understand how sensor failure could affect data acquisition. We expand approach (i) by assuming the distribution of failure probabilities is known, rather than the exact values. With regards to approach (ii), we show that designs for a specific structural dynamics application that are optimized for the loss of one sensor are also robust to the loss of multiple sensors. We apply this approach to cases where sensors experience clipping, meaning the measurements are lost beyond a given sensitivity threshold.

Fall Technical Conference

2025 Abstracts