October 8th

Session 1 (9:15-10AM)

 

Assessing Predictive Capability for Binary Classification Models

Mindy Hotchkiss, Enquery Research LLC

Abstract: Classification models for binary outcomes are in widespread use across a variety of industries. Results are commonly summarized in a misclassification table, also known as an error or confusion matrix, which indicates correct vs incorrect predictions for different circumstances. Models are developed to minimize both false positive and false negative errors, but the optimization process to train/obtain the model fit necessarily results in cost-benefit trades. However, how to obtain an objective assessment of the performance of a given model in terms of predictive capability or benefit is less well understood, due to both the rich plethora of options described in literature as well as the largely overlooked influence of noise factors, specifically class imbalance. Many popular measures are susceptible to effects due to underlying differences in how the data are allocated by condition, which cannot be easily corrected.

This talk considers the wide landscape of possibilities from a statistical robustness perspective. Results are shown from sensitivity analyses for a variety of different conditions for several popular metrics and issues are highlighted, highlighting potential concerns with respect to machine learning or ML-enabled systems. Recommendations are provided to correct for imbalance effects, as well as how to conduct a simple statistical comparison that will detangle the beneficial effects of the model itself from those of imbalance. Results are generalizable across model type. 

 

Optimal Design of Experiments for Powerful Equivalence Testing

Peter Goos, KU Leuven

Abstract: In manufacturing, an optimized process typically operates under a target condition with specific parameter settings. Products produced under this condition meet the desired quality standards. However, these parameters can inevitably deviate from their target settings during routine production. For instance, the actual pH level of a blend may differ from the target setting due to variability in the materials used to calibrate the pH meter. Similarly, production delays can cause the actual temperature of a material to vary by a few degrees from its target setting. To assess whether products still meet quality specifications despite these uncontrollable variations in process parameters, manufacturers can conduct an equivalence study. In such studies, the ranges of the process parameters correspond to the upper and lower bounds of the observed deviations from the target settings, referred to as the normal operating ranges. The manufacturer then compares the quality attributes of products produced within the normal operating ranges of the process parameters to those produced under the target condition. If the differences in quality fall within an acceptable range, the products are considered practically equivalent, indicating that the manufacturing process is robust and capable of consistently producing quality products.

In this presentation, we adapt existing methods for calculating power in bioequivalence studies to the context of industrial experimental design. We also introduce a novel design criterion, termed “PE-optimality,” to generate designs that allow for powerful equivalence testing in industrial experiments. An adequate design for equivalence testing should provide a high probability of declaring equivalence of the mean responses at various locations within the normal operating ranges of process parameters and the mean response at the target condition, when equivalence truly holds. The PE-optimality design criterion achieves this by performing prospective power analyses at a large number of locations within the experimental region and selecting a design that ensures sufficiently high power across a substantial portion of the region.

 

Can ChatGPT Think Like a Quality and Productivity Professional?

Jennifer Van Mullekom and Anne Driscoll, Virginia Tech

Abstract: As the capability of generative AI (GenAI) evolves at a rapid pace, organizations are eager to leverage it for efficiency gains. Various GenAI tools include a data analysis component. More than any other disruptive technology in recent years, GenAI has the potential to change both how data analysis is accomplished and by whom. Subsequently, it will change how we educate those that pose data centric questions and try to answer them in the future.

With all the buzz and the hype, we were curious: “Can ChatGPT Think Like a Quality and Productivity Professional?” Come to the talk and we’ll let you know the answer. Through a pilot at Virginia Tech, faculty and graduate students are evaluating the ChatGPT EDU Advanced Data Analytics tool. Our evaluation is non-other than a systematic one using a designed experiment. (After all, we are quality and productivity professionals!) Our evaluation spans across select prompt types, specificity of the prompts, and applications in Quality and Productivity. Results will be reported in a quantitative and qualitative way. Our talk will include a brief overview of GenAI for data analysis, an overview of our study design, use case examples, and the results of our pilot culminating in recommendations and best practices which you can leverage regardless of your position or organization.

Session 2 (10:30AM-12PM)

 

Monitoring Parametric, Nonparametric, and Semiparametric Linear Regression Models using a Multivariate CUSUM Bayesian Control Chart

Abdel-Salam G. Abdel-Salam, Qatar University

Abstract: In this work we develop a Bayesian multivariate cumulative sum (mCUSUM) control chart for monitoring multiple response variables for count data using their coefficients. We use the squared error loss function for parameter estimation of models fit using parametric, nonparametric, and semiparametric regression methods. We deploy the nonparametric penalized splines (p-splines) and semiparametric model robust regression 1 (MRR1) methods, comparing the regression accuracy on their mean squared error (MSE), Akaike information criterion (AIC) and

Bayesian information criterion (BIC). To assess the chart’s out-of-control detection capabilities, we perform simulation studies for both the hyper-parameters and sample size. After the results

are confirmed, we provide a real application on suicide count data made available by the World Health Organization (WHO) to predict and monitor global suicide rates.

 

To Drive Improvement: Act on Every Count

James Lucas, J. M. Lucas and Associates

Abstract: The act on every count procedure (AOEC) is an effective way to drive improvement. It is useful when a count indicates the occurrence of an event of interest, usually an adverse event, and the goal is to reduce the occurrence of adverse events.

This talk, will discuss the AOEC procedure. A unique aspect of this talk is that the title almost tells the whole story so a major part of the talk is a discussion of situations where the AOEC procedure has been successful in driving improvement. Two situations where the AOEC procedure has been successfully implemented are the FAA’s approach to airline accidents and Dupont’s benchmarked safety system. We then describe two situations where the AOEC procedure should be implemented. These situations are hospital errors and police killings of unarmed civilians. We also discuss barriers to the wider use of the AOEC procedure.

The DuPont Company’s Product Quality Management manual differentiated between low-count and rare event quality problems. It stated: “A property that has a low count of nonconformances is presumed to be a phenomenon continuously (or at least often) present in normal product, even though only infrequently counted in a typical routine sample. A rare event quality problem is presumed to be entirely absent in all normal product, and to be due to a specific unusual malfunction in each specific instance of a quality breakdown. It is not always important to distinguish between these two categories of problems because the same counted data CUSUM technology is applied in either case.” The last sentence refers to the fact that the DuPont PQM quality system used CUSUM for monitoring both variables and counts; it is the largest known CUSUM implementation. The AOEC procedure is a special case of a CUSUM, and also of a Shewhart monitoring procedure so it has all the optimality properties of both procedures. The rare event quality problem model is more applicable for the AOEC procedure because each count is considered to be due to its own assignable cause. The goal action for the AOEC procedure is the removal of the assignable cause thereby driving improvement.

As background, “Detection Possibilities When Count Levels Are Low” (Lucas et al., 2025) will be discussed. A “portable” version of the 2-in- m control procedure for detecting an order-of-magnitude shift and a conceptual version of a detecting a doubling procedure will be provided. This background will quantify why it almost never feasible to detect shifts of a doubling or smaller when count levels are low. In low count situations the AOEC procedure should be considered.

 

Vecchia Approximated Bayesian Heteroskedastic Gaussian Processes

Parul Vijay Patil, Virginia Tech

Abstract: Many computer simulations are stochastic and exhibit input dependent noise. In such situations, heteroskedastic Gaussian processes (hetGPs) make for ideal surrogates as they estimate a latent, non-constant variance. However, existing hetGP implementations are unable to deal with large simulation campaigns and use point estimates for all unknown quantities, including latent variances. This limits applicability to small simulation campaigns and undercuts uncertainty quantification (UQ). We propose a Bayesian framework to fit hetGPs using elliptical slice sampling (ESS) for latent variances, improving UQ, and Vecchia approximation to circumvent computational bottlenecks. We are motivated by the desire to train a surrogate on a large (8-million run) simulation campaign for lake temperatures forecasts provided by the Generalized Lake Model (GLM) over depth, day, and horizon. GLM simulations are deterministic, in a sense, but when driven by NOAA weather ensembles they exhibit the features of input-dependent noise: variance changes over all inputs, but in particular increases substantially for longer forecast horizons. We show good performance for our Bayesian (approximate) hetGP compared to alternatives on those GLM simulations and other classic benchmarking examples with input-dependent noise.

 

Impact of the Choice of Hyper-parameters on Statistical Inference of SGD Estimates

Yeng Saanchi, JMP

Abstract: In an era where machine learning is pervasive across various domains, understanding the characteristics of the underlying methods that drive these algorithms is crucial. Stochastic Gradient Descent (SGD) and its variants are key optimization techniques within the broader class of Stochastic Approximation (SA) algorithms. Serving as the foundation for most modern machine learning algorithms, SGD has gained popularity due to its efficient use of inexpensive gradient estimators to find the optimal solution of an objective function. Its computational and memory efficiency makes it particularly well-suited for handling large-scale datasets or streaming data.

SGD and its variants are widely applied in fields such as engineering, computer science, applied mathematics, and statistics. However, due to early stopping, SGD typically produces estimates that are not exact solutions of the empirical loss function. The difference between the SGD estimator and the true minimizer is influenced by factors such as the observed data, the tuning parameters of the SGD method, and the stopping criterion. While these methods have been successful in a wide range of applications, SGD can be erratic and highly sensitive to hyper-parameter choices, often requiring substantial tuning to achieve optimal results.

To explore the impact of step size scheduling on SGD’s accuracy and coverage probability, we conduct a simulation study. Additionally, we propose a new approach for hyper-parameter tuning that combines the double bootstrap method with the Simultaneous Perturbation Stochastic Approximation (SPSA) technique.

 

Screening Designs for Continuous and Categorical Factors

Ryan Lekivetz, JMP

Abstract: In many screening experiments, we often need to consider both continuous and categorical factors. In this presentation, we introduce a new class of saturated designs that effectively address this need. These designs include m three-level continuous factors and m–1 two-level factors—either categorical or continuous—within just n = 2*m runs, where m is at least 4.

A major advantage of our approach is its flexibility: these designs are available for any even number of runs starting from 8. Depending on whether n is a multiple of 8, 4, or just 2, the designs exhibit varying degrees of orthogonality.

We demonstrate that these designs typically have power near one for identifying up to m active main effects when the signal-to-noise ratio is greater than 1.5.

 

Rerandomization Algorithms for Optimal Designs of Network A/B Tests

Qiong Zhang, Clemson

Abstract: A/B testing is an effective method to assess the potential impact of two treatments. For A/B tests conducted by IT companies like Meta and LinkedIn, the test users can be connected and form a social network. Users’ responses may be influenced by their network connections, and the quality of the treatment estimator of an A/B test depends on how the two treatments are allocated across different users in the network. This paper investigates optimal design criteria based on some commonly used outcome models, under assumptions of network-correlated outcomes or network interference. We demonstrate that the optimal design criteria under these network assumptions depend on several key statistics of the random design vector. We propose a framework to develop algorithms that generate rerandomization designs meeting the required conditions of those statistics under a specific assumption. Asymptotic distributions of these statistics are derived to guide the specification of parameters in the algorithms. We validate the proposed algorithms using both synthetic and real-world networks.

Session 3 (2-3:30PM)

 

Probability of Detection: Evaluating the Reliability of Nondestructive Inspection Systems

Christine Knott, Air Force Research Laboratory

Abstract: The reliability of nondestructive inspection (NDI) systems is estimated using statistical methods, the most thorough of which is the Probability of Detection (POD) methodology. The Department of the Air Force uses periodic nondestructive re-inspection of critical structural components to maintain aircraft safety, and POD helps establish the length of inspection intervals. The established methods for POD will be provided, followed by a discussion of recent statistical research which extends and improves upon these methods.

 

Experiment Design and Modeling for Nondestructive Evaluation at NASA

Peter A. Parker, NASA

Abstract: NASA requirements for human-rated spaceflight systems rely on nondestructive evaluation (NDE) methods to reliably detect critical defects in hardware before flight. Diagnostic measurement techniques such as eddy current, ultrasound, and radiography are used to inspect flight hardware in a noninvasive harmless manner. The performance of a NDE inspection protocol is assessed in a probability of detection (POD) study that involves experimental demonstration of defect detection and statistical modeling of detection capability and reliability as a function of defect size. POD studies are resource-intensive, therefore, leveraging statistical experiment design is essential. POD studies are inherently interdisciplinary as they integrate the physics of the inspection modality, material properties, defect morphology, operational inspection access, and human factor interactions on the inspection technique. These studies may confirm that a well-established inspection protocol meets the required detection capability, or they may involve experimentally optimizing a NDE technique for a specific application. NASA’s NDE discipline emerged in the early 1970’s to support the Space Shuttle Program. While the statistical methods used in POD assessments have advanced over the past 55 years, POD practice and research is a narrow field of expertise. This presentation seeks to broaden awareness and promote involvement of practitioners and academics in NDE. It provides an overview of statistical design and modeling aspects of POD studies, highlights recent methodological advancements, and identifies challenging research opportunities to meet evolving aerospace requirements.

 

A scalable algorithm for generating non-uniform space filling designs

Xiankui Yang,

Abstract: Traditional space-filling designs have long been popular to achieve uniform spread or coverage throughout the design space for diverse applications. Non-uniform space-filling (NUSF) designs were recently developed to achieve flexible densities of design points to allow users to emphasize and de-emphasize different regions of the input space. However, using a point exchange algorithm, the construction of the NUSF designs entails substantial computational costs, particularly for higher-dimensional scenarios. To improve computing efficiency and scalability, we propose a new algorithm consistent with the fundamentals of NUSF, termed Quick Non-Uniform Space-Filling (QNUSF) designs. By combining hierarchical clustering with group average linkage and strategic point selection methods, the QNUSF algorithm expedites computation. We present two point selection methods, i.e. maximin and minimax QNUSF, to achieve non-uniformity with different emphasis on the spread or coverage of the design characteristic to facilitate broader adaptability. In addition, QNUSF designs allow great flexibility for handling discrete or continuous, regular or irregular input spaces to achieve the desired density distribution and hence improve versatility and applicability for achieving different experimental goals. The computational efficiency and performance of QNUSF designs are illustrated via several examples.

A critique of neutrosophic statistical analysis illustrated with interval data from designed experiments

William Woodall, Virginia Tech

Abstract: Recent studies have explored the analysis of data from experimental designs using neutrosophic statistics. These studies have reported neutrosophic bounds on the statistics in analysis of variance tables. In this paper, following Woodall et al. (2025), a simple simulation-based approach is used to demonstrate that the reported neutrosophic bounds on these statistics are either incorrect or too inaccurate to be useful. We explain why the neutrosophic calculations are incorrect using two simple examples.

 

 

Large Row-Constrained Supersaturated Designs for High-throughput Screening

Byran Smucker, Henry Ford Health

Abstract: High-throughput screening, in which large numbers of compounds are traditionally studied one-at-a-time in multiwell plates, is widely used across many areas of the biological and chemical sciences including drug discovery. To improve the efficiency of these screens, we propose a new class of supersaturated designs that guide the construction of pools of compounds in each well. Because the size of the pools are typically limited by the particular application, the new designs accommodate this constraint and are part of a larger procedure that we call Constrained Row Screening, or CRowS. We develop an efficient computational procedure to construct CRowS designs, provide some initial lower bounds on the average squared off-diagonal values of their main-effects information matrix, and study the impact of the constraint on design quality. We also show via simulation that CRowS is statistically superior to the traditional one-compound-one-well approach as well as an existing pooling method, and provide results from two separate applications having to do with the search for solutions to antibiotic-resistant bacteria.

 

Optimizing User Experience in Statistical Tools through Experimental Design

Jacob Rhyne and Mark Bailey, JMP

Abstract: Modern statistical software is increasingly used by users with a wide range of statistical training to address complex real-world problems. However, developers of such software may not always consider assessing its usability. In this talk, we present case studies that assess the usability of statistical software using a design of experiments approach. Topics include user interaction with the software, correct interpretation of results, and accommodating users of varying expertise levels.

 

October 9th

Session 4 (8-9:30AM)

Predictive Model Monitoring and Assessment in the Lubrizol Q.LIFE® Formulation Optimization System

Kevin Manouchehri, Lubrizol

Abstract: Q.LIFE® is a comprehensive predictive system to solve complex formulation optimization problems. A key piece of this Lubrizol system is its suite of empirical predictive models. Dependent upon the application, prior knowledge, and quality, quantity and nature of the data, models are developed using different modeling techniques from least-squares regression to complex ensemble models. The integration of Q.LIFE® into Lubrizol formulation strategy and practice has necessitated the need for models to be assessed and monitored using control charts, residual checks, and internal algorithms comparing similar formulations. Lubrizol’s techniques for model assessment and monitoring as new data is generated, embedded within Q.LIFE®, are demonstrated, as well as some of Lubrizol’s ideas and techniques for estimating the error around predictions regardless of the model origin.

 

A Case Study in Image Analysis for Engine Cleanliness

Quinn Frank, Lubrizol

Abstract: Laboratory Engine Tests are used in the evaluation of engine oil quality and capability. Assessments range from chemical and physical analysis of the used oil after the completion of the test to the rating of test parts for sludge, varnish and deposits. The rating of used parts after a test is complete is typically done by trained raters that look at and examine pistons, rings, liners, and screens in a clean environment with standardized time windows, lighting, rating instructions, and other related criteria. Raters receive both training and calibration credentials in mandatory rating workshops where rater repeatability and reproducibility are measured and assessed. In these workshops, differences in ratings are also compared between actual test parts and digital photographs of test parts. If possible, the use of digitals is much more cost effective in training and assessment due to the elimination of transporting people and test parts around the world. In most cases, however, the ratings are more accurate for the actual test parts than the photographs.

There is a unique opportunity in the rating of Oil Screen Clogging in the 216-hour, Sequence VH Sludge Test for passenger car motor oils. Rater repeatability and reproducibility of both parts and photographs are considered poor. We therefore took the opportunity to conduct a study to generate data and develop a Machine-Learning model and algorithm to rate the digital photographs. Our database consists of ratings by multiple raters on both parts and photographs on an array of clean, dusty, dirty, and completely clogged engine oil screens. Our case study culminates in the comparison of repeatability and reproducibility for the scenarios of rating actual parts by raters, rating digital photographs by raters, and rating digital photographs using models.

 

The Power of Foldover Designs

Jonathan Stallrich, North Carolina State University

Abstract: The analysis of screening designs is often based on a second-order model with linear main effects, two-factor interactions, and quadratic effects. When the main effect columns are orthogonal to all the second-order terms, a two-stage analysis may be conducted starting with fitting a main effect only model. A popular technique to achieve this orthogonality is to take any design and append its foldover runs. In this talk, we show that this foldover technique is even powerful than originally thought because it also includes opportunities for unbiased estimation of the variance either by pure error or lack of fit. We find optimal foldover designs for main effect estimation and other designs that balance main effect estimation and model selection for the important factors. A real life implementation of our new designs involving 8 factors and 20 runs is discussed.

 

Applications of Weibull Based Parametric Regression for Survival Data

Chad Foster and Sarah Burke, GE Aerospace

Abstract: Survival analysis is routinely used to characterize the life distribution of a part or system based on a time or usage parameter (e.g., hours, number of cycles, days since manufacture). When the population can be divided into groups, a consistent way is required to address the reliability estimates of these sub-groups that may only be different in damage accumulation rate and not in the failure mechanism. In the aerospace industry for instance, if data is limited to number of flights and the failure was dependent on the length of the flight, the population could be divided into two groups: domestic (short flights) and international (long flights). Separate probability distributions could be fit to each group iterating to ensure a common shape parameter. Alternatively, the shape parameter and the two scale parameters could be fit through a life regression model.

In practice, more complex situations occur. Components of systems are frequently traded, moved, and reused numerous times throughout its life. Reliability estimates are required to inform timing for maintenance activities, part removal, or other field management actions. Life regression models can be useful to manage issues in the field while accounting for these complex populations.

In this presentation, the foundation and background of life regression models using the Weibull distribution is given and supplemented by several examples of field management in the aerospace industry. The method is contrasted with Cox proportional hazard regression which does not require distributional assumptions but also does not provide survival time estimates. It is also highlighted that many real-world applications have parameter covariance and thus require log-likelihood significance tests. This presentation will guide practitioners on how to use life regression models in actual complex situations.

 

Identifying Prognostic Variables across Multiple Historical Clinical Trials

Xueying Liu, Virginia Tech

Abstract: In clinical trials, it is critical to incorporate prognostic variables to improve precision for estimating treatment effects. Per FDA guidance, these prognostic variables need to be prospectively specified in the statistical analysis plan. Therefore, there is a great need to effectively identify prognostic variables from historical studies that are similar to the new study. In this work, we propose a multi-task learning approach to identify prognostic variables from multiple related historical clinical trials. Specifically, a bi-level constraint variable selection (Bi-CVS) method is developed to identify both study-specific (within-study) and program-level (cross-study) prognostic variables. In addition, we introduce an efficient algorithm specifically designed for clinical trial settings and investigate its consistency property. The performance of our proposed method is evaluated through simulations for various endpoints, including continuous, binary, and time-to-event. A real clinical application is also presented to illustrate our method.

 

Simulation-Based Approach to Designing Analytical Validation Studies to Assess the Stability of Biological Material

Dilsher Dhillon, Freenome

Abstract: In vitro medical devices are required to analytically validate key performance attributes in order to provide the Food and Drug Administration (FDA) with objective evidence of the safety and efficacy of the device. These attributes include, but are not limited to, limits of blank, detection, and quantitation, accuracy, imprecision, linearity, and stability. Experimental designs and standards established by the Clinical and Laboratory Standards Institute (CLSI) are preferred by the FDA, to ensure consistency and transparency across medical devices. However, with novel and complex assay designs that did not exist when the relevant CLSI guidelines and associated examples were developed, tailored designs and power calculations may be needed. To illustrate this, we describe a study that evaluates the stability of biological material used for measuring molecular signatures. Currently, there is no standard published by CLSI for evaluating the stability of biological material. However, CLSI EP25Ed2E (Evaluation of Stability of In Vitro Medical Laboratory Test Reagents) provides guidelines on how to establish the stability of reagent materials and recommends using a linear regression of measurement vs. time. The described regression approach assumes that all of the residual variation in the measurements over time is due to measurement error, but this assumption is violated when evaluating the stability of biological material derived from more than one subject, thus impacting design requirements (e.g. sample size (replicates), number of samples and timepoints, and statistical power). In order to account for biological variability, we use linear mixed effects model simulations to conduct the power analysis. Using knowledge of the measurement error from prior studies and expected levels of variation across subjects, we simulate data from multiple time points. We then vary the slope (the increase or decrease in measurement across time) to estimate the type I and type II error rates, conditional on the number subjects, timepoints and replicates per timepoint. The results of the simulations show that the number of unique subjects and the number of timepoints have the greatest impact on increasing the power to detect a change in the slope. They also help provide guidance on the required number of unique samples as well as the number of time points and replicates per time point. This simulation-based power calculation and analysis still conforms to the recommended design and regression framework outlined in the EP25Ed2E, and it thereby ensures consistency for FDA review. By applying known characteristics of the medical device and the data generating process, simulations provide the ability to modify existing designs to new contexts while still aligning with existing precedent.

Session 5 (10-11:30AM)

 

Designing Experiments to Identify Optimal Operation Conditions for a Dynamic Cloth Media Primary Wastewater Treatment System

Madison De Boer, Baylor University

Abstract: Operation of wastewater treatment (WWT) processes is essential for human health and environmental effects and is highly energy intensive, prompting the need for optimal operation to minimize energy consumption while providing high quality effluent. Our goal is to identify optimal operational setpoints for a primary cloth media filtration system under dynamic influent conditions. Machine learning techniques like reinforcement learning are gaining traction in WWT, but many facilities lack the necessary automation to adopt these advanced methods.

Here, we apply response surface methodology (RSM) paired with constrained optimization as a practical alternative. We target a reduction of effluent total suspended solids (TSS) to enhance effluent primary water quality, reduce backwashes per hour to minimize energy consumption, as well as monitoring tank level changes to account for long-term performance of the filter. RSM is used to identify optimal input settings, maximum tank level, and influent flow rate. The system is tested under various setpoints. Modeling effluent TSS per cycle with a second-order model achieves an R^2 exceeding 75%, demonstrating strong predictive performance.

Under fixed influent flow, optimized setpoints improve filter operation. In varied flow scenarios, the approach enhances TSS removal and long-term filter performance.

 

Microstructure-based Statistical Tests for Material State Comparison

Simon Mason, The Ohio State University

Abstract: In materials science, material properties and performance are heavily tied to the microstructure of materials, or the myriad features at multiple length scales. The development of new and improved industrially important materials relies upon our ability to meaningfully capture and quantify characteristics of these microstructural features. The natural variation in microstructures across samples of a given material suggests a theoretical probability distribution over these patterns, which may be used for formulating tests of statistical hypotheses. The non-Euclidian structure of these objects, however, prevents the use of standard non-parametric tests of homogeneity such as Kolmogorov-Smirnoff or Cramer-von-Mises. We combine a new approach for metric distribution function-based testing with the development of quantitative descriptors to establish metric distances between microstructure samples. We show that for a materials domain, this test can be used to determine resolvability limits between neighboring material states in terms of processing parameters, differentiating between similar microstructure. We further examine its use as a tool for recognizing/distinguishing deep-learning generated microstructures from physics-generated images.

 

Optimal Robust Designs with both centered and baseline factors

Xietao Zhou, King’s College London

Abstract: Traditional optimal designs are optimal under a pre-specified model. When the final fitted model differs from the pre-specified model, traditional optimal designs may cease to be optimal, and the corresponding parameter estimators may have larger variances. The Q_B criterion has been proposed to offer the capacity to consider hundreds of alternative models that could potentially be useful for data from a multifactor design.

Recently, an alternative parameterization of factorial designs called the baseline parameterization has been considered in the literature. It has been argued that such a parameterization arises naturally if there is a null state of each factor, and the corresponding minimum K-aberration has been explored. In our previous work, we have generalized the Q_B criterion to apply to the baseline parameterization, and it has been shown that the optimal designs found can be projected on more eligible candidate models than the minimal K-aberration design for various specified prior probabilities of main effects and two-factor interactions being in the best model.

In the present work, we have extended the Q_B criterion to the scenario when eligible candidate models contain both baseline and centered parameterization factors. This shall be of interest in practice when some of the factors naturally do have a reasonable null state alongside other factors whose levels are equally important and are more naturally represented under the centered parameterization. We have compared our optimal designs with their counterparts in the most recent literature and have shown that the projection capacity of eligible candidate models/accuracy of estimation of models in terms of the A_s criterion can be improved when the number of runs in the experiment is a multiple of 4 and have also examined and solved the same problem with no restrictions on the number of runs of the experiment so that It could be applied in a more general way in practice.

The basic framework of the Q_B criterion and its variation on baseline parameterization will be briefly discussed, followed by a detailed explanation of the new version dealing with factors under which both parameterizations are present, finished by evaluating the robust and accurate performance of the Q_B optimal designs we have found.

 

Robust Parameter Designs Constructed from Hadamard Matrices

Yingfu Li, University of Houston – Clear Lake

Abstract: The primary objective of robust parameter design is to identify the optimal settings of control factors in a system to minimize the response variance while achieving an optimal mean response. This article investigates fractional factorial designs constructed from Hadamard matrices of orders 12, 16, and 20 to meet the requirements of robust parameter design. These designs allow for the estimation of critical factorial effects, including all control-by-noise interactions and the main effects of both control and noise factors, while saving experimental runs and often providing better estimation of other potentially important interactions. Top candidates for various combinations of control and noise factors are provided, offering practical choices for efficient and resource-constrained experimental designs with minimal runs.

 

Exploratory Image Data Analysis for Quality Improvement Hypothesis Generation

Theodore T. Allen, The Ohio State University

Abstract: Images can provide critical information for quality engineering. Exploratory image data analysis (EIDA) is proposed here as a special case of EDA (exploratory data analysis) for quality improvement problems with image data. The EIDA method aims to obtain useful information from the image data to identify hypotheses for additional exploration relating to key inputs or outputs. The proposed four steps of EIDA are: (1) image processing, (2) image-derived quantitative data analysis and display, (3) salient feature (pattern) identification, and (4) salient feature (pattern) interpretation. Three examples illustrate the methods for identifying and prioritizing issues for quality improvement, identifying key input variables for future study, identifying outliers, and formulating causal hypotheses.

 

Boundary peeling: An outlier detection method

Maria L. Weese, Miami University of Ohio

Abstract: Unsupervised outlier detection constitutes a crucial phase within data analysis and remains an open area of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce Boundary Peeling, an unsupervised outlier detection algorithm. Boundary Peeling uses the average signed distance from iteratively peeled, flexible boundaries generated by one-class support vector machines to flag outliers. The method is similar to convex hull peeling but well suited for high-dimensional data and has flexibility to adapt to different distributions. Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In unimodal and multimodal synthetic data simulations Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers. Boundary Peeling performs competitively or better in terms of correct classification, AUC, and processing time using semantically meaningful benchmark datasets.

Session 6 (1:30-3PM)

 

Integrating SPC, DOE, and AI/ML for Enhanced Quality

Daksha Chokshi, StatQualTech

Abstract: In today’s competitive landscape, continuous improvement is essential for achieving operational excellence and maintaining a competitive edge. This presentation explores the synergistic integration of Statistical Process Control (SPC), Design of Experiments (DOE), and Artificial Intelligence (AI) to develop a more adaptive and intelligent approach to process optimization and decision-making. SPC establishes a robust framework for monitoring and controlling process variability, ensuring consistent product quality. DOE offers a systematic approach to experimenting with process parameters, identifying optimal conditions that enhance performance. The incorporation of AI/ML further strengthens these traditional methodologies by enabling predictive analytics, anomaly detection, pattern recognition, and automated optimization. By combining these approaches, organizations can transition from reactive quality control to a proactive, data-driven strategy that drives self-learning process improvements.

Several case studies and practical applications will be discussed to illustrate how this triad of methodologies fosters a culture of continuous improvement, empowering organizations to achieve higher levels of productivity, quality, and innovation. The presentation will conclude with an exploration of challenges, implementation strategies, and future directions for AI-driven continuous improvement. 

 

Using Input-Varying Weights to Determine a Soft Changepoint in Mixture distributions

Di Michelson, Caleb King and Don McCormack, JMP

Abstract: It is quite common for data to come from populations that actually consist of two or more subpopulations. For example, the lifetime distribution of a product may actually consist of multiple distributions depending on specific failure modes. The typical approach in these instances is to use a mixture distribution, where the likelihood of each observation is a weighted combination of several distribution models. These weights may be constant or a function of covariates. Another approach is to consider a changepoint where the distribution model makes a sudden change from one model to another. In this talk, we propose an approach that falls between these two extremes. Instead of a hard changepoint, we instead use a probit or logistic model that allows the mixture proportion to vary over the range of the variable, with the point at which the mixture is evenly split serving as a “soft changepoint”. We illustrate this new approach using data from a an industrial application.

 

Monotonic Warpings for Additive and Deep Gaussian Processes

Steven D. Barnett, Virginia Tech

Abstract: Gaussian processes (GPs) are canonical as surrogates for computer experiments because they enjoy a degree of analytic tractability. But that breaks when the response surface is constrained, say to be monotonic. Here, we provide a mono-GP construction for a single input that is highly efficient even though the calculations are non-analytic. Key ingredients include transformation of a reference process and elliptical slice sampling. We then show how mono-GP may be deployed effectively in two ways. One is additive, extending monotonicity to more inputs; the other is as a prior on injective latent warping variables in a deep Gaussian process for (non-monotonic, multi-input) non-stationary surrogate modeling. We provide illustrative and benchmarking examples throughout, showing that our methods yield improved performance over the state-of-the-art on examples from those two classes of problems.

 

Deep Gaussian processes for estimation of failure probabilities in complex systems

Annie S. Booth, Virginia Tech

Abstract: We tackle the problem of quantifying failure probabilities for expensive deterministic computer experiments with stochastic inputs. The computational cost of the computer simulation prohibits direct Monte Carlo (MC) and necessitates a statistical surrogate model, turning the problem into a two-stage enterprise (surrogate training followed by probability estimation). Limited evaluation budgets create a design problem: how should expensive evaluations be allocated between and within the training and estimation stages? One may relegate all simulator evaluations to greedily train the surrogate, with failure probabilities then estimated from “surrogate MC”. But extended surrogate training offers diminishing returns, and surrogate MC relies too stringently on surrogate accuracy. Alternatively, a surrogate trained on a fraction of the simulation budget may be used to inform importance sampling, but this is data hungry and can provide erroneous results when budgets are limited. Instead we propose a two-stage approach: sequentially training Gaussian process (GP) surrogates through contour location, halting training once learning of the failure probability has plateaued, then employing a “hybrid MC” estimator which combines surrogate predictions in certain regions with true simulator evaluations in uncertain regions. Our unique two-stage design strikes an appropriate balance between exploring and exploiting and outperforms alternatives, including both of the aforementioned approaches, on a variety of benchmark exercises. With these tools, we are able to effectively estimate small failure probabilities with only hundreds of simulator evaluations, showcasing functionality with both shallow and deep GPs, and ultimately deploying our method on an expensive computer experiment of fluid flow around an airfoil.

 

Predictive Modeling for Patient Care using AI in a Secure Environment

Sunil Mathur, Boston Medical Center

Abstract: Predictive modeling for patient care using AI in a secure environment requires a combination of advanced machine learning techniques, robust data security measures, and ethical considerations to ensure patient privacy and regulatory compliance. AI models, such as deep learning and ensemble methods, can then be trained on large, de-identified datasets to predict disease risks, recommend personalized treatments, and optimize hospital resource allocation. We propose to use federated learning, allowing AI models to learn from decentralized data without transferring sensitive patient information across networks. Additionally, we propose to use multi-party computation and homomorphic encryption to enable computations on encrypted data, ensuring confidentiality even during AI processing. Real-time anomaly detection systems using AI will help to identify cybersecurity threats, such as unauthorized access or data breaches, further strengthening the secure environment. Finally, explainable AI (XAI) techniques will be integrated to ensure model transparency and clinician trust, allowing healthcare professionals to interpret AI-driven recommendations while maintaining accountability. By implementing privacy-preserving AI methodologies, robust encryption, and continuous monitoring, predictive modeling can transform patient care while upholding the highest security and ethical standards.

 

Dealing with Sample Bias—Alternative Approaches and the Fundamental Questions They Raise

Frederick W. Faltin, Virginia Tech

Abstract: Now and then we hear on the news of some data analysis (generally attributed to “AI”), the outcome of which has gone badly wrong. The root cause is nearly always found to have been some form of sample bias, or its flipside, unintended extrapolation. Awareness of such issues is generally high in the statistical community, but is often very much less so among data scientists more broadly. Statisticians have responded by developing and promoting several very useful lines of research for countering sample bias in observational data. This expository talk presents an overview of some of these approaches, as well as of algorithmic means being developed in other fields to adjust for bias or extrapolation in the process of fitting machine learning models. These alternative approaches raise fundamental questions about what purpose(s) the models being developed are intended to serve, and how our analysis approach needs to adapt to the answer the “right” question.

 

Modeling Intentional Test to Failure as Basis for Assay Positive Control Limits

James Garrett, James Garrett LLC

Abstract: Clinical diagnostic assays include a positive run control sample whose result must fall within limits in order to validate that the assay is capable of returning a positive result. These “control limits” are often estimated as normal ranges based on limited historical data, which is fraught in many ways. I propose an experimental design approach in which the diagnostic system is perturbed onto failure with both a low-positive reference sample and the positive control, and I apply distributional regression modeling to interpolate the positive control result for which system performance with the reference sample undeniably begins to degrade. For distributional modeling, I compare quantile regression and GAM-LSS methods. I demonstrate with a simulated data set inspired by a real example.

 

Introducing Continuous Restrictions into Spatial Models via Gaussian Random Fields with Linear Boundary Constraints

Yue Ma, The Ohio State University

Abstract: Boundary constraints are extensively used in physical, environmental and engineering models to restrict smooth states (e.g., temperature fields) to follow known physical laws. Examples include fixed-state or fixed-derivative (insulated) boundaries, and boundaries which relate the state and the derivatives (e.g., convective boundaries). Gaussian random fields (GRFs), as flexible, non-parametric models, are widely applied to recover smooth states from discrete spatial measurements across a domain. We formally define boundary-constrained random fields and introduce a representation-based approach to fully enforce linear boundary constraints on GRFs over multi-dimensional, convex domains. This new class of boundary-constrained random fields can be used for recovering smooth states, with known physical mechanisms working at the domain boundaries. Such constrained random fields make flexible priors for modeling smooth states, enable data-driven discovery of dynamic systems, and improve performance and uncertainty quantification of probabilistic solvers for differential equations.

 

Scale-Location-Truncated Beta Regression: Expanding Beta Regression to Accommodate 0 and 1

Mingang Kim, Virginia Tech

Abstract: Beta regression is frequently used when the outcome variable y is bounded within a specific interval, transformed to the (0, 1) domain if necessary. However, standard beta regression cannot handle data observed at the boundary values of 0 or 1, as the likelihood function takes on values of either 0 or ∞. To address this issue, we propose the Scale-Location-Truncated (SLT) beta regression model, which extends the beta distribution’s domain to the [0, 1] interval. By using scale-location transformation and truncation, SLT beta distribution allows positive finite mass to the boundary values, offering a flexible approach for handling values at 0 and 1.

In this paper, we demonstrate the effectiveness of the SLT beta model in comparison to stan- dard beta regression models and other approaches like the Zero-One Inflated Beta (ZOIB) model [Liu and Kong, 2015] and XBX regression [Kosmidis and Zeileis, 2024]. Using empirical and sim- ulated data, we compare the performance including predictive accuracy of the SLT beta model with other methods, particularly in cases with observed boundary data values for y. The SLT beta model is shown to offer greater flexibility, supporting both linear and nonlinear relation- ships. Additionally, we implement the SLT beta model within classical and Bayesian frameworks, employing both hierarchical and non-hierarchical models. This comprehensive implementations demonstrate its broad applicability for modeling bounded data in a range of contexts.

 

Pitfalls and Remedies for Maximum Likelihood Estimation of Gaussian Processes

Ayumi Mutoh, North Carolina State University

Abstract: Gaussian processes (GPs) are nonparametric regression models favored for their nonlinear predictive capabilities, making them popular as surrogate models for computationally expensive computer simulations. Yet, GP performance relies heavily on effective estimation of unknown kernel hyperparameters. Maximum likelihood estimation is the most common tool of choice, but it can be plagued by numerical issues in small data settings. Penalized likelihood methods attempt to overcome likelihood optimization challenges, but their success depends on tuning parameter selection. Common approaches select the penalty weight using leave-one-out cross validation (CV) with root mean squared error (RMSE). Although this method is easy to implement, it is computationally expensive and ignores the uncertainty quantification (UQ) provided by the GP. We propose a novel tuning parameter selection scheme which combines k-fold CV with a score metric that accounts for GP predictive performance and UQ. Additionally, we incorporate a one-standard-error rule to encourage smoother predictive surfaces in the face of limited data, which remedies flat likelihood issues. Our proposed tuning parameter selection for GPs matches the performance of standard MLE when no penalty is warranted, excels in settings where regularization is preferred, and outperforms the benchmark leave-one-out CV with RMSE.

 

The Current Role of Acceptance Sampling in Pharmaceutical Manufacturing

Alson Look, Regeneron

Abstract: The purpose of this presentation is to provide an overview of acceptance sampling as used in pharmaceutical manufacturing. Specifically, we will discuss: (1) A brief history of acceptance sampling. (2) Focus will be on single sampling attributes plans. (3) Operating Characteristics Curves their importance, and relationships to Average Quality Levels (AQL) and Lot Tolerance Percent Defectives (LTPD). (4) How sample sizes are selected using a statistical software package such as Minitab. (5) The implications of different sample sizes (Risk to Internal Customers/Patients, etc.), different strategies, possible solutions, and open problems.

 

Monitoring Functional Anomalies in a Water Treatment Process

Hunter Privett, Baylor University

Abstract: This research is intended for an intermediate statistical audience, with a focus on functional data analysis, outlier detection applied to process control, and water and wastewater treatment data. Batch processes in water and wastewater treatment (W/WWT) often produce data with a repetitive, functional pattern. Detecting faults in these systems is important both for the quality of the system’s effluent water and to prevent system damage. However, the expected nonstationary changes in cyclical behavior over time within these systems and unique treatment process parameters between facilities make fault detection challenging. Some case studies have been done to retroactively assess the efficacy of fault detection in these systems, but case studies are limited in that they rely on assumptions of when a fault may be occurring, rather than a controlled and known fault. In this work, we use recently developed approaches to simulating functional W/WWT datasets in order to compare different process monitoring methods when applied to W/WWT systems for several controlled, simulated faults, and we develop a new real-time monitoring method based on metrics of functional outlyingness. First, a dataset is simulated based using functions from an observed ultrafiltration system for reference, and the functions are contaminated with one of four different types of faults after a period of normal operation. Then, four different fault detection methods are applied, and true positive and false alarm rates are calculated. The first two methods applied are traditional Shewhart charts for each individual measurement using either the original raw values or globally detrended values. The next two methods are based on the functional structure of the data applied to the detrended data. The first method is one that we develop that incorporates metrics of a function’s directional outlyingness into a T2 chart. The last method uses a set of T2 and SPE charts based on functional PCA decomposition of the functions. These methods are then compared for four types of faults that include changes to the shape or magnitude of the functions. We demonstrate the importance of accounting for global behavior in W/WWT systems when performing fault detection to increase accuracy. We also demonstrate when a functional based method improves over a traditional method that ignores the functional structure.