Practical Machine Learning Hyperparameter Optimization By A Coarse-To-Fine Search

Ferdinand Che

September 3, 2024

Introduction

Hyperparameter tuning or optimization is crucial in machine learning model development, significantly impacting model generalization performance, which is how well the model performs when predicting new, unseen data. Unlike model parameters, which the model estimates from the dataset during training, hyperparameters are established before training and govern the model learning process (ES & Bajaj, 2022). To achieve the best generalization performance, it is essential to carefully select the model’s hyperparameters to ensure that the model neither overfits (when the model performs well on the training data but poorly on new data) nor underfits (when the model performs poorly on both the training data and new data) the training data and performs optimally when predicting unseen data. The model is trained with a defined discretized hyperparameter search space or grid, using different combinations of hyperparameter values, and its performance is evaluated using a validation set or cross-validation. The combination of hyperparameter values that produce the best performance on the validation set is chosen as the optimal set of hyperparameters. Usually, it leads to optimal generalization results for the model. The best search strategies are those that quickly identify the most promising search areas, leading to better models in limited training and testing time (Coelho, 2020).

When defining the discretized hyperparameter grid, an essential consideration is identifying which hyperparameters are the most important to focus on because they have the most influence or impact on the model generalization performance. It has been shown that, for a given machine learning task, the majority of the variation in model performance can be attributed to only a few hyperparameters; thus, changing the values of these hyperparameters is likely to make a much more significant difference in model performance (Coelho, 2020; Hutter et al., 2014; van Rijn & Hutter, 2018). Focusing attention on tuning the values of the most important hyperparameters is an optimal strategy that reaps the greatest benefits by reducing or narrowing the search space. Once the search space is defined, another consideration is how the optimal hyperparameter search is conducted. The search strategies vary in sophistication from a manual and uninformed approach to an automated and informed search. With informed search, each iteration of the search learns from previous iterations. The commonly mentioned search approaches, such as random and grid search, are uninformed, whereas approaches based on Bayesian optimization are informed.

Random and grid search are two commonly mentioned techniques for hyperparameter optimization. Once the search space is defined as a discretized grid of possible hyperparameter values, a random search iteratively evaluates a random combination of hyperparameters from the grid in each iteration for a predetermined number of iterations to identify the combination that produces the optimal model generalization estimate. Conversely, a grid search systematically enumerates all grid combinations to determine the combination that produces the optimal model generalization estimate. Random search offers improved efficiency, but there is a risk of missing the globally optimal parameter combination as it does not take advantage of the structure of the search space. Grid search is computationally demanding, especially when dealing with large datasets, using a model with many hyperparameters, or where the search space is very granularly discretized, resulting in many hyperparameter combinations. In scikit-learn, both RandomizedsearchCV and GridSearchCV automatically support K-fold cross-validation for classification and regression problems (Buitinck et al., 2013; Pedregosa et al., 2011; scikit-learn, n.d.). There are other hyperparameter search techniques, such as Bayesian optimization, which are designed to take advantage of the structure of the search space to produce an optimal hyperparameter combination more quickly (Bischl et al., 2021; Chen, 2021; Saraswathi, 2023).

Hyperparameter optimization techniques can be combined to leverage their strengths and mitigate their limitations. Here, we propose a coarse-to-fine approach based on an initial random search and a second stage based on a pragmatic replacement for a grid search. Here, we propose a random search with a very coarsely defined search space in step 1, combined with Bayesian optimization in step 2, which uses a finely defined and narrow search space based on the output of the random search. There are several implementations of Bayesian optimization, for example, a customized Hyperopt-based optimizer called (HyperoptSearchCV), another Hyperopt adaptation called HyperoptEstimator from Hyperopt-Sklearn, and a Bayesian optimizer called BayesSearchCV from Scikit-Optimize, all of which support cross-validation and are pragmatic substitutes (Bergstra, n.d.-b, n.d.-a; Brownlee, 2020; Komer et al., 2014, 2019; Scikit-Optimize, n.d.; Vichaar, 2023).

Coarse-To-Fine Hyperparameter Optimization

The coarse-to-fine approach offers a pragmatic approach that simplifies the task of specifying model hyperparameter ranges, which is often arbitrary and dependent on experience. It also supports automating the hyperparameter optimization process by integrating the search for a promising search space region, thereby enhancing the efficiency of the process. This was achieved by leveraging the initial broad search capability of RandomizedSearchCV and the fine-tuning precision of either Hyperopt or BayesSearchCV. The proposed approach mitigates the dependence on guesswork and intuition often associated with defining a discretized hyperparameter search space and leads to a more data-driven, systematic, and effective model tuning, instilling confidence in its efficiency.

In summary, the coarse-to-fine hyperparameter tuning approach consists of two stages. First, in step 1, a coarse search is conducted using a random search using a broadly specified or configured hyperparameter grid of the most important model hyperparameters. Step 1 aims to identify a promising region in the configured hyperparameter space where the globally optimal combination of hyperparameter values will likely be located. Secondly, in step 2, a more detailed search is conducted using Bayesian optimization using a finer hyperparameter grid derived from the promising region from step 1. Step 2 aims to efficiently explore a smaller, more targeted, or granular search space, thus enhancing the possibility of identifying a globally optimal hyperparameter combination without the onerous computational expense of an exhaustive grid search.

Experiment Setup

An experiment was conducted to evaluate the proposed coarse-to-fine approach. The extreme gradient boosting or XGBoost regression model, based on implementation with the scikit-learn API, was selected as the estimator for the experiment. The XGBoost model was chosen because it is a popular gradient-boosting algorithm known for its high performance and efficiency in machine-learning problems. A movie rentals dataset containing 15,861 rows and 14 features was used, adapted from the DataCamp DVD movie rental duration prediction project (Che, 2024b; DataCamp, n.d.)The dataset was already cleaned, preprocessed, and split (80:20) into training and test sets, and it is ready for model development. Figure 1 summarizes the dataset features.

Here, the model objective was to predict the movie rental duration in days. The model performance was assessed using the Mean Squared Error metric (MSE), which measures the average of the squares of the errors or deviations, and the goal was a model with an MSE < 3.0. A broad range of XGBoost hyperparameters were used in the experiment, selected from a list of the most important hyperparameters, and those most frequently reported in other studies are shown in Table 1 (Bartz et al., 2021; Bischl et al., 2021; Coelho, 2020). For instance, in an earlier classification study, the learning rate was found to be the most important XGBoost hyperparameter, followed by subsample and minimum child weight, as seen in Figure 2 (Coelho, 2020). The broadly specified hyperparameter grid configured for step 1 was based on the hyperparameter ranges in Figure 2.

Figure 2: A violin plot showing the most important XGBoost hyperparameters and interactions (Coelho, 2020)

Table 1: XGBoost hyperparameters included in specifying the optimization search space

A baseline model was developed for each of two hyperopt-based models, identified as “hyperoptsk” for “Hyperopt-Sklearn” and “hyperoptcv” for “HyperoptSearchCV”, and a Scikit-Optimize-based model identified as “skoptimize”, and several optimization runs were conducted to explore the model performance using the coarse-to-fine search approach. Here, two independent variables were of interest, including the hyperparameter grid size and the grid resolution. The grid size was defined by a scaling factor between 0.1 and 1.0. Given a generated hyperparameter from step 1, such as max_depth, with values in the range [3, 17] and a generated promising max_depth value of 7, the following lines would define the span of the promising region around the generated value:

Python

std_dev = np.std([min_val, max_val])
viable_span = int(std_dev * scaling_factor) # for max_depth with range [min_val=3, max_val=4]
narrow_params_grid_entry = [int(x) for x in np.linspace(max(value + viable_span, min_val), min(value – viable_span, max_val), grid_resolution)] # where value=7

The grid resolution defines the number of entries in the narrow promising grid entry, and it is narrow because the derived standard deviation is always smaller than the range of the configured parameter. Several optimization runs were conducted for a range of grid sizes ([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]), grid resolutions ([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]), and for each of the model types configured, using the narrow grid generated from step1. The model type (i.e., “hyperoptsk”, “hyperoptcv”, and “skoptimize”), model performance score (MSE), grid size, grid resolution, optimization duration, resulting in 390 rows of data. Furthermore, data was also collected to test model performance when using a narrow grid in step 2 or not. The step 2 narrow grid defaulted to the same configured grid used as input to step 1. The optimization duration referred only to the step 2 processing time in seconds. The promising regions generated in Step 1 were cached appropriately to improve efficiency. The source code to support the routine model development was bundled in a custom module called “cheutils”, which can be installed from PyPI as indicated below, along with other relevant dependencies (Che, 2024a):

Shell

!pip install -i https://test.pypi.org/simple/ cheutils==2.1.5
!pip install tqdm
!pip install fast_ml
!pip install jproperties
!pip install codetiming
!pip install scikit-optimize
!pip install hyperopt
!pip install git+https://github.com/hyperopt/hyperopt-sklearn

The baseline model produced an MSE value of 1.94, which met the target performance goal (i.e., MSE < 3.0). Several hypotheses were tested in the experiment.

H1: Using a targeted narrow grid in step 2 results in a smaller MSE or improved model performance

H2: A target narrow grid in step 2 results in a shorter processing time or optimization duration

H3: A smaller grid size results in a smaller MSE

H4: A smaller grid size results in a shorter processing time

H5: Grid resolution does not impact model performance

H6: Grid resolution does not impact processing time

Discussion of Findings

Using a narrow grid resulted in improved model performance versus not using a narrow grid, as seen in Figure 3, where the MSE was lower when using a narrow grid. However, the median MSE values were poorer than the baseline model performance when not using the narrow grid, which is not surprising since the narrow grid was simply the default configured grid used in step 1 — the values were HyperoptSK (1.96), HyperoptCV (1.98), and Skoptimize (4.33). Notably, the model performance was improved over the baseline performance for all three models when using the narrow grid — the median MSE values were HyperoptSK (1.93), HyperoptCV (1.91), and Skoptimize (1.92) when rounded up. There is no significant variability in the HyperoptCV and Skoptimize performance. A pairwise t-test, used to determine whether the mean difference between two sets of observations is zero, confirmed a significant improvement in model performance (p-value<0.05) between using the narrow grid versus not for all three models. Therefore, the null hypothesis was rejected, and hence H1 was accepted.

Figure 3: Model performance with versus without narrow grid

In terms of optimization or processing time, as shown in Figure 4, there was no discernable difference in processing time for the HyperoptSK model. This was also confirmed by a pairwise t-test, which produced a p-value > 0.05; hence, the null hypothesis could not be rejected. However, for the HyperoptCV model, there was a significant reduction in processing time when using narrow grids; a pairwise t-test produced a p-value < 0.05. Whereas for the Skoptimize model, there was a significant increase in processing time when using the narrow grids; a pairwise t-test produced a p-value < 0.05. The differences could be explained by differences in each model’s underlying search space sampling algorithms during optimization. Overall, H2 was partly confirmed.

Figure 4: Model optimization processing times (secs), with versus without narrow grid

Another question to be answered was whether grid sizes impacted model performance. As shown in Figure 5, there was a mixed bag in that there were some performance differences between some pairs of grid sizes and not between others. A pairwise t-test also produced p-values>0.05 for some pairs of grid sizes and p-values <0.05 for others. The performance differences were more pronounced for the HyperoptCV and Skoptimize models. Overall, H3 was partly confirmed.

Figure 5: Model performance vs. grid size

In terms of optimization processing times, as shown in Figure 6, the HyperoptSK and Skoptimize models are a mixed bag, with some significant differences in processing times (p-value<0.05), with processing time mainly increasing with grid size. For the HyperoptCV, there is a discernable pattern of increasing processing time with grid size (p-value<0.05). Overall, H4 is confirmed for HyperoptCV but a mixed bag for the other two models.

Figure 6: Model optimization processing times vs. grid size

The models in the experiment each use a variety of sampling stochastic algorithms underneath when iterating during model optimization. However, the question is whether the grid resolution impacts the ultimate model performance. In general, the specification of the search space upon which the models all depend is based on setting the search space boundaries and various distributions that inform the search. In all likelihood, grid resolution does not significantly impact model performance, all else being equal.

Figure 7: Model performance vs. grid resolution

As shown in Figure 7, even though there are some small differences in performance for the Hyperopt models, the differences are insignificant (p-value>0.05) across the grid resolution for all models. Hence, H5 was confirmed.

In terms of processing time, as shown in Figure 8, the result is a mixed bag for the HyperoptSK and Skoptimize models, with some significant differences between pairs of grid resolution (p-value<0.05) and some differences not significant (p-value>0.05). For the HyperoptCV model, the differences between pairs are all significant (p-value<0.05), indicating that processing time varies with grid resolution. The differences in processing time can likely be explained by the underlying stochastic algorithms used by each model, and we could not discount the impact of other variations in CPU utilization on the host machine on which the experiment was run. Overall, H6 is partly confirmed, but some questions remain unanswered.

Figure 8: Optimization processing time vs. grid resolution

Takeaways From the Experiment

In machine learning, hyperparameter tuning is an essential practical step that can boost model performance but can be time-consuming. Here, we explored the coarse-to-fine approach to hyperparameter optimization, using random search as the first step in a two-step process to identify a most promising or baseline search space while deploying a Bayesian optimization algorithm in the second step to find the optimal combination of hyperparameters most likely to produce best model generalization performance. On its own, the random search offers a stochastic and scalable solution to hyperparameter optimization but may not always yield the globally optimal solution (Bischl et al., 2021)Bayesian optimizations are very pragmatic substitutes for the computationally expensive grid search. Here, we have demonstrated another pragmatic approach to hyperparameter optimization that can simplify hyperparameter specification and aid the optimization process’s automation.

The coarse-to-fine approach proposed here shows that it is possible to enhance the model performance without a significant efficiency hit by pre-emptively homing in on the relevant search space to enhance model hyperparameter optimization of Bayesian optimization models. We demonstrated that it is possible to achieve significant performance improvement by feeding the Bayesian optimization model with a narrow grid derived from the output of a random search.

References

Bartz, E., Zaefferer, M., Mersmann, O., & Bartz-Beielstein, T. (2021). Experimental Investigation and Evaluation of Model-based Hyperparameter Optimization (No. arXiv:2107.08761). arXiv. http://arxiv.org/abs/2107.08761

Bergstra, J. (n.d.-a). hpsklearn: Hyperparameter Optimization for sklearn (Version 0.1.0) [Python; MacOS :: MacOS X, Microsoft :: Windows, POSIX, Unix]. Retrieved 2 September 2024, from http://hyperopt.github.com/hyperopt-sklearn/

Bergstra, J. (n.d.-b). hyperopt: Distributed Asynchronous Hyperparameter Optimization (Version 0.2.7) [Python; MacOS :: MacOS X, Microsoft :: Windows, POSIX, Unix]. Retrieved 2 September 2024, from https://hyperopt.github.io/hyperopt

Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., Thomas, J., Ullmann, T., Becker, M., Boulesteix, A.-L., Deng, D., & Lindauer, M. (2021). Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges (No. arXiv:2107.05847). arXiv. http://arxiv.org/abs/2107.05847

Brownlee, J. (2020, September 3). Scikit-Optimize for Hyperparameter Tuning in Machine Learning. MachineLearningMastery.Com. https://machinelearningmastery.com/scikit-optimize-for-hyperparameter-tuning-in-machine-learning/

Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., Vanderplas, J., Joly, A., Holt, B., & Varoquaux, G. (2013). API design for machine learning software: Experiences from the scikit-learn project (No. arXiv:1309.0238). arXiv. https://doi.org/10.48550/arXiv.1309.0238

Che, F. (2024a). Chewitty/cheutils [Python]. https://github.com/chewitty/cheutils

Che, F. (2024b, September 2). Chedatasets/hyperparams-optimization at main · chewitty/chedatasets. https://github.com/chewitty/chedatasets/tree/main/hyperparams-optimization

Chen, B. (2021, September 6). A Practical Introduction to Grid Search, Random Search, and Bayes Search. Medium. https://towardsdatascience.com/a-practical-introduction-to-grid-search-random-search-and-bayes-search-d5580b1d941d

Coelho, A. (2020). Narrowing the Search: Which Hyperparameters Really Matter? https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter

DataCamp. (n.d.). Predicting Movie Rental Durations — DataCamp Learn. Retrieved 15 August 2024, from https://app.datacamp.com/learn/projects/predicting-movie-rental-durations

ES, S., & Bajaj, A. (2022, July 21). Hyperparameter Tuning in Python: A Complete Guide. Neptune.Ai. https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide

Hutter, F., Hoos, H., & Leyton-Brown, K. (2014). An Efficient Approach for Assessing Hyperparameter Importance. Proceedings of the 31st International Conference on Machine Learning, 32, 754–762. https://proceedings.mlr.press/v32/hutter14.html

Komer, B., Bergstra, J., & Eliasmith, C. (2014). Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn. 32–37. https://doi.org/10.25080/Majora-14bd3278-006

Komer, B., Bergstra, J., & Eliasmith, C. (2019). Hyperopt-Sklearn. In F. Hutter, L. Kotthoff, & J. Vanschoren (Eds.), Automated Machine Learning (pp. 97–111). Springer International Publishing. https://doi.org/10.1007/978-3-030-05318-5_5

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830. http://jmlr.org/papers/v12/pedregosa11a.html

Saraswathi, M. (2023, February 8). Hyper Parameter Tuning Techniques. Medium. https://medium.com/@monicasaraswathi/hyper-parameter-tuning-techniques-bc266a87d60c

scikit-learn. (n.d.). scikit-learn: Machine learning in Python. Retrieved 24 August 2024, from https://scikit-learn.org/stable/

Scikit-Optimize. (n.d.). scikit-optimize: Sequential model-based optimization toolbox. (Version 0.10.2) [Python; MacOS, Microsoft :: Windows, POSIX, Unix]. Retrieved 2 September 2024, from https://scikit-optimize.readthedocs.io/en/latest/contents.html

van Rijn, J. N., & Hutter, F. (2018). Hyperparameter Importance Across Datasets. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2367–2376. https://doi.org/10.1145/3219819.3220058

Vichaar, S. (2023, May 18). AutoML for creating Machine Learning model: HyperOpt and HyperOpt-Sklearn. Medium. https://shunya-vichaar.medium.com/automl-for-creating-machine-learning-model-hyperopt-and-hyperopt-sklearn-22cca0b59c4a

Social Media

Share Tweet Share

Practical Machine Learning Hyperparameter Optimization By A Coarse-To-Fine Search

Categories

Social Media

Exclusive Tips and Tools for AI Adoption