View file src/colab/copie_de_automated_hyper_parameter_tuning.py - Download
# -*- coding: utf-8 -*-
"""Copie de Automated hyper-parameter tuning
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1MTUdbzFfjpLK6rpOWe4ORxsOC07y0D4L
https://colab.research.google.com/github/tensorflow/decision-forests/blob/main/documentation/tutorials/automatic_tuning_colab.ipynb
##### Copyright 2023 The TensorFlow Authors.
"""
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""# Automated hyper-parameter tuning
Welcome to the **Automated hyper-parameter tuning** tutorial. In this colab, you
will learn how to improve your models using automated hyper-parameter tuning
with TensorFlow Decision Forests.
More precicely we will:
1. Train a model **without hyper-parameter tuning**. This model will be used to
measure the quality improvement of hyper-parameter tuning.
1. Train a model **with hyper-parameter tuning** using **TF-DF's tuner**. The
hyper-parameters to optimize will be **defined manually**.
1. Train another model **with hyper-parameter tuning** using **TF-DF's tuner**.
But this time, the hyper-parameters to optimize will be **set
automatically**. **This is the recommanded first approach to try when using
hyper-parameter tuning**.
1. Finally, we will train a model **with hyper-parameter tuning** using
**Keras's tuner**.
## Introduction
A learning algorithm trains a machine learning model on a training dataset. The
parameters of a learning algorithm–called "hyper-parameters"–control
how the model is trained and impact its quality. Therefore, finding the best
hyper-parameters is an important stage of modeling.
Some hyper-parameters are simple to configure. For example, increasing the
number of trees (`num_trees`) in a random forest increases the quality of the
model until a plateau. Therefore, setting the largest value compatible with the
serving constraints (more trees means a larger model) is a valid rule of thumb.
However, other hyper-parameters have a more complex interaction with the model
and cannot be chosen with such a simple rule. For example, increasing the
maximum tree depth (`max_depth`) of a gradient boosted tree model can both
increase or decrease the quality of the model. Furthermore, hyper-parameters can
interact between each others, and the optimal value of a hyper-parameter cannot
be found in isolation.
There are three main approaches to select the hyper-parameter values:
1. **The default approach**: Learning algorithms come with default values.
While not ideal in all cases, those values produce reasonable results in
most situations. This approach is recommended as the first approach to use
in any modeling.
[This page](https://ydf.readthedocs.io/en/latest/hyper_parameters.html)
lists the default values of TF Decision Forests.
2. **The template hyper-parameter approach**: In addition to the default
values, TF Decision Forests also exposes the hyper-parameter templates.
Those are benchmark-tuned hyper-parameter values with excellent performance
but high training cost (e.g. `hyperparameter_template="benchmark_rank1"`).
3. **The manual tuning approach:** You can manually test different
hyper-parameter values and select the one that performs best.
[This guide](https://ydf.readthedocs.io/en/latest/improve_model.html)
give some advice.
4. **The automated tuning approach:** A tuning algorithm can be used to find
automatically the best hyper-parameter values. This approach gives often the
best results and does not require expertise. The main downside of this
approach is the time it takes for large datasets.
In this colab, we shows the **default** and **automated tuning** approaches with
the TensorFlow Decision Forests library.
## Hyper-parameter tuning algorithms
Automated tuning algorithms work by generating and evaluating a large number of
hyper-parameter values. Each of those iterations is called a "trial". The
evaluation of a trial is expensive as it requires to train a new model each
time. At the end of the tuning, the hyper-parameter with the best evaluation is
used.
Tuning algorithm are configured as follow:
**The search space**
The search space is the list of hyper-parameters to optimize and the values they
can take. For example, the maximum depth of a tree could be optimized for values
in between 1 and 32. Exploring more hyper-parameters and more possible values
often leads to better models but also takes more time. The hyper-parameters
hyper-parameters are listed in
the
[documentation](https://ydf.readthedocs.io/en/latest/hyper_parameters.html).
When the possible value of one hyper-parameter depends on the value of another
hyper-parameter, the search space is said to be conditional.
**The number of trials**
The number of trials defines how many models will be trained and evaluated.
Larger number of trials generally leads to better models, but takes more time.
**The optimizer**
The optimizer selects the next hyper-parameter to evaluate the past trial
evaluations. The simplest and often reasonable optimizer is the one that selects
the hyper-parameter at random.
**The objective / trial score**
The objective is the metric optimized by the tuner. Often, this metric is a
measure of quality (e.g. accuracy, log loss) of the model evaluated on a
validation dataset.
*Train-valid-test*
The validation dataset should be different from the training datasets: If the
training and validation datasets are the same, the selected hyper-parameters
will be irrelevant. The validation dataset should also be different from the
testing dataset (also called holdout dataset): Because hyper-parameter tuning is
a form of training, if the testing and validation datasets are the same, you are
effectively training on the test dataset. In this case, you might overfit on
your test dataset without a way to measure it.
*Cross-validation*
In the case of a small dataset, for example a dataset with less than 100k
examples, hyper-parameter tuning can be coupled with
[cross-validation](https://en.wikipedia.org/wiki/Cross-validation_\(statistics\)):
Instead of being evaluated from a single training-test round, the
objective/trial score is evaluated as the average of the metric over multiple
cross-validation rounds.
Similarly as to the train-valid-and-test datasets, the cross-validation used to
evaluate the objective/score during hyper-parameter tuning should be different
from the cross-validation used to evaluate the quality of the model.
*Out-of-bag evaluation*
Some models, like Random Forests, can be evaluated on the training datasets
using the "out-of-bag evaluation" method. While not as accurate as
cross-validation, the "out-of-bag evaluation" is much faster than
cross-validation and does not require a separate validation datasets.
*In tensorflow decision forests*
In TF-DF, the model
["self" evaluation](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab#model_self_evaluation)
is always a fair way to evaluate a model. For example, an out-of-bag evaluation
is used for Random Forest models while a validation dataset is used for Gradient
Boosted models.
## Hyper-parameter tuning with TF Decision Forests
TF-DF supports automatic hyper-parameter tuning with minimal configuration. In
the next example, we will train and compare two models: One trained with default
hyper-parameters, and one trained with hyper-parameter tuning.
**Note:** Hyper-parameters tuning can take a long time in the case of large
datasets. In this case, it is recommended to use TF-DF with distributed training
to drastically speed-up the hyper-parameter tuning.
## Setup
"""
# Install TensorFlow Decision Forests
!pip install tensorflow_decision_forests -U -qq
"""Install [Wurlitzer](https://pypi.org/project/wurlitzer/). Wurlitzer is required to show the detailed training logs in colabs (with `verbose=2`)."""
!pip install wurlitzer -U -qq
"""Import the necessary libraries."""
import tensorflow_decision_forests as tfdf
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import numpy as np
"""The hidden code cell limits the output height in colab."""
#@title Define "set_cell_height".
from IPython.core.magic import register_line_magic
from IPython.display import Javascript
from IPython.display import display
# Some of the model training logs can cover the full
# screen if not compressed to a smaller viewport.
# This magic allows setting a max height for a cell.
@register_line_magic
def set_cell_height(size):
display(
Javascript("google.colab.output.setIframeHeight(0, true, {maxHeight: " +
str(size) + "})"))
"""## Training a model **without** Automated hyper-parameter tuning
We will train a model on the [Adult](https://archive.ics.uci.edu/ml/datasets/adult) dataset available on the [UCI](https://archive.ics.uci.edu/ml/datasets/adult). Let's download the dataset.
"""
# Download a copy of the adult dataset.
!wget -q https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset/adult_train.csv -O /tmp/adult_train.csv
!wget -q https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset/adult_test.csv -O /tmp/adult_test.csv
"""Split the dataset into a training and a testing dataset."""
# Load the dataset in memory
train_df = pd.read_csv("/tmp/adult_train.csv")
test_df = pd.read_csv("/tmp/adult_test.csv")
# , and convert it into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="income")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="income")
"""First, we train and evaluate the quality of a [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel) model trained with the default hyper-parameters."""
# Commented out IPython magic to ensure Python compatibility.
# %%time
# # Train a model with default hyper-parameters
# model = tfdf.keras.GradientBoostedTreesModel()
# model.fit(train_ds)
# Evaluate the model
model.compile(["accuracy"])
test_accuracy = model.evaluate(test_ds, return_dict=True, verbose=0)["accuracy"]
print(f"Test accuracy without hyper-parameter tuning: {test_accuracy:.4f}")
"""The default hyper-parameters of the model are available with the `learner_params` function. The definition of those parameters is available in the [documentation](https://ydf.readthedocs.io/en/latest/hyper_parameters.html)."""
print("Default hyper-parameters of the model:\n", model.learner_params)
"""## Training a model **with** automated hyper-parameter tuning and **manual** definition of the hyper-parameters
Hyper-parameter tuning is enabled by specifying the `tuner` constructor argument of the model. The tuner object contains all the configuration of the tuner (search space, optimizer, trial and objective).
**Note:** *In the next section, you will see how to automtically configure hyper-parameters values. However, setting the hyper-parameters manually as shown here is still valuable to understand.*
"""
# Configure the tuner.
# Create a Random Search tuner with 50 trials.
tuner = tfdf.tuner.RandomSearch(num_trials=50)
# Define the search space.
#
# Adding more parameters generaly improve the quality of the model, but make
# the tuning last longer.
tuner.choice("min_examples", [2, 5, 7, 10])
tuner.choice("categorical_algorithm", ["CART", "RANDOM"])
# Some hyper-parameters are only valid for specific values of other
# hyper-parameters. For example, the "max_depth" parameter is mostly useful when
# "growing_strategy=LOCAL" while "max_num_nodes" is better suited when
# "growing_strategy=BEST_FIRST_GLOBAL".
local_search_space = tuner.choice("growing_strategy", ["LOCAL"])
local_search_space.choice("max_depth", [3, 4, 5, 6, 8])
# merge=True indicates that the parameter (here "growing_strategy") is already
# defined, and that new values are added to it.
global_search_space = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_search_space.choice("max_num_nodes", [16, 32, 64, 128, 256])
tuner.choice("use_hessian_gain", [True, False])
tuner.choice("shrinkage", [0.02, 0.05, 0.10, 0.15])
tuner.choice("num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0])
# Uncomment some (or all) of the following hyper-parameters to increase the
# quality of the search. The number of trial should be increased accordingly.
# tuner.choice("split_axis", ["AXIS_ALIGNED"])
# oblique_space = tuner.choice("split_axis", ["SPARSE_OBLIQUE"], merge=True)
# oblique_space.choice("sparse_oblique_normalization",
# ["NONE", "STANDARD_DEVIATION", "MIN_MAX"])
# oblique_space.choice("sparse_oblique_weights", ["BINARY", "CONTINUOUS"])
# oblique_space.choice("sparse_oblique_num_projections_exponent", [1.0, 1.5])
# Commented out IPython magic to ensure Python compatibility.
# %%time
# %set_cell_height 300
#
# # Tune the model. Notice the `tuner=tuner`.
# tuned_model = tfdf.keras.GradientBoostedTreesModel(tuner=tuner)
# tuned_model.fit(train_ds, verbose=2)
#
# # The `num_threads` model constructor argument (not specified in the example
# # above) controls how many trials are run in parallel (one per thread). If
# # `num_threads` is not specified (like in the example above), one thread is
# # allocated for each available CPU core.
# #
# # If the training is interrupted (for example, by pressing on the "stop" button
# # on the top-left of the colab cell), the best model so-far will be returned.
#
# # In the training logs, you can see lines such as `[10/50] Score: -0.45 / -0.40
# # HParams: ...`. This indicates that 10 of the 50 trials have been completed.
# # And that the last trial returned a score of "-0.45" and that the best trial so
# # far has a score of "-0.40". In this example, the model is optimized by
# # logloss. Since scores are maximized and log loss should be minimized, the
# # score is effectively minus the log loss.
# Evaluate the model
tuned_model.compile(["accuracy"])
tuned_test_accuracy = tuned_model.evaluate(test_ds, return_dict=True, verbose=0)["accuracy"]
print(f"Test accuracy with the TF-DF hyper-parameter tuner: {tuned_test_accuracy:.4f}")
"""The hyper-parameters and objective scores of the trials are available in the model inspector. The `score` value is always maximized. In this example, the score is the negative log loss on the validation dataset (selected automatically)."""
# Display the tuning logs.
tuning_logs = tuned_model.make_inspector().tuning_logs()
tuning_logs.head()
""" The single rows with `best=True` is the one used in the final model."""
# Best hyper-parameters.
tuning_logs[tuning_logs.best].iloc[0]
"""**Note:** Parameters with NaN value are conditional parameters that are not set.
Next, we plot the evaluation of the best score during the tuning.
"""
plt.figure(figsize=(10, 5))
plt.plot(tuning_logs["score"], label="current trial")
plt.plot(tuning_logs["score"].cummax(), label="best trial")
plt.xlabel("Tuning step")
plt.ylabel("Tuning score")
plt.legend()
plt.show()
"""## Training a model **with** automated hyper-parameter tuning and **automatic** definition of the hyper-parameters (*recommended approach*)
As before, hyper-parameter tuning is enabled by specifying the `tuner` constructor argument of the model. Set `use_predefined_hps=True` to automatically configure the search space for the hyper-parameters.
**Note:** The automatic hyper-parameter configuration explores some powerful but slow to train hyper-parameters. For example, oblique splits (commented/disabled in the previous section; see `SPARSE_OBLIQUE`) are tested. This mean the tuning will be slower, but hopefully give a result of significantly higher quality.
"""
# Commented out IPython magic to ensure Python compatibility.
# %%time
# %set_cell_height 300
#
# # Create a Random Search tuner with 50 trials and automatic hp configuration.
# tuner = tfdf.tuner.RandomSearch(num_trials=50, use_predefined_hps=True)
#
# # Define and train the model.
# tuned_model = tfdf.keras.GradientBoostedTreesModel(tuner=tuner)
# tuned_model.fit(train_ds, verbose=2)
# Evaluate the model
tuned_model.compile(["accuracy"])
tuned_test_accuracy = tuned_model.evaluate(test_ds, return_dict=True, verbose=0)["accuracy"]
print(f"Test accuracy with the TF-DF hyper-parameter tuner: {tuned_test_accuracy:.4f}")
"""Same as before, display the tuning logs."""
# Display the tuning logs.
tuning_logs = tuned_model.make_inspector().tuning_logs()
tuning_logs.head()
"""Same as before, shows the best hyper-parameters."""
# Best hyper-parameters.
tuning_logs[tuning_logs.best].iloc[0]
"""Finally, plots the evolution of the quality of the model during tuning:"""
plt.figure(figsize=(10, 5))
plt.plot(tuning_logs["score"], label="current trial")
plt.plot(tuning_logs["score"].cummax(), label="best trial")
plt.xlabel("Tuning step")
plt.ylabel("Tuning score")
plt.legend()
plt.show()
"""## Training a model with Keras Tuner *(Alternative approach)*
TensorFlow Decision Forests is based on the Keras framework, and it is compatible with the [Keras tuner](https://keras.io/keras_tuner/#quick-introduction).
Currently, the **TF-DF Tuner** and the **Keras Tuner** are complementary.
*TF-DF Tuner*
- Automatic configuration of the objective.
- Automatic extraction of validation dataset (if needed).
- Support model self evaluation (e.g. out-of-bag evaluation).
- Distributed hyper-parameter tuning.
- Shared dataset access in between the trials: The tensorflow dataset is read only once, speeding-up tuning significantly on small datasets.
*Keras Tuner*
- Support tuning of the pre-processing parameters.
- Support hyper-band optimizer.
- Support custom objectives.
Let's tune a TF-DF model using the Keras tuner.
"""
# Install the Keras tuner
!pip install keras-tuner -U -qq
import keras_tuner as kt
# Commented out IPython magic to ensure Python compatibility.
# %%script echo Disabled
#
# %%time
#
# def build_model(hp):
# """Creates a model."""
#
# model = tfdf.keras.GradientBoostedTreesModel(
# min_examples=hp.Choice("min_examples", [2, 5, 7, 10]),
# categorical_algorithm=hp.Choice("categorical_algorithm", ["CART", "RANDOM"]),
# max_depth=hp.Choice("max_depth", [4, 5, 6, 7]),
# # The keras tuner convert automaticall boolean parameters to integers.
# use_hessian_gain=bool(hp.Choice("use_hessian_gain", [True, False])),
# shrinkage=hp.Choice("shrinkage", [0.02, 0.05, 0.10, 0.15]),
# num_candidate_attributes_ratio=hp.Choice("num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0]),
# )
#
# # Optimize the model accuracy as computed on the validation dataset.
# model.compile(metrics=["accuracy"])
# return model
# # return tf.keras.Model(inputs=model.inputs, outputs=model.outputs)
#
# keras_tuner = kt.RandomSearch(
# build_model,
# objective="val_accuracy",
# max_trials=50,
# overwrite=True,
# directory="/tmp/keras_tuning")
#
# # Important: The tuning should not be done on the test dataset.
#
# # Extract a validation dataset from the training dataset. The new training
# # dataset is called the "sub-training-dataset".
#
# def split_dataset(dataset, test_ratio=0.30):
# """Splits a panda dataframe in two."""
# test_indices = np.random.rand(len(dataset)) < test_ratio
# return dataset[~test_indices], dataset[test_indices]
#
# sub_train_df, sub_valid_df = split_dataset(train_df)
# sub_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_train_df, label="income")
# sub_valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_valid_df, label="income")
#
# # Tune the model
# keras_tuner.search(sub_train_ds, validation_data=sub_valid_ds)
def build_model(hp):
"""Creates a model."""
model = tfdf.keras.GradientBoostedTreesModel(
min_examples=hp.Choice("min_examples", [2, 5, 7, 10]),
categorical_algorithm=hp.Choice("categorical_algorithm", ["CART", "RANDOM"]),
max_depth=hp.Choice("max_depth", [4, 5, 6, 7]),
# The keras tuner convert automatically boolean parameters to integers.
use_hessian_gain=bool(hp.Choice("use_hessian_gain", [True, False])),
shrinkage=hp.Choice("shrinkage", [0.02, 0.05, 0.10, 0.15]),
num_candidate_attributes_ratio=hp.Choice(
"num_candidate_attributes_ratio", [0.2, 0.5, 0.9, 1.0]
),
)
# Define input layers explicitly,
# Assuming your input features have names like "feature_1", "feature_2", etc.
# Replace with your actual feature names.
input_features = [
tf.keras.Input(shape=(1,), name=feature_name, dtype=tf.float32)
for feature_name in sub_train_df.columns
if feature_name != "income"
]
# Connect the input layers to the Decision Forests model
outputs = model(input_features)
# Create the Keras model
keras_model = tf.keras.Model(inputs=input_features, outputs=outputs)
# Compile the Keras model
keras_model.compile(metrics=["accuracy"])
return keras_model
# Commented out IPython magic to ensure Python compatibility.
# %%time
#
# keras_tuner = kt.RandomSearch(
# build_model,
# objective="val_accuracy",
# max_trials=50,
# overwrite=True,
# directory="/tmp/keras_tuning")
#
# # Important: The tuning should not be done on the test dataset.
#
# # Extract a validation dataset from the training dataset. The new training
# # dataset is called the "sub-training-dataset".
#
# def split_dataset(dataset, test_ratio=0.30):
# """Splits a panda dataframe in two."""
# test_indices = np.random.rand(len(dataset)) < test_ratio
# return dataset[~test_indices], dataset[test_indices]
#
# sub_train_df, sub_valid_df = split_dataset(train_df)
# sub_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_train_df, label="income")
# sub_valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_valid_df, label="income")
#
# # Tune the model
# keras_tuner.search(sub_train_ds, validation_data=sub_valid_ds)
"""The best hyper-parameter are available with `get_best_hyperparameters`:
"""
# Tune the model
best_hyper_parameters = keras_tuner.get_best_hyperparameters()[0].values
print("Best hyper-parameters:", keras_tuner.get_best_hyperparameters()[0].values)
"""The model should be re-trained with the best hyper-parameters:"""
# Commented out IPython magic to ensure Python compatibility.
# %set_cell_height 300
# Train the model
# The keras tuner convert automaticall boolean parameters to integers.
best_hyper_parameters["use_hessian_gain"] = bool(best_hyper_parameters["use_hessian_gain"])
best_model = tfdf.keras.GradientBoostedTreesModel(**best_hyper_parameters)
best_model.fit(train_ds, verbose=2)
"""We can then evaluate the tuned model:"""
# Evaluate the model
best_model.compile(["accuracy"])
tuned_test_accuracy = best_model.evaluate(test_ds, return_dict=True, verbose=0)["accuracy"]
print(f"Test accuracy with the Keras Tuner: {tuned_test_accuracy:.4f}")