Monotonic Constraints in LightGBM
Table of Contents
Context
In XGBoost or LightGBM, we are able to add monotonic constraints to enforce the feature to have monotonic impact to the output. This could help in many ways:
- Prevent overfitting / reduce variance especially when there is not enough data samples on a certain range values of the feature, we are unwilling to see wired local behaviors.
- By business judgement, we expect the some features to be explainable.
Three methods to train the model
LightGBM offers three methods to add the constraints (https://lightgbm.readthedocs.io/en/latest/Parameters.html), “basic”, “intermediate” and “advanced”. The latter two methods are improvement that are first introduced in the attached paper.
See the below example, we have two features, and suppose we constrain on the horizontal feature. The first two splits are natural, but one the third there is a difference among methods.
- (c) is the “basic” implementation. Notice that in the first split of monotonic feature, two children are split from this feature as parent node, and are attached $0.3$ and $0.6$. Then the feature will use the mid values, which is $0.45$ as the upper and lower bound of these two children. No matter how the blue dots are predicted high, they are $0.45$ at most. This method creates a gap in between, and losses tremendously flexibility along the other feature, reducing the prediction power.
- (d) is the “intermediate” method, or in the paper it is called “fast” method. It reduces the gap brought by the mid value method. The parent node memorizes additional two values, the min value of right child and the max value of the left child. This brings the model more flexibility.
- (e) is the “advanced” method, or called “slow” method. It further improves based on “fast” to memorize more parameters. Basically, it records bounds created by the combination of features and their thresholds. In this example, vertical feature had a split of $0.5$. Below $0.5$, the probability is bounded by $0.8$, instead of $0.5$ that is the min probability regardless. So this method can be more granular in predicting.
Monotone Penalty
It is not recommended to train the model with a strict constraint, since if the first split is on a constrained feature, then all the rest of the child nodes will be impacted. The tree growth is dependent on a greedy algorithm, if we are strict in its direction of growth, this will harm the prediction power.
That’s why a penalty can be helpful.
To accommodate this concern, the penalty is designed to be a function of depth as follows. In practice, the $p$ will be multiplied with the information gain from this node. If $\gamma$ is less than 1, then penalty is put on all depths with decreasing strength. Else, depths smaller than the $\gamma-1$ will not be punished. In general, the deeper the nodes, the penalty more closer to 1.
$$ p= \begin{cases}0 & \text { if } \gamma \geq d+1 \\ 1-\frac{\gamma}{2^{d}} & \text { if } \gamma \leq 1 \\ 1-2^{\gamma-1-d} & \text { else }\end{cases} $$