Utilizing Symbolic Regression to discover a larger class of splits for Decision Trees
This work is based on the paper: K.S. Fong and M. Motani, “Symbolic Regression Enhanced Decision Trees for Classification Tasks” Annual AAAI Conference on Artificial Intelligence (AAAI), Vancouver, Canada, Feb. 2024. [1]
Decision Tree (DT) is one of the most popular machine learning algorithm (see https://connect.aisingapore.org/aiap-field-guide/ for the AISG structured field guide to AI & ML with materials for DT). Have you ever wondered what would happen if we used a larger class of splits in DT instead of single variable splits (more precisely, axis-parallel splits)? In our work, we explore finding splits that can be non-linear multivariate functions of features, utilizing another machine learning algorithm called Symbolic Regression. We call our method Symbolic Regression Enhanced Decision Tree (SREDT) [1].
Symbolic Regression (SR) algorithms find closed-form analytical expressions that fulfill a specified objective, most frequently to predict an output from input features. In contrast to black-box machine learning models (such as neural network), SR algorithms produce expressions that are compact with few parameters, making SR algorithms both explainable and interpretable. It is thus of no surprise that SR has quickly become a first-class algorithm in various fields including physics, material sciences, engineering and healthcare.
In the field of SR, genetic programming (GP) has become the most common paradigm to tackle the large search space of equations. Even the best performing SR algorithms incorporate GP at the core. Generally, SR algorithms first initialize a set of randomly generated expressions. These expressions are evaluated (in terms of a user-specified objective), and the best performing ones undergo small changes to create the next iteration of expressions. The process of evaluating and producing the next iteration of expressions repeats until a ‘good enough’ expressions is achieved.
Our Contribution: Symbolic Regression Enhanced Decision Tree (SREDT). We utilize SR by modifying its objective to find ‘good’ splits, which can be non-linear and multivariate for DT. We utilize various objectives and different mechanisms (i.e., expression generator pre-trained on database of expressions, lookahead multiple splits, numerical optimization of constants), which we detail with more technical depth in the paper [1].
SREDT Strength #1: Better Prediction Performance. SREDT performs better than DT in prediction performance in terms of both F-score and accuracy. SREDT addresses the weaknesses of DT by utilizes SR to discover a richer class of splitting rules, resulting in single splits that would have otherwise required multiple splits by DT and ODT, or in some cases, not have been possible by any amount of axis-parallel splits.
SREDT Strength #2: Smaller Size, Increased Explainability. SREDT is smaller and more compact than DT. To compare the sizes of SREDT to DT, we consider the size of the trees in terms of tree depths, number of leaves in the trees, and the number of terms (i.e., splits, operands and operators) in the trees. In all three measurements, SREDT was found to be smaller than DT, which is a property that makes a model more explainable and interpretable.
SREDT Strength #3: Decreased Inference Time. SREDT has a faster inference time than even DT, one of the ML algorithms with the fastest inference time. The improvement in inference time for SREDT can be explained by the much lower depth of SREDT compared to both DT and ODT, as discussed above. However, since SREDT allows for more complex splitting functions, which requires more computation, the time taken to compute each individual split would likely be more than that of DT. Nonetheless, the reduction in depth far outweighs this, which leads to a net result of SREDT still making inferences.
SREDT Strength #4: Robust to Noise, Least Affected by Noise. SREDT also has robust innate feature selection. SREDTs are unique since they tend to select only a subset of features in the eventual tree. Compared to other machine learning models, SREDT is highly selective of features. This is an advantage in critical real-world problems, where there may be large perturbations in unimportant features. To demonstrate this unique property, we ran 2 tests. In the first test, for every existing feature column, we generate 3 additional feature columns, each generated by randomly permuting the existing feature column. In the second test, for every existing feature column, we generate 3 additional feature columns, each generated by adding random Gaussian noise to the original values. SREDT is the least affected by the noise in these two cases, correctly avoiding the noisy columns the greatest number of times successfully.
Other Results: In our full paper, [1], we also include results on common real-world datasets and show some visual examples of real-world datasets. More technical details, including comparison to a wide range of other ML algorithms, are also found in the full paper.
Full Paper Link: https://doi.org/10.1609/aaai.v38i11.29091
[1] K.S. Fong and M. Motani, “Symbolic Regression Enhanced Decision Trees for Classification Tasks” Annual AAAI Conference on Artificial Intelligence (AAAI), Vancouver, Canada, Feb. 2024.