ml-finance-python

python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
04_xgboost_lightgbm_catboost_tuning.ipynb

(16899B)
      1 {
      2  "cells": [
      3   {
      4    "cell_type": "markdown",
      5    "metadata": {},
      6    "source": [
      7     "# XGBoost, LightGBM and CatBoost Parameter Tuning"
      8    ]
      9   },
     10   {
     11    "cell_type": "markdown",
     12    "metadata": {},
     13    "source": [
     14     "## Imports & Settings"
     15    ]
     16   },
     17   {
     18    "cell_type": "code",
     19    "execution_count": null,
     20    "metadata": {},
     21    "outputs": [],
     22    "source": [
     23     "import warnings\n",
     24     "from pathlib import Path\n",
     25     "from random import shuffle\n",
     26     "from time import time\n",
     27     "import numpy as np\n",
     28     "import pandas as pd\n",
     29     "import xgboost as xgb\n",
     30     "from xgboost.callback import reset_learning_rate\n",
     31     "import lightgbm as lgb\n",
     32     "from catboost import Pool, CatBoostClassifier\n",
     33     "from itertools import product\n",
     34     "from sklearn.metrics import roc_auc_score\n",
     35     "from math import ceil"
     36    ]
     37   },
     38   {
     39    "cell_type": "code",
     40    "execution_count": null,
     41    "metadata": {},
     42    "outputs": [],
     43    "source": [
     44     "pd.set_option('display.expand_frame_repr', False)\n",
     45     "warnings.filterwarnings('ignore')\n",
     46     "idx = pd.IndexSlice\n",
     47     "np.random.seed(42)"
     48    ]
     49   },
     50   {
     51    "cell_type": "markdown",
     52    "metadata": {},
     53    "source": [
     54     "Change path to data store in `gbm_utils.py` if needed"
     55    ]
     56   },
     57   {
     58    "cell_type": "markdown",
     59    "metadata": {},
     60    "source": [
     61     "If you choose to compile any of the libraries with GPU support, amend the parameters in `gbm_params.py` accordingly."
     62    ]
     63   },
     64   {
     65    "cell_type": "code",
     66    "execution_count": null,
     67    "metadata": {},
     68    "outputs": [],
     69    "source": [
     70     "results_path = Path('results')\n",
     71     "if not results_path.exists():\n",
     72     "    results_path.mkdir(exist_ok=True)"
     73    ]
     74   },
     75   {
     76    "cell_type": "code",
     77    "execution_count": null,
     78    "metadata": {},
     79    "outputs": [],
     80    "source": [
     81     "from gbm_utils import format_time, get_data, get_one_hot_data, factorize_cats, get_holdout_set, OneStepTimeSeriesSplit\n",
     82     "from gbm_params import get_params"
     83    ]
     84   },
     85   {
     86    "cell_type": "markdown",
     87    "metadata": {},
     88    "source": [
     89     "## Learning Rate Schedule"
     90    ]
     91   },
     92   {
     93    "cell_type": "markdown",
     94    "metadata": {},
     95    "source": [
     96     "Define declining learning rate schedule:"
     97    ]
     98   },
     99   {
    100    "cell_type": "code",
    101    "execution_count": null,
    102    "metadata": {},
    103    "outputs": [],
    104    "source": [
    105     "def learning_rate(n, ntot):\n",
    106     "    start_eta = 0.1\n",
    107     "    k = 8 / ntot\n",
    108     "    x0 = ntot / 1.8\n",
    109     "    return start_eta * (1 - 1 / (1 + np.exp(-k * (n - x0))))"
    110    ]
    111   },
    112   {
    113    "cell_type": "markdown",
    114    "metadata": {},
    115    "source": [
    116     "### Visualize Learning Rate Schedule"
    117    ]
    118   },
    119   {
    120    "cell_type": "code",
    121    "execution_count": null,
    122    "metadata": {},
    123    "outputs": [],
    124    "source": [
    125     "ntot = 10000\n",
    126     "x = np.asarray(range(ntot))\n",
    127     "pd.Series(learning_rate(x, ntot)).plot();"
    128    ]
    129   },
    130   {
    131    "cell_type": "markdown",
    132    "metadata": {},
    133    "source": [
    134     "## Cross-Validate GBM Model"
    135    ]
    136   },
    137   {
    138    "cell_type": "markdown",
    139    "metadata": {},
    140    "source": [
    141     "### CV Settings"
    142    ]
    143   },
    144   {
    145    "cell_type": "code",
    146    "execution_count": null,
    147    "metadata": {},
    148    "outputs": [],
    149    "source": [
    150     "GBM = 'lightgbm'\n",
    151     "HOLDOUT = True\n",
    152     "FACTORS = True\n",
    153     "n_splits = 12\n",
    154     "\n",
    155     "result_key = f\"/{GBM}/{'factors' if FACTORS else 'dummies'}/results/2\""
    156    ]
    157   },
    158   {
    159    "cell_type": "markdown",
    160    "metadata": {},
    161    "source": [
    162     "### Create Binary Datasets"
    163    ]
    164   },
    165   {
    166    "cell_type": "markdown",
    167    "metadata": {},
    168    "source": [
    169     "All libraries have their own data format to precompute feature statistics to accelerate the search for split points, as described previously. These can also be persisted to accelerate the start of subsequent training.\n",
    170     "\n",
    171     "The following code constructs binary train and validation datasets for each model to be used with the OneStepTimeSeriesSplit."
    172    ]
    173   },
    174   {
    175    "cell_type": "markdown",
    176    "metadata": {},
    177    "source": [
    178     "The available options vary slightly: :\n",
    179     "- xgboost allows the use of all available threads\n",
    180     "- lightgbm explicitly aligns the quantiles that are created for the validation set with the training set\n",
    181     "- The catboost implementation needs feature columns identified using indices rather than labels"
    182    ]
    183   },
    184   {
    185    "cell_type": "code",
    186    "execution_count": null,
    187    "metadata": {},
    188    "outputs": [],
    189    "source": [
    190     "def get_datasets(features, target, kfold, model='xgboost'):\n",
    191     "    cat_cols = ['year', 'month', 'age', 'msize', 'sector']\n",
    192     "    data = {}\n",
    193     "    for fold, (train_idx, test_idx) in enumerate(kfold.split(features)):\n",
    194     "        print(fold, end=' ', flush=True)\n",
    195     "        if model == 'xgboost':\n",
    196     "            data[fold] = {'train': xgb.DMatrix(label=target.iloc[train_idx],\n",
    197     "                                               data=features.iloc[train_idx],\n",
    198     "                                               nthread=-1),                     # use avail. threads\n",
    199     "                          'valid': xgb.DMatrix(label=target.iloc[test_idx],\n",
    200     "                                               data=features.iloc[test_idx],\n",
    201     "                                               nthread=-1)}\n",
    202     "        elif model == 'lightgbm':\n",
    203     "            train = lgb.Dataset(label=target.iloc[train_idx],\n",
    204     "                                data=features.iloc[train_idx],\n",
    205     "                                categorical_feature=cat_cols,\n",
    206     "                                free_raw_data=False)\n",
    207     "\n",
    208     "            # align validation set histograms with training set\n",
    209     "            valid = train.create_valid(label=target.iloc[test_idx],\n",
    210     "                                       data=features.iloc[test_idx])\n",
    211     "\n",
    212     "            data[fold] = {'train': train.construct(),\n",
    213     "                          'valid': valid.construct()}\n",
    214     "\n",
    215     "        elif model == 'catboost':\n",
    216     "            # get categorical feature indices\n",
    217     "            cat_cols_idx = [features.columns.get_loc(c) for c in cat_cols]\n",
    218     "            data[fold] = {'train': Pool(label=target.iloc[train_idx],\n",
    219     "                                        data=features.iloc[train_idx],\n",
    220     "                                        cat_features=cat_cols_idx),\n",
    221     "\n",
    222     "                          'valid': Pool(label=target.iloc[test_idx],\n",
    223     "                                        data=features.iloc[test_idx],\n",
    224     "                                        cat_features=cat_cols_idx)}\n",
    225     "    return data"
    226    ]
    227   },
    228   {
    229    "cell_type": "markdown",
    230    "metadata": {},
    231    "source": [
    232     "### Get Data"
    233    ]
    234   },
    235   {
    236    "cell_type": "code",
    237    "execution_count": null,
    238    "metadata": {},
    239    "outputs": [],
    240    "source": [
    241     "y, features = get_data()\n",
    242     "if FACTORS:\n",
    243     "    X = factorize_cats(features)\n",
    244     "else:\n",
    245     "    X = get_one_hot_data(features)\n",
    246     "\n",
    247     "if HOLDOUT:\n",
    248     "    y, X, y_test, X_test = get_holdout_set(target=y,\n",
    249     "                                           features=X)\n",
    250     "\n",
    251     "    with pd.HDFStore('model_tuning.h5') as store:\n",
    252     "        key = f'{GBM}/holdout/'\n",
    253     "        if not any([k for k in store.keys() if k[1:].startswith(key)]):\n",
    254     "            store.put(key + 'features', X_test, format='t' if FACTORS else 'f')\n",
    255     "            store.put(key + 'target', y_test)\n",
    256     "\n",
    257     "cv = OneStepTimeSeriesSplit(n_splits=n_splits)\n",
    258     "\n",
    259     "datasets = get_datasets(features=X, target=y, kfold=cv, model=GBM)"
    260    ]
    261   },
    262   {
    263    "cell_type": "markdown",
    264    "metadata": {},
    265    "source": [
    266     "### Define Parameter Grid"
    267    ]
    268   },
    269   {
    270    "cell_type": "markdown",
    271    "metadata": {},
    272    "source": [
    273     "The numerous hyperparameters are listed in [gbm_params.py](gbm_params.py). Each library has parameter settings to:\n",
    274     "- specify the overall objectives and learning algorithm\n",
    275     "- design the base learners\n",
    276     "- apply various regularization techniques\n",
    277     "- handle early stopping during training\n",
    278     "- enabling the use of GPU or parallelization on CPU\n",
    279     "\n",
    280     "The documentation for each library details the various parameters that may refer to the same concept, but which have different names across libraries. [This site](https://sites.google.com/view/lauraepp/parameters) highlights the corresponding parameters for xgboost and lightgbm."
    281    ]
    282   },
    283   {
    284    "cell_type": "markdown",
    285    "metadata": {},
    286    "source": [
    287     "To explore the hyperparameter space, we specify values for key parameters that we would like to test in combination. The sklearn library supports RandomizedSearchCV to cross-validate a subset of parameter combinations that are sampled randomly from specified distributions. \n",
    288     "\n",
    289     "We will implement a custom version that allows us to leverage early stopping while monitoring the current best-performing combinations so we can abort the search process once satisfied with the result rather than specifying a set number of iterations beforehand.\n",
    290     "\n",
    291     "To this end, we specify a parameter grid according to each library's parameters as before, generate all combinations using the built-in Cartesian [product](https://docs.python.org/3/library/itertools.html#itertools.product) generator provided by the itertools library, and randomly shuffle the result. \n",
    292     "\n",
    293     "In the case of LightGBM, we automatically set `max_depth` as a function of the current num_leaves value, as shown in the following code:"
    294    ]
    295   },
    296   {
    297    "cell_type": "code",
    298    "execution_count": null,
    299    "metadata": {},
    300    "outputs": [],
    301    "source": [
    302     "param_grid = dict(\n",
    303     "        # common options\n",
    304     "        learning_rate=[.01, .1, .3],\n",
    305     "        # max_depth=list(range(3, 14, 2)),\n",
    306     "        colsample_bytree=[.8, 1],  # except catboost\n",
    307     "\n",
    308     "        # lightgbm\n",
    309     "        # max_bin=[32, 128],\n",
    310     "        num_leaves=[2 ** i for i in range(9, 14)],\n",
    311     "        boosting=['gbdt', 'dart'],\n",
    312     "        min_gain_to_split=[0, 1, 5],  # not supported on GPU\n",
    313     "\n",
    314     "        # xgboost\n",
    315     "        # booster=['gbtree', 'dart'],\n",
    316     "        # gamma=[0, 1, 5],\n",
    317     "\n",
    318     "        # catboost\n",
    319     "        # one_hot_max_size=[None, 2],\n",
    320     "        # max_ctr_complexity=[1, 2, 3],\n",
    321     "        # random_strength=[None, 1],\n",
    322     "        # colsample_bylevel=[.6, .8, 1]\n",
    323     ")"
    324    ]
    325   },
    326   {
    327    "cell_type": "code",
    328    "execution_count": null,
    329    "metadata": {},
    330    "outputs": [],
    331    "source": [
    332     "all_params = list(product(*param_grid.values()))\n",
    333     "n_models = len(all_params)\n",
    334     "shuffle(all_params)\n",
    335     "\n",
    336     "print('\\n# Models:', n_models)"
    337    ]
    338   },
    339   {
    340    "cell_type": "markdown",
    341    "metadata": {},
    342    "source": [
    343     "### Run Cross Validation"
    344    ]
    345   },
    346   {
    347    "cell_type": "markdown",
    348    "metadata": {},
    349    "source": [
    350     "The following function `run_cv()` implements cross-validation using the library-specific commands. The `train()` method also produces validation scores that are stored in the `scores` dictionary. \n",
    351     "\n",
    352     "When early stopping takes effect, the last iteration is also the best score."
    353    ]
    354   },
    355   {
    356    "cell_type": "code",
    357    "execution_count": null,
    358    "metadata": {},
    359    "outputs": [],
    360    "source": [
    361     "def run_cv(test_params, data, n_splits=10, gb_machine='xgboost'):\n",
    362     "    \"\"\"Train-Validate with early stopping\"\"\"\n",
    363     "    result = []\n",
    364     "    cols = ['rounds', 'train', 'valid']\n",
    365     "    for fold in range(n_splits):\n",
    366     "        train = data[fold]['train']\n",
    367     "        valid = data[fold]['valid']\n",
    368     "\n",
    369     "        scores = {}\n",
    370     "        if gb_machine == 'xgboost':\n",
    371     "            model = xgb.train(params=test_params,\n",
    372     "                              dtrain=train,\n",
    373     "                              evals=list(zip([train, valid], ['train', 'valid'])),\n",
    374     "                              verbose_eval=50,\n",
    375     "                              num_boost_round=250,\n",
    376     "                              early_stopping_rounds=25,\n",
    377     "                              evals_result=scores)\n",
    378     "\n",
    379     "            result.append([model.best_iteration,\n",
    380     "                           scores['train']['auc'][-1],\n",
    381     "                           scores['valid']['auc'][-1]])\n",
    382     "        elif gb_machine == 'lightgbm':\n",
    383     "            model = lgb.train(params=test_params,\n",
    384     "                              train_set=train,\n",
    385     "                              valid_sets=[train, valid],\n",
    386     "                              valid_names=['train', 'valid'],\n",
    387     "                              num_boost_round=250,\n",
    388     "                              early_stopping_rounds=25,\n",
    389     "                              verbose_eval=50,\n",
    390     "                              evals_result=scores)\n",
    391     "\n",
    392     "            result.append([model.current_iteration(),\n",
    393     "                           scores['train']['auc'][-1],\n",
    394     "                           scores['valid']['auc'][-1]])\n",
    395     "\n",
    396     "        elif gb_machine == 'catboost':\n",
    397     "            model = CatBoostClassifier(**test_params)\n",
    398     "            model.fit(X=train,\n",
    399     "                      eval_set=[valid],\n",
    400     "                      logging_level='Silent')\n",
    401     "\n",
    402     "            train_score = model.predict_proba(train)[:, 1]\n",
    403     "            valid_score = model.predict_proba(valid)[:, 1]\n",
    404     "            result.append([\n",
    405     "                model.tree_count_,\n",
    406     "                roc_auc_score(y_score=train_score, y_true=train.get_label()),\n",
    407     "                roc_auc_score(y_score=valid_score, y_true=valid.get_label())\n",
    408     "            ])\n",
    409     "\n",
    410     "    df = pd.DataFrame(result, columns=cols)\n",
    411     "    return (df\n",
    412     "            .mean()\n",
    413     "            .append(df.std().rename({c: c + '_std' for c in cols}))\n",
    414     "            .append(pd.Series(test_params)))"
    415    ]
    416   },
    417   {
    418    "cell_type": "markdown",
    419    "metadata": {},
    420    "source": [
    421     "The following code executes and exhaustive search over the parameter grid. The algorithms are already multithreaded so GridSearchCV does not add parallelization benefits. The below 'manual' implementation allows for more transparency during execution."
    422    ]
    423   },
    424   {
    425    "cell_type": "code",
    426    "execution_count": null,
    427    "metadata": {},
    428    "outputs": [],
    429    "source": [
    430     "results = pd.DataFrame()\n",
    431     "\n",
    432     "start = time()\n",
    433     "for n, test_param in enumerate(all_params, 1):\n",
    434     "    iteration = time()\n",
    435     "\n",
    436     "    cv_params = get_params(GBM)\n",
    437     "    cv_params.update(dict(zip(param_grid.keys(), test_param)))\n",
    438     "    if GBM == 'lightgbm':\n",
    439     "        cv_params['max_depth'] = int(ceil(np.log2(cv_params['num_leaves'])))\n",
    440     "\n",
    441     "    results[n] = run_cv(test_params=cv_params,\n",
    442     "                        data=datasets,\n",
    443     "                        n_splits=n_splits,\n",
    444     "                        gb_machine=GBM)\n",
    445     "    results.loc['time', n] = time() - iteration\n",
    446     "\n",
    447     "    if n > 1:\n",
    448     "        df = results[~results.eq(results.iloc[:, 0], axis=0).all(1)].T\n",
    449     "        if 'valid' in df.columns:\n",
    450     "            df.valid = pd.to_numeric(df.valid)\n",
    451     "            print('\\n')\n",
    452     "            print(df.sort_values('valid', ascending=False).head(5).reset_index(drop=True))\n",
    453     "\n",
    454     "    out = f'\\n\\tModel: {n} of {n_models} | '\n",
    455     "    out += f'{format_time(time() - iteration)} | '\n",
    456     "    out += f'Total: {format_time(time() - start)} | '\n",
    457     "    print(out + f'Remaining: {format_time((time() - start)/n*(n_models-n))}\\n')\n",
    458     "\n",
    459     "    with pd.HDFStore('model_tuning.h5') as store:\n",
    460     "        store.put(result_key, results.T.apply(pd.to_numeric, errors='ignore'))"
    461    ]
    462   }
    463  ],
    464  "metadata": {
    465   "kernelspec": {
    466    "display_name": "Python 3",
    467    "language": "python",
    468    "name": "python3"
    469   },
    470   "language_info": {
    471    "codemirror_mode": {
    472     "name": "ipython",
    473     "version": 3
    474    },
    475    "file_extension": ".py",
    476    "mimetype": "text/x-python",
    477    "name": "python",
    478    "nbconvert_exporter": "python",
    479    "pygments_lexer": "ipython3",
    480    "version": "3.7.0"
    481   },
    482   "toc": {
    483    "base_numbering": 1,
    484    "nav_menu": {},
    485    "number_sections": true,
    486    "sideBar": true,
    487    "skip_h1_title": true,
    488    "title_cell": "Table of Contents",
    489    "title_sidebar": "Contents",
    490    "toc_cell": false,
    491    "toc_position": {},
    492    "toc_section_display": true,
    493    "toc_window_display": true
    494   }
    495  },
    496  "nbformat": 4,
    497  "nbformat_minor": 2
    498 }