{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "d4373dd8", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Vignette.\n", "\n", "This script illustrates sparse linear multi-task regression with\n", "(i) a common feature matrix, (ii) specific feature matrices,\n", "and (iii) privileged information.\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "0610491f", "metadata": {}, "source": [ "# Examples\n", "\n", "Use the class `CoopLassoCV` for multi-task regression\n", "(sharing information between targets and features).\n", "For comparison, use the class `IndepLassoCV` for\n", "independent lasso regressions for multiple targets." ] }, { "cell_type": "markdown", "id": "2afb9f42", "metadata": {}, "source": [ "## Initialisation\n", "Import the function `simulate` to simulate data,\n", "and import the class `CoopLassoCV` to perform linear multi-task regression." ] }, { "cell_type": "code", "execution_count": null, "id": "3c9f462b", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.metrics import mean_squared_error, precision_score\n", "from collasso import simulate, CoopLassoCV" ] }, { "cell_type": "markdown", "id": "9f451d44", "metadata": {}, "source": [ "## (i) Multi-task regression with a common feature matrix\n", "\n", "The standard setting for multi-task regression involves\n", "a feature matrix of shape `(n_samples, p_features)`\n", "and a target matrix of shape `(n_samples, q_targets)`.\n", "\n", "Model training requires the feature matrix\n", "and the target matrix of the training samples (`x_train` and `y_train`),\n", "model testing requires the feature matrix of the testing samples (`x_test`)." ] }, { "cell_type": "markdown", "id": "439609bd", "metadata": {}, "source": [ "Simulate training and test data:" ] }, { "cell_type": "code", "execution_count": null, "id": "cb1c38a2", "metadata": {}, "outputs": [], "source": [ "x_train, y_train, x_test, y_test, beta = simulate()" ] }, { "cell_type": "markdown", "id": "e0cf7ccd", "metadata": {}, "source": [ "Fit linear multi-task regression:" ] }, { "cell_type": "code", "execution_count": null, "id": "948f5452", "metadata": {}, "outputs": [], "source": [ "model = CoopLassoCV()\n", "model.fit(X=x_train, y=y_train)" ] }, { "cell_type": "markdown", "id": "b8bfa278", "metadata": {}, "source": [ "Extract estimated coefficients, a matrix of shape `(p_features, q_targets)`,\n", "and calculate the precision:" ] }, { "cell_type": "code", "execution_count": null, "id": "6b4aab9e", "metadata": {}, "outputs": [], "source": [ "beta_hat = model.coef_.T # estimated regression coefficients\n", "precision_score(y_true=beta!=0, y_pred=model.coef_.T!=0, average=\"micro\")" ] }, { "cell_type": "markdown", "id": "9e25f907", "metadata": {}, "source": [ "Make out-of-sample predictions, a matrix of shape (`n_samples, q_targets`),\n", "and calculate the mean squared error:" ] }, { "cell_type": "code", "execution_count": null, "id": "b5e80bfb", "metadata": {}, "outputs": [], "source": [ "y_hat = model.predict(X=x_test)\n", "mean_squared_error(y_true=y_test, y_pred=y_hat)" ] }, { "cell_type": "markdown", "id": "3d98211a", "metadata": {}, "source": [ "## (ii) Multi-task regression with specific feature matrices\n", "\n", "In some settings, there is not a common feature matrix for all targets\n", "but a specific feature matrix for each target.\n", "Then the model requires a feature array of shape `(n_samples, p_features, q_targets)`." ] }, { "cell_type": "markdown", "id": "910e02bc", "metadata": {}, "source": [ "Simulate training and test data:" ] }, { "cell_type": "code", "execution_count": null, "id": "54f99067", "metadata": {}, "outputs": [], "source": [ "x_train, y_train, x_test, y_test, beta = simulate(kappa=0.5)" ] }, { "cell_type": "markdown", "id": "c42e0f09", "metadata": {}, "source": [ "Fit linear multi-task regression:" ] }, { "cell_type": "code", "execution_count": null, "id": "f85ec553", "metadata": {}, "outputs": [], "source": [ "model = CoopLassoCV()\n", "model.fit(X=x_train, y=y_train)" ] }, { "cell_type": "markdown", "id": "3978d428", "metadata": {}, "source": [ "Extract estimated coefficients, a matrix of shape `(p_features, q_targets)`,\n", "and calculate the precision:" ] }, { "cell_type": "code", "execution_count": null, "id": "c9201327", "metadata": {}, "outputs": [], "source": [ "beta_hat = model.coef_.T\n", "precision_score(y_true=beta!=0, y_pred=model.coef_.T!=0, average=\"micro\")" ] }, { "cell_type": "markdown", "id": "17069981", "metadata": {}, "source": [ "Make out-of-sample predictions, a matrix of shape (`n_samples, q_targets`),\n", "and calculate the mean squared error:" ] }, { "cell_type": "code", "execution_count": null, "id": "a2e60e9a", "metadata": {}, "outputs": [], "source": [ "y_hat = model.predict(X=x_test)\n", "mean_squared_error(y_true=y_test, y_pred=y_hat)" ] }, { "cell_type": "markdown", "id": "1f2cdf6f", "metadata": {}, "source": [ "## (iii) Multi-task regression with privileged information\n", "\n", "In some applications, some features may be used for model training\n", "but not for model testing (prileged information).\n", "In contrast to primary features,\n", "auxiliary features must not be selected by the model.\n", "Irrespective of whether there is a common feature matrix for all targets\n", "or a specific feature matrix for each target,\n", "it is possible to exclude the same set of features for all targets,\n", "or specific sets of features for each target." ] }, { "cell_type": "markdown", "id": "2ae1086f", "metadata": {}, "source": [ "**Step 1** - Simulate data" ] }, { "cell_type": "markdown", "id": "6caa1976", "metadata": {}, "source": [ "**Option A**: common feature matrix" ] }, { "cell_type": "code", "execution_count": null, "id": "e9b65597", "metadata": {}, "outputs": [], "source": [ "x_train, y_train, x_test, y_test, beta = simulate()" ] }, { "cell_type": "markdown", "id": "e877100e", "metadata": {}, "source": [ "**Option B**: separate feature matrices" ] }, { "cell_type": "code", "execution_count": null, "id": "b73babaa", "metadata": {}, "outputs": [], "source": [ "x_train, y_train, x_test, y_test, beta = simulate(kappa=0.5)" ] }, { "cell_type": "markdown", "id": "bcd1594c", "metadata": {}, "source": [ "**Step 2** - Define primary and auxiliary features" ] }, { "cell_type": "markdown", "id": "52696b40", "metadata": {}, "source": [ "**Option A**:\n", "Allow the model to select from the *same set* of features for all targets,\n", "by defining `z` as a vector of shape `(p_features, )`:" ] }, { "cell_type": "code", "execution_count": null, "id": "162957e8", "metadata": {}, "outputs": [], "source": [ "z = np.zeros(x_train.shape[1])\n", "z[0:100] = 1" ] }, { "cell_type": "markdown", "id": "079c1975", "metadata": {}, "source": [ "(Here, all targets have the primary features `x_1,...,x_100`\n", "and the auxiliary features `x_101,...,x_200`.)" ] }, { "cell_type": "markdown", "id": "6fd86460", "metadata": {}, "source": [ "**Option B**:\n", "Allow the model to select from a *different set* of features for each target,\n", "by defining `z` as a matrix of shape `(p_features, q_targets`):" ] }, { "cell_type": "code", "execution_count": null, "id": "fd64e8d0", "metadata": {}, "outputs": [], "source": [ "z = np.zeros((x_train.shape[1],y_train.shape[1]))\n", "z[0:50,0] = 1\n", "z[100:175,1] = 1\n", "z[125:200,2] = 1" ] }, { "cell_type": "markdown", "id": "64b2a46a", "metadata": {}, "source": [ "(Here, the first target has the primary features `x_1,...,x_50`,\n", "the second target `x_101,...,x_175`,\n", "and the third target `x_126,...,x_200`.)" ] }, { "cell_type": "markdown", "id": "a74227c4", "metadata": {}, "source": [ "Fit linear multi-task regression:" ] }, { "cell_type": "code", "execution_count": null, "id": "eac24085", "metadata": {}, "outputs": [], "source": [ "model = CoopLassoCV()\n", "model.fit(X=x_train, y=y_train, Z=z)" ] }, { "cell_type": "markdown", "id": "4f19ee35", "metadata": {}, "source": [ "Extract estimated coefficients, a matrix of shape `(p_features, q_targets)`,\n", "and calculate precision:" ] }, { "cell_type": "code", "execution_count": null, "id": "2981e502", "metadata": {}, "outputs": [], "source": [ "y_hat = model.predict(X=x_test)\n", "beta_hat = model.coef_.T" ] }, { "cell_type": "markdown", "id": "de914140", "metadata": {}, "source": [ "**Note**: As auxiliary features are not selected,\n", "their values in the test data have no impact on predictions." ] }, { "cell_type": "code", "execution_count": null, "id": "9314e380", "metadata": {}, "outputs": [], "source": [ "assert np.all(beta_hat[z == 0, ...] == 0)\n", "x_test_new = x_test.copy()\n", "if x_test.ndim == 2 and z.ndim == 2:\n", " x_test_new[: , np.sum(z,axis=1) == 0] = np.nan\n", "else:\n", " x_test_new[:, z == 0, ...] = np.nan\n", "assert np.all(y_hat == model.predict(x_test_new))" ] }, { "cell_type": "markdown", "id": "f7137e66", "metadata": {}, "source": [ "## Related packages\n", "\n", "- **scikit-learn** implements\n", "linear multi-task lasso and elastic net regression,\n", "in the classes `MultiTaskLasso`, `MultiTaskLassoCV`,\n", "`MultiTaskElasticNet` and `MultiTaskElasticNetCV`.
\n", "[GitHub](https://github.com/scikit-learn/scikit-learn)|\n", "[PyPI](https://pypi.org/project/scikit-learn/)|\n", "[website](https://scikit-learn.org)\n", "\n", "- **MuTaR** from Hicham Janati implements\n", "group-norms multi-task linear models\n", "and optimal transport regularised models.
\n", "[GitHub](https://github.com/hichamjanati/mutar)|\n", "[PyPI](https://pypi.org/project/mutar/)|\n", "[website](https://hichamjanati.github.io/mutar/)\n", "\n", "- **scikit-MTR** from Henzhe Zhang implements\n", "multi-task regression by stacking.
\n", "[GitHub](https://github.com/hengzhe-zhang/Scikit-MTR)|\n", "[PyPI](https://pypi.org/project/scikit-MTR/)" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "main_language": "python", "notebook_metadata_filter": "-all" } }, "nbformat": 4, "nbformat_minor": 5 }