{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false }, "source": [ "# [Introduction to Data Science](http://datascience-intro.github.io/1MS041-2023/) \n", "## 1MS041, 2023 \n", "©2023 Raazesh Sainudiin, Benny Avelin. [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_diabetes" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "dataset = load_diabetes()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _diabetes_dataset:\n", "\n", "Diabetes dataset\n", "----------------\n", "\n", "Ten baseline variables, age, sex, body mass index, average blood\n", "pressure, and six blood serum measurements were obtained for each of n =\n", "442 diabetes patients, as well as the response of interest, a\n", "quantitative measure of disease progression one year after baseline.\n", "\n", "**Data Set Characteristics:**\n", "\n", " :Number of Instances: 442\n", "\n", " :Number of Attributes: First 10 columns are numeric predictive values\n", "\n", " :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n", "\n", " :Attribute Information:\n", " - age age in years\n", " - sex\n", " - bmi body mass index\n", " - bp average blood pressure\n", " - s1 tc, total serum cholesterol\n", " - s2 ldl, low-density lipoproteins\n", " - s3 hdl, high-density lipoproteins\n", " - s4 tch, total cholesterol / HDL\n", " - s5 ltg, possibly log of serum triglycerides level\n", " - s6 glu, blood sugar level\n", "\n", "Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).\n", "\n", "Source URL:\n", "https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n", "\n", "For more information see:\n", "Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) \"Least Angle Regression,\" Annals of Statistics (with discussion), 407-499.\n", "(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n", "\n" ] } ], "source": [ "print(dataset.DESCR)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "X,Y = load_diabetes(return_X_y=True)\n", "X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(331, 10) (111, 10) (331,) (111,)\n" ] } ], "source": [ "print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "lr = LinearRegression()\n", "lr.fit(X_train,Y_train)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "plt.scatter(lr.predict(X_test),Y_test)\n", "plt.scatter(lr.predict(X_test),lr.predict(X_test))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "45.120563074396195" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "MAE = np.mean(np.abs(Y_test - lr.predict(X_test)))\n", "MAE" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "321" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = 346\n", "a = 25\n", "span = b-a\n", "span" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from Utils import epsilon_bounded" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "82.75719699590479" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "epsilon = epsilon_bounded(len(Y_test),span*2,0.05)\n", "epsilon" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[-37.6366339215086, 127.87776007030098]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[MAE-epsilon,MAE+epsilon]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wine quality dataset" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "df_red = pd.read_csv('/Users/avelin/Downloads/winequality-red.csv',sep=';')\n", "df_white = pd.read_csv('/Users/avelin/Downloads/winequality-white.csv',sep=';')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "df_red['type'] = 1" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitytype
07.40.700.001.90.07611.034.00.99783.510.569.451
17.80.880.002.60.09825.067.00.99683.200.689.851
27.80.760.042.30.09215.054.00.99703.260.659.851
311.20.280.561.90.07517.060.00.99803.160.589.861
47.40.700.001.90.07611.034.00.99783.510.569.451
\n", "
" ], "text/plain": [ " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", "0 7.4 0.70 0.00 1.9 0.076 \n", "1 7.8 0.88 0.00 2.6 0.098 \n", "2 7.8 0.76 0.04 2.3 0.092 \n", "3 11.2 0.28 0.56 1.9 0.075 \n", "4 7.4 0.70 0.00 1.9 0.076 \n", "\n", " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", "0 11.0 34.0 0.9978 3.51 0.56 \n", "1 25.0 67.0 0.9968 3.20 0.68 \n", "2 15.0 54.0 0.9970 3.26 0.65 \n", "3 17.0 60.0 0.9980 3.16 0.58 \n", "4 11.0 34.0 0.9978 3.51 0.56 \n", "\n", " alcohol quality type \n", "0 9.4 5 1 \n", "1 9.8 5 1 \n", "2 9.8 5 1 \n", "3 9.8 6 1 \n", "4 9.4 5 1 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_red.head(5)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "df_white['type'] = 0" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "feature_cols = [col for col in df_red.columns if col!='quality']" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['fixed acidity',\n", " 'volatile acidity',\n", " 'citric acid',\n", " 'residual sugar',\n", " 'chlorides',\n", " 'free sulfur dioxide',\n", " 'total sulfur dioxide',\n", " 'density',\n", " 'pH',\n", " 'sulphates',\n", " 'alcohol',\n", " 'type']" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_cols" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "target = 'quality'" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "X1 = df_red[feature_cols].to_numpy()\n", "X2 = df_white[feature_cols].to_numpy()\n", "Y1 = df_red[target].to_numpy()\n", "Y2 = df_white[target].to_numpy()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "X = np.concatenate([X1,X2],axis=0)\n", "Y = np.concatenate([Y1,Y2],axis=0)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6497, 12) (6497,)\n" ] } ], "source": [ "print(X.shape,Y.shape)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state=0)\n", "from sklearn.linear_model import LinearRegression\n", "lr = LinearRegression()\n", "lr.fit(X_train,Y_train)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "plt.scatter(lr.predict(X_test),Y_test)\n", "plt.scatter(lr.predict(X_test),lr.predict(X_test))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.3587525442945037" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.max(np.abs(Y_test-lr.predict(X_test)))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5877140417627457" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "MAE = np.mean(np.abs(Y_test - lr.predict(X_test)))\n", "MAE" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1625" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(Y_test)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.16845176105008947" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "epsilon = epsilon_bounded(len(Y_test),5,0.05)\n", "epsilon" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.4192622807126562, 0.7561658028128351]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[MAE-epsilon,MAE+epsilon]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "lx_course_instance": "2023", "lx_course_name": "Introduction to Data Science", "lx_course_number": "1MS041" }, "nbformat": 4, "nbformat_minor": 5 }