"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses).\n",
">\n",
">Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course overview\n",
"\n",
"This is the sixth, and final, module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the [first 10 modules](https://github.com/kgdunn/python-basic-notebooks) in a slightly different way. It places more emphasis on\n",
"\n",
"* dealing with data: importing, merging, filtering;\n",
"* calculations from the data;\n",
"* visualization of it.\n",
"\n",
"In short: ***how to extract value from your data***.\n",
"\n",
"## Review so far\n",
"\n",
"In [module 11](https://yint.org/pybasic11) we learned about\n",
"* creating variables, and showing their `type`,\n",
"* performing basic calculations, and the `math` library,\n",
"* lists, as one of the most fundamental Python objects.\n",
"\n",
"In the [module 12](https://yint.org/pybasic12) we took this a step further:\n",
"* and introduced the Pandas library, for `Series` and `DataFrame` objects,\n",
"* learned how to import and write Excel files,\n",
"* do basic operations on DataFrames, and \n",
"* learned about another fundamental Python type, the `dict`ionary.\n",
"\n",
"[Module 13](https://yint.org/pybasic13) we introduced:\n",
"* a general workflow for data processing\n",
"* and how to visualize data with Pandas:\n",
"\n",
" * box plot, \n",
" * time series (sequence) plot, and\n",
" * scatter plots [including showing how you can visualize 5 dimensions!]\n",
"\n",
"[Module 14](https://yint.org/pybasic14) we saw how to create:\n",
"* for loops, for when we need to do things over and over,\n",
"* but we also saw the `groupby` function, which does actions repeatedly on sub-groups of your data.\n",
"* We also introduced the correlation matrix.\n",
"\n",
"Then in [module 15](https://yint.org/pybasic15) we saw:\n",
"* that we could visualize the correlation matrix (2D histogram), to find candidates for regression,\n",
"* using the `LinearRegression` tool from a new library, `scikit-learn`.\n",
"* We also used another new library, `seaborn`, to visualize these regression models.\n",
"\n",
"\n",
"# Module 16 Overview\n",
"\n",
"In this module we will cover a collection of last loose ends. Things you will use regularly in your work. \n",
"\n",
"* Reading in subsets of data\n",
"* Handling missing values with Pandas\n",
"* Filtering data, and the multi-level `groupby` capability of Panda\n",
"* Effective table display in Pandas\n",
"\n",
"Most of them come from this list, with some modifications: https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Data set import and basic checks\n",
"\n",
"We will use a data set that is related to food consumption. It shows, in a relative way, the food consumption habits of European (and soon to be former EU) countries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
"df.info()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualizing the correlation matrix is essential to help understanding relationships. Use the code and the plot below to help answer:\n",
"\n",
"* Countries which consume garlic more than average, also seem to consume a higher amount of ...\n",
"* Which variables are negatively correlated with \"Real coffee\" consumption?\n",
"* Countries with higher consumption of \"Crisp bread\" also show high consumption of which other products?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"sns.set(rc={'figure.figsize':(15, 15)})\n",
"cmap = sns.diverging_palette(220, 10, as_cmap=True)\n",
"sns.heatmap(df.corr(), cmap=cmap, square=True, linewidths=0.2, cbar_kws={\"shrink\": 0.5});"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## List comprehensions\n",
"\n",
"\"*List comprehensions*\" are a quick way to make a list. You can read more, and see some examples here: https://realpython.com/list-comprehension-python/#using-list-comprehensions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print( [i for i in range(10)] )\n",
"print( [i*2+1 for i in range(10)] )\n",
"print( [i*2 for i in range(10) if i > 4] )\n",
"print( [i for i in range(10) if i % 2 == 1] )\n",
"print( [i for i in range(10) if i % 2 == 0] )\n",
"print( [i for i in range(10) if i % 4 == 1] )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading only certain rows\n",
"\n",
"Imagine you had a large data set, and only needed certain rows for your calculations/visualization later on. You can use the `nrows` and `skiprows` arguments to read only a subset of the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_subset = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", nrows=5)\n",
"display(df_subset)\n",
"\n",
"df_partial = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", skiprows=[2, 3, 4])\n",
"display(df_partial)\n",
"\n",
"# Requires an extra `engine` input\n",
"df_bottom = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", skipfooter=12, engine='python')\n",
"display(df_bottom)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Skipping every 3rd row, using a list comprehension...\n",
"print([i for i in range(40) if i%3 ==1])\n",
"df_partial = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", \n",
" skiprows=[i for i in range(40) if i%3 ==1])\n",
"display(df_partial)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import specific columns only\n",
"\n",
"If you know the names of the columns you need, you can use the `usecols` input.\n",
"\n",
"Note: this also works for Excel files! You can say, for example, `usecols=\"F,G,BQ\"` if you need columns F, G and BQ only."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_subset = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\", \n",
" usecols=[\"Country\", \"Sweetener\", \"Biscuits\", \"Powder soup\", \"Tin soup\"])\n",
"display(df_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropping columns or rows\n",
"\n",
"Conversely, you can read in the whole data set, and drop away the columns or rows you do not need."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = (\n",
" pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
" .drop([\"Sweetener\", \"Biscuits\", \"Powder soup\", \"Tin soup\"], axis=1)\n",
")\n",
"display(df)\n",
"df.shape\n",
"\n",
"# Also drop some rows: drop away every 3rd row.\n",
"# You can also leave away 'axis=0' (because that's the default)\n",
"df_subset = df.drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0) \n",
"display(df_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting an index\n",
"\n",
"You can always make a column from your dataframe to be your `index`, using the `set_index` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
"df = df.set_index('Country')\n",
"display(df)\n",
"\n",
"# Or, in a single line, in a chained operation\n",
"df = (\n",
" pd.read_csv(\"https://openmv.net/file/food-consumption.csv\")\n",
" .drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)\n",
" .set_index('Country')\n",
")\n",
"display(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualizing and deleting missing values\n",
"\n",
"Pandas generally handles missing values well: for example, the ``df.mean()`` function will work even if there are missing values. But some mathematical tools cannot have missing values, such as when performing a linear regression. So deleting missing data first is an option. It is therefore helpful that you can:\n",
"\n",
"* Find how many missing values are there per column? Or per row?\n",
"* Delete columns with missing values.\n",
"* Deleting rows with any missing values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Which columns have missing values:\n",
"df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
"display(df.isna().sum())\n",
"\n",
"# Which rows have missing values:\n",
"df.isna().sum(axis=1)\n",
"\n",
"# Display missing values in a heat map\n",
"import matplotlib as mpl \n",
"mpl.rcParams.update(mpl.rcParamsDefault) # reset the visualization parameters\n",
"sns.heatmap(df.isna());"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confirm that the \"Sweetener\", \"Biscuits\", and \"Yoghurt\" columns are not present after running this command (these columns had missing values in them):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Delete columns with missing values\n",
"df.dropna(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confirm that the rows for \"Sweden\", \"Finland\", and \"Spain\", which had missing entries, are not present after this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Delete rows with missing values\n",
"df.dropna(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dropping missing values in all rows, but only for a subset of the columns is possible. For example, drop only rows in the columns for \"Sweetener\" and \"Yoghurt\" (ignore the column for \"Biscuits\"):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(df.dropna(subset=[\"Sweetener\", \"Yoghurt\"], axis=0))\n",
"\n",
"# Note: you can also flip this around. Specify a subset of row names\n",
"# in `subset` and delete from all columns, using `axis=1`.\n",
"df.dropna(subset=[\"Sweden\"], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using iloc and loc\n",
"\n",
"We learned about `.iloc` in the [prior module](https://yint.org/pybasic15). Let's look at this again, and emphasize the difference between `.iloc` and `.loc`. [This article](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/) gives more details about the two if you want some more explanation.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
"\n",
"# \"Instant coffee\" is column 1: make all these values missing\n",
"df.iloc[:, 1] = np.nan\n",
"display(df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# But what if don't know, or care, which column index it is? \n",
"# When we know the column's name, then use \".loc\" \n",
"df.loc[:, \"Tea\"] = np.nan\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or you can use a list of column names:\n",
"df.loc[:, [\"Potatoes\",\"Frozen fish\"]] = 98.76\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can use a mixture of iloc and loc:\n",
"df.iloc[[0, 1, 2], :].loc[:, \"Tin soup\"]\n",
"\n",
"# but this is less code:\n",
"df.iloc[[0, 1, 2], :][\"Tin soup\"]\n",
"\n",
"# or even less this way:\n",
"df.iloc[[0, 1, 2]][\"Tin soup\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropping missing values, specifying a threshold\n",
"\n",
"If you want to delete a column only if there are more than a certain number of missing values:\n",
"\n",
"* Read the data\n",
"* Make a column have a high number of missing values (for demonstration purposes; normally the column is already problematic)\n",
"* Remove that column, because it has a high degree of missing values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read the data, and make every 3rd row a missing value for column \"Tea\"\n",
"df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
"\n",
"df.iloc[[i for i in range(16) if i%3 == 1]][\"Tea\"] = np.nan\n",
"\n",
"# The above code generates a warning. Why?\n",
"display(df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# How to solve it? As suggested by the warning, use \".loc\" instead.\n",
"# df.loc[row_indexer, col_indexer] = np.nan\n",
"\n",
"# Create a variable containing all row names:\n",
"row_indexer = df.index\n",
"\n",
"# Now take every third row name:\n",
"row_indexer = df.index[ [i for i in range(16) if i%3 ==1] ]\n",
"row_indexer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Then, set these rows to have missing values:\n",
"df.loc[row_indexer, \"Tea\"] = np.nan\n",
"display(df)\n",
"df.isna().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Finally, we can now delete columns with a threshold (degree) of missing values\n",
"# What value should you fill in here?\n",
"display(df.dropna(thresh=11, axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering rows\n",
"\n",
"* Find which countries have `\"Olive oil\"` consumption of more than 50?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"https://openmv.net/file/food-consumption.csv\").set_index('Country')\n",
"df[ df[\"Olive oil\"] > 50 ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Which countries have `\"Olive oil\"` more than 50, **and** `\"Garlic\"` more than 40?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[ (df[\"Tea\"] > 30) & (df[\"Tea\"] < 80) ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Which countries have `\"Tea\"` more than 80, **or** `\"Oranges\"` more than 90?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[(df[\"Tea\"] > 80) | (df[\"Oranges\"] > 90)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering with the `.query` function\n",
"\n",
"It is sometimes more natural to filter with the `.query` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.query(\"30 < Tea < 80\")\n",
"\n",
"# or use backticks if the column name has a space:\n",
"df.query(\"10 < `Tin soup` < 20\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can have multiple queries:\n",
"\n",
"Find the countries which have \"Real coffee\" and \"Tea\" consumption above 70. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.query(\"(`Real coffee` > 70) or (Tea > 70)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Really powerful is the ability to reference one column against another.\n",
"\n",
"Find all countries where more `\"Instant coffee\"` is drunk more than `\"Real coffee\"`. *These are countries to avoid visiting*. What else do you notice about these countries eating habits?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.query(\"`Instant coffee` > `Real coffee`\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## New data set: raw material properties\n",
"\n",
"For the rest of the notebook we will switch to a new data set, where we characterize the properties of a raw material. As each batch of raw material is acquired, there are 6 measurements taken. There is also an indicator variable (categorical variable) on whether the raw materials outcome was (`Adequate`), or not (`Poor`).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"https://openmv.net/file/raw-material-characterization.csv\").set_index(\"Lot number\")\n",
"\n",
"# Note that the Outcome column is an object. We can explicitly convert it to a categorical variable:\n",
"df[\"Outcome\"] = df[\"Outcome\"].astype('category')\n",
"df.info()\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Single level `groupby`\n",
"\n",
"Recall the `groupby` function from [two modules ago](https://yint.org/pybasic14), which we applied as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Groupby: for plotting\n",
"df.groupby(\"Outcome\").plot.scatter(x='Size5', y=\"Size15\", xlim=(10, 16), ylim=(18, 45), grid=True);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or another combination of the variables plotted:\n",
"df.groupby(\"Outcome\").plot.scatter(x='TMA', y=\"TGA\", xlim=(45, 65), ylim=(550, 770), grid=True);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or using groupby for summaries\n",
"display(df.groupby(\"Outcome\").mean())\n",
"display(df.groupby(\"Outcome\").std())\n",
"\n",
"# Or, call all the summaries together. We will explain the .agg function below.\n",
"df.groupby(\"Outcome\").agg([\"mean\", \"std\"]).round(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multilevel groupby\n",
"\n",
"We can also use `groupby` for multiple levels. Imagine we have a second categorical variable, or some other variable with few discrete values:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First, create a new categorical variable, using a list comprehension\n",
"print( [item for item in df[\"Size5\"]] )\n",
"print( [item for item in df[\"Size5\"] if item <= 13] )\n",
"print( [\"Small\" if item <= 13 else \"Large\" for item in df[\"Size5\"]] )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Using what you learned above, you can see how we can quickly create a new column, \n",
"# based on the values in another column.\n",
"df[\"Size\"] = [\"Small\" if item <= 13 else \"Large\" for item in df[\"Size5\"]]\n",
"\n",
"# Now you can use a multi-level groupby:\n",
"display(df.groupby([\"Outcome\", \"Size\", ]).size())\n",
"display(df.groupby([\"Outcome\", \"Size\", ]).mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple groupby summaries\n",
"\n",
"\n",
"In the above, we had to write 2 lines with the `groupby` function: once for `count` and once for `mean`. But you can get them both in 1 table, using the `agg` function. `agg` is short hand for aggregation (which means to form things into a cluster)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# These 2 lines do exactly the same:\n",
"display( df.groupby([\"Outcome\", \"Size\"]).mean() )\n",
"display( df.groupby([\"Outcome\", \"Size\"]).agg('mean') )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Now extend the input to the .agg function, with 2 aggregations:\n",
"display( df.groupby([\"Outcome\", \"Size\"]).agg([\"mean\", \"std\"]) )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can specify an entire collection of aggregations, and on which columns you want to do that:\n",
"agg_func_math = ['count', 'mean', 'median', 'min', 'max', 'std']\n",
"df.groupby(['Outcome'])[[\"Size5\", \"TGA\"]].agg(agg_func_math).round(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add a new column, not at the end (right-hand side)\n",
"\n",
"We saw above that we can create a new column, but that it automatically gets added on the right-hand side of the data frame. If you would like the column elsewhere, use the `.insert()` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.insert(0, 'EmptyColumn', np.nan)\n",
"df.insert(3, 'Column of ones', [1] * df.shape[0])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Replace values in a column\n",
"\n",
"We can do a \"search and replace\" function on the values in a data frame. \n",
"\n",
"Imagine if we wanted to change the `Outcome` column, and instead of `Adequate` and `Poor` we would rather have `Good` and `Bad`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[\"Outcome-newname\"] = df['Outcome'].replace({\"Adequate\": \"Good\", \"Poor\": \"Bad\"})\n",
"df\n",
"\n",
"# Try setting the `Outcome` column to numeric values: Adequate -> 1 and Poor -> 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Styling table displays\n",
"\n",
"To help emphasize your message in a table, you might want to colour your table appropriately.\n",
"\n",
"You can read about all the options on this page in the Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"https://openmv.net/file/raw-material-characterization.csv\").set_index(\"Lot number\")\n",
"df.style.bar(color=['lightblue'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# How to style only a subset of the columns:\n",
"(df.style\n",
" .hide_index() # if you don't need your index column, you can drop it away \n",
" .bar(color='green', subset=['Size5', 'Size10', 'Size15'])\n",
" .set_caption('Raw material outcomes')\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"cmap = cmap=sns.diverging_palette(0, 50, as_cmap=True)\n",
"\n",
"# Double sort: first on `Outcome`, then on `Size5`\n",
"df.sort_values([\"Outcome\", \"Size5\"], inplace=True)\n",
"\n",
"(df[[\"Outcome\", \"Size5\", \"Size10\", \"Size15\"]].style\n",
" .background_gradient(cmap)\n",
" .set_precision(2)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show missing values with a colour. First, create an artificial missing value:\n",
"df.iloc[4, 3] = np.nan\n",
"df.head(7).style.set_precision(2).highlight_null('red')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show the minimum and maximum values with different colours:\n",
"(df.style\n",
" .set_precision(2)\n",
" .highlight_min(axis=0, color=\"lightblue\")\n",
" .highlight_max(axis=0, color='orange')\n",
" .highlight_null('red')\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Challenge\n",
"\n",
"1. Apply table styling to the Foods data set, at the start of this notebook. Can you visualize, in a colourful way, some of the food consumption trends?\n",
"2. Apply this same table styling to a small/medium sized data set of your own."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"End of this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# IGNORE this. Execute this cell to load the notebook's style sheet.\n",
"from IPython.core.display import HTML\n",
"css_file = './images/style.css'\n",
"#HTML(open(css_file, \"r\").read())"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
},
"toc": {
"base_numbering": "1",
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": true,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "376.22283935546875px"
},
"toc_section_display": true,
"toc_window_display": true
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}