Quantcast
Channel: Shane Lynn

Python Pandas read_csv – Load Data from CSV Files

$
0
0

CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, manipulate, and write data to and from CSV files using Python is a key skill to master for any data scientist or business analysis. In this post, we’ll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files post analysis.

Pandas is the most popular data manipulation package in Python, and DataFrames are the Pandas data type for storing tabular 2D data.

Load CSV files to Python Pandas

The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the “read_csv” function in Pandas:

# Load the Pandas libraries with alias 'pd' 
import pandas as pd 

# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
data = pd.read_csv("filename.csv") 

# Preview the first 5 lines of the loaded data 
data.head()

While this code seems simple, an understanding of three fundamental concepts is required to fully grasp and debug the operation of the data loading procedure if you run into issues:

  1. Understanding file extensions and file types – what do the letters CSV actually mean? What’s the difference between a .csv file and a .txt file?
  2. Understanding how data is represented inside CSV files – if you open a CSV file, what does the data actually look like?
  3. Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are you working in?
  4. CSV data formats and errors – common errors with the function.

Each of these topics is discussed below, and we finish this tutorial by looking at some more advanced CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.

1. File Extensions and File Types

The first step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.

  1. Data is stored on your computer in individual “files”, or containers, each with a different name.
  2. Each file contains data of different types – the internals of a Word document is quite different from the internals of an image.
  3. Computers determine how to read files using the “file extension”, that is the code that follows the dot (“.”) in the filename.
  4. So, a filename is typically in the form “<random name>.<file extension>”. Examples:
    • project1.DOCX – a Microsoft Word file called Project1.
    • shanes_file.TXT – a simple text file called shanes_file
    • IMG_5673.JPG – An image file called IMG_5673.
    • Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, ZIP – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a complete list of extensions here.
  5. A CSV file is a file with a “.csv” file extension, e.g. “data.csv”, “super_information.csv”. The “CSV” in this case lets the computer know that the data contained in the file is in “comma separated value” format, which we’ll discuss below.

File extensions are hidden by default on a lot of operating systems. The first step that any self-respecting engineer, software engineer, or data scientist will do on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.

Folder with file extensions showing. Before working with CSV files, ensure that you can see your file extensions in your operating system. Different file contents are denoted by the file extension, or letters after the dot, of the file name. e.g. TXT is text, DOCX is Microsoft Word, PNG are images, CSV is comma-separated value data.

To check if file extensions are showing in your system, create a new text document with Notepad (Windows) or TextEdit (Mac) and save it to a folder of your choice. If you can’t see the “.txt” extension in your folder when you view it, you will have to change your settings.

  • In Microsoft Windows: Open Control Panel > Appearance and Personalization.  Now, click on Folder Options or File Explorer Option, as it is now called > View tab. In this tab, under Advance Settings, you will see the option Hide extensions for known file types. Uncheck this option and click on Apply and OK.
  • In Mac OS: Open Finder > In menu, click Finder > Preferences, Click Advanced, Select the checkbox for “Show all filename extensions”.

2. Data Representation in CSV files

A “CSV” file, that is, a file with a “csv” filetype, is a basic text file. Any text editor such as NotePad on windows or TextEdit on Mac, can open a CSV file and show the contents. Sublime Text is a wonderful and multi-functional text editor option for any platform.

CSV is a standard for storing tabular data in text format, where commas are used to separate the different columns, and newlines (carriage return / press enter) used to separate rows. Typically, the first row in a CSV file contains the names of the columns for the data.

And example table data set and the corresponding CSV-format data is shown in the diagram below.

Comma-separated value files, or CSV files, are simple text files where commas and newlines are used to define tabular data in a structured way.

Note that almost any tabular data can be stored in CSV format – the format is popular because of its simplicity and flexibility. You can create a text file in a text editor, save it with a .csv extension, and open that file in Excel or Google Sheets to see the table form.

Other Delimiters / Separators – TSV files

The comma separation scheme is by far the most popular method of storing tabular data in text files.

However, the choice of the ‘,’ comma character to delimiters columns, however, is arbitrary, and can be substituted where needed. Popular alternatives include tab (“\t”) and semi-colon (“;”). Tab-separate files are known as TSV (Tab-Separated Value) files.

When loading data with Pandas, the read_csv function is used for reading any delimited text file, and by changing the delimiter using the sep  parameter.

Delimiters in Text Fields – Quotechar

One complication in creating CSV files is if you have commas, semicolons, or tabs actually in one of the text fields that you want to store. In this case, it’s important to use a “quote character” in the CSV file to create these fields.

The quote character can be specified in Pandas.read_csv using the quotechar argument. By default (as with many systems), it’s set as the standard quotation marks (“). Any commas (or other delimiters as demonstrated below) that occur between two quote characters will be ignored as column separators.

In the example shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the “NickName” column to contain semicolons without being split into more columns.

Demonstration of semicolon separated file data with quote character to prevent unnecessary splits in columns.
Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote characters are used if the data in a column may contain the separating character. In this case, the ‘NickName’ column contains semicolon characters, and so this column is “quoted”. Specify the separator and quote character in pandas.read_csv

3. Python – Paths, Folders, Files

When you specify a filename to Pandas.read_csv, Python will look in your “current working directory“. Your working directory is typically the directory that you started your Python process or Jupiter notebook from.

When filenotfounderrors occur, it can be due to a misspelled filename or a working directory mistake,
Pandas searches your ‘current working directory’ for the filename that you specify when opening or loading files. The FileNotFoundError can be due to a misspelled filename, or an incorrect working directory.

Finding your Python Path

Your Python path can be displayed using the built-in os module. The OS module is for operating system dependent functionality into Python programs and scripts.

To find your current working directory, the function required is os.getcwd(). The  os.listdir() function can be used to display all files in a directory, which is a good check to see if the CSV file you are loading is in the directory as expected.

# Find out your current working directory
import os
print(os.getcwd())

# Out: /Users/shane/Documents/blog

# Display all of the files found in your current working directory
print(os.listdir(os.getcwd())


# Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']

In the example above, my current working directory is in the ‘/Users/Shane/Document/blog’ directory. Any files that are places in this directory will be immediately available to the Python file open() function or the Pandas read csv function.

Instead of moving the required data files to your working directory, you can also change your current working directory to the directory where the files reside using os.chdir().

File Loading: Absolute and Relative Paths

When specifying file names to the read_csv function, you can supply both absolute or relative file paths.

  • A relative path is the path to the file if you start from your current working directory. In relative paths, typically the file will be in a subdirectory of the working directory and the path will not start with a drive specifier, e.g. (data/test_file.csv). The characters ‘..’ are used to move to a parent directory in a relative path.
  • An absolute path is the complete path from the base of your file system to the file that you want to load, e.g. c:/Documents/Shane/data/test_file.csv. Absolute paths will start with a drive specifier (c:/ or d:/ in Windows, or ‘/’ in Mac or Linux)

It’s recommended and preferred to use relative paths where possible in applications, because absolute paths are unlikely to work on different computers due to different directory structures.

absolute vs relative file paths
Loading the same file with Pandas read_csv using relative and absolute paths. Relative paths are directions to the file starting at your current working directory, where absolute paths always start at the base of your file system.

4. Pandas CSV File Loading Errors

The most common error’s you’ll get while loading data from CSV files into Pandas will be:

  1. FileNotFoundError: File b'filename.csv' does not exist
    A File Not Found error is typically an issue with path setup, current directory, or file name confusion (file extension can play a part here!)
  2. UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
    A Unicode Decode Error is typically caused by not specifying the encoding of the file, and happens when you have a file with non-standard characters. For a quick fix, try opening the file in Sublime Text, and re-saving with encoding ‘UTF-8’.
  3. pandas.parser.CParserError: Error tokenizing data.
    Parse Errors can be caused in unusual circumstances to do with your data format – try to add the parameter “engine=’python'” to the read_csv function call; this changes the data reading function internally to a slower but more stable method.

Advanced CSV Loading

There are some additional flexible parameters in the Pandas read_csv() function that are useful to have in your arsenal of data science techniques:

Specifying Data Types

As mentioned before, CSV files do not contain any type information for data. Data types are inferred through examination of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, the dtype parameter can be used with a dictionary of column names and data types to be applied, for example: dtype={"name": str, "age": np.int32}.

Note that for dates and date times, the format, columns, and other behaviour can be adjusted using parse_dates, date_parser, dayfirst, keep_date parameters.

Skipping and Picking Rows and Columns From File

The nrows parameter specifies how many rows from the top of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly the skiprows parameter allows you to specify rows to leave out, either at the start of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, the usecols parameter can be used to specify which columns in the data to load.

Custom Missing Value Symbols

When data is exported to CSV from different systems, missing values can be specified with different tokens. The na_values parameter allows you to customise the characters that are recognised as missing values. The default values interpreted as NA/NaN are: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

# Advanced CSV loading example

data = pd.read_csv(
    "data/files/complex_data_example.tsv",      # relative python path to subdirectory
    sep='\t' 					# Tab-separated value file.
    quotechar="'",				# single quote allowed as quote character
    dtype={"salary": int}, 		        # Parse the salary column as an integer 
    usecols=['name', 'birth_date', 'salary'].   # Only load the three columns specified.
    parse_dates=['birth_date'], 		# Intepret the birth_date column as a date
    skiprows=10, 				# Skip the first 10 rows of the file
    na_values=['.', '??'] 			# Take any '.' or '??' values as NA
)

CSV Format Positives and Negatives

As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be aware of the potential pitfalls and issues that you will encounter as you load, store, and exchange data in CSV format:

On the plus side:

  • CSV format is universal and the data can be loaded by almost any software.
  • CSV files are simple to understand and debug with a basic text editor
  • CSV files are quick to create and load into memory before analysis.

However, the CSV format has some negative sides:

  • There is no data type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the data only.
  • There’s no formatting or layout information storable – things like fonts, borders, column width settings from Microsoft Excel will be lost.
  • File encodings can become a problem if there are non-ASCII compatible characters in text fields.
  • CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will find however that your CSV data compresses well using zip compression.

As and aside, in an effort to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to be a fast, simple, open, flexible and multi-platform data format that supports multiple data types natively.

Additional Reading

  1. Official Pandas documentation for the read_csv function.
  2. Python 3 Notes on file paths, working directories, and using the OS module.
  3. Datacamp Tutorial on loading CSV files, including some additional OS commands.
  4. PythonHow Loading CSV tutorial.
  5. Chris Albon Notes on CSV loading in Pandas.

Plotting with Python and Pandas – Libraries for Data Visualisation

$
0
0

Anyone familiar with the use of Python for data science and analysis projects has googled some combination of “plotting in python”, “data visualisation in python”, “barcharts in python” at some point. It’s not uncommon to end up lost in a sea of competing libraries, confused and alone, and just to go home again!

The purpose of this post is to help navigate the options for bar-plotting, line-plotting, scatter-plotting, and maybe pie-charting through an examination of five Python visualisation libraries, with an example plot created in each.

For data scientists coming from R, this is a new pain. R has one primary, well-used, and well-documented library for plotting: ggplot2, a package that provides a uniform API for all plot types. Unfortunately the Python port of ggplot2 isn’t as complete, and may lead to additional frustration.

Pie charts, bar charts, line graphs data visualisations
Data visualisation places raw data in a visual context to convey information easily.

How to choose a visualisation tool

Data visualisation describes any effort to help people understand the significance of data by placing it in a visual context.

Data visualisation describes the movement from raw data to meaningful insights and learning, and is an invaluable skill (when used correctly) for uncovering correlations, patterns, movements, and achieving comparisons of data.

The choice of data visualisation tool is a particularly important decision for analysts involved in dissecting or modelling data. Ultimately, your choice of tool should lead to:

  • Fast iteration speed: the ability to quickly iterate on different visualisations to find the answers you’re searching for or validate throwaway ideas.
  • Un-instrusive operation: If every plot requires a Google search, it’s easy to lose focus on the task at hand: visualising. Your tool of choice should be simple to use, un-instrusive, and not the focus of your work and effort.
  • Flexibility: The tool(s) chosen should allow you to create all of the basic chart types easily. The basic toolset should include at least bar-charts, histograms, scatter plots, and line charts, with common variants of each.
  • Good aesthetics: If your visualisations don’t look good, no one will love them. If you need to change tool to make your charts “presentation ready”, you may need a different tool, and save the effort.

In my experience of Python, to reach a point where you can comfortably explore data in an ad-hoc manner and produce plots in a throwaway fashion, you will most likely need to familiarise yourself with at least two libraries.

Python visualisation setup

To start creating basic visualisations in Python, you will need a suitable system and environment setup, comprising:

  • An interactive environment: A console to execute ad-hoc Python code, and an editor to run scripts. PyCharm, Jupyter notebooks, and the Spyder editor are all great choices, though Jupyter is potentially most popular here.
  • A data manipulation library: Extending Python’s basic functionality and data types to quickly manipulate data requires a library – the most popular here is Pandas.
  • A visualisation library: – we’ll go through the options now, but ultimately you’ll need to be familiar with more than one to achieve everything you’d like.
Stack choices for Python Data Visualisation – you will need to choose different tools for interactive environments, data manipulation, and data visulisation. A common and flexible setup comprises Jupyter notebooks, the Pandas library, and Matplotlib.

Example Plotting Data

For the purposes of this blog post, a sample data set from an “EdgeTier“-like customer service system is being used. This data contains the summary details of customer service chat interactions between agents and customers, completely anonymised with some spurious data.

The data is provided as a CSV file and loaded into Python Pandas, where each row details an individual chat session, and there are 8 columns with various chat properties, which should be self-explanatory from the column names.

Sample dataset for plotting examples in Python. The dataset contains 5,477 rows; each row details a chat interaction between a customer and an agent/user. There are 100 different users in the example data.

To follow along with these examples, you can download the sample data here.

Bar Plot Example and Data Preparation

The plot example for this post will be a simple bar plot of the number of chats per user in our dataset for the top 20 users.

For some of the libraries, the data needs to be re-arranged to contain the specific values that you are going to plot (rather than relying on the visualisation library itself to calculate the values). The calculation of “number of chats per user” is easily achieved using the Pandas grouping and summarising functionality:

# Group the data by user_id and round the number of chats that appear for each
chats_per_user = data.groupby(
    'user_id')['chat_id'].count().reset_index()
# Rename the columns in the results
chats_per_user.columns = ['user_id', 'number_chats']
# Sort the results by the number of chats
chats_per_user = chats_per_user.sort_values(
    'number_chats', 
    ascending=False
)
# Preview the results
chats_per_user.head()
sorted and aggregated chats per user data.
Result of Python Pandas summarisation of the chat data to get the number of chats per user in the dataset and sort the results

Matplotlib

Matplotlib is the grand-daddy of Python plotting libraries. Initially launched in 2003, Matplotlib is still actively developed and maintained with over 28,000 commits on the official Matplotlib Github repository from 750+ contributors, and is the most flexible and complete data visualisation library out there.

Matplotlib Examples plots. Matplotlib provides a low-level and flexible API for generating visualisations with Python.

Matplotlib provides a low-level plotting API, with a MATLAB style interface and output theme. The documentation includes great examples on how best to shape your data and form different chart types. While providing flexibility, the low-level API can lead to verbose visualisation code, and the end results tend to be aesthetically lacking in the absence of significant customisation efforts.

Many of the higher-level visualisation libaries availalbe are based on Matplotlib, so learning enough basic Matplotlib syntax to debug issues is a good idea.

There’s some generic boilerplate imports that are typically used to set up Matplotlib in a Jupyter notebook:

# Matplotlib pyplot provides plotting API
import matplotlib as mpl
from matplotlib import pyplot as plt
# For output plots inline in notebook:
%matplotlib inline
# For interactive plot controls on MatplotLib output:
# %matplotlib notebook

# Set the default figure size for all inline plots
# (note: needs to be AFTER the %matplotlib magic)
plt.rcParams['figure.figsize'] = [8, 5]

Once the data has been rearranged as in the output in “chats_per_user” above, plotting in Matplotlib is simple:

# Show the top 20 users in a bar plot with Matplotlib.
top_n = 20
# Create the bars on the plot
plt.bar(x=range(top_n), # start off with the xticks as numbers 0:19
        height=chats_per_user[0:top_n]['number_chats'])
# Change the xticks to the correct user ids
plt.xticks(range(top_n), chats_per_user[0:top_n]['user_id'], 
           rotation=60)
# Set up the x, y labels, titles, and linestyles etc.
plt.ylabel("Number of chats")
plt.xlabel("User")
plt.title("Chats per users for Top 20 users")
plt.gca().yaxis.grid(linestyle=':')

Note that the .bar() function is used to create bar plots, the location of the bars are provided as argument “x”, and the height of the bars as the “height” argument. The axis labels are set after the plot render using the xticks function. The bar could have been made horizontal using the barh function, which is similar, but uses “y” and “width”.

matplotlib bar chart output.
Matplotlib Bar Plot created with Bar() function. Note that to create plots in Matplotlib, typically data must be in it’s final format prior to calling Matplotlib, and the output can be aesthetically quite simple with default themes.

Use of this pattern of plot creation first, followed by various pyplot commands (typically imported as “plt”) is common for Matplotlib generated figures, and for other high-level libraries that use matplotlib as a core. The Matplotlib documentation contains a comprehensive tutorial on the range of plot customisations possible with pyplot.

The advantage of Matplotlib’s flexibility and low-level API can become a disadvantage with more advanced plots requiring very verbose code. For example, there is no simple way to create a stacked bar chart (which is a relatively common display format), and the resulting code is very complicated and untenable as a “quick analysis tool”.

Pandas Plotting

The Pandas data management library includes simplified wrappers for the Matplotlib API that work seamlessly with the DataFrame and Series data containers. The DataFrame.plot() function provides an API for all of the major chart types, in a simple and concise set of parameters.

Because the outputs are equivalent to more verbose Matplotlib commands, the results can still be lacking visually, but the ability to quickly generate throwaway plots while exploring a dataset makes these methods incredibly useful.

For Pandas visualisation, we operate on the DataFrame object directly to be visualised, following up with Matplotlib-style formatting commands afterwards to add visual details to the plot.

# Plotting directly from DataFrames with Pandas
chats_per_user[0:20].plot(
    x='user_id', 
    y='number_chats', 
    kind='bar', 
    legend=False, 
    color='blue',
    width=0.8
)
# The plot is now created, and we use Matplotlib style
# commands to enhance the output.
plt.ylabel("Number of chats")
plt.xlabel("User")
plt.title("Chats per users for Top 20 users")
plt.gca().yaxis.grid(linestyle=':')
Bar chart created using Pandas plotting methods direction from a DataFrame. The results are very similar to the output from Matplotlib directly, and are styled using the same commands after plotting.

The plotting interface in Pandas is simple, clear, and concise; for bar plots, simply supply the column name for the x and y axes, and the “kind” of chart you want, here a “bar”.

Plotting with Seaborn

Seaborn is a Matplotlib-based visualisation library provides a non-Pandas-based high-level API to create all of the major chart types.

Seaborn outputs are beautiful, with themes reminiscent of the ggplot2 library in R. Seaborn is excellent for the more “statistically inclined” data visualisation practitioner, with built-in functions for density estimators, confidence bounds, and regression functions.

Check out the Gallary of examples for Seaborn on the official website to see the range of outputs supported.
# Creating a bar plot with seaborn
import seaborn as sns
sns.set()

sns.barplot(
    x='user_id', 
    y='number_chats', 
    color='salmon', 
    data=chats_per_user[0:20]
)
# Again, Matplotlib style formatting commands are used
# to customise the output details.
plt.xticks(rotation=60)
plt.ylabel("Number of chats")
plt.xlabel("User")
plt.title("Chats per users for Top 20 users")
plt.gca().yaxis.grid(linestyle=':')
Barchart output from the Seaborn library, a matplotlib-based visualisation tool that provides more appealing visuals out-of-the-box for end users.

The Seaborn API is a little different to that of Pandas, but worth knowing if you would like to quickly produce publishable charts. As with any library that creates Matplotlib-based output, the basic commands for changing axis titles, fonts, chart sizes, tick marks and other output details are based on Matplotlib commands.

../_images/seaborn-jointplot-2.png
Output from the “jointplot” function in the Seaborn library. Seaborn includes built-in functionality for fitting regression lines, density estimators, and confidence intervals, producing very visually appealing outputs.

Data Manipulation within Seaborn Plots

The Seaborn library is different from Matplotlib in that manipulation of data can be achieved during the plotting operation, allowing application directly on the raw data (in the above examples, the “chats_per_user” had to be calculated before use with Pandas and Matplotlib).

An example of a raw data operation can be seen below, where the count of chats per language is calculated and visualised in a single operation starting with the raw data:

# Calculate and plot in one command with Seaborn
sns.barplot(      # The plot type is specified with the function
    x='language', # Specify x and y axis column names
    y='chat_id', 
    estimator=len,# The "estimator" is the function applied to 
                  # each grouping of "x"
    data=data     # The dataset here is the raw data with all chats.
)
Seaborn output after manipulating data to get the languages per chat and plotting in the same command. Here, the “len” function was used as estimator to count each chat, but other estimators may include calculations of mean, median, standard deviation etc.

Other estimators can be used to get different statistical measures to visualise within each categorical bin. For another example, consider calculating and plotting the average handling time per user:

# Calculate and plot mean handling time per user from raw data.
sns.barplot(
    x='user_id', y='handling_time', 
    estimator=np.mean,  # "mean" function from numpy as estimator
    data=data,          # Raw dataset fed directly to Seaborn
    order=data['user_id'].value_counts().index.tolist()[0:20]
)
# Matplotlib style commands to customise the chart
plt.xlabel('User')
plt.ylabel('Handling Time (seconds)')
plt.title('Average Handling Time by User')
plt.xticks(rotation=60) 
Example output from Seaborn using a mean estimator to calculate the average handling time per user in the raw dataset. By default, error bars are also included on each bar visualised.

Free Styling with Seaborn

For nicer visuals without learning a new API, it is possible to preload the Seaborn library, apply the Seaborn themes, and then plot as usual with Pandas or Matplotlib, but benefit from the improved Seaborn colours and setup.

Using sns.set() set’s the Seaborn theme to all Matplotlib output:

# Getting Seaborn Style for Pandas Plots!
import seaborn
sns.set()         # This command sets the "seaborn" style
chats_per_user[0:20].plot(  # This is Pandas-style plotting
    x='user_id', 
    y='number_chats', 
    kind='bar', 
    legend=False,
    width=0.8
)
# Matplotlib styling of the output:
plt.ylabel("Number of chats")
plt.xlabel("User")
plt.title("Chats per users for Top 20 users")
plt.gca().yaxis.grid(linestyle=':')
Seaborn-styled output from a Pandas plotting command. Using the “sns.set()” command, Seaborn styles are magically applied to all Matplotlib output for your session, improving colours and style for figures for free

For further information on the graph types and capabilities of Seaborn, the walk-through tutorial on the official docs is worth exploring.

Seaborn Stubborness

A final note on Seaborn is that it’s an opinionated library. One particular example is the stacked-bar chart, which Seaborn does not support. The lack of support is not due to any technical difficulty, but rather, the author of the library doesn’t like the chart type.

It’s worth keeping this limitation in mind as you explore which plot types you will need.

Altair

Altair is a “declaritive statistical visualisation” library based on the “vega lite” visualisation grammar.

Altair uses a completely different API to any of the Matplotlib-based libaraies above, and can create interactive visualisations that can be rendered in a browser and stored in JSON format. Outputs look very professional, but there are some caveats to be aware of for complex or data heavy visualisations where entire datasets can end up stored in your notebooks or visualisation files.

The API and commands for Altair are very different to the other libraries we’ve examined:

# Plotting bar charts with Altair
import altair as alt

bars = alt.Chart(
    chats_per_user[0:20],    # Using pre-calculated data in this example
    title='Chats per User ID').mark_bar().encode(
        # Axes are created with alt.X and alt.Y if you need to 
        # specify any additional arguments (labels in this case)
        x=alt.X(        
            'user_id', 
            # Sorting the axis was hard to work out:
            sort=alt.EncodingSortField(field='number_chats', 
                                       op='sum', 
                                       order='descending'),
            axis=alt.Axis(title='User ID')), 
        y=alt.Y(
            'number_chats',
            axis=alt.Axis(title='Number of Chats')
        )
).interactive()
bars
Altair visualisation output in Jupyter notebook.

Online Editor for Vega

Altair is unusal in that it actually generates a JSON representation of the plot rendered that can then be rendered again in any Vega-compatible application. For example, the output of the last code block displays natively in a Jupyter notebook, but actually generates the following JSON (which can be pasted into this online Vega editor to render again).

{
  "config": {"view": {"width": 400, "height": 300}},
  "data": {"name": "data-84c58b571b3ed04edf7929613936b11e"},
  "mark": "bar",
  "encoding": {
    "x": {
      "type": "nominal",
      "axis": {"title": "User ID"},
      "field": "user_id",
      "sort": {"op": "sum", "field": "number_chats", "order": "descending"}
    },
    "y": {
      "type": "quantitative",
      "axis": {"title": "Number of Chats"},
      "field": "number_chats"
    }
  },
  "selection": {
    "selector001": {
      "type": "interval",
      "bind": "scales",
      "encodings": ["x", "y"],
      "on": "[mousedown, window:mouseup] > window:mousemove!",
      "translate": "[mousedown, window:mouseup] > window:mousemove!",
      "zoom": "wheel!",
      "mark": {"fill": "#333", "fillOpacity": 0.125, "stroke": "white"},
      "resolve": "global"
    }
  },
  "title": "Chats per User ID",
  "$schema": "https://vega.github.io/schema/vega-lite/v2.6.0.json",
  "datasets": {
    "data-84c58b571b3ed04edf7929613936b11e": [
      {"user_id": "User 1395", "number_chats": 406},
      {"user_id": "User 1251", "number_chats": 311},
      {"user_id": "User 1495", "number_chats": 283},
      {"user_id": "User 1497", "number_chats": 276},
      {"user_id": "User 1358", "number_chats": 236},
      {"user_id": "User 1350", "number_chats": 233},
      {"user_id": "User 1472", "number_chats": 230},
      {"user_id": "User 1452", "number_chats": 224},
      {"user_id": "User 1509", "number_chats": 221},
      {"user_id": "User 1391", "number_chats": 220},
      {"user_id": "User 1346", "number_chats": 198},
      {"user_id": "User 1439", "number_chats": 196},
      {"user_id": "User 1519", "number_chats": 178},
      {"user_id": "User 1392", "number_chats": 177},
      {"user_id": "User 1404", "number_chats": 172},
      {"user_id": "User 1308", "number_chats": 141},
      {"user_id": "User 1512", "number_chats": 135},
      {"user_id": "User 1517", "number_chats": 118},
      {"user_id": "User 1478", "number_chats": 115},
      {"user_id": "User 1446", "number_chats": 107}
    ]
  }
}

Altair Data Aggregations

Similar to Seaborn, the Vega-Lite grammar allows transformations and aggregations to be done during the plot render command. As a result however, all of the raw data is stored with the plot in JSON format, an approach that can lead to very large file sizes if the user is not aware.

# Altair bar plot from raw data.

# to allow plots with > 5000 rows - the following line is needed:
alt.data_transformers.enable('json')

# Charting command starts here:
bars = alt.Chart(
    data, 
    title='Chats per User ID').mark_bar().encode(
        x=alt.X(       # Calculations are specified in axes
            'user_id:O', 
            sort=alt.EncodingSortField(
                field='count', 
                op='sum', 
                order='descending'
            )
        ),
        y=alt.Y('count:Q')
).transform_aggregate(    # "transforms" are used to group / aggregate
    count='count()',
    groupby=['user_id']
).transform_window(
    window=[{'op': 'rank', 'as': 'rank'}],
    sort=[{'field': 'count', 'order': 'descending'}]
).transform_filter('datum.rank <= 20')

bars

The flexibility of the Altair system allows you to publish directly to a html page using “chart.save(“test.html”)” and it’s also incredibly easy to quickly allow interaction on the plots in HTML and in Juptyer notebooks for zooming, dragging, and selecting, etc. There is a selection of interactive charts in the online gallery that demonstrate the power of the library.

For an example in the online editor – click here!

Interactive elements in Altair allow brushing, selections, zooming, and linking plots together, giving massive flexibility for visualisation and data exploration.

Plotly

Plotly is the final plotting library to enter our review. Plotly is an excellent option to create interactive and embeddable visualisations with zoom, hover, and selection capabilities.

Plotly provides a web-service for hosting graphs, and automatically saves your output into an online account, where there is also an excellent editor. However, the library can also be used in offline mode. To use in an offline mode, there are some imports and commands for setup needed usually:

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

For plotting, then, the two commands required are:

  • plot: to create html output in your working directory
  • iplot: to create interactive plots directly in a Jupyter notebook output.

Plotly itself doesn’t provide a direct interface for Pandas DataFrames, so plotting is slightly different to some of the other libraries. To generate our example bar plot, we separately create the chart data and the layout information for the plot with separate Plotly functions:

# Create the data for the bar chart
bar_data = go.Bar(
    x=chats_per_user[0:20]['user_id'], 
    y=chats_per_user[0:20]['number_chats']
)
# Create the layout information for display
layout = go.Layout(
    title="Chats per User with Plotly",    
    xaxis=dict(title='User ID'),
    yaxis=dict(title='Number of chats')
)
# These two together create the "figure"
figure = go.Figure(data=[bar_data], layout=layout)
# Use "iplot" to create the figure in the Jupyter notebook
iplot(figure)
# use "plot" to create a HTML file for sharing / embedding
plot(figure)

Plotly, with the commands above, creates an interactive chart on the Jupyter notebook cell, which has hover functionality and controls automatically added. The output HTML can be shared and embedded as so, with controls functional (see here).

There are a rich set of visualisation possibilities with the Plotly library, and the addition of intuitive interactive elements opens up a fantastic method to share results and even use the library as a data exploration tool.

Cufflinks – Using Plotly with Pandas directly

The “cufflinks” library is a library that provides bindings between Plotly and Pandas. Cufflinks provides a method to create plots from Pandas DataFrames using the existing Pandas Plot interface but with Plotly output.

After installation with “pip install cufflinks”, the interface for cufflinks offline plotting with Pandas is simple:

import cufflinks as cf
# Going offline means you plot only locally, and dont need a plotly username / password
cf.go_offline()

# Create an interactive bar chart:
chats_per_user[0:20].iplot(
    x='user_id',
    y='number_chats',
    kind='bar'
)
Creation of Plotly-based interactive charts from Pandas Dataframes is made simple with the Cufflinks linking library.

A shareable link allows the chart to be shared and edited online on the Plotly graph creator; for an example, see here. The cufflinks interface supports a wide range of visualisations including bubble charts, bar charts, scatter plots, boxplots, heatmaps, pie charts, maps, and histograms.

“Dash” – The Plotly Web-Application creator

Finally, Plotly also includes a web-application framework to allow users to create interactive web applications for their visualisations. Similar to the Rstudio Shiny package for the R environment, Dash allows filtering, selection, drop-downs, and other UI elements to be added to your visualisation and to change the results in real time.

Dash allows very comprehensive and interactive data visualisation web applications to be created using Plotly and Python code alone. Screenshot from the Dash Gallery.

For inspiration, and to see what’s possible, there’s an excellent gallery of worked Dash examples covering various industries and visualisation types. Dash is commonly compared to Bokeh, another Python visualisation library that has dash-boarding capabilities. Most recently, Plotly have also released the “Dash Design Kit“, which eases the styling for Dash developers.

Overall, the Plotly approach, focussed on interactive plots and online hosting, is different to many other libraries and requires almost a full learning path by itself to master.

Wrap Up

What is the Python Visualisation and Plotting library of your future?

LibraryProsCons
MatplotlibVery flexible
Fine grained control over plot elements
Forms basis for many other libraries, so learning commands is usefuls


Verbose code to achieve basic plot types.
Default output is basic and needs a lot of customisation for publication.
PandasHigh level API.
Simple interface to learn.
Nicely integrated to Pandas data formats.
Plots, by default, are ugly.
Limited number of plot types.
SeabornBetter looking styling.
Matplotlib based so other knowledge transfers.
Somewhat inflexible at times – i.e. no stacked bar charts.
Styling can be used by other Matplotlib-based libraries.
Limited in some ways, e.g. no stacked bar charts.
AltairNice aesthetics on plots.
Exports as HTML easily.
JSON format and online hosting is useful.
Online Vega Editor is useful.
Very different API.
Plots actually contain the raw data in the JSON output which can lead to issues with security and file sizes.
PlotlyVery simple to add interaction to plots.
Flexible and well documented library.
Simple integration to Pandas with cufflinks library.
“Dash” applications are promising.
Only editor and view is useful for sharing and editing.
Very different API again.
Somewhat roundabout methods to work offline. Plotly encourages use of cloud platform.

Overall advice to be proficient and comfortable: Learn the basics of Matplotlib so that you can manipulate graphs after they have been rendered, master the Pandas plotting commands for quick visualisations, and know enough Seaborn to get by when you need something more specialised.

Further Reading & Web Links

Data Visualisation in Python – Pycon Dublin 2018 Presentation

$
0
0

The ability to explore and grasp data structures through quick and intuitive visualisation is a key skill of any data scientist. Different tools in the Python ecosystem required varying levels of mental-gymnastics to manipulate and visualise information during a data exploration session.

The array of Python libraries, each with their own idiosyncrasies, available can be daunting for newcomers and data scientists-in-training. In this talk, we will examine the core data visualisation libraries compatible with the popular Pandas data wrangling library. We’ll look at the base-level Matplotlib library first, and then show the benefits of the higher-level Pandas visualisation toolkit, the popular Seaborn visualisation library, and the Vega-lite based Altair.

This talk was presented at Pycon Ireland 2018, and the aim was to introduce attendee to different libraries for bar plotting, scatter plotting, and line plotting (never pie charting) their way to data visualisation bliss.

Presentation Slides

This presentation has been uploaded to SpeakerDeck for those interested in a downloadable format.

Presentation Contents:

  • Introduction to Data Visualisation.
  • Basic Python Setup for Data Visualisation
    • Main chart types – Barplot, Histogram, Scatter Plot, Line Chart.
    • Core libraries and Python visualisation toolsets.
  • Bar Plot and Stacked Bar plot in Matplotlib, Pandas, Seaborn, Altair.
  • Histograms and Stacked Bar plot in Matplotlib, Pandas, Seaborn, Altair.
  • Scatter Plots and Stacked Bar plot in Matplotlib, Pandas, Seaborn, Altair.
  • Line Plots and Stacked Bar plot in Matplotlib, Pandas, Seaborn, Altair.
  • Other Data visualisation options – Plotly and Bokeh.
  • Data Visualisation mistakes – what to watch out for.
  • Conclusions

PyCon IE Video

Unfortunately (who can bear the sound of their own voice!), there’s a video of the proceedings of the day, where these slides were presented at the Radisson in Dublin in November 2018.

Bar Plots in Python using Pandas DataFrames

$
0
0

Bar Plots – The king of plots?

The ability to render a bar plot quickly and easily from data in Pandas DataFrames is a key skill for any data scientist working in Python.

Nothing beats the bar plot for fast data exploration and comparison of variable values between different groups, or building a story around how groups of data are composed. Often, at EdgeTier, we tend to end up with an abundance of bar charts in both exploratory data analysis work as well as in dashboard visualisations.

The advantage of bar plots (or “bar charts”, “column charts”) over other chart types is that the human eye has evolved a refined ability to compare the length of objects, as opposed to angle or area.

Luckily for Python users, options for visualisation libraries are plentiful, and Pandas itself has tight integration with the Matplotlib visualisation library, allowing figures to be created directly from DataFrame and Series data objects. This blog post focuses on the use of the DataFrame.plot functions from the Pandas visualisation API.

Editing environment

As with most of the tutorials in this site, I’m using a Jupyter Notebook (and trying out Jupyter Lab) to edit Python code and view the resulting output. You can install Jupyter in your Python environment, or get it prepackaged with a WinPython or Anaconda installation (useful on Windows especially).

To import the relevant libraries and set up the visualisation output size, use:

# Set the figure size - handy for larger output
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = [10, 6]
# Set up with a higher resolution screen (useful on Mac)
%config InlineBackend.figure_format = 'retina'

Getting started: Bar charting numbers

The simplest bar chart that you can make is one where you already know the numbers that you want to display on the chart, with no calculations necessary. This plot is easily achieved in Pandas by creating a Pandas “Series” and plotting the values, using the kind="bar" argument to the plotting command.

For example, say you wanted to plot the number of mince pies eaten at Christmas by each member of your family on a bar chart. (I have no idea why you’d want to do that!) Imagine you have two parents (ate 10 each), one brother (a real mince pie fiend, ate 42), one sister (scoffed 17), and yourself (also with a penchant for the mince pie festive flavours, ate 37).

To create this chart, place the ages inside a Python list, turn the list into a Pandas Series or DataFrame, and then plot the result using the Series.plot command.

# Import the pandas library with the usual "pd" shortcut
import pandas as pd
# Create a Pandas series from a list of values ("[]") and plot it:
pd.Series([65, 61, 25, 22, 27]).plot(kind="bar")

A Pandas DataFrame could also be created to achieve the same result:

# Create a data frame with one column, "ages"
plotdata = pd.DataFrame({"ages": [65, 61, 25, 22, 27]})
plotdata.plot(kind="bar")
bar chart created directly from a pandas dataframe or series
It’s simple to create bar plots from known values by first creating a Pandas Series or DataFrame and then using the .plot() command.

Dataframe.plot.bar()

For the purposes of this post, we’ll stick with the .plot(kind="bar") syntax; however; there are shortcut functions for the kind parameter to plot(). Direct functions for .bar() exist on the DataFrame.plot object that act as wrappers around the plotting functions – the chart above can be created with plotdata['pies'].plot.bar(). Other chart types (future blogs!) are accessed similarly:

df = pd.DataFrame()
# Plotting functions:
df.plot.area     df.plot.barh     df.plot.density  df.plot.hist     df.plot.line     df.plot.scatter
df.plot.bar      df.plot.box      df.plot.hexbin   df.plot.kde      df.plot.pie

Bar labels in plots

By default, the index of the DataFrame or Series is placed on the x-axis and the values in the selected column are rendered as bars. Every Pandas bar chart works this way; additional columns become a new sets of bars on the chart.

To add or change labels to the bars on the x-axis, we add an index to the data object:

# Create a sample dataframe with an text index
plotdata = pd.DataFrame(
    {"pies": [10, 10, 42, 17, 37]}, 
    index=["Dad", "Mam", "Bro", "Sis", "Me"])
# Plot a bar chart
plotdata.plot(kind="bar")
In Pandas, the index of the DataFrame is placed on the x-axis of bar charts while the column values become the column heights.

Note that the plot command here is actually plotting every column in the dataframe, there just happens to be only one. For example, the same output is achieved by selecting the “pies” column:

# Individual columns chosen from the DataFrame
# as Series are plotted in the same way:
plotdata['pies'].plot(kind="bar")

In real applications, data does not arrive in your Jupyter notebook in quite such a neat format, and the “plotdata” DataFrame that we have here is typically arrived at after significant use of the Pandas GroupBy, indexing/iloc, and reshaping functionality.

Labelling axes and adding plot titles

No chart is complete without a labelled x and y axis, and potentially a title and/or caption. With Pandas plot(), labelling of the axis is achieved using the Matplotlib syntax on the “plt” object imported from pyplot. The key functions needed are:

  • xlabel” to add an x-axis label
  • ylabel” to add a y-axis label
  • title” to add a plot title
from matplotlib import pyplot as plt
plotdata['pies'].plot(kind="bar", title="test")

plt.title("Mince Pie Consumption Study Results")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
pandas bar plot with labelled x and y axis and title applied
Pandas bar chart with xlabel, ylabel, and title, applied using Matplotlib pyplot interface.

Rotate the x-axis labels

If you have datasets like mine, you’ll often have x-axis labels that are too long for comfortable display; there’s two options in this case – rotating the labels to make a bit more space, or rotating the entire chart to end up with a horizontal bar chart. The xticks function from Matplotlib is used, with the rotation and potentially horizontalalignment parameters.

plotdata['pies'].plot(kind="bar", title="test")
# Rotate the x-labels by 30 degrees, and keep the text aligned horizontally
plt.xticks(rotation=30, horizontalalignment="center")
plt.title("Mince Pie Consumption Study Results")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
Pandas bar chart with rotated x-axis labels. The Matplotlib “xtick” function is used to rotate the labels on axes, allowing for longer labels when needed.

Horizontal bar charts

Rotating to a horizontal bar chart is one way to give some variance to a report full of of bar charts! Horizontal charts also allow for extra long bar titles. Horizontal bar charts are achieved in Pandas simply by changing the “kind” parameter to “barh” from “bar”.

Remember that the x and y axes will be swapped when using barh, requiring care when labelling.

plotdata['pies'].plot(kind="barh")
plt.title("Mince Pie Consumption Study Results")
plt.ylabel("Family Member")
plt.xlabel("Pies Consumed")
Horizontal bar chart created using the Pandas barh function. Horizontal bar charts are excellent for variety, and in cases where you have long column labels.

Additional series: Stacked and unstacked bar charts

The next step for your bar charting journey is the need to compare series from a different set of samples. Typically this leads to an “unstacked” bar plot.

Let’s imagine that we have the mince pie consumption figures for the previous three years now (2018, 2019, 2020), and we want to use a bar chart to display the information. Here’s our data:

# Create a DataFrame with 3 columns:
plotdata = pd.DataFrame({
    "pies_2018":[40, 12, 10, 26, 36],
    "pies_2019":[19, 8, 30, 21, 38],
    "pies_2020":[10, 10, 42, 17, 37]
    }, 
    index=["Dad", "Mam", "Bro", "Sis", "Me"]
)
plotdata.head()
Create a Data Frame with three columns, one for each year of mince pie consumption. We’ll use this data for stacking and unstacking bar charts.

Unstacked bar plots

Out of the box, Pandas plot provides what we need here, putting the index on the x-axis, and rendering each column as a separate series or set of bars, with a (usually) neatly positioned legend.

plotdata = pd.DataFrame({
    "pies_2018":[40, 12, 10, 26, 36],
    "pies_2019":[19, 8, 30, 21, 38],
    "pies_2020":[10, 10, 42, 17, 37]
    }, 
    index=["Dad", "Mam", "Bro", "Sis", "Me"]
)
plotdata.plot(kind="bar")
plt.title("Mince Pie Consumption Study")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
Python Pandas un-stacked bar chart. If you select more than one column, Pandas creates, by default, an unstacked bar chart with each column forming one set of columns, and the DataFrame index as the x-axis.

The unstacked bar chart is a great way to draw attention to patterns and changes over time or between different samples (depending on your x-axis). For example, you can tell visually from the figure that the gluttonous brother in our fictional mince-pie-eating family has grown an addiction over recent years, whereas my own consumption has remained conspicuously high and consistent over the duration of data.

With multiple columns in your data, you can always return to plot a single column as in the examples earlier by selecting the column to plot explicitly with a simple selection like plotdata['pies_2019'].plot(kind="bar").

Stacked bar plots

In the stacked version of the bar plot, the bars at each index point in the unstacked bar chart above are literally “stacked” on top of one another.

While the unstacked bar chart is excellent for comparison between groups, to get a visual representation of the total pie consumption over our three year period, and the breakdown of each persons consumption, a “stacked bar” chart is useful.

Pandas makes this easy with the “stacked” argument for the plot command. As before, our data is arranged with an index that will appear on the x-axis, and each column will become a different “series” on the plot, which in this case will be stacked on top of one another at each x-axis tick mark.

# Adding the stacked=True option to plot() 
# creates a stacked bar plot
plotdata.plot(kind='bar', stacked=True)
plt.title("Total Pie Consumption")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
The Stacked Bar Chart. A stacked bar places the values at each sample or index point in the DataFrame on top of one another. Stacked bar charts are best for examining patterns in the composition of the totals at each sample point.

Ordering stacked and unstacked bars

The order of appearance in the plot is controlled by the order of the columns seen in the data set. Re-ordering can be achieved by selecting the columns in the order that you require. Note that the selection column names are put inside a list during this selection example to ensure a DataFrame is output for plot():

# Choose columns in the order to "stack" them
plotdata[["pies_2020", "pies_2018", "pies_2019"]].plot(kind="bar", stacked=True)
plt.title("Mince Pie Consumption Totals")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
showing how order of the stacked bars is achieved
Stacked Bars in Order. The order of the bars in the stacked bar chart is determined by the order of the columns in the Pandas dataframe.

In the stacked bar chart, we’re seeing total number of pies eaten over all years by each person, split by the years in question. It is difficult to quickly see the evolution of values over the samples in a stacked bar chart, but much easier to see the composition of each sample. The choice of chart depends on the story you are telling or point being illustrated.

Wherever possible, make the pattern that you’re drawing attention to in each chart as visually obvious as possible. Stacking bar charts to 100% is one way to show composition in a visually compelling manner.

Stacking to 100% (filled-bar chart)

Showing composition of the whole, as a percentage of total is a different type of bar chart, but useful for comparing the proportional makeups of different samples on your x-axis.

A “100% stacked” bar is not supported out of the box by Pandas (there is no “stack-to-full” parameter, yet!), requiring knowledge from a previous blog post on “grouping and aggregation” functionality in Pandas.

Start with our test dataset again:

plotdata = pd.DataFrame({
    "pies_2018":[40, 12, 10, 26, 36],
    "pies_2019":[19, 8, 30, 21, 38],
    "pies_2020":[10, 10, 42, 17, 37]
    }, index=["Dad", "Mam", "Bro", "Sis", "Me"]
)
plotdata.head()

We can convert each row into “percentage of total” measurements relatively easily with the Pandas apply function, before going back to the plot command:

stacked_data = plotdata.apply(lambda x: x*100/sum(x), axis=1)
stacked_data.plot(kind="bar", stacked=True)
plt.title("Mince Pie Consumption Breakdown")
plt.xlabel("Family Member")
plt.ylabel("Percentage Pies Consumed (%)")
Bars can be stacked to the full height of the figure with “group by” and “apply” functionality in Pandas. Stacking bars to 100% is an excellent way to show relative variations or progression in “proportion of total” per category or group.

For this same chart type (with person on the x-axis), the stacked to 100% bar chart shows us which years make up different proportions of consumption for each person. For example, we can see that 2018 made up a much higher proportion of total pie consumption for Dad than it did my brother.

Transposing for a different view

It may be more useful to ask the question – which family member ate the highest portion of the pies each year? This question requires a transposing of the data so that “year” becomes our index variable, and “person” become our category.

In this figure, the visualisation tells a different story, where I’m emerging as a long-term glutton with potentially one of the highest portions of total pies each year. (I’ve been found out!)

By default, the DataFrame index is places on the x-axis of a bar plot. For our data, a more informative visualisation is achieved by transposing the data prior to plotting.
plotdata.transpose().apply(lambda x: x*100/sum(x), axis=1).plot(kind="bar", stacked=True)
plt.title("Mince Pie Consumption Per Year")
plt.xlabel("Year")
plt.ylabel("Pies Consumed (%)")
Plotting the data with “year” as the index variable places year as the categorical variable on our visualisation, allowing easier comparison of year-on-year changes in consumption proportions. The data is transposed from it’s initial format to place year on the index.

Choosing the X-axis manually

The index is not the only option for the x-axis marks on the plot. Often, the index on your dataframe is not representative of the x-axis values that you’d like to plot. To flexibly choose the x-axis ticks from a column, you can supply the “x” parameter and “y” parameters to the plot function manually.

As an example, we reset the index (.reset_index()) on the existing example, creating a column called “index” with the same values as previously. We can then visualise different columns as required using the x and y parameter values.

“Resetting the index” on a dataframe removes the index and creates a new column from it, by default called “index”.
plotdata.reset_index().plot(
    x="index", y=["pies_2018", "pies_2019"], kind="bar"
)
plt.title("Mince Pie Consumption 18/19")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
More specific control of the bar plots created by Pandas plot() is achieved using the “x”, and “y” parameters. By default, “x” will be the index of the DataFrame, and y will be all numeric columns, but this is simple to overwrite.

Colouring bars by a category

The next dimension to play with on bar charts is different categories of bar. Colour variation in bar fill colours is an efficient way to draw attention to differences between samples that share common characteristics. It’s best not to simply colour all bars differently, but colour by common characteristics to allow comparison between groups. As an aside, if you can, keep the total number of colours on your chart to less than 5 for ease of comprehension.

Manually colouring bars

Let’s colour the bars by the gender of the individuals. Unfortunately, this is another area where Pandas default plotting is not as friendly as it could be. Ideally, we could specify a new “gender” column as a “colour-by-this” input. Instead, we have to manually specify the colours of each bar on the plot, either programmatically or manually.

The manual method is only suitable for the simplest of datasets and plots:

plotdata['pies'].plot(kind="bar", color=['black', 'red', 'black', 'red', 'black'])
Bars in pandas barcharts can be coloured entirely manually by provide a list or Series of colour codes to the “color” parameter of DataFrame.plot()

Colouring by a column

A more scaleable approach is to specify the colours that you want for each entry of a new “gender” column, and then sample from these colours. Start by adding a column denoting gender (or your “colour-by” column) for each member of the family.

plotdata = pd.DataFrame({
    "pies": [10, 10, 42, 17, 37], 
    "gender": ["male", "female", "male", "female", "male"]
    }, 
    index=["Dad", "Mam", "Bro", "Sis", "Me"]
)
plotdata.head()

Now define a dictionary that maps the gender values to colours, and use the Pandas “replace” function to insert these into the plotting command. Note that colours can be specified as

  • words (“red”, “black”, “blue” etc.),
  • RGB hex codes (“#0097e6”, “#7f8fa6”), or
  • with single-character shortcuts from matplotlib (“k”, “r”, “b”, “y” etc).

I would recommend the Flat UI colours website for inspiration on colour implementations that look great.

# Define a dictionary mapping variable values to colours:
colours = {"male": "#273c75", "female": "#44bd32"}
plotdata['pies'].plot(
    kind="bar", 
    color=plotdata['gender'].replace(colours)
)
Colours can be added to each bar in the bar chart based on the values in a different categorical column. Using a dictionary to “replace” the values with colours gives some flexibility.

Adding a legend for manually coloured bars

Because Pandas plotting isn’t natively supporting the addition of “colour by category”, adding a legend isn’t super simple, and requires some dabbling in the depths of Matplotlib. The colour legend is manually created in this situation, using individual “Patch” objects for the colour displays.

from matplotlib.patches import Patch

colours = {"male": "#273c75", "female": "#44bd32"}
plotdata['pies'].plot(
        kind="bar", color=plotdata['gender'].replace(colours)
).legend(
    [
        Patch(facecolor=colours['male']),
        Patch(facecolor=colours['female'])
    ], ["male", "female"]
)

plt.title("Mince Pie Consumption")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
When colouring bars by a category, the legend must be created manually using some Matplotlib patch commands.

Styling your Pandas Barcharts

Fine-tuning your plot legend – position and hiding

With multiple series in the DataFrame, a legend is automatically added to the plot to differentiate the colours on the resulting plot. You can disable the legend with a simple legend=False as part of the plot command.

plotdata[["pies_2020", "pies_2018", "pies_2019"]].plot(
    kind="bar", stacked=True, legend=False
)

The legend position and appearance can be achieved by adding the .legend() function to your plotting command. The main controls you’ll need are loc to define the legend location, ncol the number of columns, and title for a name.

See https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html for a full set of parameters. The available legend locations are

  • best
  • upper right
  • upper left
  • lower left
  • lower right
  • right
  • center left
  • center right
  • lower center
  • upper center
  • center
# Plot and control the legend position, layout, and title with .legend(...)
plotdata[["pies_2020", "pies_2018", "pies_2019"]].plot(
    kind="bar", stacked=True
).legend(
    loc='upper center', ncol=3, title="Year of Eating"
)
plt.title("Mince Pie Consumption Totals")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
using legends in pandas to control location orientation and legend title
Legend control example for Pandas plots. Location (loc), orientation (through ncol) and title are the key parameters for control.

Applying themes and styles

The default look and feel for the Matplotlib plots produced with the Pandas library are sometimes not aesthetically amazing for those with an eye for colour or design. There’s a few options to easily add visually pleasing theming to your visualisation output.

Using Matplotlib Themes

Matplotlib comes with options for the “look and feel” of the plots. Themes are customiseable and plentiful; a comprehensive list can be seen here: https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html

Simply choose the theme of choice, and apply with the matplotlib.style.use function.

import matplotlib
matplotlib.style.use('fivethirtyeight') 
plotdata.plot(kind="bar")

plt.title("Mince Pie Consumption by 538")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
Bar chart plotted in “fivethirtyeight” style. Matplotlib includes several different themes or styles for your plotting delight, applied with matplotlib.style.use(“theme”)

Styling with Seaborn

A second simple option for theming your Pandas charts is to install the Python Seaborn library, a different plotting library for Python. Seaborn comes with five excellent themes that can be applied by default to all of your Pandas plots by simply importing the library and calling the set() or the set_style() functions.

import seaborn as sns
sns.set_style("dark")
plotdata.plot(kind="bar")
plt.title("Mince Pie Consumption in Seaborn style")
plt.xlabel("Family Member")
plt.ylabel("Pies Consumed")
Seaborn “dark” theme. Using seaborn styles applied to your Pandas plots is a fantastic and quick method to improve the look and feel of your visualisation outputs.

More Reading

By now you hopefully have gained some knowledge on the essence of generating bar charts from Pandas DataFrames, and you’re set to embark on a plotting journey. Make sure you catch up on other posts about loading data from CSV files to get your data from Excel / other, and then ensure you’re up to speed on the various group-by operations provided by Pandas for maximum flexibility in visualisations.

Outside of this post, just get stuck into practicing – it’s the best way to learn. If you are looking for additional reading, it’s worth reviewing:

Plot your Fitbit data in Python (API v1.2)

Pandas Drop: Delete DataFrame Rows & Columns

PostgreSQL: Find slow, long-running and Blocked Queries

How to Build a Sentiment and Entity Detection API with Python (1/2)


Build a Sentiment & Entity Detection API with FastAPI (2/2)

Electric Car Prices in Ireland for 2023





Latest Images