Batch Geocoding with R and Google maps

October 12, 2013, 8:17 am

≫ Next: Data Science Videos from Dublin WebSummit 2013

I’ve recently wanted to geocode a large number of addresses (think circa 60k) in Ireland as part of a visualisation of the Irish property market. Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply.

There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day. This hard limit cannot be overcome without employing new a IP address, or paying for a business account. To ease the pain of starting an R process every 2,500 addresses / day, I’ve built the a script that geocodes addresses up the the API query limit every day with a few handy features:

Once it hits the geocoding limit, it patiently waits for Google’s servers to let it proceed.
The script pings Google once per hour during the down time to start geocoding again as soon as possible.
A temporary file containing the current data state is maintained during the process. Should the script be interrupted, it will start again from the place it left off once any problems with the data /connection has been rectified.

The R script assumes that you are starting with a database that is contained in a single *.csv file, “input.csv”, where the addresses are contained in the “address” column. Feel free to use/modify to suit your own devices!

Comments are included where possible:

# Geocoding script for large list of addresses.
# Shane Lynn 10/10/2013

#load up the ggmap library
library(ggmap)
# get the input data
infile <- "input"
data <- read.csv(paste0('./', infile, '.csv'))

# get the address list, and append "Ireland" to the end to increase accuracy 
# (change or remove this if your address already include a country etc.)
addresses = data$Address
addresses = paste0(addresses, ", Ireland")

#define a function that will process googles server responses for us.
getGeoDetails <- function(address){   
   #use the gecode function to query google servers
   geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
   #now extract the bits that we need from the returned list
   answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
   answer$status <- geo_reply$status

   #if we are over the query limit - want to pause for an hour
   while(geo_reply$status == "OVER_QUERY_LIMIT"){
       print("OVER QUERY LIMIT - Pausing for 1 hour at:") 
       time <- Sys.time()
       print(as.character(time))
       Sys.sleep(60*60)
       geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
       answer$status <- geo_reply$status
   }

   #return Na's if we didn't get a match:
   if (geo_reply$status != "OK"){
       return(answer)
   }   
   #else, extract what we need from the Google server reply into a dataframe:
   answer$lat <- geo_reply$results[[1]]$geometry$location$lat
   answer$long <- geo_reply$results[[1]]$geometry$location$lng   
   if (length(geo_reply$results[[1]]$types) > 0){
       answer$accuracy <- geo_reply$results[[1]]$types[[1]]
   }
   answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
   answer$formatted_address <- geo_reply$results[[1]]$formatted_address

   return(answer)
}

#initialise a dataframe to hold the results
geocoded <- data.frame()
# find out where to start in the address list (if the script was interrupted before):
startindex <- 1
#if a temp file exists - load it up and count the rows!
tempfilename <- paste0(infile, '_temp_geocoded.rds')
if (file.exists(tempfilename)){
       print("Found temp file - resuming from index:")
       geocoded <- readRDS(tempfilename)
       startindex <- nrow(geocoded)
       print(startindex)
}

# Start the geocoding process - address by address. geocode() function takes care of query speed limit.
for (ii in seq(startindex, length(addresses))){
   print(paste("Working on index", ii, "of", length(addresses)))
   #query the google geocoder - this will pause here if we are over the limit.
   result = getGeoDetails(addresses[ii]) 
   print(result$status)     
   result$index <- ii
   #append the answer to the results file.
   geocoded <- rbind(geocoded, result)
   #save temporary results as we are going along
   saveRDS(geocoded, tempfilename)
}

#now we add the latitude and longitude to the main data
data$lat <- geocoded$lat
data$long <- geocoded$long
data$accuracy <- geocoded$accuracy

#finally write it all to the output files
saveRDS(data, paste0("../data/", infile ,"_geocoded.rds"))
write.table(data, file=paste0("../data/", infile ,"_geocoded.csv"), sep=",", row.names=FALSE)

Let me know if you find a use for the script, or if you have any suggestions for improvements.

Please be aware that it is against the Google Geocoding API terms of service to geocode addresses without displaying them on a Google map. Please see the terms of service for more details on usage restrictions.

↧

Data Science Videos from Dublin WebSummit 2013

November 2, 2013, 5:28 am

≫ Next: CSV Data Extraction Tool for ROS bag files for Python, R, Matlab

≪ Previous: Batch Geocoding with R and Google maps

The Web Summit, Europes largest technology-industry conference was held in Dublin this week. An annual event since 2010, the Web Summit attracted over ten thousand visitors from over 90 countries. The Web Summit puts Ireland on the international startup and internet scene. With speakers like Elon Musk (founder of Paypal, SpaceX, and Tesla) and representatives from new and successful internet companies such as Coursera, Stripe, Hailo, Vine and Mailbox, speaking, I was sickened to not have one of the coveted €1000 euro tickets. (Granted only €1000 for the last week or so!)

The speakers were spread over 4 different stages and spoke on a broad range of topics. Some of the best talks about data science and data visualisation are embedded here:

Des Traynor – Designing Dashboards: From Data to Insights

Des Traynor, founder of Intercom, takes us through the fundamentals of good data display techniques. What are the disadvantages of bubble charts? Why not use a bar chart? What are the common tricks used by people to deceive us with data?

David Coallier – Data Science… What Even?!

David Coallier, a data science from Engine Yard, speaks about data science – what is it? What tools do you use? What are the best packages for doing data science? What does it take to become a data scientist?

Casper Schlickum – The Big Data Myth

Casper Schlickum, managing director of Xaxis EMEA, talks about the myth of big data. Is anyone actually using big data? Has anything actually changed with the way people do business as a result of the big data explosion? Should we care about big data? Great speech from a leader in the industry.

Dwight Merriman, MongoDB

And to add some balance to the big data debate, Dwight Merriman, from MongoDB, talks about Big Data, unstructured data, data science. What are we using data for now? How has the way we see data changed? What are actual uses and advantages for Big Data?

I haven’t been able to watch everything, so if you think I’m missing any key talks, please do let me know.

↧

CSV Data Extraction Tool for ROS bag files for Python, R, Matlab

November 19, 2013, 1:32 pm

≫ Next: Online Learning Curriculum for Data Scientists

≪ Previous: Data Science Videos from Dublin WebSummit 2013

So you’ve been using ROS to record data from a robot that you use? And you have the data in a rosbag file? And you’ve spent a while googling to find out how to extract images, data, imu readings, gps positions, etc. out of said rosbag file?

This post provides a tool to extract data to CSV format for a number of ROS message types. It was initially written for data analysis of messages in MATLAB, but applies to Python, R, SAS, Excel, or SPSS where you need it.

ROS (robot operating system) is a software system gaining popularity in robotics for control and automation.

ROS records data in binary .bag files, or bagfiles for short. Getting data out of so-called bagfiles for analysis in MATLAB, Excel, or <insert your favourite analysis software here> isn’t the easiest thing in the world. I’ve put together a small ROS package to extract data from ROS bag files and create CSV files for use in other applications.

update 6th July 2015 – This code has now been added to Github at https://github.com/shanealynn/ros_csv_extraction/

Thus far, the data extraction tool is compatible with the following ROS message types:

sensor_msgs/Image
sensor_msgs/Imu
sensor_msgs/LaserScan
sensor_msgs/NavSatFix
gps_common/gpsVel
umrr_driver/radar_msg (this was a type used by the CRUISE vehicle (see below))

To install the data extraction tool, download the zip file, extract it somewhere on your ROS_PACKAGE_PATH, and run rosmake data_extraction before using.

The tool can be used in two different ways:

1.) Extract all compatible topics in a bag file

rosrun data_extract extract_all.py -b <path_to_bag_file> -o <path_to_output_dir>

2.) Extract a single topic

rosrun data_extraction extract_topic.py -b <path_to_bag_file> -o <path_to_output_csv_file> -t <topic_name>

This program was created during a six month research proejct completed at the University of Technology Sydney on their CRUISE project. (CAS Research UTE for Intelligence, Safety, and Exploration), One of my main tasks was to transition the data collection software on the CRUISE vehicle to ROS, a popular Robot Operating System. I was working at the time under an Australian Endeavour award, researching with Dr. Sarath Kodagoda on pedestrian detection systems for autonomous vehicles using radar signals. We investigated a range of classification techniques (support vector machines, naive bayes classifiers, decision trees etc) to determine which obstacles were most likely pedestrians.

The idea was to use ROS as a data collection tool, syncing data from a number of SICK laser rangefinders, cameras, inertial measurement units, radar etc.

If you find it useful, let me know. If you have any additions / suggestions, let me know too.

Download the files for ROS here!

↧

Online Learning Curriculum for Data Scientists

December 18, 2013, 4:34 am

≫ Next: Self-Organising Maps for Customer Segmentation using R

≪ Previous: CSV Data Extraction Tool for ROS bag files for Python, R, Matlab

“Is there any online reading or courses I can do to get into data analysis?”

At my workplace, I get asked the question above. The question is usually posed by people typically with a finance background, who’s working as a management consultant. In this post I propose a learning path for such people to “get into data analysis”. I will assume that the prospective student someone with decent Excel skills, not afraid of a VLOOKUP or a touch of VB, and can throw together decent plots / dashboards using the same Microsoft package, but has little or no knowledge of programming / command line operations.

A data scientist can be defined by Drew Conway‘s Data Science Venn diagram which suggests that data scientists must have a solid mathematical background, skills in coding and computer hacking, and a healthy mix of subject matter expertise.

The courses mentioned below are by no means a “over a weekend” type of engagement – if you are serious about entering the world of data science as a profession, allow yourself at least 3-6 months to complete and study the content of the courses below.

Learn to program.

R and Python are the two primary scripting languages that are taking over the world of data science. There is very little that can not be done with knowledge of these two languages, and I would recommend getting to grips with both during your learning. R is a statistical programming language that has a huge number of packages available for every function you could think of. Python is a more general language that has data science capabilities built through the numpy and scipy libraries.

Course	Description
Try R	Start your journey into R and data visualisation with the “Try R” free online course from CodeSchool.com. Learn the basic syntax and get loading and plotting small data sets.
Computing for Data Analysis	Augment your fundamental R knowledge with “Computing for Data Analysis” at Coursera.org.
Python Track	Take a trip into Python and get top grips with the basic syntax with the Python track at Codeacademy.com
Introduction to Computer Science	Expand this preliminary Python know-how with a fully blown project to create a working search engine at Udacity.com’s Introduction to Computer Science

Learn some maths.

Data scientists are one part statisticians. To gather meaningful information from large data sets requires skills in summarising and correlating variables on a regular basis. A solid understanding of the maths behind statistical transformations and machine learning techniques ensures that results are valid and immune to scrutiny. Note that a lot of the necessary statistics and maths knowledge can be picked up from the machine learning-focussed courses.

Course	Description
Introduction to Statistics	Start off with some preliminary statistics at Udacity.com’s “Introduction to Statistics”
Statistics One	Go a bit deeper with “Statistics One” from Princeton at Coursera.org

Learn machine learning and data visualisation.

The core information that separates data scientists from data analysts is the ability to move beyond reporting and applying more sophisticated analytical techniques to model variance, extract meaning, and predict variables of interest, using your data.

Course Name	Description
Data Analysis	Start with the excellent “Data Analysis” course at Coursera.org that will give you direct experience in loading, visualising, and modelling of real data sets using R. This course is considerably more advanced than the previous “computing for data analysis”, and covers some data analysis techniques, and focuses on teaching students how to structure data analysis reports.
Machine Learning	Make sure that you take the brilliant and MOOC-starting “Machine Learning” or “ml-class” course with Andrew Ng at Coursera.org. Python skills are a must for this course that covers linear algebra, regression, neural networks, support vector machine, and recommender systems among others. Andrew Ng provides an excellent background for the topics that are covered.
Artificial Intelligence for Robotics	Sebastian Thrun‘s “Artificial Intelligence for Robotics” class is a brilliant introduction to more applied machine learning techniques such as the Kalman Filter and Particle Filters. While perhaps slightly off-topic, the course has a range of interesting and worthwhile Python-based exercises that will only add to your learning journey.
Algorithms / Neural Networks	More detailed specific-topic courses can be taken in Algorithms and Network Analysis at Udacity, or Neural Networks for Machine Learning – both of which I’ve personally found useful. The Neural Networks course dips into the realm of “Deep Learning”, a hot, but advanced, topic in machine learning at Google and Facebook at the moment.
Introduction to Hadoop and MapReduce	At some point, you’re going to need to wet your toes with some Big Data, Hadoop, and MapReduce knowledge – Get a basic introduction with “Introduction to Hadoop and MapReduce” at Udacity.com, in conjunction with Cloudera.

When you have completed the majority of the courses listed above, you’ll be in a very strong position to put your knowledge to use. And practice is the key. Get on Kaggle, download a data set, and get involved!!

↧

Self-Organising Maps for Customer Segmentation using R

February 3, 2014, 2:57 am

≫ Next: Scraping Dublin City Bikes Data Using Python

≪ Previous: Online Learning Curriculum for Data Scientists

Self-Organising Maps (SOMs) are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations. In this post, we examine the use of R to create a SOM for customer segmentation. The figures shown here used use the 2011 Irish Census information for the greater Dublin area as an example data set. This work is based on a talk given to the Dublin R Users group in January 2014.

If you are keen to get down to business:

The slides from a talk on this subject that I gave to the Dublin R Users group in January 2014 are available here
The code for the Dublin Census data example is available for download from here.(zip file containing code and data – filesize 25MB)

SOMs were first described by Teuvo Kohonen in Finland in 1982, and Kohonen’s work in this space has made him the most cited Finnish scientist in the world. Typically, visualisations of SOMs are colourful 2D diagrams of ordered hexagonal nodes.

The SOM Grid

SOM visualisation are made up of multiple “nodes”. Each node vector has:

A fixed position on the SOM grid
A weight vector of the same dimension as the input space. (e.g. if your input data represented people, it may have variables “age”, “sex”, “height” and “weight”, each node on the grid will also have values for these variables)
Associated samples from the input data. Each sample in the input space is “mapped” or “linked” to a node on the map grid. One node can represent several input samples.

The key feature to SOMs is that the topological features of the original input data are preserved on the map. What this means is that similar input samples (where similarity is defined in terms of the input variables (age, sex, height, weight)) are placed close together on the SOM grid. For example, all 55 year old females that are appoximately 1.6m in height will be mapped to nodes in the same area of the grid. Taller and smaller people will be mapped elsewhere, taking all variables into account. Tall heavy males will be closer on the map to tall heavy females than small light males as they are more “similar”.

SOM Heatmaps

Typical SOM visualisations are of “heatmaps”. A heatmap shows the distribution of a variable across the SOM. If we imagine our SOM as a room full of people that we are looking down upon, and we were to get each person in the room to hold up a coloured card that represents their age – the result would be a SOM heatmap. People of similar ages would, ideally, be aggregated in the same area. The same can be repeated for age, weight, etc. Visualisation of different heatmaps allows one to explore the relationship between the input variables.

The figure below demonstrates the relationship between average education level and unemployment percentage using two heatmaps. The SOM for these diagrams was generated using areas around Ireland as samples.

SOM Algorithm

The algorithm to produce a SOM from a sample data set can be summarised as follows:

Select the size and type of the map. The shape can be hexagonal or square, depending on the shape of the nodes your require. Typically, hexagonal grids are preferred since each node then has 6 immediate neighbours.
Initialise all node weight vectors randomly.
Choose a random data point from training data and present it to the SOM.
Find the “Best Matching Unit” (BMU) in the map – the most similar node. Similarity is calculated using the Euclidean distance formula.
Determine the nodes within the “neighbourhood” of the BMU.
– The size of the neighbourhood decreases with each iteration.
Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint.
– The learning rate decreases with each iteration.
– The magnitude of the adjustment is proportional to the proximity of the node to the BMU.
Repeat Steps 2-5 for N iterations / convergence.

Sample equations for each of the parameters described here are given on Slideshare.

SOMs in R

Training

The “kohonen” package is a well-documented package in R that facilitates the creation and visualisation of SOMs. To start, you will only require knowledge of a small number of key functions, the general process in R is as follows (see the presentation slides for further details):

# Load the kohonen package 
require(kohonen)

# Create a training data set (rows are samples, columns are variables
# Here I am selecting a subset of my variables available in "data"
data_train <- data[, c(2,4,5,8)]

# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process. 
data_train_matrix <- as.matrix(scale(data_train))

# Create the SOM Grid - you generally have to specify the size of the 
# training grid prior to training the SOM. Hexagonal and Circular 
# topologies are possible
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")

# Finally, train the SOM, options for the number of iterations,
# the learning rates, and the neighbourhood are available
som_model <- som(data_train_matrix, 
		grid=som_grid, 
		rlen=100, 
		alpha=c(0.05,0.01), 
		keep.data = TRUE,
		n.hood=“circular” )

Visualisation

The kohonen.plot function is used to visualise the quality of your generated SOM and to explore the relationships between the variables in your data set. There are a number different plot types available. Understanding the use of each is key to exploring your SOM and discovering relationships in your data.

Training Progress:
As the SOM training iterations progress, the distance from each node’s weights to the samples represented by that node is reduced. Ideally, this distance should reach a minimum plateau. This plot option shows the progress over time. If the curve is continually decreasing, more iterations are required.
```
plot(som_model, type="changes")
```
Node Counts
The Kohonen packages allows us to visualise the count of how many samples are mapped to each node on the map. This metric can be used as a measure of map quality – ideally the sample distribution is relatively uniform. Large values in some map areas suggests that a larger map would be benificial. Empty nodes indicate that your map size is too big for the number of samples. Aim for at least 5-10 samples per node when choosing map size.
```
plot(som_model, type="count")
```
Neighbour Distance
Often referred to as the “U-Matrix”, this visualisation is of the distance between each node and its neighbours. Typically viewed with a grayscale palette, areas of low neighbour distance indicate groups of nodes that are similar. Areas with large distances indicate the nodes are much more dissimilar – and indicate natural boundaries between node clusters. The U-Matrix can be used to identify clusters within the SOM map.
```
plot(som_model, type="dist.neighbours")
```
Codes / Weight vectors
Thenode weight vectors, or “codes”, are made up of normalised values of the original variables used to generate the SOM. Each node’s weight vector is representative / similar of the samples mapped to that node. By visualising the weight vectors across the map, we can see patterns in the distribution of samples and variables. The default visualisation of the weight vectors is a “fan diagram”, where individual fan representations of the magnitude of each variable in the weight vector is shown for each node. Other represenations are available, see the kohonen plot documentation for details.
```
plot(som_model, type="codes")
```
Heatmaps
Heatmaps are perhaps the most important visualisation possible for Self-Organising Maps. The use of a weight space view as in (4) that tries to view all dimensions on the one diagram is unsuitable for a high-dimensional (>7 variable) SOM. A SOM heatmap allows the visualisation of the distribution of a single variable across the map. Typically, a SOM investigative process involves the creation of multiple heatmaps, and then the comparison of these heatmaps to identify interesting areas on the map. It is important to remember that the individual sample positions do not move from one visualisation to another, the map is simply coloured by different variables.
The default Kohonen heatmap is created by using the type “heatmap”, and then providing one of the variables from the set of node weights. In this case we visualise the average education level on the SOM.
```
plot(som_model, type = "property", property = som_model$codes[,4], main=names(som_model$data)[4], palette.name=coolBlueHotRed)
```
It should be noted that this default visualisation plots the normalised version of the variable of interest. A more intuitive and useful visualisation is of the variable prior to scaling, which involves some R trickery – using the aggregate function to regenerate the variable from the original training set and the SOM node/sample mappings. The result is scaled to the real values of the training variable (in this case, unemployment percent).
```
var <- 2 #define the variable to plot 
var_unscaled <- aggregate(as.numeric(data_train[,var]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2] 
plot(som_model, type = "property", property=var_unscaled, main=names(data_train)[var], palette.name=coolBlueHotRed)
```
It is noteworthy that these two heatmaps immediately show an inverse relationship between unemployment percent and education level in the areas around Dublin. Further heatmaps, visualised side by side, can be used to build up a picture of the different areas and their characteristics.

Clustering

Clustering can be performed on the SOM nodes to isolate groups of samples with similar metrics. Manual identification of clusters is completed by exploring the heatmaps for a number of variables and drawing up a “story” about the different areas on the map. An estimate of the number of clusters that would be suitable can be ascertained using a kmeans algorithm and examing for an “elbow-point” in the plot of “within cluster sum of squares”. The Kohonen package documentation shows how a map can be clustered using hierachical clustering. The results of the clustering can be visualised using the SOM plot function again.

mydata <- som_model$codes 
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) 
for (i in 2:15) {
  wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
}
plot(wss)

## use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(som_model$codes)), 6)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters") 
add.cluster.boundaries(som_model, som_cluster)

Ideally, the clusters found are contiguous on the map surface. However, this may not be the case, depending on the underlying distribution of variables. To obtain contiguous cluster, a hierachical clustering algorithm can be used that only combines nodes that are similar AND beside each other on the SOM grid. However, hierachical clustering usually suffices and any outlying points can be accounted for manually.

The mean values and distributions of the training variables within each cluster are used to build a meaningful picture of the cluster characteristics. The clustering and visualisation procedure is typically an iterative process. Several SOMs are normally built before a suitable map is created. It is noteworthy that the majority of time used during the SOM development exercise will be in the visualisation of heatmaps and the determination of a good “story” that best explains the data variations.

Conclusions

Self-Organising Maps (SOMs) are another powerful tool to have in your data science repertoire. Advantages include: –

Intuitive method to develop customer segmentation profiles.
Relatively simple algorithm, easy to explain results to non-data scientists
New data points can be mapped to trained model for predictive purposes.

Disadvantages include:

Lack of parallelisation capabilities for VERY large data sets since the training data set is iterative
Difficult to represent very many variables in two dimensional plane
Requires clean, numeric data

Please do explore the slides and code (2014-01 SOM Example code_release.zip) from the talk for more detail. Contact me if you there are any problems running the example code etc.

↧

Scraping Dublin City Bikes Data Using Python

March 25, 2014, 8:55 am

≫ Next: Asynchronous updates to a webpage with Flask and Socket.io

≪ Previous: Self-Organising Maps for Customer Segmentation using R

FAST TRACK: There is some python code that allows you to scrape bike availability from bike schemes at the bottom of this post…

SLOW TRACK: As a recent aside, I was interested in collecting Dublin Bikes usage data over a long time period for data visualisation and exploration purposes. The Dublinbikes scheme was launched in September 2009 and is operated by JCDeaux and the Dublin City Council and is one of the more successful public bike schemes that has been implemented. To date, there have been over 6 million journeys and over 37,000 long term subscribers to the scheme. The bike scheme has attracted considerable press recently since an expansion to 1500 bikes and 102 stations around the city.

I wanted to collect, in real time, the status of all of the Dublin Bike Stations across Dublin over a number of months, and then visualise the bike usage and journey numbers at a number of different stations. Things like this:

There is no official public API that allows a large number of requests without IP blocking. The slightly-hidden API at the Dublin Cyclocity website started to block me after only a few minutes of requests. However, the good people at Citybik.es provide a wonderful API that provides real-time JSON data for a host of cities in Europe, America, and Australasia.

The code below provides a short and simple scraper that queries the Citybik.es API at a predefined rate and stores all of the results into a CSV or SQLite Database file. All data is returned from the API as a JSON dump detailing bike availability at all stations, this data is parsed, converted into a pandas data frame, and inserted into the requested data container. You’ll need python 2.7, and a few dependencies that are accessible using Pip or easy_install.

Maybe you’ll find it useful on your data adventures.

# City Bikes Scraper.
#
# Simple functions to use the citybik.es API to record bike availability in a specific city.
# Settings for scrapers can be changed in lines 18-22
# 
# Built using Python 2.7
#
# Shane Lynn 24/03/2014
# @shane_a_lynn
# http://www.shanelynn.ie

import requests
import pandas as pd
import pandas.io.sql as pdsql
from time import sleep, strftime, gmtime
import json
import sqlite3

# define the city you would like to get information from here:
# for full list see http://api.citybik.es
API_URL = "http://api.citybik.es/dublinbikes.json"

#Settings:
SAMPLE_TIME = 120                   # number of seconds between samples
SQLITE = False                      # If true - data is stored in SQLite file, if false - csv.
SQLITE_FILE = "bikedb.db"           # SQLite file to save data in
CSV_FILE = "output.csv"             # CSV file to save data in

def getAllStationDetails():
    print "\n\nScraping at " + strftime("%Y%m%d%H%M%S", gmtime())

    try:
        # this url has all the details
        decoder = json.JSONDecoder()
        station_json = requests.get(API_URL, proxies='')
        station_data = decoder.decode(station_json.content)
    except:
        print "---- FAIL ----"
        return None

    #remove unnecessary data - space saving
    # we dont need latitude and longitude
    for ii in range(0, len(station_data)):
        del station_data[ii]['lat']
        del station_data[ii]['lng']
        #del station_data[ii]['station_url']
        #del station_data[ii]['coordinates']

    print " --- SUCCESS --- "
    return station_data

def writeToCsv(data, filename="output.csv"):
    """
    Take the list of results and write as csv to filename.
    """
    data_frame = pd.DataFrame(data)
    data_frame['time'] = strftime("%Y%m%d%H%M%S", gmtime())
    data_frame.to_csv(filename, header=False, mode="a")

def writeToDb(data, db_conn):
    """
    Take the list of results and write to sqlite database
    """
    data_frame = pd.DataFrame(data)
    data_frame['scrape_time'] = strftime("%Y%m%d%H%M%S", gmtime())
    pdsql.write_frame(data_frame, "bikedata", db_conn, flavor="sqlite", if_exists="append", )
    db_conn.commit()

if __name__ == "__main__":

    # Create / connect to Sqlite3 database
    conn = sqlite3.connect(SQLITE_FILE) # or use :memory: to put it in RAM
    cursor = conn.cursor()
    # create a table to store the data
    cursor.execute("""CREATE TABLE IF NOT EXISTS bikedata
                      (name text, idx integer, timestamp text, number integer,
                       free integer, bikes integer, id integer, scrape_time text)
                   """)
    conn.commit()

    #run main function
    # we need to run the full collection, parsing, and writing every minute.
    while True:
        station_data = getAllStationDetails()
        if station_data:
            if SQLITE:
                writeToDb(station_data, conn)
            else:                
                writeToCsv(station_data, filename=CSV_FILE)

        print "Sleeping for 120 seconds."
        sleep(120)

↧

Asynchronous updates to a webpage with Flask and Socket.io

July 29, 2014, 3:53 pm

≫ Next: Using Python Threading and Returning Multiple Results (Tutorial)

≪ Previous: Scraping Dublin City Bikes Data Using Python

This post is about creating Python Flask web pages that can be asynchronously updated by your Python Flask application at any point without any user interaction. We’ll be using Python Flask, and the Flask-SocketIO plug-in to achieve this. In short, the final result is hosted on GitHub.

What I want to achieve here is a web page that is automatically updated for each user as a result of events that happened in the background on my server system. For example, allowing events like a continually updating message stream, a notification system, or a specific Twitter monitor / display. In this post, I show how to develop a bare-bones Python Flask application that updates connected clients with random numbers. Flask is an extremely lightweight and simple framework for building web applications using Python.

Flask logo

If you haven’t used Flask before, it’s amazingly simple, and to get started serving a very simple webpage only requires a few lines of Python:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

if __name__ == "__main__":
    app.run()

Running this file with

python application.py

will start a server on your local machine with one page saying “Hello World!” A quick look through the documentation and the first few sections of the brilliant mega-tutorial by Miguel Grinberg will have you creating multi-page python-based web applications in no time. However, most of the tutorials out there focus on the production of non-dynamic pages that load on first accessed and don’t describe further updates.

For the purpose of updating the page once our user has first visited, we will be using Socket.io and the accomanying Flask addon built by the same Miguel Grinberg, Flask-Socketio (Miguel appears to be some sort of Python Flask God). Socket IO is a genius engine that allows real-time bidirectional event-based communication. Gone are the days of static HTML pages that load when you visit; with Socket technology, the server can continuously update your view with new information.

For Socket.io communication, “events” are triggered by either the server or connected clients, and corresponding callback functions are set to execute when these events are detected. Implementing event triggers or binding event callbacks are very simply implemented in Flask (after some initial setup) using:

from flask import Flask, render_template
from flask.ext.socketio import SocketIO, emit

app = Flask(__name__)
app.config['SECRET_KEY'] = 'secret!'
socketio = SocketIO(app)

@socketio.on('my event')                          # Decorator to catch an event called "my event":
def test_message(message):                        # test_message() is the event callback function.
    emit('my response', {'data': 'got it!'})      # Trigger a new event called "my response" 
                                                  # that can be caught by another callback later in the program.

if __name__ == '__main__':
    socketio.run(app)

Four events are allowed in the @socketio.on() decorator – ‘connect’, ‘disconnect’, ‘message’, and ‘json’. Namespaces can also be assigned to keep things neatly separated, and the send() or emit() functions can be used to send ‘message’ or custom events respectively – see the details on Miguel’s page.

On the client side, a little bit of JavaScript wizardry with jQuery is used to handle incoming and trigger outgoing events. I would really recommend the JavaScript path on CodeSchool if you are not familiar with these technologies.

$(document).ready(function(){
    // start up the SocketIO connection to the server - the namespace 'test' is also included here if necessary
    var socket = io.connect('http://' + document.domain + ':' + location.port + '/test');
    // this is a callback that triggers when the "my response" event is emitted by the server.
    socket.on('my response', function(msg) {
        $('#log').append('<p>Received: ' + msg.data + '</p>');
    });
    //example of triggering an event on click of a form submit button
    $('form#emit').submit(function(event) {
        socket.emit('my event', {data: $('#emit_data').val()});
        return false;
    });
});

And that, effectively, is the bones of sending messages between client and server. In this specific example, we want the server to be continually working in the background generating new information, while at the same time allowing new clients to connect, and pushing new information to connected clients. For this purpose, we’ll be using the Python threading module to create a thread that generates random numbers regularly, and emits the newest value to all connected clients. Hence, in application.py, we define a thread object that will continually create random numbers and emit them using socketIO separately to the main flask process:

#random number Generator Thread
thread = Thread()
thread_stop_event = Event()

class RandomThread(Thread):
    def __init__(self):
        self.delay = 1
        super(RandomThread, self).__init__()

    def randomNumberGenerator(self):
        """
        Generate a random number every 1 second and emit to a socketio instance (broadcast)
        Ideally to be run in a separate thread?
        """
        #infinite loop of magical random numbers
        print "Making random numbers"
        while not thread_stop_event.isSet():
            number = round(random()*10, 3)
            print number
            socketio.emit('newnumber', {'number': number}, namespace='/test')
            sleep(self.delay)

    def run(self):
        self.randomNumberGenerator()

In our main Flask code then, we start this RandomThead running, and then catch the emitted numbers in Javascript on the client. Client presentation is done, for this example, using a simple bootstrap themed page contained in the Flask Template folder, and the number handling logic is maintained in the static JavaScript file application.js. A running list of 10 numbers is maintained and all connected clients will update simultaneously as new numbers are generated by the server.

$(document).ready(function(){
    //connect to the socket server.
    var socket = io.connect('http://' + document.domain + ':' + location.port + '/test');
    var numbers_received = [];

    //receive details from server
    socket.on('newnumber', function(msg) {
        console.log("Received number" + msg.number);
        //maintain a list of ten numbers
        if (numbers_received.length >= 10){
            numbers_received.shift()
        }
        numbers_received.push(msg.number);
        numbers_string = '';
        for (var i = 0; i < numbers_received.length; i++){
            numbers_string = numbers_string + '<p>' + numbers_received[i].toString() + '</p>';
        }
        $('#log').html(numbers_string);
    });

});

And in the application.py file:

@app.route('/')
def index():
    #only by sending this page first will the client be connected to the socketio instance
    return render_template('index.html')

@socketio.on('connect', namespace='/test')
def test_connect():
    # need visibility of the global thread object
    global thread
    print('Client connected')

    #Start the random number generator thread only if the thread has not been started before.
    if not thread.isAlive():
        print "Starting Thread"
        thread = RandomThread()
        thread.start()

And that’s the job. Flask served web pages that react to events on the server. I was tempted to add a little graph using HighCharts or AmCharts, but I’m afraid time got the better of me. Perhaps in part2. The final output should look like this:

Flask screenshot

You can find all of the source code on GitHub, with instructions on how to install the necessary libraries etc. Feel free to adapt to your own needs, and leave any comments if you come up with something neat or have any problems. This functionality is documented in the original documentation for Flask-SocketIO. The documentation and tutorials are quite comprehensive and worth working through if you are interested in more.

↧

Using Python Threading and Returning Multiple Results (Tutorial)

December 18, 2014, 6:34 am

≫ Next: Summarising, Aggregating, and Grouping data in Python Pandas

≪ Previous: Asynchronous updates to a webpage with Flask and Socket.io

I recently had an issue with a long running web process that I needed to substantially speed up due to timeouts. The delay arose because the system needed to fetch data from a number of URLs. The total number of URLs varied from user to user, and the response time for each URL was quite long (circa 1.5 seconds).

Problems arose with 10-15 URL requests taking over 20 seconds, and my server HTTP connection was timing out. Rather than extending my timeout time, I have turned to Python’s threading library. It’s easy to learn, quick to implement, and solved my problem very quickly. The system was implemented in Pythons web micro-framework Flask.

Parallel programming allows you to speed up your code execution - very useful for data science and data processing

Using Threads for a low number of tasks

Threading in Python is simple. It allows you to manage concurrent threads doing work at the same time. The library is called “threading“, you create “Thread” objects, and they run target functions for you. You can start potentially hundreds of threads that will operate in parallel. The first solution was inspired by a number of StackOverflow posts, and involves launching an individual thread for each URL request. This turned out to not be the ideal solution, but provides a good learning ground.

You first need to define a “work” function that each thread will execute separately. In this example, the work function is a “crawl” method that retrieves data from a url. Returning values from threads is not possible and, as such, in this example we pass in a globally accessible (to all threads) “results” array with the index of the array in which to store the result once fetched. The crawl() function will look like:

...
import logging
from urllib2 import urlopen
from threading import Thread
from json import JSONDecoder
...

# Define a crawl function that retrieves data from a url and places the result in results[index]
# The 'results' list will hold our retrieved data
# The 'urls' list contains all of the urls that are to be checked for data
results = [{} for x in urls]
def crawl(url, result, index):
    # Keep everything in try/catch loop so we handle errors
    try:
        data = urlopen(url).read()
        logging.info("Requested..." + url)
        result[index] = data
    except:
        logging.error('Error with URL check!')
        result[index] = {}
    return True

To actually start Threads in python, we use the “threading” library and create “Thead” objects. We can specify a target function (‘target’) and set of arguments (‘args’) for each thread and, once started, the theads will execute the function specified all in parallel. In this case, the use of threads will effectively reduce our URL lookup time to 1.5 seconds (approx) no matter how many URLs there are to check. The code to start the threaded processes is:

#create a list of threads
threads = []
# In this case 'urls' is a list of urls to be crawled.
for ii in range(len(urls)):
    # We start one thread per url present.
    process = Thread(target=crawl, args=[urls[ii], result, ii])
    process.start()
    threads.append(process)

# We now pause execution on the main thread by 'joining' all of our started threads.
# This ensures that each has finished processing the urls.
for process in threads:
    process.join()

# At this point, results for each URL are now neatly stored in order in 'results'

The only peculiarity here is the

join()

function. Essentially, join() pauses the calling thread (in this case the main thread of the program) until the thread in question has finished processing. Calling join prevents our program from progressing until all URLs have been fetched.

This method of starting one thread for each task will work well unless you have a high number (many hundreds) of tasks to complete.

Using Queue for a high number of tasks

The solution outlined above operated successfully for us, with users to our web application requiring, on average, 9-11 threads per request. The threads were starting, working, and returning results successfully. Issues arose later when users required much more threaded processes (>400). With such requests, Python was starting hundreds of threads are receiving errors like:

error: can't start new thread

File "/usr/lib/python2.5/threading.py", line 440, in start
    _start_new_thread(self.__bootstrap, ())

For these users, the original solution was not viable. There is a limit in your environment to the maximum number of threads that can be started by Python. Another of Pythons built-in libraries for threading, Queue, can be used to get around obstacle. A queue is essentially used to store a number of “tasks to be done”. Threads can take tasks from the queue when they are available, do the work, and then go back for more. In this example, we needed to ensure maximum of 50 threads at any one time, but the ability to process any number of URL requests. Setting up a queue in Python is very simple:

...
from Queue import Queue
...
#set up the queue to hold all the urls
q = Queue(maxsize=0)
# Use many threads (50 max, or one for each url)
num_theads = min(50, len(urls))

To return results from the threads, we will use the same technique of passing a results list along with an index for storage to each worker thread. The index needs to be included in the Queue when setting up tasks since we will not be explicitly calling each “crawl” function with arguments ( we also have no guarantee as to which order the tasks are executed).

#this is where threads will deposit the results
results = [{} for x in urls];
#load up the queue with the urls to fetch and the index for each job (as a tuple):
for i in range(len(urls)):
    #need the index and the url in each queue item.
    q.put((i,urls[i]))

The threaded “crawl” function will be different since it now relies on the queue. The threads are set up to close and return when the queue is empty of tasks.

def crawl(q, result):
    while not q.empty():
        work = q.get()                      #fetch new work from the Queue
        try:
            data = urlopen(work[1]).read()
            logging.info("Requested..." + work[1])
            result[work[0]] = data          #Store data back at correct index
        except:
            logging.error('Error with URL check!')
            result[work[0]] = {}
        #signal to the queue that task has been processed
        q.task_done()
    return True

The new Queue object itself is passed to the threads along with the list for storing results. The final location for each result is contained within the queue tasks – ensuring that the final “results” list is in the same order as the original “urls” list. We populate the queue with this job information:

#set up the worker threads
for i in range(num_theads):
    logging.debug('Starting thread ', i)
    worker = Thread(target=crawl, args=(q,results))
    worker.setDaemon(True)    #setting threads as "daemon" allows main program to 
                              #exit eventually even if these dont finish 
                              #correctly.
    worker.start()

#now we wait until the queue has been processed
q.join()

logging.info('All tasks completed.')

Our tasks will now not be completely processed in parallel, but rather by 50 threads operating in parallel. Hence, 100 urls will take 2 x 1.5 seconds approx. Here, this delay was acceptable since the number of users requiring more than 50 threads is minimal. However, at least the system is flexible enough to handle any situation.

This setup is well suited for the example of non-computationally intesive input/output work (fetching URLs), since much of the threads time will be spent waiting for data. In data-intensive or data science work, the multiprocessing or celery libraries can be better suited since they split work across multiple CPU cores. Hopefully the content above gets you on the right track!

Further information on Python Threading

There is some great further reading on threads and the threading module if you are looking for more in-depth information:

Python Threading Module Tutorials on YouTube
Threading information at PyMOTW
Tutorial on Python Multithreaded Programming at tutorialspoint.com
Python Programming and Threading tutorial on Python Programming WikiBooks
Threads in Python on python-course.eu

↧

Summarising, Aggregating, and Grouping data in Python Pandas

June 14, 2015, 5:12 am

≫ Next: Fixing Office 2016 installation for Mac – error code 0xD0000006

≪ Previous: Using Python Threading and Returning Multiple Results (Tutorial)

I’ve recently started using Python’s excellent Pandas library as a data analysis tool, and, while finding the transition from R’s excellent data.table library frustrating at times, I’m finding my way around and finding most things work quite well.

One aspect that I’ve recently been exploring is the task of grouping large data frames by different variables, and applying summary functions on each group. This is accomplished in Pandas using the “groupby()” and “agg()” functions of Panda’s DataFrame objects.

A Sample DataFrame

Download File Icon In order to demonstrate the effectiveness and simplicity of the grouping commands, we will need some data. For an example dataset, I have extracted my own mobile phone usage records. I analyse this type of data using Pandas during my work on KillBiller. If you’d like to follow along – the full csv file is available here.

The dataset contains 830 entries from my mobile phone log spanning a total time of 5 months. The CSV file can be loaded into a pandas DataFrame using the pandas.DataFrame.from_csv() function, and looks like this:

index	date	duration	item	month	network	network_type
0	15/10/14 06:58	34.429	data	2014-11	data	data
1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
3	15/10/14 14:48	4.000	call	2014-11	Tesco	mobile
4	15/10/14 17:27	4.000	call	2014-11	Tesco	mobile
5	15/10/14 18:55	4.000	call	2014-11	Tesco	mobile
6	16/10/14 06:58	34.429	data	2014-11	data	data
7	16/10/14 15:01	602.000	call	2014-11	Three	mobile
8	16/10/14 15:12	1050.000	call	2014-11	Three	mobile
9	16/10/14 15:30	19.000	call	2014-11	voicemail	voicemail
10	16/10/14 16:21	1183.000	call	2014-11	Three	mobile
11	16/10/14 22:18	1.000	sms	2014-11	Meteor	mobile
…	…	…	…	…	…	…

The main columns in the file are:

date: The date and time of the entry
duration: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry.
item: A description of the event occurring – can be one of call, sms, or data.
month: The billing month that each entry belongs to – of form ‘YYYY-MM’.
network: The mobile network that was called/texted for each entry.
network_type: Whether the number being called was a mobile, international (‘world’), voicemail, landline, or other (‘special’) number.

Phone numbers were removed for privacy. The date column can be parsed using the extremely handy dateutil library.

import pandas as pd
import dateutil

# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.cv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

Summarising the DataFrame

Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:

# How many rows the dataset
data['item'].count()
Out[38]: 830

# What was the longest phone call / data entry?
data['duration'].max()
Out[39]: 10528.0

# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0

# How many entries are there for each month?
data['month'].value_counts()
Out[41]: 
2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
dtype: int64

# Number of non-null unique network entries
data['network'].nunique()
Out[42]: 9

The need for custom functions is minimal unless you have very specific requirements. The full range of basic statistics that are quickly calculable and built into the base Pandas package are:

Function	Description
`count`	Number of non-null observations
`sum`	Sum of values
`mean`	Mean of values
`mad`	Mean absolute deviation
`median`	Arithmetic median of values
`min`	Minimum
`max`	Maximum
`mode`	Mode
`abs`	Absolute Value
`prod`	Product of values
`std`	Unbiased standard deviation
`var`	Unbiased variance
`sem`	Unbiased standard error of the mean
`skew`	Unbiased skewness (3rd moment)
`kurt`	Unbiased kurtosis (4th moment)
`quantile`	Sample quantile (value at %)
`cumsum`	Cumulative sum
`cumprod`	Cumulative product
`cummax`	Cumulative maximum
`cummin`	Cumulative minimum

Summarising Groups in the DataFrame

There’s further power put into your hands by mastering the Pandas “groupby()” functionality. Groupby essentially splits the data into different groups depending on a variable of your choice. For example, the expression

data.groupby('month')

will split our current DataFrame by month. The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. the GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:

data.groupby(['month']).groups.keys()
Out[59]: ['2014-12', '2014-11', '2015-02', '2015-03', '2015-01']

len(data.groupby(['month']).groups['2014-11'])
Out[61]: 230

Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to obtain summary statistics for each group – an immensely useful function. This functionality is similar to the dplyr and plyr libraries for R. Different variables can be excluded / included from each summary requirement.

# Get the first entry for each month
data.groupby('month').first()
Out[69]: 
                       date  duration  item   network network_type
month                                                             
2014-11 2014-10-15 06:58:00    34.429  data      data         data
2014-12 2014-11-13 06:58:00    34.429  data      data         data
2015-01 2014-12-13 06:58:00    34.429  data      data         data
2015-02 2015-01-13 06:58:00    34.429  data      data         data
2015-03 2015-02-12 20:15:00    69.000  call  landline     landline

# Get the sum of the durations per month
data.groupby('month')['duration'].sum()
Out[70]: 
month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

# Get the number of dates / entries in each month
data.groupby('month')['date'].count()
Out[74]: 
month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

# What is the sum of durations, for calls only, to each network
data[data['item'] == 'call'].groupby('network')['duration'].sum()
Out[78]: 
network
Meteor 7200
Tesco 13828
Three 36464
Vodafone 14621
landline 18433
voicemail 1775
Name: duration, dtype: float64

You can also group by more than one variable, allowing more complex queries.

# How many calls, sms, and data entries are in each month?
data.groupby(['month', 'item'])['date'].count()
Out[76]: 
month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, dtype: int64

# How many calls, texts, and data are sent per month, split by network_type?
data.groupby(['month', 'network_type'])['date'].count()
Out[82]: 
month network_type
2014-11 data 29
 landline 5
 mobile 189
 special 1
 voicemail 6
2014-12 data 30
 landline 7
 mobile 108
 voicemail 8
 world 4
2015-01 data 31
 landline 11
 mobile 160
....

Multiple Statistics per Group

The final piece of syntax that we’ll examine is the “agg()” function for Pandas. The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation. The syntax is simple, and is similar to that of MongoDB’s aggregation framework.

summary and aggregation in pandas python

Instructions for aggregation are provided in the form of a python dictionary. Use the dictionary keys to specify the columns upon which you’d like to operate, and the values to specify the function to run.

For example:

# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration':sum,      # find the sum of the durations for each group
                                     'network_type': "count", # find the number of network type entries
                                     'date': 'first'})    # get the first date per group

The aggregation dictionary syntax is flexible and can be defined before the operation. You can also define functions inline using “lambda” functions to extract statistics that are not provided by the built-in options.

# Define the aggregation procedure outside of the groupby operation
aggregations = {
    'duration':'sum',
    'date': lambda x: max(x)
}
data.groupby('month').agg(aggregations)

The final piece of the puzzle is the ability to rename the newly calculated columns and to calculate multiple statistics from a single column in the original data frame. Such calculations are possible through nested dictionaries, or by passing a list of functions for a column. Our final example calculates multiple values from the duration column and names the results appropriately. Note that the results have multi-indexed column headers.

# Define the aggregation calculations
aggregations = {
    'duration': { # work on the "duration" column
        'total_duration': 'sum',  # get the sum, and call this result 'total_duration'
        'average_duration': 'mean', # get mean, call result 'average_duration'
        'num_calls': 'count'
    },
    'date': {     # Now work on the "date" column
        'max_date': 'max',   # Find the max, call the result "max_date"
        'min_date': 'min',
        'num_days': lambda x: max(x) - min(x)  # Calculate the date range per group
    },
    'network': ["count", "max"]  # Calculate two results for the 'network' column with a list
}

# Perform groupby aggregation by "month", but only on the rows that are of type "call"
data[data['item'] == 'call'].groupby('month').agg(aggregations)

Aggregation and summarisation of data using pandas python on mobile phone data

The groupby functionality in Pandas is well documented in the official docs and performs at speeds on a par (unless you have massive data and are picky with your milliseconds) with R’s data.table and dplyr libraries.

↧

Fixing Office 2016 installation for Mac – error code 0xD0000006

October 5, 2015, 1:40 pm

≫ Next: Amazon Elastic Beanstalk – Logging to Logentries from Python Application

≪ Previous: Summarising, Aggregating, and Grouping data in Python Pandas

This is a very quick post to help some people out on installation problems with Office for Mac 2016.

On excited installation of Excel 2016 on my Macbook, the following error threatened to ruin the day:

“An unknown error has occurred, the error code is: 0xD0000006”

Seemingly unfound on the internet, the solution, oddly enough was to ensure that the “name” of the computer has no special characters. The Macbook in question had an “á” in the Computer name.

To change the name of your computer, open up “System Preferences” by pressing Command-Space and typing “Preferences”. Alternatively, click your Apple symbol on top left and click “Preferences”.

Under the “Sharing” option, you’ll find your computer name.

Hope this helps out.

System Preferences on Macbook Pro.

Fix your excel powerpoint and office problems by changing your computer name

Sharing options on Macbook Pro

↧

Amazon Elastic Beanstalk – Logging to Logentries from Python Application

April 1, 2016, 12:12 pm

≫ Next: Analysis of Weather data using Pandas, Python, and Seaborn

≪ Previous: Fixing Office 2016 installation for Mac – error code 0xD0000006

[Short version] The S3 ingestion script for Amazon applications provided by Logentries will not work for the gzip compressed log files produced by the Elastic Beanstalk log rotation system. A slightly edited script will work instead and can be found on Github here.[/Short Version]

Logentries is a brilliant startup originating here in Dublin for collecting and analysing log files on the cloud, and in real time. The company was founded by Villiam Holub and Trevor Parsons, spinning out of academic research in University College Dublin. Its a great service, with a decent free tier (5GB per month and 7 day retention).

logentries pricing

At KillBiller, we run a Python backend system to support cost calculations, and this runs on the Amazon Elastic Beanstalk platform, which simplifies scaling, deployment, and database management. The KillBiller backend uses a Python Flask application to serve http requests that come from our user-facing mobile app.

Setting up automatic logging from Elastic Beanstalk (and python) to Logentries proved more difficult than expected. I wanted the automatically rotated logs from the Elastic Beanstalk application uploaded to Logentries. This is achieved by:

Setting up your Amazon Elastic Beanstalk (EB) application to “rotate logs” through S3.
Using AWS Lambda to detect the upload of logs from EB and trigger a script. Logentries provide a tutorial for this.
Use the lamda script to upload logs from S3 directly to Logentries. (this is where the trouble starts).

The problem encountered is that Amazon places a single GZIP compressed file in your S3 bucket during log rotation. The lambda script provided by Logentries will only work with text files.

With only some slight changes, we can edit the script to take the gzip file from S3, unzip to a stream, and using the Python zlib and StringIO libraries, turn this back to normal text for Logentries ingestion. The full edited script is available here, and the parts that have changed from the original version are:

...
import zlib
import StringIO

...
# Get object from S3 Bucket
response = s3.get_object(Bucket=bucket, Key=key)
body = response['Body']
# Read data from object, at this point this is compressed zip binary format.
compressed_data = body.read()
s.sendall("Successfully read the zipped contents of the file.")
# Use zlib library to decompress this binary string
data = zlib.decompress(compressed_data, 16+zlib.MAX_WBITS)
s.sendall("Decompressed data")
# Now continue script as normal to send data to Logentries.
for token in tokens:
    s.sendall('%s %s\n' % (token, "username='{}' downloaded file='{}' from bucket='{}' and uncompressed!."

...

In terms of other steps, here’s the key points:

1. Turn on Log Rotation for your Elastic Beanstalk Application

This setting is found under “Software Configuration” in the Configuration page of your Elastic Beanstalk application.

amazon-log-rotation

At regular intervals, Amazon will collect logs from every instance in your application, zip them, and place them on an Amazon S3 bucket, under different folders, in my case:

/elasticbeanstalk-eu-west-1-<account number>/resources/environments/logs/publish/<environment id>/<instance id>/<log files>.gz

2. Set up AWS Lambda to detect log changes and upload to Logentries

To complete this step, follow the instructions as laid out by Logentries themselves at this page. The only changes that I think are worth making are:

Instead of the “le_lambda.py” file that Logentries provide, use the slightly edited version I have here. You’ll still need the “le_certs.pem” file and need to zip them together when creating your AWS Lambda task.
When configuring the event sources for your AWS Lambda job, you can specify a Prefix for the notifications, ensuring that only logs for a given application ID are uploaded, rather than any changes to your S3 bucket (such as manually requested logs). For example:
```
resources/environments/logs/publish/e-mpcwnwheky/
```
If you have other Lambda functions (perhaps the original le_lambda.py) being used for non-zipped files, you can use the “suffix” filter to only trigger for “.gz” files.

Now your logs should flow freely to Logentries; get busy making up some dashboards, tags, and alerts.

↧

Analysis of Weather data using Pandas, Python, and Seaborn

April 3, 2016, 3:29 am

≫ Next: How wet is a cycling commute in Ireland? Pretty wet… if you live in Galway.

≪ Previous: Amazon Elastic Beanstalk – Logging to Logentries from Python Application

The most recent post on this site was an analysis of how often people cycling to work actually get rained on in different cities around the world. You can check it out here.

The analysis was completed using data from the Wunderground weather website, Python, specifically the Pandas and Seaborn libraries. In this post, I will provide the Python code to replicate the work and analyse information for your own city. During the analysis, I used Python Jupyter notebooks to interactively explore and cleanse data; there’s a simple setup if you elect to use something like the Anaconda Python distribution to install everything you need.

If you want to skip data downloading and scraping, all of the data I used is available to download here.

Scraping Weather Data

Wunderground.com has a “Personal Weather Station (PWS)” network for which fantastic historical weather data is available – covering temperature, pressure, wind speed and direction, and of course rainfall in mm – all available on a per-minute level. Individual stations can be examined at specific URLS, for example here for station “IDUBLIND35”.

There’s no official API for the PWS stations that I could see, but there is a very good API for forecast data. However, CSV format data with hourly rainfall, temperature, and pressure information can be downloaded from the website with some simple Python scripts.

The hardest part here is to actually find stations that contain enough information for your analysis – you’ll need to switch to “yearly view” on the website to find stations that have been around more than a few months, and that record all of the information you want. If you’re looking for temperature info – you’re laughing, but precipitation records are more sparse.

Wunderground have an excellent site with interactive graphs to look at weather data on a daily, monthly, and yearly level. Data is also available to download in CSV format, which is great for data science purposes.

import requests
import pandas as pd
from dateutil import parser, rrule
from datetime import datetime, time, date
import time

def getRainfallData(station, day, month, year):
    """
    Function to return a data frame of minute-level weather data for a single Wunderground PWS station.
    
    Args:
        station (string): Station code from the Wunderground website
        day (int): Day of month for which data is requested
        month (int): Month for which data is requested
        year (int): Year for which data is requested
    
    Returns:
        Pandas Dataframe with weather data for specified station and date.
    """
    url = "http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID={station}&day={day}&month={month}&year={year}&graphspan=day&format=1"
    full_url = url.format(station=station, day=day, month=month, year=year)
    # Request data from wunderground data
    response = requests.get(full_url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
    data = response.text
    # remove the excess <br> from the text data
    data = data.replace('<br>', '')
    # Convert to pandas dataframe (fails if issues with weather station)
    try:
        dataframe = pd.read_csv(io.StringIO(data), index_col=False)
        dataframe['station'] = station
    except Exception as e:
        print("Issue with date: {}-{}-{} for station {}".format(day,month,year, station))
        return None
    return dataframe
    
# Generate a list of all of the dates we want data for
start_date = "2015-01-01"
end_date = "2015-12-31"
start = parser.parse(start_date)
end = parser.parse(end_date)
dates = list(rrule.rrule(rrule.DAILY, dtstart=start, until=end))

# Create a list of stations here to download data for
stations = ["IDUBLINF3", "IDUBLINF2", "ICARRAIG2", "IGALWAYR2", "IBELFAST4", "ILONDON59", "IILEDEFR28"]
# Set a backoff time in seconds if a request fails
backoff_time = 10
data = {}

# Gather data for each station in turn and save to CSV.
for station in stations:
    print("Working on {}".format(station))
    data[station] = []
    for date in dates:
        # Print period status update messages
        if date.day % 10 == 0:
            print("Working on date: {} for station {}".format(date, station))
        done = False
        while done == False:
            try:
                weather_data = getRainfallData(station, date.day, date.month, date.year)
                done = True
            except ConnectionError as e:
                # May get rate limited by Wunderground.com, backoff if so.
                print("Got connection error on {}".format(date))
                print("Will retry in {} seconds".format(backoff_time))
                time.sleep(10)
        # Add each processed date to the overall data
        data[station].append(weather_data)
    # Finally combine all of the individual days and output to CSV for analysis.
    pd.concat(data[station]).to_csv("data/{}_weather.csv".format(station))

Cleansing and Data Processing

The data downloaded from Wunderground needs a little bit of work. Again, if you want the raw data, it’s here. Ultimately, we want to work out when its raining at certain times of the day and aggregate this result to daily, monthly, and yearly levels. As such, we use Pandas to add month, year, and date columns. Simple stuff in preparation, and we can then output plots as required.

station = 'IEDINBUR6' # Edinburgh
data_input = pd.read_csv('data/{}_weather.csv'.format(station))

# Give the variables some friendlier names and convert types as necessary.
data_raw['temp'] = data_raw['TemperatureC'].astype(float)
data_raw['rain'] = data_raw['HourlyPrecipMM'].astype(float)
data_raw['total_rain'] = data_raw['dailyrainMM'].astype(float)
data_raw['date'] = data_raw['DateUTC'].apply(parser.parse)
data_raw['humidity'] = data_raw['Humidity'].astype(float)
data_raw['wind_direction'] = data_raw['WindDirectionDegrees']
data_raw['wind'] = data_raw['WindSpeedKMH']

# Extract out only the data we need.
data = data_raw.loc[:, ['date', 'station', 'temp', 'rain', 'total_rain', 'humidity', 'wind']]
data = data[(data['date'] >= datetime(2015,1,1)) & (data['date'] <= datetime(2015,12,31))]

# There's an issue with some stations that record rainfall ~-2500 where data is missing.
if (data['rain'] < -500).sum() > 10:
    print("There's more than 10 messed up days for {}".format(station))
    
# remove the bad samples
data = data[data['rain'] > -500]

# Assign the "day" to every date entry
data['day'] = data['date'].apply(lambda x: x.date())

# Get the time, day, and hour of each timestamp in the dataset
data['time_of_day'] = data['date'].apply(lambda x: x.time())
data['day_of_week'] = data['date'].apply(lambda x: x.weekday())    
data['hour_of_day'] = data['time_of_day'].apply(lambda x: x.hour)
# Mark the month for each entry so we can look at monthly patterns
data['month'] = data['date'].apply(lambda x: x.month)

# Is each time stamp on a working day (Mon-Fri)
data['working_day'] = (data['day_of_week'] >= 0) & (data['day_of_week'] <= 4)

# Classify into morning or evening times (assuming travel between 8.15-9am and 5.15-6pm)
data['morning'] = (data['time_of_day'] >= time(8,15)) & (data['time_of_day'] <= time(9,0))
data['evening'] = (data['time_of_day'] >= time(17,15)) & (data['time_of_day'] <= time(18,0))

# If there's any rain at all, mark that!
data['raining'] = data['rain'] > 0.0

# You get wet cycling if its a working day, and its raining at the travel times!
data['get_wet_cycling'] = (data['working_day']) & ((data['morning'] & data['rain']) |
                                                   (data['evening'] & data['rain']))

At this point, the dataset is relatively clean, and ready for analysis. If you are not familiar with grouping and aggregation procedures in Python and Pandas, here is another blog post on the topic.

Data after cleansing from Wunderground.com. This data is now in good format for grouping and visualisation using Pandas.

Data summarisation and aggregation

With the data cleansed, we now have non-uniform samples of the weather at a given station throughout the year, at a sub-hour level. To make meaningful plots on this data, we can aggregate over the days and months to gain an overall view and to compare across stations.

# Looking at the working days only and create a daily data set of working days:
wet_cycling = data[data['working_day'] == True].groupby('day')['get_wet_cycling'].any()
wet_cycling = pd.DataFrame(wet_cycling).reset_index()

# Group by month for display - monthly data set for plots.
wet_cycling['month'] = wet_cycling['day'].apply(lambda x: x.month)
monthly = wet_cycling.groupby('month')['get_wet_cycling'].value_counts().reset_index()
monthly.rename(columns={"get_wet_cycling":"Rainy", 0:"Days"}, inplace=True)
monthly.replace({"Rainy": {True: "Wet", False:"Dry"}}, inplace=True)    
monthly['month_name'] = monthly['month'].apply(lambda x: calendar.month_abbr[x])

# Get aggregate stats for each day in the dataset on rain in general - for heatmaps.
rainy_days = data.groupby(['day']).agg({
        "rain": {"rain": lambda x: (x > 0.0).any(),
                 "rain_amount": "sum"},
        "total_rain": {"total_rain": "max"},
        "get_wet_cycling": {"get_wet_cycling": "any"}
        })    
rainy_days.reset_index(drop=False, inplace=True)
rainy_days.columns = rainy_days.columns.droplevel(level=0)
rainy_days['rain'] = rainy_days['rain'].astype(bool)
rainy_days.rename(columns={"":"date"}, inplace=True)               

# Add the number of rainy hours per day this to the rainy_days dataset.
temp = data.groupby(["day", "hour_of_day"])['raining'].any()
temp = temp.groupby(level=[0]).sum().reset_index()
temp.rename(columns={'raining': 'hours_raining'}, inplace=True)
temp['day'] = temp['day'].apply(lambda x: x.to_datetime().date())
rainy_days = rainy_days.merge(temp, left_on='date', right_on='day', how='left')
rainy_days.drop('day', axis=1, inplace=True)

print "In the year, there were {} rainy days of {} at {}".format(rainy_days['rain'].sum(), len(rainy_days), station)    
print "It was wet while cycling {} working days of {} at {}".format(wet_cycling['get_wet_cycling'].sum(), 
                                                      len(wet_cycling),
                                                     station)
print "You get wet cycling {} % of the time!!".format(wet_cycling['get_wet_cycling'].sum()*1.0*100/len(wet_cycling))

At this point, we have two basic data frames which we can use to visualise patterns for the city being analysed.

"Monthly" data frame gives the number of wet and dry commutes per month of the year

"Rainy_days" examines how much rain there was daily in the dataset with how often commuters got wet.

"wet_cycling" is a subset of "rainy_days".

Visualisation using Pandas and Seaborn

At this point, we can start to plot the data. It’s well worth reading the documentation on plotting with Pandas, and looking over the API of Seaborn, a high-level data visualisation library that is a level above matplotlib.

This is not a tutorial on how to plot with seaborn or pandas – that’ll be a seperate blog post, but rather instructions on how to reproduce the plots shown on this blog post.

Barchart of Monthly Rainy Cycles

The monthly summarised rainfall data is the source for this chart.

# Monthly plot of rainy days
plt.figure(figsize=(12,8))
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=2)
sns.barplot(x="month_name", y="Days", hue="Rainy", data=monthly.sort_values(['month', 'Rainy']))
plt.xlabel("Month")
plt.ylabel("Number of Days")
plt.title("Wet or Dry Commuting in {}".format(station))

Number of days monthly when cyclists get wet commuting at typical work times in Dublin, Ireland.

Heatmaps of Rainfall and Rainy Hours per day

The heatmaps shown on the blog post are generated using the “calmap” python library, installable using pip. Simply import the library, and form a Pandas series with a DateTimeIndex and the library takes care of the rest. I had some difficulty here with font sizes, so had to increase the size of the plot overall to counter.

import calmap

temp = rainy_days.copy().set_index(pd.DatetimeIndex(analysis['rainy_days']['date']))
#temp.set_index('date', inplace=True)
fig, ax = calmap.calendarplot(temp['hours_raining'], fig_kws={"figsize":(15,4)})
plt.title("Hours raining")
fig, ax = calmap.calendarplot(temp['total_rain'], fig_kws={"figsize":(15,4)})
plt.title("Total Rainfall Daily")

The Calmap package is very useful for generating heatmaps. Note that if you have highly outlying points of data, these will skew your color mapping considerably – I’d advise removing or reducing them for visualisation purposes.

Heatmap of total rainfall daily over 2015. Note that if you are looking at rainfall data like this, outlying values such as that in August in this example will skew the overall visualisation and reduce the colour-resolution of smaller values. Its best to normalise the data or reduce the outliers prior to plotting.

Exploratory Line Plots

Remember that Pandas can be used on its own for quick visualisations of the data – this is really useful for error checking and sense checking your results. For example:

temp[['get_wet_cycling', 'total_rain', 'hours_raining']].plot()

Quickly view and analyse your data with Pandas straight out of the box. The .plot() command will plot against the axis, but you can specify x and y variables as required.

Comparison of Every City in Dataset

To compare every city in the dataset, summary stats for each city were calculated in advance and then the plot was generated using the seaborn library. To achieve this as quickly as possible, I wrapped the entire data preparation and cleansing phase described above into a single function called “analyse data”, used this function on each city’s dataset, and extracted out the pieces of information needed for the plot.

Here’s the wrapped analyse_data function:

def analyse_station(data_raw, station):
    """
    Function to analyse weather data for a period from one weather station.
    
    Args:
        data_raw (pd.DataFrame): Pandas Dataframe made from CSV downloaded from wunderground.com
        station (String): Name of station being analysed (for comments)
    
    Returns:
        dict: Dictionary with analysis in keys:
            data: Processed and cleansed data
            monthly: Monthly aggregated statistics on rainfall etc.
            wet_cycling: Data on working days and whether you get wet or not commuting
            rainy_days: Daily total rainfall for each day in dataset.
    """
    # Give the variables some friendlier names and convert types as necessary.
    data_raw['temp'] = data_raw['TemperatureC'].astype(float)
    data_raw['rain'] = data_raw['HourlyPrecipMM'].astype(float)
    data_raw['total_rain'] = data_raw['dailyrainMM'].astype(float)
    data_raw['date'] = data_raw['DateUTC'].apply(parser.parse)
    data_raw['humidity'] = data_raw['Humidity'].astype(float)
    data_raw['wind_direction'] = data_raw['WindDirectionDegrees']
    data_raw['wind'] = data_raw['WindSpeedKMH']
    
    # Extract out only the data we need.
    data = data_raw.loc[:, ['date', 'station', 'temp', 'rain', 'total_rain', 'humidity', 'wind']]
    data = data[(data['date'] >= datetime(2015,1,1)) & (data['date'] <= datetime(2015,12,31))]
    
    # There's an issue with some stations that record rainfall ~-2500 where data is missing.
    if (data['rain'] < -500).sum() > 10:
        print("There's more than 10 messed up days for {}".format(station))
        
    # remove the bad samples
    data = data[data['rain'] > -500]

    # Assign the "day" to every date entry
    data['day'] = data['date'].apply(lambda x: x.date())

    # Get the time, day, and hour of each timestamp in the dataset
    data['time_of_day'] = data['date'].apply(lambda x: x.time())
    data['day_of_week'] = data['date'].apply(lambda x: x.weekday())    
    data['hour_of_day'] = data['time_of_day'].apply(lambda x: x.hour)
    # Mark the month for each entry so we can look at monthly patterns
    data['month'] = data['date'].apply(lambda x: x.month)

    # Is each time stamp on a working day (Mon-Fri)
    data['working_day'] = (data['day_of_week'] >= 0) & (data['day_of_week'] <= 4)

    # Classify into morning or evening times (assuming travel between 8.15-9am and 5.15-6pm)
    data['morning'] = (data['time_of_day'] >= time(8,15)) & (data['time_of_day'] <= time(9,0))
    data['evening'] = (data['time_of_day'] >= time(17,15)) & (data['time_of_day'] <= time(18,0))

    # If there's any rain at all, mark that!
    data['raining'] = data['rain'] > 0.0

    # You get wet cycling if its a working day, and its raining at the travel times!
    data['get_wet_cycling'] = (data['working_day']) & ((data['morning'] & data['rain']) |
                                                       (data['evening'] & data['rain']))
    # Looking at the working days only:
    wet_cycling = data[data['working_day'] == True].groupby('day')['get_wet_cycling'].any()
    wet_cycling = pd.DataFrame(wet_cycling).reset_index()
    
    # Group by month for display
    wet_cycling['month'] = wet_cycling['day'].apply(lambda x: x.month)
    monthly = wet_cycling.groupby('month')['get_wet_cycling'].value_counts().reset_index()
    monthly.rename(columns={"get_wet_cycling":"Rainy", 0:"Days"}, inplace=True)
    monthly.replace({"Rainy": {True: "Wet", False:"Dry"}}, inplace=True)    
    monthly['month_name'] = monthly['month'].apply(lambda x: calendar.month_abbr[x])
    
    # Get aggregate stats for each day in the dataset.
    rainy_days = data.groupby(['day']).agg({
            "rain": {"rain": lambda x: (x > 0.0).any(),
                     "rain_amount": "sum"},
            "total_rain": {"total_rain": "max"},
            "get_wet_cycling": {"get_wet_cycling": "any"}
            })    
    rainy_days.reset_index(drop=False, inplace=True)
    rainy_days.columns = rainy_days.columns.droplevel(level=0)
    rainy_days['rain'] = rainy_days['rain'].astype(bool)
    rainy_days.rename(columns={"":"date"}, inplace=True)               
    
    # Also get the number of hours per day where its raining, and add this to the rainy_days dataset.
    temp = data.groupby(["day", "hour_of_day"])['raining'].any()
    temp = temp.groupby(level=[0]).sum().reset_index()
    temp.rename(columns={'raining': 'hours_raining'}, inplace=True)
    temp['day'] = temp['day'].apply(lambda x: x.to_datetime().date())
    rainy_days = rainy_days.merge(temp, left_on='date', right_on='day', how='left')
    rainy_days.drop('day', axis=1, inplace=True)
    
    print "In the year, there were {} rainy days of {} at {}".format(rainy_days['rain'].sum(), len(rainy_days), station)    
    print "It was wet while cycling {} working days of {} at {}".format(wet_cycling['get_wet_cycling'].sum(), 
                                                          len(wet_cycling),
                                                         station)
    print "You get wet cycling {} % of the time!!".format(wet_cycling['get_wet_cycling'].sum()*1.0*100/len(wet_cycling))

    return {"data":data, 'monthly':monthly, "wet_cycling":wet_cycling, 'rainy_days': rainy_days}

The following code was used to individually analyse the raw data for each city in turn. Note that this could be done in a more memory efficient manner by simply saving the aggregate statistics for each city at first rather than loading all into memory. I would recommend that approach if you are dealing with more cities etc.

# Load up each of the stations into memory.
stations = [
 ("IAMSTERD55", "Amsterdam"),
 ("IBCNORTH17", "Vancouver"),
 ("IBELFAST4", "Belfast"),
 ("IBERLINB54", "Berlin"),
 ("ICOGALWA4", "Galway"),
 ("ICOMUNID56", "Madrid"),
 ("IDUBLIND35", "Dublin"),
 ("ILAZIORO71", "Rome"),
 ("ILEDEFRA6", "Paris"),
 ("ILONDONL28", "London"),
 ("IMUNSTER11", "Cork"),
 ("INEWSOUT455", "Sydney"),
 ("ISOPAULO61", "Sao Paulo"),
 ("IWESTERN99", "Cape Town"),
 ("KCASANFR148", "San Francisco"),
 ("KNYBROOK40", "New York"),
 ("IRENFREW4", "Glasgow"),
 ("IENGLAND64", "Liverpool"),
 ('IEDINBUR6', 'Edinburgh')
]
data = []
for station in stations:
   weather = {}
   print "Loading data for station: {}".format(station[1])
   weather['data'] = pd.DataFrame.from_csv("data/{}_weather.csv".format(station[0]))
   weather['station'] = station[0]
   weather['name'] = station[1]
   data.append(weather)
 
for ii in range(len(data)):
    print "Processing data for {}".format(data[ii]['name'])
    data[ii]['result'] = analyse_station(data[ii]['data'], data[ii]['station'])
 
# Now extract the number of wet days, the number of wet cycling days, and the number of wet commutes for a single chart.
output = []
for ii in range(len(data)):
    temp = {
            "total_wet_days": data[ii]['result']['rainy_days']['rain'].sum(),
            "wet_commutes": data[ii]['result']['wet_cycling']['get_wet_cycling'].sum(),
            "commutes": len(data[ii]['result']['wet_cycling']),
            "city": data[ii]['name']
        }
    temp['percent_wet_commute'] = (temp['wet_commutes'] *1.0 / temp['commutes'])*100
    output.append(temp)
output = pd.DataFrame(output)

The final step in the process is to actually create the diagram using Seaborn.

# Generate plot of percentage of wet commutes
plt.figure(figsize=(20,8))
sns.set_style("whitegrid")    # Set style for seaborn output
sns.set_context("notebook", font_scale=2)
sns.barplot(x="city", y="percent_wet_commute", data=output.sort_values('percent_wet_commute', ascending=False))
plt.xlabel("City")
plt.ylabel("Percentage of Wet Commutes (%)")
plt.suptitle("What percentage of your cycles to work do you need a raincoat?", y=1.05, fontsize=32)
plt.title("Based on Wundergroud.com weather data for 2015", fontsize=18)
plt.xticks(rotation=60)
plt.savefig("images/city_comparison_wet_commutes.png", bbox_inches='tight')

Percentage of times you got wet cycling to work in 2015 for cities globally. Galway comes out consistently as one of the wettest places for a cycling commute in the data available, but 2015 was a particularly bad year for Irish weather. Here’s hoping for 2016.

If you do proceed to using this code in any of your work, please do let me know!

↧

How wet is a cycling commute in Ireland? Pretty wet… if you live in Galway.

April 4, 2016, 3:19 am

≫ Next: The ggthemr package – Theme and colour your ggplot figures

≪ Previous: Analysis of Weather data using Pandas, Python, and Seaborn

How often do you get wet cycling to work?

Cycling in Ireland is taking off. The DublinBikes scheme is a massive success with over 10 million journeys, there’s large increases in people cycling in Irish cities, there’s a good cyclist community, and infrastructure is slowing improving around the country.

However, Ireland is a rainy place!

It turns out that you’ll get wet 3 times more often if you’re a Galway cyclist when compared to a Dubliner. Dublin is Ireland’s driest cycling city.

End Result: Percentage of times you got wet cycling to work in 2015 for cities globally. We’re wet, but 2015 was a particularly bad year for Irish weather. Here’s hoping for 2016!

Overall, Ireland lives up to the wet reputation – we require raincoats and (very uncool) waterproof pants a bit more than many international spots!

In this post, we examine exactly how many times cycling commuters got rained on in 2015, and we compare this to cities internationally. Data is taken from the Wunderground personal weather stations (PWS) network. If you’re interested in the techniques and repeating this analysis (Python, Pandas, & Seaborn) head here.

Average Annual Rainfall across Ireland is heavily biased to the Atlantic Coast, varying from 600-800 mm in Eastern Ireland to over 3000 mm in the West.

When do you cycle to work?

To look at the weather data, we assume that:

Your cycle to work happens between 8.15am and 9.00am and you cycle home between 5.15pm and 6.00pm.
You only cycle to work on Monday-Friday. (clearly not working at KillBiller.com)
If it rains at any time during these periods, with any amount, that’s deemed a “wet cycling day”.
Unfortunately, Irish people are all too familiar with the “rainfall radar”!

Dublin cyclists

In 2015, it rained on 181 of the 365 of days at a weather station in Conyngham Road, Dublin*.

Eliminating weekends and looking only at commuting times; in 2015, Dublin cyclists would have gotten wet 35 times, or 13% of their 261 working days!

In Dublin, commuting cyclists got wet thirty-five times in the year, and very few of those wet cycles would have been a downpour. According to the data, Dublin cyclists had a completely dry February and a rain jacket was needed 2-3 times per month on average, even in Winter.

Here’s a monthly breakdown of wet and dry cycles in 2015 for cycling commuters every work day in Dublin. The cycling commute in Dublin is not as wet as you’d think!

Number of days monthly when cyclists get wet commuting at typical work times in Dublin, Ireland. Note the completely dry February, and the relatively even spread of days between winter and summer.

*The rainfall figure is roughly in line with Met Eireann figures for Ireland, where we’d expect 151 wet days on the east coast and we know that 2015 was particularly wet.

How does Dublin compare to cycling commutes in Galway, Cork, and Belfast?

Other Irish cities are not dry havens:

Cyclists in Galway required mudguards for 115 working days in 2015, or 44% of their commuting cycles (total of 261 wet days in Galway!);
Cork doesn’t fare much better, at 67 working days (27 %)*; and
Belfast (Queens University) dripped on working cyclists 36 % of the time, or on 93 of their trips.

Galway was our wettest city in the 2015 data, with 267 rainy days, and 115 wet commuting times.

It doesn't rain in the morning or evening as much in Cork. Even with 221 days with rain, commuters were only rained on 67 times (27%) of working days.

Belfast is wet 236 days of 365 in 2015, cyclists were wet on 35% of their commutes.

* Note that the Cork station at Blackrock was missing about 20 days data in June 2015.

Could you just cycle to work at a different time?

An interesting way to look at the data is to examine how many hours in the day is rain actually falling. If you had a very flexible workplace, could you just dodge the rain!? We can look at the number of hours per day during which rain was recorded. As a visual, you can see variations between the cities in Ireland, and that we clearly need raincoats year round.

How many hours a day does it rain in Dublin, Belfast, Cork, and Galway? Darker colours represent more hours with rain – with the darkest red being 24 hours. Unfortunately for the westerners, Galway fared worst here too. Although there is a chance that the particular weather station chosen is in some sort of bath/shower…maybe.

How does Ireland stack up for wet commuting globally?

Looking at 2015 rainfall data from 18 cities globally, Ireland doesn’t fare well. Galway consistently tops the charts in 2015. The graph below compares 19 international cities. Moving to Madrid anyone??

A look at the rainfall filled hours in some of the international cities obviously shows some stark differences to our Irish homes, and also that these cities have a very defined seasonal variation in weather patterns. We can take some solace that we’re not alone – our Scottish friends also stand out for rainy cycles.

Internationally, rainfall varies a huge amount between seasons. Overall however, Irish cyclists get a rougher deal!

Caveats

One issue that’s been pointed out with this analysis is that it rarely actually takes a full 45 minutes to cycle to/from work or school. Hence, there’s a potential flaw in the analysis in that any rain in the commuting periods makes the day “wet”.

Reducing the definition of a wet day to “a day in which there is no dry period 20 minutes at commuting times” might fix this issue – and would likely reduce the percentage of wet days across the board for all cities. If this analysis is completed, it’ll appear here.

Data sources

The data for this article was collected and plotted from weather stations at Wunderground using the Python programming language, specifically using the Pandas and Seaborn libraries. If you’re interested, you can see the details, and how to repeat the analysis, over here. The raw data is also available here, if you’re keen on that sort of thing!

City	Total Wet Days	Total Commutes	Number of Wet Commutes	Percent Wet Commutes (%)
Galway	267	256	115	44.9
Glasgow	186	254	94	37.0
Belfast	236	261	93	35.6
Cork	221	246	67	27.2
Amsterdam	195	262	62	23.7
Dublin	181	261	35	13.4
New York	100	262	31	11.8
Cape Town	123	259	27	10.4
Berlin	124	254	26	10.2
Paris	117	248	25	10.1
Sao Paulo	143	262	26	9.9
Vancouver	115	261	25	9.6
Sydney	120	262	25	9.5
Liverpool	148	261	22	8.4
London	102	261	21	8.0
San Francisco	62	262	15	5.7
Rome	61	261	14	5.4
Madrid	66	262	12	4.6

Updates

This post garnered some attention in the media, and was featured on:

Broadsheet.ie under “Mean Rainfall“
Joe.ie under “Galway is the city in the world for people who cycle to work” (maybe not the world, and maybe only for rainy days!)
RTE1 Ray D’Arcy show on Wednesday 6th April at 3.10pm.

↧

The ggthemr package – Theme and colour your ggplot figures

July 4, 2016, 2:59 am

≫ Next: Fixing Office 2016 installation for Mac – error code 0xD0000006

≪ Previous: How wet is a cycling commute in Ireland? Pretty wet… if you live in Galway.

Theming ggplot figure output

The default colour themes in ggplot2 are beautiful. Your figures look great, the colours match, and you have the characteristic “R” look and feel. The author of ggplot2, Hadley Wickham, has done a fantastic job.

For the tinkerers, there’s methods to change every part of the look and feel of your figures. In practice however, changing all of the defaults can feel laborious and too much work when you want a quick change to look and feel.

The ggthemr package was developed by a friend of mine, Ciarán Tobin, who works with me at KillBiller and Edgetier. The package gives a quick and easy way to completely change the look and feel of your ggplot2 figures, as well as quickly create a theme based on your own, or your company’s, colour palette.

In this post, we will quickly examine some of the built in theme variations included with ggplot2 in R, and then look at the colour schemes available using ggthemr.

Basic themes included in ggplot2

There’s 8 built-in theme variations in the latest versions of ggplot2. Quickly change the default look of your figures by adding theme_XX() to the end of your plotting commands.

Let’s look at some of these. First create a simple figure, based on the massively overused “iris” dataset, since it’s built-in to R.

# Define a set of figures to play with using the Iris dataset
point_plot <- ggplot(iris, aes(x=jitter(Sepal.Width), y=jitter(Sepal.Length), col=Species)) + 
 geom_point() + 
 labs(x="Sepal Width (cm)", y="Sepal Length (cm)", col="Species", title="Iris Dataset")

bar_plot <- ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) + 
 geom_bar(stat="summary", fun.y="mean") + 
 labs(x="Species", y="Mean Sepal Width (cm)", fill="Species", title="Iris Dataset")

box_plot <- ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) + 
 geom_boxplot() + 
 labs(x="Species", y="Sepal Width (cm)", fill="Species", title="Iris Dataset")

# Display this figure:
point_plot
# Display this figure with a theme:
point_plot + theme_dark()

The default look and feel can be adjusted by adding an in-build theme from ggplot2.

theme_gray() – signature ggplot2 theme
theme_bw() – dark on light ggplot2 theme
theme_linedraw() – uses black lines on white backgrounds only
theme_light() – similar to linedraw() but with grey lines aswell
theme_dark() – lines on a dark background instead of light
theme_minimal() – no background annotations, minimal feel.
theme_classic() – theme with no grid lines.
theme_void() – empty theme with no elements

Examples of these theme’s applied to the figure is shown below.

Default ggplot2 library theme scatter diagram

Scatter plot with theme_dark() from ggplot library

Note that the colour palette for your figures is not affected by these theme changes – only the figure parameters such as the grid lines, outlines, and backgrounds etc.

The “ggthemr” package

The ggthemr R package is an R package to set up a new theme for your ggplot figures, and completely change the look and feel of your figures, from colours to gridlines. The package is available on github, and is installed using the Devtools package:

library(devtools)
<span class="s1">devtools</span><span class="s2">::</span><span class="s3">install_github(</span><span class="s4">'cttobin/ggthemr'</span><span class="s3">)</span>

The ggthemr package is built for “fire-and-forget” usage: you set the theme at the start of your R session, and all of your plots will look different from there.

The command to set a theme is:

# set ggthemr theme
ggthemr("&lt;theme name&gt;") 
# plot your existing figure with the new theme
plt
# to remove all ggthemr effects later:
ggthemr_reset()

For example:

library(ggthemr)
ggthemr("dust")
point_plot
bar_plot

“ggthemr” colour themes

There are 17 built-in colour themes with ggthemr, each one providing a unique way to change your ggplot figure look. All are listed on the ggthemr github page. The colour themes available built-in are:

flat
flat-dark
camoflauge
chalk
copper
dust
earth
fresh
grape
grass
greyscale
light
lilac
pale
sea
sky
solarized

One benefit of using ggthemr is that the default color palettes are replaced for lines and bars – completely changing the look of the charts. A complete view of all themes is available here, and some random examples are shown below.

“Dust” look and feel across charts

“Earth” look and feel colour palette for box plot, scatter diagram, and bar chart.

“Flat” look and feel from ggthemr for various chart types

Custom palettes with ggthemr

If you’re in a working environment that has it’s own custom palette, for example a consultancy or accounting firm, it’s great to have ggplot2 figures match your document templates and power point files.

Ggthemr allows you to specify a custom theme palette that can be applied easily. Imagine we worked for “Tableau“, the data visualisation and business intelligence platform.

To define the custom theme, get the colour scheme for tableau figures in hex, choose a base theme, then define the swatch for ggthemr:

tableau_colours <- c('#1F77B4', '#FF7F0E', '#2CA02C', '#D62728', '#9467BD', '#8C564B', '#CFECF9', '#7F7F7F', '#BCBD22', '#17BECF')
# you have to add a colour at the start of your palette for outlining boxes, we'll use a grey:
tableau_colours <- c("#555555", tableau_colours)
# remove previous effects:
ggthemr_reset()
# Define colours for your figures with define_palette
tableau <- define_palette(
 swatch = tableau_colours, # colours for plotting points and bars
 gradient = c(lower = tableau_colours[1L], upper = tableau_colours[2L]), #upper and lower colours for continuous colours
 background = "#EEEEEE" #defining a grey-ish background 
)
# set the theme for your figures:
ggthemr(tableau)
# Create plots with familiar tableau look
point_plot
bar_plot
box_plot

Along with the swatch, gradient, and background elements of the figures, define_palette() also accepts specification of the figure text colours, line colours, and minor and major gridlines.

Additional figure controls

There are three additional controls in the ggthemr package allowing further figure adjustments:

“Type” controls whether the background colour spills over the entire plot area, or just the axes section, options inner or outer
“Spacing” controls the padding between the axes and the axis labels / titles, options 0,1,2.
“Layout” controls the appearance and position of the axes and gridlines – options clean, clear, minimal, plain, scientific.

For example, a new figure look can be created with:

ggthemr("earth", type="outer", layout="scientific", spacing=2)
point_plot

“Earth” themed geom_point() with a scientific layout, extra spacing, and outer type specified for ggthemr.

Manually customising a ggplot2 theme

If ggthemr isn’t doing it for you, the in-built ggplot2 theming system is completely customisable. There’s an extensive system in ggplot2 for changing every element of your plots – all defined using the theme() function. For example, the theme_grey() theme is defined as:

theme_grey <- function (base_size = 11, base_family = "") 
{
 half_line <- base_size/2
 theme(line = element_line(colour = "black", size = 0.5, linetype = 1, 
 lineend = "butt"), rect = element_rect(fill = "white", 
 colour = "black", size = 0.5, linetype = 1), text = element_text(family = base_family, 
 face = "plain", colour = "black", size = base_size, lineheight = 0.9, 
 hjust = 0.5, vjust = 0.5, angle = 0, margin = margin(), 
 debug = FALSE), axis.line = element_line(), axis.line.x = element_blank(), 
 axis.line.y = element_blank(), axis.text = element_text(size = rel(0.8), 
 colour = "grey30"), axis.text.x = element_text(margin = margin(t = 0.8 * 
 half_line/2), vjust = 1), axis.text.y = element_text(margin = margin(r = 0.8 * 
 half_line/2), hjust = 1), axis.ticks = element_line(colour = "grey20"), 
 axis.ticks.length = unit(half_line/2, "pt"), axis.title.x = element_text(margin = margin(t = 0.8 * 
 half_line, b = 0.8 * half_line/2)), axis.title.y = element_text(angle = 90, 
 margin = margin(r = 0.8 * half_line, l = 0.8 * half_line/2)), 
 legend.background = element_rect(colour = NA), legend.margin = unit(0.2, 
 "cm"), legend.key = element_rect(fill = "grey95", 
 colour = "white"), legend.key.size = unit(1.2, "lines"), 
 legend.key.height = NULL, legend.key.width = NULL, legend.text = element_text(size = rel(0.8)), 
 legend.text.align = NULL, legend.title = element_text(hjust = 0), 
 legend.title.align = NULL, legend.position = "right", 
 legend.direction = NULL, legend.justification = "center", 
 legend.box = NULL, panel.background = element_rect(fill = "grey92", 
 colour = NA), panel.border = element_blank(), panel.grid.major = element_line(colour = "white"), 
 panel.grid.minor = element_line(colour = "white", size = 0.25), 
 panel.margin = unit(half_line, "pt"), panel.margin.x = NULL, 
 panel.margin.y = NULL, panel.ontop = FALSE, strip.background = element_rect(fill = "grey85", 
 colour = NA), strip.text = element_text(colour = "grey10", 
 size = rel(0.8)), strip.text.x = element_text(margin = margin(t = half_line, 
 b = half_line)), strip.text.y = element_text(angle = -90, 
 margin = margin(l = half_line, r = half_line)), strip.switch.pad.grid = unit(0.1, 
 "cm"), strip.switch.pad.wrap = unit(0.1, "cm"), plot.background = element_rect(colour = "white"), 
 plot.title = element_text(size = rel(1.2), margin = margin(b = half_line * 
 1.2)), plot.margin = margin(half_line, half_line, 
 half_line, half_line), complete = TRUE)
}

To create a theme of your own – you can change the values in this function definition, and add it to your plot as so:

ggplot(...) + theme_your_own() # adding a theme to a figure

As you can see, its not for the faint hearted, so for your quick-fix, I’d recommend looking at ggthemr and wish you the best with beautiful plots in your next presentation! Let us know how things go!

↧

Fixing Office 2016 installation for Mac – error code 0xD0000006

October 5, 2015, 1:40 pm

≫ Next: Amazon Elastic Beanstalk – Logging to Logentries from Python Application

≪ Previous: The ggthemr package – Theme and colour your ggplot figures

This is a very quick post to help some people out on installation problems with Office for Mac 2016.

On excited installation of Excel 2016 on my Macbook, the following error threatened to ruin the day:

“An unknown error has occurred, the error code is: 0xD0000006”

Under the “Sharing” option, you’ll find your computer name.

Hope this helps out.

System Preferences on Macbook Pro.

Sharing options on Macbook Pro

↧

Amazon Elastic Beanstalk – Logging to Logentries from Python Application

April 1, 2016, 12:12 pm

≫ Next: Analysis of Weather data using Pandas, Python, and Seaborn

≪ Previous: Fixing Office 2016 installation for Mac – error code 0xD0000006

logentries pricing

Setting up your Amazon Elastic Beanstalk (EB) application to “rotate logs” through S3.
Using AWS Lambda to detect the upload of logs from EB and trigger a script. Logentries provide a tutorial for this.
Use the lamda script to upload logs from S3 directly to Logentries. (this is where the trouble starts).

The problem encountered is that Amazon places a single GZIP compressed file in your S3 bucket during log rotation. The lambda script provided by Logentries will only work with text files.

...
import zlib
import StringIO

...
# Get object from S3 Bucket
response = s3.get_object(Bucket=bucket, Key=key)
body = response['Body']
# Read data from object, at this point this is compressed zip binary format.
compressed_data = body.read()
s.sendall("Successfully read the zipped contents of the file.")
# Use zlib library to decompress this binary string
data = zlib.decompress(compressed_data, 16+zlib.MAX_WBITS)
s.sendall("Decompressed data")
# Now continue script as normal to send data to Logentries.
for token in tokens:
    s.sendall('%s %s\n' % (token, "username='{}' downloaded file='{}' from bucket='{}' and uncompressed!."

...

In terms of other steps, here’s the key points:

1. Turn on Log Rotation for your Elastic Beanstalk Application

This setting is found under “Software Configuration” in the Configuration page of your Elastic Beanstalk application.

amazon-log-rotation

At regular intervals, Amazon will collect logs from every instance in your application, zip them, and place them on an Amazon S3 bucket, under different folders, in my case:

/elasticbeanstalk-eu-west-1-<account number>/resources/environments/logs/publish/<environment id>/<instance id>/<log files>.gz

2. Set up AWS Lambda to detect log changes and upload to Logentries

To complete this step, follow the instructions as laid out by Logentries themselves at this page. The only changes that I think are worth making are:

Instead of the “le_lambda.py” file that Logentries provide, use the slightly edited version I have here. You’ll still need the “le_certs.pem” file and need to zip them together when creating your AWS Lambda task.
When configuring the event sources for your AWS Lambda job, you can specify a Prefix for the notifications, ensuring that only logs for a given application ID are uploaded, rather than any changes to your S3 bucket (such as manually requested logs). For example:
```
resources/environments/logs/publish/e-mpcwnwheky/
```
If you have other Lambda functions (perhaps the original le_lambda.py) being used for non-zipped files, you can use the “suffix” filter to only trigger for “.gz” files.

Now your logs should flow freely to Logentries; get busy making up some dashboards, tags, and alerts.

↧

Analysis of Weather data using Pandas, Python, and Seaborn

April 3, 2016, 3:29 am

≫ Next: How wet is a cycling commute in Ireland? Pretty dry!… if you don’t live in Galway.

≪ Previous: Amazon Elastic Beanstalk – Logging to Logentries from Python Application

The most recent post on this site was an analysis of how often people cycling to work actually get rained on in different cities around the world. You can check it out here.

If you want to skip data downloading and scraping, all of the data I used is available to download here.

Scraping Weather Data

import requests
import pandas as pd
from dateutil import parser, rrule
from datetime import datetime, time, date
import time

def getRainfallData(station, day, month, year):
    """
    Function to return a data frame of minute-level weather data for a single Wunderground PWS station.
    
    Args:
        station (string): Station code from the Wunderground website
        day (int): Day of month for which data is requested
        month (int): Month for which data is requested
        year (int): Year for which data is requested
    
    Returns:
        Pandas Dataframe with weather data for specified station and date.
    """
    url = "http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID={station}&day={day}&month={month}&year={year}&graphspan=day&format=1"
    full_url = url.format(station=station, day=day, month=month, year=year)
    # Request data from wunderground data
    response = requests.get(full_url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
    data = response.text
    # remove the excess <br> from the text data
    data = data.replace('<br>', '')
    # Convert to pandas dataframe (fails if issues with weather station)
    try:
        dataframe = pd.read_csv(io.StringIO(data), index_col=False)
        dataframe['station'] = station
    except Exception as e:
        print("Issue with date: {}-{}-{} for station {}".format(day,month,year, station))
        return None
    return dataframe
    
# Generate a list of all of the dates we want data for
start_date = "2015-01-01"
end_date = "2015-12-31"
start = parser.parse(start_date)
end = parser.parse(end_date)
dates = list(rrule.rrule(rrule.DAILY, dtstart=start, until=end))

# Create a list of stations here to download data for
stations = ["IDUBLINF3", "IDUBLINF2", "ICARRAIG2", "IGALWAYR2", "IBELFAST4", "ILONDON59", "IILEDEFR28"]
# Set a backoff time in seconds if a request fails
backoff_time = 10
data = {}

# Gather data for each station in turn and save to CSV.
for station in stations:
    print("Working on {}".format(station))
    data[station] = []
    for date in dates:
        # Print period status update messages
        if date.day % 10 == 0:
            print("Working on date: {} for station {}".format(date, station))
        done = False
        while done == False:
            try:
                weather_data = getRainfallData(station, date.day, date.month, date.year)
                done = True
            except ConnectionError as e:
                # May get rate limited by Wunderground.com, backoff if so.
                print("Got connection error on {}".format(date))
                print("Will retry in {} seconds".format(backoff_time))
                time.sleep(10)
        # Add each processed date to the overall data
        data[station].append(weather_data)
    # Finally combine all of the individual days and output to CSV for analysis.
    pd.concat(data[station]).to_csv("data/{}_weather.csv".format(station))

Cleansing and Data Processing

station = 'IEDINBUR6' # Edinburgh
data_input = pd.read_csv('data/{}_weather.csv'.format(station))

# Give the variables some friendlier names and convert types as necessary.
data_raw['temp'] = data_raw['TemperatureC'].astype(float)
data_raw['rain'] = data_raw['HourlyPrecipMM'].astype(float)
data_raw['total_rain'] = data_raw['dailyrainMM'].astype(float)
data_raw['date'] = data_raw['DateUTC'].apply(parser.parse)
data_raw['humidity'] = data_raw['Humidity'].astype(float)
data_raw['wind_direction'] = data_raw['WindDirectionDegrees']
data_raw['wind'] = data_raw['WindSpeedKMH']

# Extract out only the data we need.
data = data_raw.loc[:, ['date', 'station', 'temp', 'rain', 'total_rain', 'humidity', 'wind']]
data = data[(data['date'] >= datetime(2015,1,1)) & (data['date'] <= datetime(2015,12,31))]

# There's an issue with some stations that record rainfall ~-2500 where data is missing.
if (data['rain'] < -500).sum() > 10:
    print("There's more than 10 messed up days for {}".format(station))
    
# remove the bad samples
data = data[data['rain'] > -500]

# Assign the "day" to every date entry
data['day'] = data['date'].apply(lambda x: x.date())

# Get the time, day, and hour of each timestamp in the dataset
data['time_of_day'] = data['date'].apply(lambda x: x.time())
data['day_of_week'] = data['date'].apply(lambda x: x.weekday())    
data['hour_of_day'] = data['time_of_day'].apply(lambda x: x.hour)
# Mark the month for each entry so we can look at monthly patterns
data['month'] = data['date'].apply(lambda x: x.month)

# Is each time stamp on a working day (Mon-Fri)
data['working_day'] = (data['day_of_week'] >= 0) & (data['day_of_week'] <= 4)

# Classify into morning or evening times (assuming travel between 8.15-9am and 5.15-6pm)
data['morning'] = (data['time_of_day'] >= time(8,15)) & (data['time_of_day'] <= time(9,0))
data['evening'] = (data['time_of_day'] >= time(17,15)) & (data['time_of_day'] <= time(18,0))

# If there's any rain at all, mark that!
data['raining'] = data['rain'] > 0.0

# You get wet cycling if its a working day, and its raining at the travel times!
data['get_wet_cycling'] = (data['working_day']) & ((data['morning'] & data['rain']) |
                                                   (data['evening'] & data['rain']))

At this point, the dataset is relatively clean, and ready for analysis. If you are not familiar with grouping and aggregation procedures in Python and Pandas, here is another blog post on the topic.

Data after cleansing from Wunderground.com. This data is now in good format for grouping and visualisation using Pandas.

Data summarisation and aggregation

# Looking at the working days only and create a daily data set of working days:
wet_cycling = data[data['working_day'] == True].groupby('day')['get_wet_cycling'].any()
wet_cycling = pd.DataFrame(wet_cycling).reset_index()

# Group by month for display - monthly data set for plots.
wet_cycling['month'] = wet_cycling['day'].apply(lambda x: x.month)
monthly = wet_cycling.groupby('month')['get_wet_cycling'].value_counts().reset_index()
monthly.rename(columns={"get_wet_cycling":"Rainy", 0:"Days"}, inplace=True)
monthly.replace({"Rainy": {True: "Wet", False:"Dry"}}, inplace=True)    
monthly['month_name'] = monthly['month'].apply(lambda x: calendar.month_abbr[x])

# Get aggregate stats for each day in the dataset on rain in general - for heatmaps.
rainy_days = data.groupby(['day']).agg({
        "rain": {"rain": lambda x: (x > 0.0).any(),
                 "rain_amount": "sum"},
        "total_rain": {"total_rain": "max"},
        "get_wet_cycling": {"get_wet_cycling": "any"}
        })    

# clean up the aggregated data to a more easily analysed set:
rainy_days.reset_index(drop=False, inplace=True) # remove the 'day' as the index
rainy_days.rename(columns={"":"date"}, inplace=True) # The old index column didn't have a name - add "date" as name
rainy_days.columns = rainy_days.columns.droplevel(level=0) # The aggregation left us with a multi-index
                                                           # Remove the top level of this index.
rainy_days['rain'] = rainy_days['rain'].astype(bool)       # Change the "rain" column to True/False values

# Add the number of rainy hours per day this to the rainy_days dataset.
temp = data.groupby(["day", "hour_of_day"])['raining'].any()
temp = temp.groupby(level=[0]).sum().reset_index()
temp.rename(columns={'raining': 'hours_raining'}, inplace=True)
temp['day'] = temp['day'].apply(lambda x: x.to_datetime().date())
rainy_days = rainy_days.merge(temp, left_on='date', right_on='day', how='left')
rainy_days.drop('day', axis=1, inplace=True)

print "In the year, there were {} rainy days of {} at {}".format(rainy_days['rain'].sum(), len(rainy_days), station)    
print "It was wet while cycling {} working days of {} at {}".format(wet_cycling['get_wet_cycling'].sum(), 
                                                      len(wet_cycling),
                                                     station)
print "You get wet cycling {} % of the time!!".format(wet_cycling['get_wet_cycling'].sum()*1.0*100/len(wet_cycling))

At this point, we have two basic data frames which we can use to visualise patterns for the city being analysed.

Visualisation using Pandas and Seaborn

This is not a tutorial on how to plot with seaborn or pandas – that’ll be a seperate blog post, but rather instructions on how to reproduce the plots shown on this blog post.

Barchart of Monthly Rainy Cycles

The monthly summarised rainfall data is the source for this chart.

# Monthly plot of rainy days
plt.figure(figsize=(12,8))
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=2)
sns.barplot(x="month_name", y="Days", hue="Rainy", data=monthly.sort_values(['month', 'Rainy']))
plt.xlabel("Month")
plt.ylabel("Number of Days")
plt.title("Wet or Dry Commuting in {}".format(station))

Number of days monthly when cyclists get wet commuting at typical work times in Dublin, Ireland.

Heatmaps of Rainfall and Rainy Hours per day

import calmap

temp = rainy_days.copy().set_index(pd.DatetimeIndex(analysis['rainy_days']['date']))
#temp.set_index('date', inplace=True)
fig, ax = calmap.calendarplot(temp['hours_raining'], fig_kws={"figsize":(15,4)})
plt.title("Hours raining")
fig, ax = calmap.calendarplot(temp['total_rain'], fig_kws={"figsize":(15,4)})
plt.title("Total Rainfall Daily")

Exploratory Line Plots

Remember that Pandas can be used on its own for quick visualisations of the data – this is really useful for error checking and sense checking your results. For example:

temp[['get_wet_cycling', 'total_rain', 'hours_raining']].plot()

Quickly view and analyse your data with Pandas straight out of the box. The .plot() command will plot against the axis, but you can specify x and y variables as required.

Comparison of Every City in Dataset

Here’s the wrapped analyse_data function:

def analyse_station(data_raw, station):
    """
    Function to analyse weather data for a period from one weather station.
    
    Args:
        data_raw (pd.DataFrame): Pandas Dataframe made from CSV downloaded from wunderground.com
        station (String): Name of station being analysed (for comments)
    
    Returns:
        dict: Dictionary with analysis in keys:
            data: Processed and cleansed data
            monthly: Monthly aggregated statistics on rainfall etc.
            wet_cycling: Data on working days and whether you get wet or not commuting
            rainy_days: Daily total rainfall for each day in dataset.
    """
    # Give the variables some friendlier names and convert types as necessary.
    data_raw['temp'] = data_raw['TemperatureC'].astype(float)
    data_raw['rain'] = data_raw['HourlyPrecipMM'].astype(float)
    data_raw['total_rain'] = data_raw['dailyrainMM'].astype(float)
    data_raw['date'] = data_raw['DateUTC'].apply(parser.parse)
    data_raw['humidity'] = data_raw['Humidity'].astype(float)
    data_raw['wind_direction'] = data_raw['WindDirectionDegrees']
    data_raw['wind'] = data_raw['WindSpeedKMH']
    
    # Extract out only the data we need.
    data = data_raw.loc[:, ['date', 'station', 'temp', 'rain', 'total_rain', 'humidity', 'wind']]
    data = data[(data['date'] >= datetime(2015,1,1)) & (data['date'] <= datetime(2015,12,31))]
    
    # There's an issue with some stations that record rainfall ~-2500 where data is missing.
    if (data['rain'] < -500).sum() > 10:
        print("There's more than 10 messed up days for {}".format(station))
        
    # remove the bad samples
    data = data[data['rain'] > -500]

    # Assign the "day" to every date entry
    data['day'] = data['date'].apply(lambda x: x.date())

    # Get the time, day, and hour of each timestamp in the dataset
    data['time_of_day'] = data['date'].apply(lambda x: x.time())
    data['day_of_week'] = data['date'].apply(lambda x: x.weekday())    
    data['hour_of_day'] = data['time_of_day'].apply(lambda x: x.hour)
    # Mark the month for each entry so we can look at monthly patterns
    data['month'] = data['date'].apply(lambda x: x.month)

    # Is each time stamp on a working day (Mon-Fri)
    data['working_day'] = (data['day_of_week'] >= 0) & (data['day_of_week'] <= 4)

    # Classify into morning or evening times (assuming travel between 8.15-9am and 5.15-6pm)
    data['morning'] = (data['time_of_day'] >= time(8,15)) & (data['time_of_day'] <= time(9,0))
    data['evening'] = (data['time_of_day'] >= time(17,15)) & (data['time_of_day'] <= time(18,0))

    # If there's any rain at all, mark that!
    data['raining'] = data['rain'] > 0.0

    # You get wet cycling if its a working day, and its raining at the travel times!
    data['get_wet_cycling'] = (data['working_day']) & ((data['morning'] & data['rain']) |
                                                       (data['evening'] & data['rain']))
    # Looking at the working days only:
    wet_cycling = data[data['working_day'] == True].groupby('day')['get_wet_cycling'].any()
    wet_cycling = pd.DataFrame(wet_cycling).reset_index()
    
    # Group by month for display
    wet_cycling['month'] = wet_cycling['day'].apply(lambda x: x.month)
    monthly = wet_cycling.groupby('month')['get_wet_cycling'].value_counts().reset_index()
    monthly.rename(columns={"get_wet_cycling":"Rainy", 0:"Days"}, inplace=True)
    monthly.replace({"Rainy": {True: "Wet", False:"Dry"}}, inplace=True)    
    monthly['month_name'] = monthly['month'].apply(lambda x: calendar.month_abbr[x])
    
    # Get aggregate stats for each day in the dataset.
    rainy_days = data.groupby(['day']).agg({
            "rain": {"rain": lambda x: (x > 0.0).any(),
                     "rain_amount": "sum"},
            "total_rain": {"total_rain": "max"},
            "get_wet_cycling": {"get_wet_cycling": "any"}
            })    
    rainy_days.reset_index(drop=False, inplace=True)
    rainy_days.columns = rainy_days.columns.droplevel(level=0)
    rainy_days['rain'] = rainy_days['rain'].astype(bool)
    rainy_days.rename(columns={"":"date"}, inplace=True)               
    
    # Also get the number of hours per day where its raining, and add this to the rainy_days dataset.
    temp = data.groupby(["day", "hour_of_day"])['raining'].any()
    temp = temp.groupby(level=[0]).sum().reset_index()
    temp.rename(columns={'raining': 'hours_raining'}, inplace=True)
    temp['day'] = temp['day'].apply(lambda x: x.to_datetime().date())
    rainy_days = rainy_days.merge(temp, left_on='date', right_on='day', how='left')
    rainy_days.drop('day', axis=1, inplace=True)
    
    print "In the year, there were {} rainy days of {} at {}".format(rainy_days['rain'].sum(), len(rainy_days), station)    
    print "It was wet while cycling {} working days of {} at {}".format(wet_cycling['get_wet_cycling'].sum(), 
                                                          len(wet_cycling),
                                                         station)
    print "You get wet cycling {} % of the time!!".format(wet_cycling['get_wet_cycling'].sum()*1.0*100/len(wet_cycling))

    return {"data":data, 'monthly':monthly, "wet_cycling":wet_cycling, 'rainy_days': rainy_days}

# Load up each of the stations into memory.
stations = [
 ("IAMSTERD55", "Amsterdam"),
 ("IBCNORTH17", "Vancouver"),
 ("IBELFAST4", "Belfast"),
 ("IBERLINB54", "Berlin"),
 ("ICOGALWA4", "Galway"),
 ("ICOMUNID56", "Madrid"),
 ("IDUBLIND35", "Dublin"),
 ("ILAZIORO71", "Rome"),
 ("ILEDEFRA6", "Paris"),
 ("ILONDONL28", "London"),
 ("IMUNSTER11", "Cork"),
 ("INEWSOUT455", "Sydney"),
 ("ISOPAULO61", "Sao Paulo"),
 ("IWESTERN99", "Cape Town"),
 ("KCASANFR148", "San Francisco"),
 ("KNYBROOK40", "New York"),
 ("IRENFREW4", "Glasgow"),
 ("IENGLAND64", "Liverpool"),
 ('IEDINBUR6', 'Edinburgh')
]
data = []
for station in stations:
   weather = {}
   print "Loading data for station: {}".format(station[1])
   weather['data'] = pd.DataFrame.from_csv("data/{}_weather.csv".format(station[0]))
   weather['station'] = station[0]
   weather['name'] = station[1]
   data.append(weather)
 
for ii in range(len(data)):
    print "Processing data for {}".format(data[ii]['name'])
    data[ii]['result'] = analyse_station(data[ii]['data'], data[ii]['station'])
 
# Now extract the number of wet days, the number of wet cycling days, and the number of wet commutes for a single chart.
output = []
for ii in range(len(data)):
    temp = {
            "total_wet_days": data[ii]['result']['rainy_days']['rain'].sum(),
            "wet_commutes": data[ii]['result']['wet_cycling']['get_wet_cycling'].sum(),
            "commutes": len(data[ii]['result']['wet_cycling']),
            "city": data[ii]['name']
        }
    temp['percent_wet_commute'] = (temp['wet_commutes'] *1.0 / temp['commutes'])*100
    output.append(temp)
output = pd.DataFrame(output)

The final step in the process is to actually create the diagram using Seaborn.

# Generate plot of percentage of wet commutes
plt.figure(figsize=(20,8))
sns.set_style("whitegrid")    # Set style for seaborn output
sns.set_context("notebook", font_scale=2)
sns.barplot(x="city", y="percent_wet_commute", data=output.sort_values('percent_wet_commute', ascending=False))
plt.xlabel("City")
plt.ylabel("Percentage of Wet Commutes (%)")
plt.suptitle("What percentage of your cycles to work do you need a raincoat?", y=1.05, fontsize=32)
plt.title("Based on Wundergroud.com weather data for 2015", fontsize=18)
plt.xticks(rotation=60)
plt.savefig("images/city_comparison_wet_commutes.png", bbox_inches='tight')

If you do proceed to using this code in any of your work, please do let me know!

↧

How wet is a cycling commute in Ireland? Pretty dry!… if you don’t live in Galway.

April 4, 2016, 3:19 am

≫ Next: The ggthemr package – Theme and colour your ggplot figures

≪ Previous: Analysis of Weather data using Pandas, Python, and Seaborn

How often do you get wet cycling to work?

However, Ireland is a rainy place!

It turns out that you’ll get wet 3 times more often if you’re a Galway cyclist when compared to a Dubliner. Dublin is Ireland’s driest cycling city.

End Result: Percentage of times you got wet cycling to work in 2015 for cities globally. We’re wet, but 2015 was a particularly bad year for Irish weather. Here’s hoping for 2016!

Overall, Ireland lives up to the wet reputation – we require raincoats and (very uncool) waterproof pants a bit more than many international spots!

Average Annual Rainfall across Ireland is heavily biased to the Atlantic Coast, varying from 600-800 mm in Eastern Ireland to over 3000 mm in the West.

When do you cycle to work?

To look at the weather data, we assume that:

Your cycle to work happens between 8.15am and 9.00am and you cycle home between 5.15pm and 6.00pm.
You only cycle to work on Monday-Friday. (clearly not working at KillBiller.com)
If it rains at any time during these periods, with any amount, that’s deemed a “wet cycling day”.
Unfortunately, Irish people are all too familiar with the “rainfall radar”!

Dublin cyclists

In 2015, it rained on 181 of the 365 of days at a weather station in Conyngham Road, Dublin*.

Eliminating weekends and looking only at commuting times; in 2015, Dublin cyclists would have gotten wet 35 times, or 13% of their 261 working days!

Here’s a monthly breakdown of wet and dry cycles in 2015 for cycling commuters every work day in Dublin. The cycling commute in Dublin is not as wet as you’d think!

Number of days monthly when cyclists get wet commuting at typical work times in Dublin, Ireland. Note the completely dry February, and the relatively even spread of days between winter and summer.

*The rainfall figure is roughly in line with Met Eireann figures for Ireland, where we’d expect 151 wet days on the east coast and we know that 2015 was particularly wet.

How does Dublin compare to cycling commutes in Galway, Cork, and Belfast?

Other Irish cities are not dry havens:

Cyclists in Galway required mudguards for 115 working days in 2015, or 44% of their commuting cycles (total of 261 wet days in Galway!);
Cork doesn’t fare much better, at 67 working days (27 %)*; and
Belfast (Queens University) dripped on working cyclists 36 % of the time, or on 93 of their trips.

* Note that the Cork station at Blackrock was missing about 20 days data in June 2015.

Could you just cycle to work at a different time?

How does Ireland stack up for wet commuting globally?

Internationally, rainfall varies a huge amount between seasons. Overall however, Irish cyclists get a rougher deal!

Caveats

Data sources

City	Total Wet Days	Total Commutes	Number of Wet Commutes	Percent Wet Commutes (%)
Galway	267	256	115	44.9
Glasgow	186	254	94	37.0
Belfast	236	261	93	35.6
Cork	221	246	67	27.2
Amsterdam	195	262	62	23.7
Dublin	181	261	35	13.4
New York	100	262	31	11.8
Cape Town	123	259	27	10.4
Berlin	124	254	26	10.2
Paris	117	248	25	10.1
Sao Paulo	143	262	26	9.9
Vancouver	115	261	25	9.6
Sydney	120	262	25	9.5
Liverpool	148	261	22	8.4
London	102	261	21	8.0
San Francisco	62	262	15	5.7
Rome	61	261	14	5.4
Madrid	66	262	12	4.6

Updates

This post garnered some attention in the media, and was featured on:

Broadsheet.ie under “Mean Rainfall“
Joe.ie under “Galway is the city in the world for people who cycle to work” (maybe not the world, and maybe only for rainy days!)
RTE1 Ray D’Arcy show on Wednesday 6th April at 3.10pm.

↧

The ggthemr package – Theme and colour your ggplot figures

July 4, 2016, 2:59 am

≫ Next: Selecting DataFrame rows and columns using iloc, loc, and ix in Pandas

≪ Previous: How wet is a cycling commute in Ireland? Pretty dry!… if you don’t live in Galway.

Theming ggplot figure output

In this post, we will quickly examine some of the built in theme variations included with ggplot2 in R, and then look at the colour schemes available using ggthemr.

Basic themes included in ggplot2

There’s 8 built-in theme variations in the latest versions of ggplot2. Quickly change the default look of your figures by adding theme_XX() to the end of your plotting commands.

Let’s look at some of these. First create a simple figure, based on the massively overused “iris” dataset, since it’s built-in to R.

# Define a set of figures to play with using the Iris dataset
point_plot <- ggplot(iris, aes(x=jitter(Sepal.Width), y=jitter(Sepal.Length), col=Species)) + 
 geom_point() + 
 labs(x="Sepal Width (cm)", y="Sepal Length (cm)", col="Species", title="Iris Dataset")

bar_plot <- ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) + 
 geom_bar(stat="summary", fun.y="mean") + 
 labs(x="Species", y="Mean Sepal Width (cm)", fill="Species", title="Iris Dataset")

box_plot <- ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) + 
 geom_boxplot() + 
 labs(x="Species", y="Sepal Width (cm)", fill="Species", title="Iris Dataset")

# Display this figure:
point_plot
# Display this figure with a theme:
point_plot + theme_dark()

The default look and feel can be adjusted by adding an in-build theme from ggplot2.

theme_gray() – signature ggplot2 theme
theme_bw() – dark on light ggplot2 theme
theme_linedraw() – uses black lines on white backgrounds only
theme_light() – similar to linedraw() but with grey lines aswell
theme_dark() – lines on a dark background instead of light
theme_minimal() – no background annotations, minimal feel.
theme_classic() – theme with no grid lines.
theme_void() – empty theme with no elements

Examples of these theme’s applied to the figure is shown below.

Note that the colour palette for your figures is not affected by these theme changes – only the figure parameters such as the grid lines, outlines, and backgrounds etc.

The “ggthemr” package

library(devtools)
<span class="s1">devtools</span><span class="s2">::</span><span class="s3">install_github(</span><span class="s4">'cttobin/ggthemr'</span><span class="s3">)</span>

The ggthemr package is built for “fire-and-forget” usage: you set the theme at the start of your R session, and all of your plots will look different from there.

The command to set a theme is:

# set ggthemr theme
ggthemr("&lt;theme name&gt;") 
# plot your existing figure with the new theme
plt
# to remove all ggthemr effects later:
ggthemr_reset()

For example:

library(ggthemr)
ggthemr("dust")
point_plot
bar_plot

“ggthemr” colour themes

flat
flat-dark
camoflauge
chalk
copper
dust
earth
fresh
grape
grass
greyscale
light
lilac
pale
sea
sky
solarized

“Dust” look and feel across charts

“Earth” look and feel colour palette for box plot, scatter diagram, and bar chart.

“Flat” look and feel from ggthemr for various chart types

Custom palettes with ggthemr

Ggthemr allows you to specify a custom theme palette that can be applied easily. Imagine we worked for “Tableau“, the data visualisation and business intelligence platform.

To define the custom theme, get the colour scheme for tableau figures in hex, choose a base theme, then define the swatch for ggthemr:

tableau_colours <- c('#1F77B4', '#FF7F0E', '#2CA02C', '#D62728', '#9467BD', '#8C564B', '#CFECF9', '#7F7F7F', '#BCBD22', '#17BECF')
# you have to add a colour at the start of your palette for outlining boxes, we'll use a grey:
tableau_colours <- c("#555555", tableau_colours)
# remove previous effects:
ggthemr_reset()
# Define colours for your figures with define_palette
tableau <- define_palette(
 swatch = tableau_colours, # colours for plotting points and bars
 gradient = c(lower = tableau_colours[1L], upper = tableau_colours[2L]), #upper and lower colours for continuous colours
 background = "#EEEEEE" #defining a grey-ish background 
)
# set the theme for your figures:
ggthemr(tableau)
# Create plots with familiar tableau look
point_plot
bar_plot
box_plot

Along with the swatch, gradient, and background elements of the figures, define_palette() also accepts specification of the figure text colours, line colours, and minor and major gridlines.

Additional figure controls

There are three additional controls in the ggthemr package allowing further figure adjustments:

“Type” controls whether the background colour spills over the entire plot area, or just the axes section, options inner or outer
“Spacing” controls the padding between the axes and the axis labels / titles, options 0,1,2.
“Layout” controls the appearance and position of the axes and gridlines – options clean, clear, minimal, plain, scientific.

For example, a new figure look can be created with:

ggthemr("earth", type="outer", layout="scientific", spacing=2)
point_plot

“Earth” themed geom_point() with a scientific layout, extra spacing, and outer type specified for ggthemr.

Manually customising a ggplot2 theme

theme_grey <- function (base_size = 11, base_family = "") 
{
 half_line <- base_size/2
 theme(line = element_line(colour = "black", size = 0.5, linetype = 1, 
 lineend = "butt"), rect = element_rect(fill = "white", 
 colour = "black", size = 0.5, linetype = 1), text = element_text(family = base_family, 
 face = "plain", colour = "black", size = base_size, lineheight = 0.9, 
 hjust = 0.5, vjust = 0.5, angle = 0, margin = margin(), 
 debug = FALSE), axis.line = element_line(), axis.line.x = element_blank(), 
 axis.line.y = element_blank(), axis.text = element_text(size = rel(0.8), 
 colour = "grey30"), axis.text.x = element_text(margin = margin(t = 0.8 * 
 half_line/2), vjust = 1), axis.text.y = element_text(margin = margin(r = 0.8 * 
 half_line/2), hjust = 1), axis.ticks = element_line(colour = "grey20"), 
 axis.ticks.length = unit(half_line/2, "pt"), axis.title.x = element_text(margin = margin(t = 0.8 * 
 half_line, b = 0.8 * half_line/2)), axis.title.y = element_text(angle = 90, 
 margin = margin(r = 0.8 * half_line, l = 0.8 * half_line/2)), 
 legend.background = element_rect(colour = NA), legend.margin = unit(0.2, 
 "cm"), legend.key = element_rect(fill = "grey95", 
 colour = "white"), legend.key.size = unit(1.2, "lines"), 
 legend.key.height = NULL, legend.key.width = NULL, legend.text = element_text(size = rel(0.8)), 
 legend.text.align = NULL, legend.title = element_text(hjust = 0), 
 legend.title.align = NULL, legend.position = "right", 
 legend.direction = NULL, legend.justification = "center", 
 legend.box = NULL, panel.background = element_rect(fill = "grey92", 
 colour = NA), panel.border = element_blank(), panel.grid.major = element_line(colour = "white"), 
 panel.grid.minor = element_line(colour = "white", size = 0.25), 
 panel.margin = unit(half_line, "pt"), panel.margin.x = NULL, 
 panel.margin.y = NULL, panel.ontop = FALSE, strip.background = element_rect(fill = "grey85", 
 colour = NA), strip.text = element_text(colour = "grey10", 
 size = rel(0.8)), strip.text.x = element_text(margin = margin(t = half_line, 
 b = half_line)), strip.text.y = element_text(angle = -90, 
 margin = margin(l = half_line, r = half_line)), strip.switch.pad.grid = unit(0.1, 
 "cm"), strip.switch.pad.wrap = unit(0.1, "cm"), plot.background = element_rect(colour = "white"), 
 plot.title = element_text(size = rel(1.2), margin = margin(b = half_line * 
 1.2)), plot.margin = margin(half_line, half_line, 
 half_line, half_line), complete = TRUE)
}

To create a theme of your own – you can change the values in this function definition, and add it to your plot as so:

ggplot(...) + theme_your_own() # adding a theme to a figure

↧

Selecting DataFrame rows and columns using iloc, loc, and ix in Pandas

October 2, 2016, 2:44 am

≫ Next: Batch Geocoding in Python with Google Geocoding API

≪ Previous: The ggthemr package – Theme and colour your ggplot figures

There are multiple ways to select and index rows and columns from Pandas DataFrames. I find tutorials online focusing on advanced selections of row and column choices a little complex for my requirements.

There’s three main options to achieve the selection and indexing activities in Pandas, which can be confusing. The three selection cases and methods covered in this post are:

This blog post, inspired by other tutorials, describes selection activities with these operations. The tutorial is suited for the general data science situation where, typically I find myself:

Each row in your data frame represents a data sample.
Each column is a variable, and is usually named. I rarely select columns without their names.
I need to quickly and often select relevant rows from the data frame for modelling and visualisation activities.

For the uninitiated, the Pandas library for Python provides high-performance, easy-to-use data structures and data analysis tools for handling tabular data in “series” and in “data frames”. It’s brilliant at making your data processing easier and I’ve written before about grouping and summarising data with Pandas.

iloc and loc indexing is achieved with pandas using two main arguments for rows and columns

Summary of iloc and loc methods discussed in this blog post. iloc and loc are operations for retrieving data from Pandas dataframes.

Selection and Indexing Methods for Pandas DataFrames

For these explorations we’ll need some sample data – I downloaded the uk-500 sample data set from www.briandunning.com. This data contains artificial names, addresses, companies and phone numbers for fictitious UK characters. To follow along, you can download the .csv file here. Load the data as follows (the diagrams here come from a Jupyter notebook in the Anaconda Python install):

View the code on Gist.

example data for pandas iloc loc and ix indexing examples.

Example data loaded from CSV file.

1. Selecting pandas data using “iloc”

The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.

The iloc indexer syntax is data.iloc[<row selection>, <column selection>], which is sure to be a source of confusion for R users. “iloc” in pandas is used to select rows and columns by number, in the order that they appear in the data frame. You can imagine that each row has a row number from 0 to the total rows (data.shape[0]) and iloc[] allows selections based on these numbers. The same applies for columns (ranging from 0 to data.shape[1] )

There are two “arguments” to iloc – a row selector, and a column selector. For example:

View the code on Gist.

Multiple columns and rows can be selected together using the .iloc indexer.

View the code on Gist.

There’s two gotchas to remember when using iloc in this manner:

Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.
When using .loc, or .iloc, you can control the output format by passing lists or single values to the selectors.
When selecting multiple columns or multiple rows in this manner, remember that in your selection e.g.[1:5], the rows/columns selected will run from the first number to one minus the second number. e.g. [1:5] will go 1,2,3,4., [x,y] goes from x to y-1.

In practice, I rarely use the iloc indexer, unless I want the first ( .iloc[0] ) or the last ( .iloc[-1] ) row of the data frame.

2. Selecting pandas data using “loc”

The Pandas loc indexer can be used with DataFrames for two different use cases:

a.) Selecting rows by label/index
b.) Selecting rows with a boolean / conditional lookup

The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>] .

2a. Label-based / Index-based indexing using .loc

Selections using the loc method are based on the index of the data frame (if any). Where the index is set on a DataFrame, using <code>df.set_index()</code>, the .loc method directly selects based on index values of any rows. For example, setting the index of our test data frame to the persons “last_name”:

View the code on Gist.

Pandas Dataframe with index set using .set_index() for .loc[] explanation.

Last Name set as Index set on sample data frame

Now with the index set, we can directly select rows for different “last_name” values using .loc[<label>] – either singly, or in multiples. For example:

.loc is used by pandas for label based lookups in dataframes

Selecting single or multiple rows using .loc index selections with pandas. Note that the first example returns a series, and the second returns a DataFrame. You can achieve a single-column DataFrame by passing a single-element list to the .loc operation.

Select columns with .loc using the names of the columns. In most of my data work, typically I have named columns, and use these named selections.

selecting columns by name in pandas .loc

When using the .loc indexer, columns are referred to by names using lists of strings, or “:” slices.

You can select ranges of index labels – the selection </code>data.loc[‘Bruch’:’Julio’]</code> will return all rows in the data frame between the index entries for “Bruch” and “Julio”. The following examples should now make sense:

View the code on Gist.

Note that in the last example, data.loc[487] (the row with index value 487) is not equal to data.iloc[487] (the 487th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.

2b. Boolean / Logical indexing using .loc

Conditional selections with boolean arrays using data.loc[<selection>] is the most common method that I use with Pandas DataFrames. With boolean indexing or logical selection, you pass an array or Series of True/False values to the .loc indexer to select the rows where your Series has True values.

In most use cases, you will make selections based on the values of different columns in your data set.

For example, the statement data[‘first_name’] == ‘Antonio’] produces a Pandas Series with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the first_name is “Antonio”. These type of boolean arrays can be passed directly to the .loc indexer as so:

The .loc indexer can accept boolean arrays to select rows

Using a boolean True/False series to select rows in a pandas data frame – all rows with first name of “Antonio” are selected.

As before, a second argument can be passed to .loc to select particular columns out of the data frame. Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation.

Multiple column selection example using .loc

Selecting multiple columns with loc can be achieved by passing column names to the second argument of .loc[]

Note that when selecting columns, if one column only is selected, the .loc operator returns a Series. For a single column DataFrame, use a one-element list to keep the DataFrame format, for example:

.loc returning Series or DataFrames depending on selection

If selections of a single column are made as a string, a series is returned from .loc. Pass a list to get a DataFrame back.

Make sure you understand the following additional examples of .loc selections for clarity:

View the code on Gist.

Logical selections and boolean Series can also be passed to the generic [] indexer of a pandas DataFrame and will give the same results: data.loc[data[‘id’] == 9] == data[data[‘id’] == 9] .

3. Selecting pandas data using ix

The ix[] indexer is a hybrid of .loc and .iloc. Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works where the index of the DataFrame is not integer based. ix will accept any of the inputs of .loc and .iloc.

Slightly more complex, I prefer to explicitly use .iloc and .loc to avoid unexpected results.

As an example:

View the code on Gist.

Setting values in DataFrames using .loc

With a slight change of syntax, you can actually update your DataFrame in the same statement as you select and filter using .loc indexer. This particular pattern allows you to update values in columns depending on different conditions. The setting operation does not make a copy of the data frame, but edits the original data.

As an example:

View the code on Gist.

That’s the basics of indexing and selecting with Pandas. If you’re looking for more, take a look at the .iat, and .at operations for some more performance-enhanced value accessors in the Pandas Documentation and take a look at selecting by callable functions for more iloc and loc fun.

↧