Who would've thought analysing UK Post Office data could be so much fun?
April 15, 2018
This post follows my guest article (kindly hosted by The Data Lab) in which I used Alteryx to perform spatial analysis on open-source UK Post Office locations from Datadaptive. In this post I want to explore that dataset further, focussing on tabular data manipulation functions to compare and contrast how we might approach the same problem in Alteryx and Excel.
Part 1 - Getting stuff done in Excel
Why look at Excel when I'm so fond of Alteryx?
It's true that I have something of a soft spot for Alteryx, but I'm not afraid to stick up for Excel either. Most of us will have used Excel long before experimenting with more advanced data science platforms. Excel is ubiquitous in the workplace, it's cheap compared to Alteryx, and few people are intimidated by it. Excel has some nifty features too, it's ideal for lots of tasks, and although Excel-bashing is popular among today's 'enlightened' data analysts, I'm keen that we give Microsoft their due for creating such an iconic and enduring piece of software.
What question am I trying to answer?
The previous blog post looked at Post Office locations; this time I want to look at Post Office opening hours in the freely-downloadable csv file. The table below shows what the data looks like (for more detail on field descriptions and metadata check out Datadaptive's brilliant user guide):
Although I've truncated many of the columns in the snippet above, I'm interested in columns T through AU which contain information on Post Office opening hours. Here is my question: how long is each Post Office open per week? Let's take a closer look at the data structure to find out.
Opening hours data
The unique Post Office ID is contained in the far left field, LOCOUCD. Opening hours are spread over four columns per day, from morning (eg Col B) to evening (col E), with two optional fields (cols C + D) which tell us whether the Post Office is closed over lunch (as it is for IDs 67, 68 and 74 in the snippet below).
This structure is repeated for the other days of the week, so we have 7 * 4 = 28 fields containing opening hours data. We want to summarise this into a single measure which tells us how many hours each Post Office is open per week.
We can get the information we need from Excel in a couple of minutes. The first step is to create a new column for each day of the week. In this example I'm going to use the 'Insert Sheet Columns'command to add a new column to the right of col E. I've called it 'OpenHours' and defined it using the following formula:
It's then easy to copy that formula down into all rows using the handy fill handle, and Excel capably handles the relative cell references to automatically update the formula on each row (for a great explanation of relative vs absolute references, including the often-misunderstood dollar sign $, click here).
You may also notice that Excel already knows we're dealing with time data in the 'hh:mm' format, so we don't need to worry about converting to and from decimal days or hours when adding or subtracting times (although we do need to be very careful when the total hours exceed 24, see formatting advice here).
To replicate the new field for the other days of the week I can just copy the new column and paste it using the 'Insert Copied Cells' command which neatly shunts everything to the right whilst preserving relative references. Finally, I can sum up the seven totals (one for each day of the week) to get the total hours per Post Office. I've animated those steps in the video below - for a bigger version click here.
You can download my finished spreadsheet and graph here (4.5 MB).
The graph below shows a count of Post Offices by the total number of opening hours per week. The distribution is intriguing, with a clear modal value of 46 hours: 1,056 Post Offices are open for 46 hours a week, which equates to 0900 to 1730 Monday to Friday, plus 0900 to 1230 on Saturdays. I have darkened the 46 hour bar in the chart below (which, surprise surprise, was plotted in Excel):
There are quite a few Post Offices with very limited opening hours, including a number of mobile van-based services which can pop up for as little as half an hour a week in various locations. At the other end of the scale, and impossible to see on the graph above, are two Post Offices which open just under 168 hours a week: New Oxford Street and Birmingham Airport. In case you're wondering, 7 * 24 = 168 so yes, these are in fact 24-hour Post Offices!
So are we done now? No, of course not! Read on to discover how we might approach the same problem in Alteryx, followed by some thoughts on which option is 'better'.
Part 2 - Attacking the same problem in Alteryx
My Alteryx workflow for recreating the analysis above is show below, and you can download it here to unzip and play with. Recall from my previous post that Alteryx workflows are self-documenting recipes for preparing, blending, analysing and outputting your data. Once created you rerun your workflow again and again. Source data recently updated? No problem, just rerun the module. Changes required? No problem, just tweak the module and rerun it. Alteryx is all about repeatability because, in the long run, this saves time, mistakes and hassle.
Whoa! That looks like a lot of steps for something that took 2 mins in Excel!
So I'm going to come clean: this workflow took me 30 mins to develop and debug in Alteryx. The Excel version took two mins, so you might be wondering why I would use Alteryx when it takes 15 times as long? 15x is perhaps a little misleading because I still had to complete a handful of extra steps in Excel to get the results into a graphable format, so let's say Alteryx took 6x longer (30 mins vs 5 mins). Is it worth it? To answer that, we need to understand how the module works, so I'm going to explore each section (A, B and C) in more detail below.
Section A - Input and prepare
Section A brings our data into the workflow and does a little bit of prep and tidying up. There are four steps in this section, and we'll look at each in turn:
Input Data extracts the csv file from the downloaded zip file. That means we don't need to unzip or extract anything before we start because Alteryx natively consumes zip files. If the dataset gets updated frequently we could even use the Download tool to retrieve the latest zip file every time the module runs, but that would rely on the URL of the zip file remaining constant which in this case isn't guaranteed.
Select lets us tidy up our data stream, in this case by dropping unwanted fields and renaming 'LOCOUCD' to 'StoreID'. The Select tool also makes it easy to reorder fields and change data type from, say, string to integer.
Transpose makes light work of pivoting our horizontal data structure into a vertical one, by turning this:
I'll explain the rationale for that un-Excel-like transformation in a second. For the time being, suffice it to say that while we could easily create a new field to summarise each day of the week, this approach wouldn't be as efficient (or future-proof) as transposing all days of the week into a single field.
Formula converts our hh:mm time values (which Alteryx treats as 'dumb' text strings) into the dedicated 'Time' data field type. This will be useful later when we need to add and subtract opening times.
Why didn't you create a new column for each day?
Excel made it quick and easy for us to create and duplicate a new 'total hours' column for each day of the week. In Alteryx we could use the Formula tool to add new fields, which we could define like this:
(TOPMMON-FRAMMON) - (FRPMMON-TOAMMON)
The problem is that we'd need to define a new formula for each day of the week, which in Alteryx is slightly harder than Excel. In this case we'd need to do this 7 times which is hardly the end of the world. But imagine if our opening times were listed by weeks of the year (x52) or even days of the year (x365). In either circumstance it wouldn't make sense to add 52 or 365 new columns when we could transpose our data into a single column instead. A different way of thinking, perhaps, but one which will reap benefits later on.
So where is the equivalent of Excel's fill handle?
To understand the purpose of the transpose tool in Section A we really need to grapple with some fundamental differences between Alteryx and Excel. The latter, as we saw earlier, has some handy time-saving features like the fill handle or automatic updating of relative cell references. These features allow us to derive new fields and copy and paste them with ease.
If you're looking for direct equivalents in Alteryx you're going to be disappointed. That's not to say that Alteryx can't achieve the same results--it can, along with a whole lot more besides--but we need to remember that Excel and Alterx are designed for different purposes.
In Excel all functions, commands and formulas rely on cell references such as A1, B2, C10 etc which specify columns (the alphabetical part) and rows (the numerical part). Alteryx uses absolute field names instead of columns, but it doesn't get hung up on rows because it treats all rows as equals: if you drag a tool onto your workflow, unless told otherwise Alteryx will execute that tool on every row or record in your datastream.
Let's compare that approach with Excel's fill handle for example, which gives us the ability to 'fill' a formula across columns, down rows, or both. In Excel, which tends to be driven by individual commands from the user's mouse or keyboard (I'll discuss VBA macros later), the fill handle saves time because it replicates an initial step many times over. It's still driven by individual commands, however, and so it still relies on manual steps even though it gives the impression of being semi-automated.
Alteryx, by contrast, aims for full automation. Workflows run from top to bottom (or side to side, depending on whether you set the canvas layout to vertical or horizontal)--all they require is one click to set them running, and then (in theory) they shouldn't require any further intervention.
Thus there's a contradiction between the Excel approach, which helps users to speed up manual procedures and make them more manageable, with the Alteryx approach, which encourages users to massage and manipulate their data into a structure that's better suited to fully repeatable automation. This tension is most apparent for new users, particularly those well-versed in Excel, to whom my advice would be this: don't get hung up on trying to do things exactly as you would in Excel,but try to think creatively because the chances are whatever you're trying to do can be done in Alteryx pretty easily.
Watch how I apply that advice in the next two Sections, B and C.
Section B - Calculate Opening Hours
This section takes our tidied and transposed data and calculates opening hours, breaking down the process into 7 discrete steps. As is often the way, there are limitless ways to solve the same problem, and I could probably achieve the same result with fewer tools or more concise syntax if I put my mind to it. But this works fine, it gets the job done, and it allows me to get on with my day. If you can think of a better way then please leave a comment or email me, and don't overlook the Alteryx Community as a consistently friendly and helpful source of inspiration.
RecordID adds a sequential number for each record in the data stream. We'll use this later. Note that there are 325,612 rows of transposed data (which equates to 11,629 stores * 7 days of the week * 4 fields per day). Alteryx takes just over a second to process that each time I hit 'run'.
Another Formula, which calculates a new field called Counter as follows: Mod([RecordID],4) Mod is a bitwise modulo function that gives us the remainder of the RecordID divided by 4. I only want to process every 4th record because we have 4 fields of opening hours per day. Counter will help to accomplish that because every 4th record will have a remainder of 0.
Multi-Row Formula tool allows us to reference values on nearby rows, using Excel-like relative positions. This tool works out how many minutes each store is closed for lunch, if indeed it shuts for lunch at all.
My syntax looks like this: iif([Counter]=0, DateTimeDiff([Row-1:Time],[Row-2:Time],"minutes"),null())
We only run this on every 4th row (where Counter = 0). Row-1 is the row above, while Row-2 is two rows above. You can learn more about the DateTimeDiff function here.
A second Multi-Row Formula calculates the total opening hours per day (including lunch closures which we'll subtract later): iif([Counter]=0, DateTimeDiff([Time],[Row-3:Time],"minutes"),null())
Filter retains records where Counter = 0 (which is passed to the True output) and discards everything else: [Counter] = 0
Multi-Field Formula converts null values to 0. We use the _CurrentField_ variable to apply this formula to more than one field at a time: iif(isnull([_CurrentField_]),0,[_CurrentField_])
Finally, another Formula subtracts lunch closure duration from the morning to evening duration, giving us the number of minutes each store is open per week.
We now have a daily opening duration (in minutes) for each of the 11,629 stores in our dataset, and we've structured our repeatable workflow in such a way that, should the data change in future, we can rerun our analysis in just over a second.
Section C - Summarise results, infill blanks, and export to Excel
This final section might look complicated, but in fact it's pretty straightforward.
We begin with Summarize which applies a 'GroupBy' on StoreID along with a 'Sum' on Open Duration. This returns the total opening duration per store, condensing our 81,403 transposed and filtered records back to the familiar count of 11,629 (one record per store).
A simple Formula converts our opening duration (in minutes) into hours. I used the Floor function to deal with decimal hours, so a result of 10 really represents stores that are open between 10 and 11 hours (>=10 < 11): Floor([Open Duration]/60)
A second Summarize performs a GroupBy and Count on the Hours field. This gives us a count of how many stores are open for each unique entry in the Hours field.
A third Summarize gets the maximum value in the Hours field. In this case it's 167. If the data changed in future I need my max value to update dynamically because it feeds into step 5.
We input the max value into Generate Rows which creates 168 numbered rows (from 0 to our max input 167). Think of these empty rows as the template for the horizontal axis on our graph.
Join matches both streams and performs a 'right outer join' which captures matching hours as well as 'blank' hours. We need to infill these blanks to ensure the horizontal axis on our graph is complete. Without this step, Excel would treat our X axis as categorical so the outliers would appear compressed. By infilling blanks we can trick Excel in giving us a numerical X axis.
For a superbly useful explanation of Alteryx joins vs SQL see here and in particular this diagram here. Suffice it to say that once you've used the Join tool you'll never want to go back to VLOOKUP or INDEX + MATCH in Excel.
Union captures the two distinct outputs from the Join tool and sticks them back together.
Select drops our Max_Hours field which is no longer required.
Formula replaces nulls with zeros. Note that the iif()... syntax is much quicker and more elegant than if then else endif: iif(isnull([Count]),0,[Count])
Ouput exports our results to Excel. Alteryx gives you heaps of output configuration options, and in this case I'm overwriting a hidden sheet in Excel (without affecting the other sheets). My graph will update automatically based on the refreshed data in the hidden sheet. This is a neat trick and a really nice way to update reports without the rigmarole of updating input ranges every day/week/month.
Part 3 - Comparing both approaches
If I only needed to run this task once I'd be quicker using Excel, and that's OK with me. As analysts we need tools that help us get the job done quickly, and in this case Excel provides the tools I need to derive new fields, summarise them and draw a graph. 5 minutes, job done, move onto the next task.
If I needed to run this task every 6 months I might still be better off sticking with Excel, because although I'd need to repeat a few manual steps, that would still be quicker than figuring how to automate things in Alteryx.
But what if I needed to run this task every week? Or even every day? At what point would it become worth my while to invest the time in building an automated repeatable solution? Let's go ahead and work it out. And yes, I'm going to do this in Excel!
Calculating the return on investment
Let's assume that the Excel solution takes 5 mins to run each time, whereas the Alteryx solution takes 30 mins to build the first time, and then 10 secs to load for each subsequent run. We can model both scenarios as straight line 'costs' and then figure out where they intersect:
The gradient of the Excel line is 5, because every iteration takes an extra 5 minutes. The Alteryx line has a gradient of (1/6) because 10 seconds is a sixth of a minute, but we factor in the initial 30 minute investment with our Y intercept of 30. The two lines intersect just beyond 6 iterations, which means that if I'm going to run this task more than 6 times I'm better off doing it in Alteryx. You can download the Excel workings here.
If I were to run the task 10 times I would save 18 mins by adopting Alteryx over Excel. Running it 52 times (once a week for a year) would save 221 mins, and running it once a day (365 times) would save 1,734 mins over the course of a year. That's almost 29 hours, or 4 working days!
So, depending on how often you need to repeat a piece of work, the question surely isn't "Can I afford to build this in Alteryx?" but rather "How can I afford not to build this is Alteryx?"
But what about Excel macros and VBA?
I claimed earlier that most people use Excel in a fairly manual, point-and-click sort of way. Advanced users will know that it's possible to record macros and customise scripts using Office Visual Basic for Applications (VBA).
As a non-programmer and a fairly mediocre script-writer (Python, VB and VBA) I've done enough macro development to know that yes, it's possible, but it's a long, slow and difficult journey. Devising and debugging VBA macros demands patience and a technical aptitude that few 'ordinary' business users possess. Be prepared to spend hours or days digging around in internet forums, blogs and help sites, stepping through lines of code whilst hoping and praying they'll work.Because sometimes they won't, and it can be difficult to figure out why.
The satisfaction you get when an Excel macro does work is immense, but at what cost? Say you spend three hours figuring out how to parse a text file in VBA and perform a VLOOKUP on it: that's a fine achievement,and a justifiably proud technical accomplishment. But if I were to point out that it could be automated in Alteryx in 30 seconds would you still consider those three hours in VBA time well spent?
So yes - generally in Excel VBA it can be done, but I find I can work more quickly and accurately in Alteryx. If you can achieve as much in VBA, SQL, Python or R as I can in Alteryx, for the same amount of time, and with as few errors and bugs, then I respect you immensely.
If on the other hand you're not an expert programmer but you need some analysis done quickly and painlessly, then it might be worth checking out Alteryx. And if you need to repeat that analysis every month/week/day then you should download a free trial asap.
But I don't have time to learn a new application / syntax / language / skill!
If you know, hand on heart, that you're currently working as efficiently and effectively as you could be, then you have no reason to learn any new packages. If there's nothing more for you to learn, if you're at the top of your game, the zenith of your career, then Alteryx isn't likely to make much impact.
For the rest of us, myself included, I think it's important to keep an eye out for new tools and ways of working that enhance productivity and reduce errors. If you feel you're too busy to learn something new, just think how much time you could save if you did manage it. Get your stopwatch out, start recording hours spent on tasks, and start thinking like an accountant. And then ask yourself again if Alteryx might be worth it, because in my experience it generally is.
Are there other advantages to Alteryx over Excel?
Yes, countless. So many, in fact, that I don't know where to start. I'll need another blog post to do the topic justice, but you might find this article by Murphy Troy interesting. For the time being, let me knock it out of the park with the following example.
Here is an animated map of UK Post Office openings over the course of a week, split into 672 x 15-minute chunks. Watch how they come and go over the course of the day, and keep your eyes peeled for the two 24 hour Post Offices (in London and Birmingham):
That's cool but pretty pointless, right?
Yes and no. For sure, the animation above is a gimmick. It doesn't show anything particularly surprising or insightful, and it certainly won't help you find your nearest Post Office. You could argue that all it does is to demonstrate that a) most Post Offices are closed at night, and b) the provision of Post Offices tends to reflect the underlying population of the UK. In other words, big deal, so what?
It might be more helpful to focus on the methodology instead of the content. After all, this entire article is based on some Post Office open data I happened to stumble across. I could've chosen some other whacky topic to analyse, and in future posts I will, so please don't get too hung up on the thematic content here.
Instead, the point I'm trying to make is that with the right tools you can do a lot in a short space of time.Whether it's reformatting tables, summarising fields, automating analysis or creating animations, I can do things in Alteryx which I could barely attempt in other platforms. The visualisation above may be a bit silly, but to me it's still interesting because it's a reminder of how important creativity is to the analytic process. If your software is holding you back then maybe it's time to try something else? Hint hint...
How did you make it, out of interest?
The 672 frames were batch rendered to png files by Alteryx, and then assembled into an mp4 movie using the free and excellent ScreenToGif recorder. My workflow (download it here) is presented below - it's a little more complicated than my previous examples, but in a nutshell it takes all seven days of the week multiplied by 96 15-minute intervals per day, testing each 15-minute 'window' against the opening hours data for every Post Office. In case you're wondering, that's just under 6.5 million conditional tests which takes Alteryx 10 seconds to evaluate.
This process yields a 1 or a 0 depending on whether each Post Office is open or closed at any given time. All the 1s are plotted on the map in red, and all the 0s are ignored. Rinse and repeat 672 times and hey presto, you have a compelling animation.
Alteryx really comes into its own for batch processing, and the workflow above runs the following 'batch macro' once for every record in the input stream:
The macro handles the map rendering and png output, and the entire process takes 3 minutes to render all 672 frames.
That must be really difficult, right?
Not really, no. Look - if I can do it then anyone can. I'm not a programmer or even a particularly gifted analyst. I'm not great at logic and I often struggle to 'think outside the box'. Oh, and I'm terrible at solving anagrams.
But I really love working with data, and in Alteryx I've found a platform that gives me everything I need to solve 99% of the problems that cross my desk. There's no way I could create that animation in Excel VBA let alone Python or R. A talented programmer could, for sure, but as I keep saying, I'm not a programmer, let alone a talented one!
And yet Alteryx allows me to do things which, previously, I couldn't dream of tackling. I have an MSc and PhD in Geographical Information Science yet I'd struggle to tell you how I'd solve this problem in 'conventional' tools such as ArcGIS or QGIS. I could probably get there in the end, but it might take a week or two and the sheer mental effort would exhaust and frustrate me.
In Alteryx, by contrast, I can attack a problem without having any idea how I'll reach the solution, or even what the end result will look like. I tend not to sketch, plan or prepare with Alteryx; rather, I roll my sleeves up and get stuck in. I test ideas, run them, tweak them, fix them and hone them until I have something that works. And then I move onto the next problem. It's an interactive and engaging process which I really enjoy.
Alteryx helps me get stuff done, and makes my job fun too. It puts the thrill back into analytics. Stay tuned for more adventures.
All files used in this blog post can be downloaded here(10 MB). Unzip and extract the folders to anywhere on your hard drive, the Alteryx modules should run wherever they're extracted thanks to this setting. Get your free 14-day Alteryx trial here.