In the last article, we learned to pull Google Analytics data via API in R and then make it into readable charts that let us get a summary-level view of the channel performance.
Today, to continue the exploratory analysis theme, we are going to take this further by looking for trends in the data, identifying patterns and seasonality. Why would we do that? Well, most of the time, the behavior of website visitors isn't uniform across the time of day, day of the week, through the month of through the year.
It's only natural that some websites are getting more traffic on specific days or months – not only because you might be running paid ads in specific time slots to optimize acquisition costs, but also because there's a limited number of Jacks and Jills that would rather browse your website at 4 in the morning than be slumbering.
Since there is natural variance in website behavior, we could be making better business decisions if we had the knowledge of that variance. When does our site get the most visitors? Does it get the most conversions then, too? Is there a trend we had no idea about? We're going to be answering those questions today.
Please note: everything below implies you've done your homework and have:
- R and Rstudio installed;
- Enabled GA API for your Google account.
In case you don't, I highly recommend starting with the first article and then moving to this one.
Getting your tools ready
As with anything, we need to prepare our tools first. Let's load all the packages that we're going to need.
library(googleAnalyticsR)
library(tidyverse)
library(lubridate)
library(stringr)
library(forcats)
library(ggthemes)
library(scales)
In case you don't have some of these packages, you can always run install.packages()
function for any missing packages:
install.packages('googleAnalyticsR', dependencies = TRUE)
Don't forget to authorize by running ga_auth()
.
Pulling the data
As I said in the beginning, we're going to be exploring Google Analytics data for trends in time. Let's start with the variance throughout the day – the by-hour breakdown.
In the snippet below, I'm pulling the past 90 days' worth of data – you may change that to whatever date range you are exploring. Also notice how we're specifying our date range using two separate variables – we're going to spice things up with a little bit of automation in a few minutes.
start_date = today() - 91
end_date = today() - 1
trends_hourly = google_analytics(
viewId = 1234567, # replace this with your view ID
date_range = c(
start_date,
end_date
),
dimensions = c("hour", "date", "dayOfWeekName", "deviceCategory"),
metrics = c("sessions", "users", "transactions", "transactionRevenue"),
anti_sample = T
)
Your trends_hourly
data frame should now contain the data for sessions, users, sales and revenue. Now, let's transform the data just a tiny bit.
trends_hourly_clean = trends_hourly %>%
transmute(
hour = hour %>% as.numeric(),
date,
day = dayOfWeekName %>% as_factor() %>% fct_relevel("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"),
device = deviceCategory %>% as_factor(),
sessions,
users,
sales = transactions,
revenue = transactionRevenue
)
Okay, now we've got ourselves a data frame that contains the data on traffic and key events (in our case those are sales, but you can pull events instead), segmented by device. Let's do some exploring now.
Plotting the traffic throughout the day
We've done some beautification to our plots already, so I'm going to go ahead and put together a snippet that is going to generate us a real nice chart of users per day.
trends_hourly_clean %>%
ggplot(
aes(
x = hour,
y = users,
color = device
)
) +
geom_jitter(alpha = 1/10) +
geom_smooth(alpha = 1/5, se = F) +
theme_minimal(base_family = "Helvetica Neue") +
theme(legend.position = "top") +
scale_x_continuous(name = "Hour of day", breaks = seq(0, 23, 6), minor_breaks = 0:23) +
scale_y_continuous(name = "Users", limits = c(0, NA)) +
scale_color_few("Light", name = "Device:") +
ggtitle(
"Users by hour of day, breakdown by device",
str_c(
"Based on Google Analytics data from ",
start_date,
" to ",
end_date
)
)
Now, let's take a look at what happened.
First, we've plotted each individual data point using geom_jitter()
– those are the faint points on the background. You can make those more visible if you want by increasing the value of the alpha
parameter in the code from 1/10
. Dot color denotes one of three possible devices – mobile, desktop or tablet.
The reason we've used geom_jitter()
instead of geom_point()
is that the latter is prone to over-plotting, making the chart harder to read – in my opinion. The former, at the other hand, plots each of the data points slightly displaced from its precise value, making those fuzzy 'clouds' of points, which are easier to comprehend at a glance.
These two plotting geoms are pretty interchangeable most of the time, but you should always be wary of the geom_jitter()
's lack of precision. If you want to display data points at their exact locations, you should be picking geom_point()
every time.
The lines, in turn, represent the average number of people that have visited your website from each of the devices in any given hour. You can read more about how geom_smooth()
does the calculations at its official documentation page.
From this chart alone, we can see that:
- people tend to visit the website from desktop devices most of the time;
- neither mobile nor tablet traffic are significantly high at any time;
- the traffic is the heaviest, quite expectedly, during the workday.
That's nice, but you might have already noticed something sketchy on the plot, visible very clearly for the desktop traffic. Namely, the dots (geom_jitter()
) are all around the line (geom_smooth()
), but the line almost never lies on the point 'cloud'. It's almost if the line is being pulled by several trends at once, averaging in a trend that represents a non-existent data.
Let's segment this by adding a simple line in the end of our plot snippet:
trends_hourly_clean %>%
ggplot(
aes(
x = hour,
y = users,
color = device
)
) +
geom_jitter(alpha = 1/10) +
geom_smooth(alpha = 1/5, se = F) +
theme_minimal(base_family = "Helvetica Neue") +
theme(legend.position = "top") +
scale_x_continuous(name = "Hour of day", breaks = seq(0, 23, 6), minor_breaks = 0:23) +
scale_y_continuous(name = "Users", limits = c(0, NA)) +
scale_color_few("Light", name = "Device:") +
ggtitle(
"Users by hour of day, breakdown by device",
str_c(
"Based on Google Analytics data from ",
start_date,
" to ",
end_date
)
) +
facet_wrap( ~ day, ncol = 1) // that is our magic line
What we've got is a much clearer picture now – we can see that there is indeed difference between a workday and weekend traffic. There are still some variance in the data points, but it's nowhere near the level it was on the averaged one we had initially.
Final words
Seeing how similar the weekdays are, you can probably look at them as one, and do the same for the weekends. There's also a ton of other segmentations you could do which are specific to the type of website and business you're looking at. Here's something you can play with on your own:
- new vs. returning users;
- traffic channels;
- paid user vs. trial user vs. someone with no account (you'll need to set up custom dimensions for that one);
- os and browser;
- screen resolution.
Plotting those is going to give you some insight into how your users are using your website, and hopefully feed your ideation for testing and website improvement.
This post is part of my R series. In the next posts, we'll be trying to look at the landing page performance, automated device insights and explore the possibilities for split-test analysis straight through the API. Stay tuned for the next posts!