::p_load(scales, viridis, lubridate, ggthemes,
pacman
gridExtra, readxl, knitr, data.table, CGPfunctions, ggHoriPlot, tidyverse)
Hands-on Exercise 6.1
1 Visualising and Analysing Time-oriented Data
1.1 Learning Outcome
By the end of this hands-on exercise you will be able create the followings data visualisation by using R packages:
plotting a calender heatmap by using ggplot2 functions,
plotting a cycle plot by using ggplot2 function,
plotting a slopegraph
plotting a horizon chart
1.2 Getting Started
1.2.1 Installing and launching R packages
In this hands-on exercise, visualisation packages will be installed and launched. They are scales, viridis, lubridate, ggthemes, gridExtra, readxl, knitr, data.table and tidyverse.
The code chunk:
1.3 Plotting Calendar Heatmap
In this section, you will learn how to plot a calender heatmap programmatically by using ggplot2 package.
By the end of this section, you will be able to:
plot a calender heatmap by using ggplot2 functions and extension,
to write function using R programming,
to derive specific date and time related field by using base R and lubridate packages
to perform data preparation task by using tidyr and dplyr packages.
1.3.1 The Data
For the purpose of this hands-on exercise, eventlog.csv file will be used. This data file consists of 199,999 rows of time-series cyber attack records by country.
1.3.2 Importing the data
First, you will use the code chunk below to import eventlog.csv file into R environment and called the data frame as attacks.
<- read_csv("data/eventlog.csv") attacks
Rows: 199999 Columns: 3
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (2): source_country, tz
dttm (1): timestamp
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1.3.3 Examining the data structure
It is always a good practice to examine the imported data frame before further analysis is performed.
For example, kable() can be used to review the structure of the imported data frame.
There are three columns, namely timestamp, source_country and tz.
timestamp field stores date-time values in POSIXct format.
source_country field stores the source of the attack. It is in ISO 3166-1 alpha-2 country code.
tz field stores time zone of the source IP address.
kable(head(attacks))
timestamp | source_country | tz |
---|---|---|
2015-03-12 15:59:16 | CN | Asia/Shanghai |
2015-03-12 16:00:48 | FR | Europe/Paris |
2015-03-12 16:02:26 | CN | Asia/Shanghai |
2015-03-12 16:02:38 | US | America/Chicago |
2015-03-12 16:03:22 | CN | Asia/Shanghai |
2015-03-12 16:03:45 | CN | Asia/Shanghai |
1.3.4 Data Preparation
Step 1: Deriving weekday and hour of day fields
Before we can plot the calender heatmap, two new fields namely wkday and hour need to be derived. In this step, we will write a function to perform the task.
<- function(ts, sc, tz) {
make_hr_wkday <- ymd_hms(ts,
real_times tz = tz[1],
quiet = TRUE)
<- data.table(source_country = sc,
dt wkday = weekdays(real_times),
hour = hour(real_times))
return(dt)
}
weekdays()
is a base R function.
Step 2: Deriving the attacks tibble data frame
<- c('Saturday', 'Friday',
wkday_levels 'Thursday', 'Wednesday',
'Tuesday', 'Monday',
'Sunday')
<- attacks %>%
attacks group_by(tz) %>%
do(make_hr_wkday(.$timestamp,
$source_country,
.$tz)) %>%
.ungroup() %>%
mutate(wkday = factor(
levels = wkday_levels),
wkday, hour = factor(
levels = 0:23)) hour,
Beside extracting the necessary data into attacks data frame, mutate()
of dplyr package is used to convert wkday and hour fields into factor so theyβll be ordered when plotting
Table below shows the tidy tibble table after processing.
kable(head(attacks))
tz | source_country | wkday | hour |
---|---|---|---|
Africa/Cairo | BG | Saturday | 20 |
Africa/Cairo | TW | Sunday | 6 |
Africa/Cairo | TW | Sunday | 8 |
Africa/Cairo | CN | Sunday | 11 |
Africa/Cairo | US | Sunday | 15 |
Africa/Cairo | CA | Monday | 11 |
1.3.5 Building the Calendar Heatmaps
<- attacks %>%
grouped count(wkday, hour) %>%
ungroup() %>%
na.omit()
ggplot(grouped,
aes(hour,
wkday, fill = n)) +
geom_tile(color = "white",
size = 0.1) +
theme_tufte(base_family = "Helvetica") +
coord_equal() +
scale_fill_gradient(name = "# of attacks",
low = "sky blue",
high = "dark blue") +
labs(x = NULL,
y = NULL,
title = "Attacks by weekday and time of day") +
theme(axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6) )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
βΉ Please use `linewidth` instead.
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
a tibble data table called grouped is derived by aggregating the attack by wkday and hour fields.
a new field called n is derived by using
group_by()
andcount()
functions.na.omit()
is used to exclude missing value.geom_tile()
is used to plot tiles (grids) at each x and y position.color
andsize
arguments are used to specify the border color and line size of the tiles.theme_tufte()
of ggthemes package is used to remove unnecessary chart junk. To learn which visual components of default ggplot2 have been excluded, you are encouraged to comment out this line to examine the default plot.coord_equal()
is used to ensure the plot will have an aspect ratio of 1:1.scale_fill_gradient()
function is used to creates a two colour gradient (low-high).
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Then we can simply group the count by hour and wkday and plot it, since we know that we have values for every combination thereβs no need to further preprocess the data.
1.3.6 Building Multiple Calendar Heatmaps
Challenge: Building multiple heatmaps for the top four countries with the highest number of attacks.
1.3.7 Plotting Multiple Calendar Heatmaps
Step 1: Deriving attack by country object
In order to identify the top 4 countries with the highest number of attacks, you are required to do the followings:
count the number of attacks by country,
calculate the percent of attackes by country, and
save the results in a tibble data frame.
<- count(
attacks_by_country %>%
attacks, source_country) mutate(percent = percent(n/sum(n))) %>%
arrange(desc(n))
Step 2: Preparing the tidy data frame
In this step, you are required to extract the attack records of the top 4 countries from attacks data frame and save the data in a new tibble data frame (i.e. top4_attacks).
<- attacks_by_country$source_country[1:4]
top4 <- attacks %>%
top4_attacks filter(source_country %in% top4) %>%
count(source_country, wkday, hour) %>%
ungroup() %>%
mutate(source_country = factor(
levels = top4)) %>%
source_country, na.omit()
1.3.8 Plotting Multiple Calendar Heatmaps
Step 3: Plotting the Multiple Calender Heatmap by using ggplot2 package.
ggplot(top4_attacks,
aes(hour,
wkday, fill = n)) +
geom_tile(color = "white",
size = 0.1) +
theme_tufte(base_family = "Helvetica") +
coord_equal() +
scale_fill_gradient(name = "# of attacks",
low = "sky blue",
high = "dark blue") +
facet_wrap(~source_country, ncol = 2) +
labs(x = NULL, y = NULL,
title = "Attacks on top 4 countries by weekday and time of day") +
theme(axis.ticks = element_blank(),
axis.text.x = element_text(size = 7),
plot.title = element_text(hjust = 0.5),
legend.title = element_text(size = 8),
legend.text = element_text(size = 6) )
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
1.4 Plotting Cycle Plot
In this section, you will learn how to plot a cycle plot showing the time-series patterns and trend of visitor arrivals from Vietnam programmatically by using ggplot2 functions.
1.4.1 Step 1: Data Import
For the purpose of this hands-on exercise, arrivals_by_air.xlsx will be used.
The code chunk below imports arrivals_by_air.xlsx by using read_excel()
of readxl package and save it as a tibble data frame called air.
<- read_excel("data/arrivals_by_air.xlsx") air
1.4.2 Step 2: Deriving month and year fields
Next, two new fields called month and year are derived from Month-Year field.
$month <- factor(month(air$`Month-Year`),
airlevels=1:12,
labels=month.abb,
ordered=TRUE)
$year <- year(ymd(air$`Month-Year`)) air
1.4.3 Step 4: Extracting the target country
Next, the code chunk below is use to extract data for the target country (i.e. Vietnam)
<- air %>%
Vietnam select(`Vietnam`,
month, %>%
year) filter(year >= 2010)
1.4.4 Step 5: Computing year average arrivals by month
The code chunk below uses group_by()
and summarise()
of dplyr to compute year average arrivals by month.
<- Vietnam %>%
hline.data group_by(month) %>%
summarise(avgvalue = mean(`Vietnam`))
1.4.5 Step 6: Plotting the cycle plot
The code chunk below is used to plot the cycle plot as shown in Slide 12/23.
ggplot() +
geom_line(data=Vietnam,
aes(x=year,
y=`Vietnam`,
group=month),
colour="black") +
geom_hline(aes(yintercept=avgvalue),
data=hline.data,
linetype=6,
colour="red",
size=0.5) +
facet_grid(~month) +
labs(axis.text.x = element_blank(),
title = "Visitor arrivals from Vietnam by air, Jan 2010-Dec 2019") +
xlab("") +
ylab("No. of Visitors") +
theme_tufte(base_family = "Helvetica")
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database
1.5 Plotting Slopegraph
In this section you will learn how to plot a slopegraph by using R.
Before getting start, make sure that CGPfunctions has been installed and loaded onto R environment. Then, refer to Using newggslopegraph to learn more about the function. Lastly, read more about newggslopegraph()
and its arguments by referring to this link.
1.5.1 Step 1: Data Import
Import the rice data set into R environment by using the code chunk below.
<- read_csv("data/rice.csv") rice
Rows: 550 Columns: 4
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (1): Country
dbl (3): Year, Yield, Production
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1.5.2 Step 2: Plotting the slopegraph
Next, code chunk below will be used to plot a basic slopegraph as shown below.
%>%
rice mutate(Year = factor(Year)) %>%
filter(Year %in% c(1961, 1980)) %>%
newggslopegraph(Year, Yield, Country,
Title = "Rice Yield of Top 11 Asian Counties",
SubTitle = "1961-1980",
Caption = "Prepared by: Dr. Kam Tin Seong")
Converting 'Year' to an ordered factor
For effective data visualisation design, factor()
is used convert the value type of Year field from numeric to factor.