Sometimes getting data off the internet is very, very simple - it’s stored in a format that R can handle and just lives on a server somewhere, or it’s in a more complex format and perhaps part of an API but there’s an R package designed to make using it a piece of cake. This chapter will explore how to download and read in static files, and how to use APIs when pre-existing clients are available.
In this first exercise we’re going to look at reading already-formatted datasets - CSV or TSV files, with which you’ll no doubt be familiar! - into R from the internet. This is a lot easier than it might sound because R’s file-reading functions accept not just file paths, but also URLs.
The URLs to those files are in your R session as csv_url and tsv_url.
# Here are the URLs! As you can see they're just normal strings
csv_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1561/datasets/chickwts.csv"
tsv_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_3026/datasets/tsv_data.tsv"
# Read a file in from the CSV URL and assign it to csv_data
csv_data <- read.csv(csv_url)
# Read a file in from the TSV URL and assign it to tsv_data
tsv_data <- read.delim(tsv_url)
# Examine the objects with head()
head(csv_data)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
head(tsv_data)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
Sometimes just reading the file in from the web is enough, but often you’ll want to store it locally so that you can refer back to it. This also lets you avoid having to spend the start of every analysis session twiddling your thumbs while particularly large files download.
Helpfully, R has download.file(), a function that lets you do just that: download a file to a location of your choice on your computer. It takes two arguments; url
, indicating the URL to read from, and destfile
, the destination to write the downloaded file to. In this case, we’ve pre-defined the URL - once again, it’s csv_url
.
Download the file at csv_url
with download.file()
, naming the destination file "feed_data.csv"
. Read "feed_data.csv"
into R with read.csv().
Whether you’re downloading the raw files with download.file() or using read.csv() and its sibling functions, at some point you’re probably going to find the need to modify your input data, and then save the modified data to disk so you don’t lose the changes.
You could use write.table(), but then you have to worry about accidentally writing out data in a format R can’t read back in. An easy way to avoid this risk is to use saveRDS() and readRDS(), which save R objects in an R-specific file format, with the data structure intact. That means you can use it for any type of R object (even ones that don’t turn into tables easily), and not worry you’ll lose data reading it back in. saveRDS() takes two arguments, object, pointing to the R object to save and file pointing to where to save it to. readRDS() expects file, referring to the path to the RDS file to read in.
In this example we’re going to modify the data you already read in, which is predefined as csv_data, and write the modified version out to a file before reading it in again.
# Add a new column: square_weight
csv_data$square_weight <- csv_data$weight ^ 2
# Save it to disk with saveRDS()
saveRDS(object = csv_data, file = "modified_feed_data.RDS")
# Read it back in with readRDS()
modified_feed_data <- readRDS(file = "modified_feed_data.RDS")
# Examine modified_feed_data
str(modified_feed_data)
## 'data.frame': 71 obs. of 3 variables:
## $ weight : int 179 160 136 227 217 168 108 124 143 140 ...
## $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ square_weight: num 32041 25600 18496 51529 47089 ...
API Clients
Using API Clients
pageviews
In the last video you were introduced to Application Programming Interfaces, or APIs, along with their intended purpose (as the computer equivalent of the visible web page that you and I might interact with) and their utility for data retrieval. What are APIs for?
Possible Answers
[x] Making parts of a website available to people.
[x] Making parts of a website available to puppies.
[o] Making parts of a website available to computers.
So we know that APIs are server components to make it easy for your code to interact with a service and get data from it. We also know that R features many “clients” - packages that wrap around connections to APIs so you don’t have to worry about the details.
Let’s look at a really simple API client - the pageviews package, which acts as a client to Wikipedia’s API of pageview data. As with other R API clients, it’s formatted as a package, and lives on CRAN - the central repository of R packages. The goal here is just to show how simple clients are to use: they look just like other R code, because they are just like other R code.
# Load pageviews
library(pageviews)
# Get the pageviews for "Hadley Wickham"
hadley_pageviews <- article_pageviews(project = "en.wikipedia", "Hadley Wickham")
# Examine the resulting object
str(hadley_pageviews)
## 'data.frame': 1 obs. of 8 variables:
## $ project : chr "wikipedia"
## $ language : chr "en"
## $ article : chr "Hadley_Wickham"
## $ access : chr "all-access"
## $ agent : chr "all-agents"
## $ granularity: chr "daily"
## $ date : POSIXct, format: "2015-10-01"
## $ views : num 53
API etiquette
Getting access tokens
birdnik
As we discussed in the last video, it’s common for APIs to require access tokens - unique keys that verify you’re authorised to use a service. They’re usually pretty easy to use with an API client.
To show how they work, and how easy it can be, we’re going to use the R client for the Wordnik dictionary and word use service - ‘birdnik’ - and an API token we prepared earlier. Birdnik is fairly simple (I wrote it!) and lets you get all sorts of interesting information about word usage in published works. For example, to get the frequency of the use of the word “chocolate”, you would write:
word_frequency(api_key, “chocolate”)
In this exercise we’re going to look at the word “vector” (since it’s a common word in R!) using a pre-existing API key (stored as api_key
)
birdnik
."vector"
in Wordnik’s database. Assign the results to vector_frequency
.if (!require('birdnik'))devtools::install_github("ironholds/birdnik")
## Loading required package: birdnik
## Loading required package: httr
# Load birdnik
library(birdnik)
# Get the word frequency for "vector", using api_key to access it
api_key <- "d8ed66f01da01b0c6a0070d7c1503801993a39c126fbc3382"
vector_frequency <- word_frequency(api_key, "vector")