[SUNY 2018] Anyone Can Learn Web Scraping

To get the most benefit from this skill-building session, it is recommended that you bring a laptop and work through the examples during  the session. To run through them, you'll need the following programs installed. They are all FREE. If you are ever asked for a credit card number, you've gone the wrong way.  I will be using a Windows machine, but I believe this should work just the same on Mac.  It is recommended you install them in this order:

  1. The R Project for Statistical Computing
  2. RStudio Desktop
  3. Once R and RStudio Desktop are installed, open RStudio and copy-paste the following command in the window labeled Console. It will probably take a minute or two if you installed R for the first time:
    1. install.packages("twitteR")
  4. Google Chrome
  5. Once Chrome is installed, install these two free extensions:
    1. Scraper
    2. SelectorGadget (and optionally, see its webpage for more information)

You should also have access to either Google Sheets or Excel on the computer you bring.

Here is the code we'll be using in Twitter, which you can either copy/paste or re-type when needed:

# Tutorial/Demonstration of Twitter Data Calls using R
# by: Richard N Landers (rlanders@tntlab.org)
#
# We'll be grabbing the most recent public posts on #digitalhumanities from Twitter

# Open Twitter retrieval library
library(twitteR)

# At this point, create an "App" on Twitter after logging in by going to 
# http://apps.twitter.com and "creating an application"
# Once you've created an application, open its settings, go to Keys and Access Tokens, 
# generate, then copy/paste the four strings required here
consumer_key <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
consumer_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
access_token <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
access_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Using keys above, authorize R Studio to access the Twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

# Let's grab all Twitter posts about #digitalhumanities that we can and convert to a data frame
humanSearch <- searchTwitter("#digitalhumanities", n=1000)
humanSearch_df <- twListToDF(humanSearch)

# write out new dataset to play with; remember to set your WD first
write.csv(humanSearch_df, "tweets.csv")

Finally, you can download the slides from this skill-building session here (coming soon):