Feel free to try the exercises below at your leisure. Solutions will be posted later in the week!

Text-as-Data

  1. Create a regular expression to find words that start with a vowel. Test your findings on this vector test <- c('apple', 'banana', 'kiwi', 'eggplant')

  2. Scrape data from the body of the Wikipedia page here. Using the nrc sentiment library, summarize the proportion of non-stop words in each category. Compare your findings with a second Wikipedia page here.

  3. Using the twitter data from the lab assignment (with the stop words and other url link language excluded), produce a word cloud of the word stems used in the tweets (use SnowballC::wordStem()).

  4. Estimate a topic model for Jane Austen’s Emma (which can be accessed in the janeaustenr package). Estimate the model with 5 topics (treating chapters as documents). What are the top 10 words for each topic?