Week 11 Optional Exercises (SOLUTIONS)

Feel free to try the exercises below at your leisure. Solutions will be posted later in the week!

Text-as-Data

Create a regular expression to find words that start with a vowel. Test your findings on this vector test <- c('apple', 'banana', 'kiwi', 'eggplant')
Scrape data from the body of the Wikipedia page here. Using the nrc sentiment library, summarize the proportion of non-stop words in each category. Compare your findings with a second Wikipedia page here.
Using the twitter data from the lab assignment (with the stop words and other url link language excluded), produce a word cloud of the word stems used in the tweets (use SnowballC::wordStem()).
Estimate a topic model for Jane Austen’s Emma (which can be accessed in the janeaustenr package). Estimate the model with 5 topics (treating chapters as documents). What are the top 10 words for each topic?