Welcome

Hello everyone and thank you very much for your interest in “Who said text isn’t data? Text analysis for data journalists with R”. I’m really happy that you decided to learn about text analysis.

While I’m still preparing everything so that I don’t get boring on Thursday, you might be wondering if you need to have anything installed on your computer to follow along with the session.

Covid-19 changed a lot of things in our lives, including how this kind of session is supposed to happen. Normally, I would do a hands-on session where you could apply what we are learning. But this being all virtual, I am thinking more of a session where I will be speaking a little bit about the theory of text analysis and demonstrate how to do stuff on my screen.

You will have my slides, all my code, and datasets in a repository, so that, after the session, you can try it by yourself.

But that doesn’t mean you can’t follow along. You can, and that’s why I’m writing to you.

For this session we will be using:

  • R, as a programming language;
  • RStudio as IDE - basically, where you will be writing your code.
  • Tidyverse as a set of packages for R that is useful to do data science with R.
  • Tidytext as the main package for doing text analysis.
So, if you want to follow along (or if you want to have everything configured correctly so that you can run my examples after the session), here’s what you have to do:

1 - Download and Install R

To download R, you need to go to this website choose your operating system and then just run what you have just downloaded.

It will look like any other program you need to install on your computer.

If you get in trouble, you can read this.

2 - Download and Install RStudio

On your computer, you must have a program called “R” now. It basically means you can run R code on your computer. But where do you write that code?

That’s what IDE’s (Integrated development environment) are made for - basically fancy words for a program where you write code.

Think about it as if it is “Microsoft Word” for code. You could be writing on “Notepad”, but a lot of people prefer “Microsoft Word” because it has some features they like.

It’s the same thing for code. RStudio was especially thought to write R code.

You can download RStudio here.

3 - Get to know RStudio

Packages are a very important thing in R. They are basically code someone already wrote and that you can install and call to help you.

If you open your RStudio and click on File > New File > R Script, you will have something like this:

I know all those windows might look confusing, but I think this tweet really helps explaining what everything does:

4 - Installing Packages

For now, I just want you to focus on Console. That’s basically where you can just write R code, hit enter and it immediately runs there. We will use that to install our packages.

I will be using the following packages:

  • Tidyverse - a package with a lot of packages that help us doing data science.
  • Tidytext - a package to do text analysis
  • Glue - a package to make our life easier when using variables in text.
  • Plotly - a package that allows us to make some quick interactive charts.

You install packages writing on the console: install.packages('name_package').

So, to install the packages I’ve mentioned, you just need to write (or simply copy paste) this:

  • install.packages('tidyverse')
  • install.packages('tidytext')
  • install.packages('tlue')
  • install.packages('tlotly')

Don’t be afraid with all the gibberish that it will show you on the console :)

5 - Have fun

I know you can’t wait to do some text analysis. Now that you have everything installed and configured, you can start having fun. I’ll leave you some code here:

library(tidyverse)
library(tidytext)
library(plotly)


text <- c("Hello, my name is Rui Barros",
          "and I'm a data journalist.",
          "Thursday, I'll guide you through the wonders of text analysis.",
          "Who said text can't be data?")

text

text_df <- tibble(line = 1:4, text = text) %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort = TRUE) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(title = "Most used words by Rui Barros")

ggplotly(text_df)

Why don’t you try to copy/paste it on the script you just created and hit cmd/ctrl + shift + enter?

See you on ThuRsday!