layout: true <div class="my-footer"> <span>bit.ly/tidytext4journalists</span> </div> --- class: center, middle <h1 style ="font-size:8rem; margin-bottom:0px; margin-top:0px"> 📄 </h1> <h1 style ="margin-bottom:0px; margin-top:0px"> Who said text isn't data? </h1> <h3 style ="margin-top:0px"> Text analysis for data journalists with R </h3> .large[**Rui Barros | *data journalist @Público* **] --- class: left # Three things you should know about me... -- - .large[I get very nervous when I speak in public.] -- - .large[English is not my native language.] -- - .large[I tend to speak fast.] -- <center> <h1 style ="font-size:3; margin-bottom:0px; margin-top:0px"> 🙋♀️ </h1> <h2 style ="margin-bottom:0px; margin-top:0px"> <i>Feel free to interrupt me</i> </h2> <center> --- # <center> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/1200px-R_logo.svg.png" height = '50px'> 101 </center> - .large[**R**: a statistical programming language that we will be using.] -- - .large[**RStudio**: a program on your computer built to help you write R code easily. It's where you will be writing.] -- - .large[**Packages**: R code someone already wrote to help you doing something.] -- .box[You install new packages with: <br> `install.packages('name_package')`] --- # <center> Packages we will be using: </center> - .large[**[dplyr](https://github.com/tidyverse/dplyr/)**: a package for data manipulation.] - .large[**[tidytext](https://github.com/juliasilge/tidytext)**: a package for text analysis.] - .large[**[ggplot2](https://github.com/tidyverse/ggplot2/)**: a package for data visualization.] --- class: center # Welcome to the tidyverse <img src="https://www.tidyverse.org/images/hex-tidyverse.png" height = '250px'> .box[ Just install the whole thing: <br> `install.packages("tidyverse")` <br> `install.packages("tidytext")` ] --- # <center> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/1200px-R_logo.svg.png" height = '50px'> + <img src="https://www.tidyverse.org/images/hex-tidyverse.png" height = '50px'> 101 - Sintax</center> - .large[R uses **`<-`** to assign a value, a dataset, a list, anything you want to a value.] -- - .large[R uses **` %>% `** (called a pipe) to move the output to a new operation. Basically, just read it as *and then do...*] -- - .large[You need to call a package before using it.] .box[ You call a package using: <br> `library('name_package')` ] --- class: black, center # I think we are ready... <img src="https://media1.tenor.com/images/f250c971767587d622373ceb638e8fbb/tenor.gif?itemid=10300477"> --- # <center>What is text analysis</center> <img src="figs/text_to_stats.png"> --- class: middle <blockquote> <h2> Basically, it's all about counting pieces of text. </h2> <h3> And then do some stats wizardry 🤷 </h3> </blockquote> --- class: left # <center style="margin-bottom:100px">Two kind of use cases in data journalism</center> .pull-left[ <center> <h1 style ="font-size:8rem; margin-bottom:0px; margin-top:0px"> 📃 </h1> </center> ### <center>Text analysis is the story by itself</center> ] .pull-right[ <center> <h1 style ="font-size:8rem; margin-bottom:0px; margin-top:0px"> 🔦 </h1> </center> ### <centr>Text analysis is a tool you use while working on a story</center> ] --- class: inverse, center # What we will be analysing today <img src="https://upload.wikimedia.org/wikipedia/en/e/e1/Eurovision_Song_Contest.svg"> --- class: left # Step 1 - Get the lyrics - .large[I've scraped all the lyrics submitted by countries between 1956 to 2019 to the Eurovision Song Contest (you can find the code for scraping it [here](https://gitlab.com/ruimgbarros/tidytext4journalists/-/blob/master/data/eurovision_song_extractor.R) ).] -- - .large[I'll not teach you how to do this part 🤷 .] --- # Step 2 - Get to know your data <center> <img src="figs/n_songs.png" width = '70%'> </center> --- # Step 3 - 🔪 🔪 🔪 ... - .large[If it's all about counting words, we need to turn the long pieces of text into units.] -- - .large[We are going to do it by word, but you can do it by anything you want (called n-grams).] -- .box[ You do it using: <br> `unnest_tokens(name_column_with_words, name_column_with_strings))` ] --- # Step 4 - Counting words... - .large[Now that we have one word per row, it's time to count words.] -- - .large[Basically, let's check what is the most common word in Eurovision songs.] --- class: center # Most used words .pull-left[ .large[The] .large[You] .large[I] ] .pull-left[ .large[And] .large[To] .large[A] ] -- <div style='margin-top:350px' > </div> ##What can we learn from this words? --- class: black, center, middle # Nothing. --- # Lessons about language - .large[Humans are very boring and use a lot of words that don’t carry a lot of meaning with them.] - .large[They are useful when we speak and write, but they don't help when we do text-mining .] --- class: black, center, middle <h1>✋ </h1> <h1 style ="margin-bottom:0px; margin-top:0px"> Stop Words </h1> .large[Your solution to words that mean nothing!] --- # Stop Words - .large[Stop words are words that are not useful for an analysis, typically extremely common words like articles, connectors, lexical.] -- - .large[They change from language to language.] -- - .large[They can be customized on the topic your are analysing (ex: "virus" on a dataset of texts about covid-19 probably is not relevant).] --- # Stop Words - .large[`Tidytext` already provide us a datset called `stopwords` with english stop words.] .box[ You take them out using <br> `anti_join(stop_words)` ] --- class: black, center, middle # NOW we can count words --- # Step 5 - What if there is more than counting? - .large[Counting can be fun, but we get it, everybody loves "love"...] -- - .large[There should be more I can do, right?] --- class: black, center, middle <h1>⭐ </h1> <h1 style ="margin-bottom:0px; margin-top:0px"> tf-idf </h1> .large[term frequency–inverse document frequency] --- # What the #@*% is tf-idf - .large[Basically, it measures how important a word is to a document when you compare it with other documents] -- - .large[Think about is as a fingertip that REALLY gives it away that text was written by someone] --- # tf-idf - "Big Bang Theory" Example .large[**Character1** - You really think so?] .large[**Character2** - Of course not. Even in my sleep-deprived state, I've managed to pull off another one of my classic pranks. ***Bazinga!***] -- <center> <img src="https://br.web.img2.acsta.net/r_640_360/newsv7/16/11/07/21/20/588402.jpg" width = '70%'> </center> --- # tf-idf in Eurovision lyrics - .large[It really shows how important it is to have a very clean dataset...] -- - .large[For Portugal: yeah, some sea related words lol.] -- - .large[Luxembourg and Switzerland have daddy issues, Belarus and Moldova mommy issues 🤷 (I'm joking but maybe it means something...) ] --- class: black, center, middle <h1>❤️ </h1> <h1 style ="margin-bottom:0px; margin-top:0px"> Feelings </h1> .large[Sentiment analysis] --- # Sentiment analysis - .large[Sentiment analysis is all about... you guessed, counting words.] -- - .large[We are going to be using a dataset in which the feelings are already associated with the words..] --- class: right, middle <h1 class="fa fa-quote-left fa-fw"></h1> .large[One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.] <h1 class="fa fa-quote-right fa-fw"></h1> Julia Silge and David Robinson @ ["Text Mining with R"](https://www.tidytextmining.com/) --- class: center, middle <img src="figs/eur_happ_index.png" width = '70%'> --- class: black, middle, center # So... how is all this useful? --- # How can YOU use text analysis - .large[Analysing text can be the story. Ex: [text analysis of political manifestos ](https://rr.sapo.pt/2020/07/10/europeias-2019/o-que-dizem-os-programas-dos-partidos-para-as-europeias-confia-em-mim-mas-nao-me-leias/multimedia/152006/)] -- - .large[It's very useful for the so called "snow ball approach". Ex: BASE approach.] -- - .large[It can reveal patterns to get you a story. Ex: [Portuguese State of the Nation](https://rr.sapo.pt/2019/07/11/politica/quatro-anos-por-19-vezes-as-palavras-e-expressoes-mais-usadas-por-costa-no-discurso-do-estado-da-nacao/especial/157563/)] --- class: inverse, center background-image: url("https://media.giphy.com/media/xUOxfjsW9fWPqEWouI/giphy.gif") # Questions? --- class: left, middle <span style ='font-size: 100pt'>🎲</span> # Thank you! <a href="https://twitter.com/rui_barros17"><i class="fa fa-twitter fa-fw"></i> @rui_barros17</a><br> <a href="gitlab.com/ruimgbarros/"><i class="fa fa-gitlab fa-fw"></i> @ruimgbarros</a><br> <a href="https://blog.ruimgbarros.com/" target='_blank'><i class="fa fa-link fa-fw"></i> blog.ruimgbarros.com </a><br> <a href="mailto:ruimgbarros@gmail.com"><i class="fa fa-paper-plane fa-fw"></i> ruimgbarros@gmail.com</a> // <a href="mailto:rui.barros@publico.pt"><i class="fa fa-paper-plane fa-fw"></i> rui.barros@publico.pt</a> Made with [**remark.js**](http://remarkjs.com/) and [**xaringan R package**](https://github.com/yihui/xaringan)