class: center, middle, inverse, title-slide # R for Data Analysis ## Strings and Factors ### Ayush Patel ### 29-Jul-2021 --- layout: true --- name: Introduction class: left,middle .pull-left[ ## Find me [__@ayushbipinpatel__](https://twitter.com/ayushbipinpatel) <img src="" width=5%> [__@AyushBipinPatel__](https://github.com/AyushBipinPatel) <img src="" width=5%> [__ayushpatel.netlify.app__](https://ayushpatel.netlify.app/) <img src="" width=5%> [__ayush.ap58@gmail.com__](ayush.ap58@gmail.com)<img src="" width=5%> ] .pull-right[ <img src = "https://images.metmuseum.org/CRDImages/ad/original/57258.jpg"> .small[ Image: [John Biglin in a Single Scull by Thomas Eakins](https://images.metmuseum.org/CRDImages/ad/original/57258.jpg) ] ] --- class: left, middle .pull-left[ # Pre-requisite .big[You....] understand __different types of objects, how to create objects and assign values to objects__. <br> __how to access specific values within an object.__ <br> know __what a function is and how to use a function.__ <br> know __basics of data wrangling__ <br> ] .pull-right[ <img src = "https://images.metmuseum.org/CRDImages/ad/original/DT84.jpg"> .small[ [Image: Lake George by John Frederick Kensett ](https://images.metmuseum.org/CRDImages/ad/original/DT84.jpg) ] ] --- # Before we get to it.. .big[Continue in the Rmarkdown document you used in the last class or create a new one.] <br> Load the following libraries and data ```r library(tidyverse) ``` --- class: center, middle # String Manipulation - the essentials --- # How to create a string ```r str1 <- "Hello" str2 <- 'Double and single quote, either works' str1 ``` ``` ## [1] "Hello" ``` ```r str2 ``` ``` ## [1] "Double and single quote, either works" ``` --- # Basic string operations - length ```r str1 <- "why is regex so painful?!" str2 <- c("what", "is","the", "time?") ``` ```r str_length(str1) ``` ``` ## [1] 25 ``` ```r str_length(str2) ``` ``` ## [1] 4 2 3 5 ``` --- # Basic string operations - Combining Strings ```r str_c("Hi", "Hello") ``` ``` ## [1] "HiHello" ``` ```r str_c("Hi", "Hello", sep = "|") ``` ``` ## [1] "Hi|Hello" ``` ```r str_c("Hi", NA, sep = "|") # careful of the Missing values ``` ``` ## [1] NA ``` .big[length recycling] ```r str_c("Dr", c("Strange","Banner","Stark"),"is a genius", sep = " ") ``` ``` ## [1] "Dr Strange is a genius" "Dr Banner is a genius" "Dr Stark is a genius" ``` --- # Basic string operations - Collapse a vector ```r str3 <- c("what", "is","the", "time?") str3 ``` ``` ## [1] "what" "is" "the" "time?" ``` ```r str_c(str3, collapse = "/") ``` ``` ## [1] "what/is/the/time?" ``` --- # Basic string operations - Subsetting Strings .big[indexing starts with 1] ```r str_sub("she sells sea shells on the sea shore", 5,15) ``` ``` ## [1] "sells sea s" ``` .big[No error is generated if the string is too short] ```r str_sub("aaa",1,15) ``` ``` ## [1] "aaa" ``` --- # Activity 1 Write code to : + In the `sentences` object, what is the maximum amd the minimum length. + Combine the strings on the 55th and 650th place of the `sentences` object. + Subset all the strings in the `sentences` object starting from the center of the string till the end of the string.
05
:
00
--- class: center, middle # Regular expression essentials --- # The `.` .big[Matches with everything except a new line] ```r str_view_all("apple",".") ```
--- # The anchors .big[`^` to match the start of the string] .big[`$` to match the end of the string] ```r str_view(c("Atoms","are","the","smallest","building","blocks","of","matter"),"^a|A") ```
--- # The anchors .big[`^` to match the start of the string] .big[`$` to match the end of the string] ```r str_view(c("Atoms","are","the","smallest","building","blocks","of","matter"),"(e|f)$") ```
--- # Activity 2 Use `words` object, create regular expressions that find all words that: • Start with “y”. • End with “x” • Are exactly three letters long. (Don’t cheat by using str_length()!) • Have seven letters or more. .big[HINT: use str_detect()] .small[question tweaked from R4DS]
10
:
00
??? 1) 6 2) 4 3) 110 4) 219 sum(str_detect(words,"x$"),na.rm = T) sum(str_detect(words,"^..{5,}.$"),na.rm = T) --- class: middle, center # Working with Categorical variables --- Why use factors: + Easy sorting and arrangement of designed levels + Saves you from typos --- # Creating a factor .yscroll[ ```r str4 <- c("Very good", "Bad", "Very Bad", "Good", "OK") str4 ``` ``` ## [1] "Very good" "Bad" "Very Bad" "Good" "OK" ``` ```r sort(str4) ``` ``` ## [1] "Bad" "Good" "OK" "Very Bad" "Very good" ``` ```r mood_levels <- c("Very good", "Good", "OK", "Bad", "Very Bad") factor(str4,levels = mood_levels) ->fact1 fact1 ``` ``` ## [1] Very good Bad Very Bad Good OK ## Levels: Very good Good OK Bad Very Bad ``` ```r sort(fact1) ``` ``` ## [1] Very good Good OK Bad Very Bad ## Levels: Very good Good OK Bad Very Bad ``` ] --- ## What is the problem with this chart? .big[What can be done about it] .yscroll[ <img src="strings-and-factors_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> ```r levels(gss_cat$rincome) ``` ``` ## [1] "No answer" "Don't know" "Refused" "$25000 or more" ## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" ## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999" ## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable" ``` <br> <br> ] --- class: center, middle # What looks better? Is more useful? .pull-left[ <img src="strings-and-factors_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> ] .pull-right[ <img src="strings-and-factors_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> ] --- # What did I do? .pull-left[ ```r tibble( ice_cream = factor(c("vanilla", "cookiesCream", "SaltedPeanuts", "Chocolate", "grages","coconuts") ), unit_sale = c(55,22,3,5,1000,52) ) %>% ggplot(aes(ice_cream, unit_sale))+ geom_col() ``` ] .pull-right[ ```r tibble( ice_cream = factor(c("vanilla","cookiesCream", "SaltedPeanuts", "Chocolate", "grages","coconuts")), unit_sale = c(55,22,3,5,80,52) ) %>% mutate( ice_cream = fct_reorder(ice_cream, unit_sale) ) %>% ggplot(aes(ice_cream,unit_sale))+ geom_col() ``` ] --- # fct_reorder vs fct_relevel .pull-left[ .big[Use `fct_reorder` when there is no principled order] Eg: Ice cream flavours Religions Colour preference ``` fct_reorder( factor_you_need_to_reorder, numerice_var_to_reorder_with ) ``` ] .pull-right[ .big[Use `fct_reorder` when there is principled order] Eg: Income Groupings Satisfaction levels Military Ranks ``` fct_relevel( factor_you_need_to_relevel, column_vec_of_principled_order ) ``` ] --- class: center, middle ## Further exploration Go through the [cheat sheet](https://www.rstudio.com/resources/cheatsheets/) of StringR and forcats --- class: center, middle background-image: url("images/background2.jpg") background-size: cover