-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy path04-stringr-basics.Rmd
411 lines (291 loc) · 13.1 KB
/
04-stringr-basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
# Basic Manipulations with `"stringr"` Functions {#basics2}
## Introduction
As we saw in the previous chapters, R provides a useful range of functions for basic string processing and manipulations of `"character"` data. Most of the times these functions are enough and they will allow us to get our job done. Sometimes, however, they have an awkward behavior.
As an example, consider the function `paste()`. The default separator is a blank space, which more often than not is what we want to use. But that's secondary. The really annoying thing is when we want to paste things that include zero length arguments. How does `paste()` behave in those cases? See below:
```{r paste_berkeley_ex, tidy=FALSE}
# this works fine
paste("University", "of", "California", "Berkeley")
# this works fine too
paste("University", "of", "California", "Berkeley")
# this is weird
paste("University", "of", "California", "Berkeley", NULL)
# this is ugly
paste("University", "of", "California", "Berkeley", NULL, character(0),
"Go Bears!")
```
Notice the output from the last example (the _ugly_ one). The objects `NULL`
and `character(0)` have zero length, yet when included inside `paste()` they
are treated as an empty string `""`. Wouldn't be good if `paste()` removed
zero length arguments? Sadly, there's nothing we can do to change `nchar()` and
`paste()`. But fear not. There is a very nice package that solves these
problems and provides several functions for carrying out consistent string
processing.
## Package `"stringr"`
Thanks to Hadley Wickham and company, we have the package `"stringr"` that adds more
functionality to the base functions for handling strings in R. According to the description of the package
http://cran.r-project.org/web/packages/stringr/index.html
> `"stringr"` is a set of simple wrappers that make R's string functions more consistent,
> simpler and easier to use. It does this by ensuring that: function and
> argument names (and positions) are consistent, all functions deal with NA's
> and zero length character appropriately, and the output data structures from
> each function matches the input data structures of other functions."
To install `"stringr"` use the function `install.packages()`. Once installed,
load it to your current session with `library()`:
```{r install_stringr, message=FALSE, eval=FALSE}
# installing 'stringr'
install.packages("stringr")
# load 'stringr'
library(stringr)
```
```{r load_stringr, echo=FALSE}
library(stringr)
```
## Basic String Operations
`"stringr"` provides functions for both 1) basic manipulations and 2) for
regular expression operations. In this chapter we cover those functions that
have to do with basic manipulations.
The following table contains the `"stringr"` functions for basic string operations:
| Function | Description | Similar to |
|:---------------|:----------------------------------------|:--------------|
| `str_c()` | string concatenation | `paste()` |
| `str_length()` | number of characters | `nchar()` |
| `str_sub()` | extracts substrings | `substring()` |
| `str_dup()` | duplicates characters | _none_ |
| `str_trim()` | removes leading and trailing whitespace | _none_ |
| `str_pad()` | pads a string | _none_ |
| `str_wrap()` | wraps a string paragraph | `strwrap()` |
| `str_trim()` | trims a string | _none_ |
Notice that all functions start with `"str_"` followed by a term
associated to the task they perform. For example, `str_length()` gives you the
number (i.e. length) of characters in a string. In addition, some functions are
designed to provide a better alternative to already existing functions. This is
the case of `str_length()` which is intended to be a substitute of `nchar()`.
Other functions, however, don't have a corresponding alternative such as
`str_dup()` which allows you to duplicate characters.
### Concatenating with `str_c()`
Let's begin with `str_c()`. This function is equivalent to `paste()` but
instead of using the white space as the default separator, `str_c()` uses the
empty string `""` which is a more common separator when _pasting_ strings:
```{r str_c_ex}
# default usage
str_c("May", "The", "Force", "Be", "With", "You")
# removing zero length objects
str_c("May", "The", "Force", NULL, "Be", "With", "You", character(0))
```
Observe another major difference between `str_c()` and `paste()`: zero length
arguments like `NULL` and `character(0)` are silently removed by `str_c()`.
If you want to change the default separator, you can do that as usual by
specifying the argument `sep`:
```{r str_join}
# changing separator
str_c("May", "The", "Force", "Be", "With", "You", .sep = "_")
# synonym function 'str_glue'
str_glue("May", "The", "Force", "Be", "With", "You", .sep = "_")
```
As you can see from the previous examples, an alternative for `str _()` is `str_glue()` with the argument `.sep`.
### Number of characters with `str_length()`
As we've mentioned before, the function `str_length()` is equivalent to
`nchar()`. Both functions return the number of characters in a string, that is,
the _length_ of a string (do not confuse it with the `length()` of a vector).
Compared to `nchar()`, `str_length()` has a more consistent behavior when
dealing with `NA` values. Instead of giving `NA` a length of 2, `str_length()`
preserves missing values just as `NA`s.
```{r str_length_ex}
# some text (NA included)
some_text <- c("one", "two", "three", NA, "five")
# compare 'str_length' with 'nchar'
nchar(some_text)
str_length(some_text)
```
In addition, `str_length()` has the nice feature that it converts factors to
characters, something that `nchar()` is not able to handle:
```{r str_length_with_factors}
some_factor <- factor(c(1,1,1,2,2,2), labels = c("good", "bad"))
some_factor
# try 'nchar' on a factor
nchar(some_factor)
# now compare it with 'str_length'
str_length(some_factor)
```
### Substring with `str_sub()`
To extract substrings from a character vector `stringr` provides `str_sub()`
which is equivalent to `substring()`. The function `str_sub()` has the
following usage form:
```
str_sub(string, start = 1L, end = -1L)
```
The three arguments in the function are: a `string` vector, a `start` value
indicating the position of the first character in substring, and an `end` value
indicating the position of the last character. Here's a simple example with a
single string in which characters from 1 to 5 are extracted:
```{r str_sub_ex1}
lorem <- "Lorem Ipsum"
# apply 'str_sub'
str_sub(lorem, start = 1, end = 5)
# equivalent to 'substring'
substring(lorem, first = 1, last = 5)
# another example
str_sub("adios", 1:3)
```
An interesting feature of `str_sub()` is its ability to work with negative
indices in the `start` and `end` positions. When we use a negative position, `str_sub()` counts backwards from last character:
```{r str_sub_ex2}
resto = c("brasserie", "bistrot", "creperie", "bouchon")
# 'str_sub' with negative positions
str_sub(resto, start = -4, end = -1)
# compared to substring (useless)
substring(resto, first = -4, last = -1)
```
Similar to `substring()`, we can also give `str_sub()` a set of positions which
will be recycled over the string. But even better, we can give `str_sub()`
a negative sequence, something that `substring()` ignores:
```{r str_sub_ex3}
# extracting sequentially
str_sub(lorem, seq_len(nchar(lorem)))
substring(lorem, seq_len(nchar(lorem)))
# reverse substrings with negative positions
str_sub(lorem, -seq_len(nchar(lorem)))
substring(lorem, -seq_len(nchar(lorem)))
```
We can use `str_sub()` not only for extracting subtrings but also for replacing
substrings:
```{r str_sub_ex4}
# replacing 'Lorem' with 'Nullam'
lorem <- "Lorem Ipsum"
str_sub(lorem, 1, 5) <- "Nullam"
lorem
# replacing with negative positions
lorem <- "Lorem Ipsum"
str_sub(lorem, -1) <- "Nullam"
lorem
# multiple replacements
lorem <- "Lorem Ipsum"
str_sub(lorem, c(1,7), c(5,8)) <- c("Nullam", "Enim")
lorem
```
### Duplication with `str_dup()`
A common operation when handling characters is _duplication_. The problem is
that R doesn't have a specific function for that purpose. But `stringr` does: `str_dup()` duplicates and concatenates strings within a character vector.
Its usage requires two arguments:
```
str_dup(string, times)
```
The first input is the `string` that you want to dplicate. The second input,
`times`, is the number of times to duplicate each string:
```{r str_dup_ex}
# default usage
str_dup("hola", 3)
# use with differetn 'times'
str_dup("adios", 1:3)
# use with a string vector
words <- c("lorem", "ipsum", "dolor", "sit", "amet")
str_dup(words, 2)
str_dup(words, 1:5)
```
### Padding with `str_pad()`
Another handy function that we can find in `stringr` is `str_pad()` for
_padding_ a string. Its default usage has the following form:
```
str_pad(string, width, side = "left", pad = " ")
```
The idea of `str_pad()` is to take a string and pad it with leading or trailing
characters to a specified total `width`. The default padding character is a
space (`pad = " "`), and consequently the returned string will appear to be
either left-aligned (`side = "left"`), right-aligned (`side = "right"`), or
both (`side = "both"`).
Let's see some examples:
```{r str_pad_ex}
# default usage
str_pad("hola", width = 7)
# pad both sides
str_pad("adios", width = 7, side = "both")
# left padding with '#'
str_pad("hashtag", width = 8, pad = "#")
# pad both sides with '-'
str_pad("hashtag", width = 9, side = "both", pad = "-")
```
### Wrapping with `str_wrap()`
The function `str_wrap()` is equivalent to `strwrap()` which can be used to
_wrap_ a string to format paragraphs. The idea of wrapping a (long) string is
to first split it into paragraphs according to the given `width`, and then add
the specified indentation in each line (first line with `indent`, following
lines with `exdent`). Its default usage has the following form:
```
str_wrap(string, width = 80, indent = 0, exdent = 0)
```
For instance, consider the following quote (from Douglas Adams) converted into
a paragraph:
```{r douglas_adams, tidy=FALSE}
# quote (by Douglas Adams)
some_quote <- c(
"I may not have gone",
"where I intended to go,",
"but I think I have ended up",
"where I needed to be")
# some_quote in a single paragraph
some_quote <- paste(some_quote, collapse = " ")
```
Now, say you want to display the text of `some_quote` within some pre-specified
column width (e.g. width of 30). You can achieve this by applying `str_wrap()`
and setting the argument `width = 30`
```{r str_wrap_ex1, tidy=FALSE}
# display paragraph with width=30
cat(str_wrap(some_quote, width = 30))
```
Besides displaying a (long) paragraph into several lines, you may also wish to
add some indentation. Here's how you can indent the first line, as well as the
following lines:
```{r str_wrap_ex2, tidy=FALSE}
# display paragraph with first line indentation of 2
cat(str_wrap(some_quote, width = 30, indent = 2), "\n")
# display paragraph with following lines indentation of 3
cat(str_wrap(some_quote, width = 30, exdent = 3), "\n")
```
### Trimming with `str_trim()`
One of the typical tasks of string processing is that of parsing a text into
individual words. Usually, you end up with words that have blank spaces, called _whitespaces_, on either end of the word. In this situation, you can use the
`str_trim()` function to remove any number of whitespaces at the ends of a
string. Its usage requires only two arguments:
```
str_trim(string, side = "both")
```
The first input is the `string` to be strimmed, and the second input indicates
the `side` on which the whitespace will be removed.
Consider the following vector of strings, some of which have whitespaces either
on the left, on the right, or on both sides. Here's what `str_trim()` would do
to them under different settings of `side`
```{r steal, tidy=FALSE}
# text with whitespaces
bad_text <- c("This", " example ", "has several ", " whitespaces ")
# remove whitespaces on the left side
str_trim(bad_text, side = "left")
# remove whitespaces on the right side
str_trim(bad_text, side = "right")
# remove whitespaces on both sides
str_trim(bad_text, side = "both")
```
### Word extraction with `word()`
We end this chapter describing the `word()` function that is designed to
extract words from a sentence:
```
word(string, start = 1L, end = start, sep = fixed(" "))
```
The way in which you use `word()` is by passing it a `string`, together with a
`start` position of the first word to extract, and an `end` position of the
last word to extract. By default, the separator `sep` used between words is a
single space.
Let's see some examples:
```{r word_ex}
# some sentence
change <- c("Be the change", "you want to be")
# extract first word
word(change, 1)
# extract second word
word(change, 2)
# extract last word
word(change, -1)
# extract all but the first words
word(change, 2, -1)
```
`"stringr"` has more functions but we'll discuss them in the chapters about
[regular expressions](#regex1).