regex - R - extract all strings matching pattern and create relational table -
i looking shorter , more pretty solution (possibly in tidyverse) following problem. have data.frame "data":
id string 1 1.001 xxx 123.123 2 b 23,45 lorem ipsum 3 c donald trump 4 d ssss 134, 1,45
what wanted extract numbers (no matter if delimiter "." or "," -> in case assume string "134, 1,45" can extracted 2 numbers: 134 , 1.45) , create data.frame "output" looking similar this:
id string 1 1.001 2 123.123 3 b 23.45 4 c <na> 5 d 134 6 d 1.45
i managed (code below) solution pretty ugly me not efficient (two for-loops). suggest better way do (preferably using dplyr)
# data data <- data.frame(id = c("a", "b", "c", "d"), string = c("1.001 xxx 123.123", "23,45 lorem ipsum", "donald trump", "ssss 134, 1,45"), stringsasfactors = false) # creating empty data.frame len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x)))) output <- data.frame(id = rep(na, len), string = rep(na, len)) # main solution start = 0 for(i in 1:dim(data)[1]){ tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))) for(j in (start+1):(start+tmp_len)){ output[j,1] <- data$id[i] output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start] } start = start + tmp_len } # further modifications output$string <- gsub(",", ".", output$string) output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".", substring(output$string, 1, nchar(output$string) - 1), output$string)) output
1) base r uses relatively simple regular expressions , no packages.
in first 2 lines of code replace comma followed space space , replace remaining commas dot. after these 2 lines s
be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
in next 4 lines of code trim whitespace beginning , end of each string field , split string field on whitespace producing list. grep
out elements consisting of digits , dots. (the regular expression ^[0-9.]*$
matches start of word followed 0 or more digits or dots followed end of word words containing characters matched.) replace 0 length components na. add data$id
names. after these 4 lines run list l
list(a = c("1.001", "123.123"), b = "23.45", c = na, d = c("134", "1.45"))
.
in last line of code convert list l
data frame appropriate names.
s <- gsub(", ", " ", data$string) s <- gsub(",", ".", s) l <- strsplit(trimws(s), "\\s+") l <- lapply(l, grep, pattern = "^[0-9.]*$", value = true) l <- ifelse(lengths(l), l, na) names(l) <- data$id with(stack(l), data.frame(id = ind, string = values))
giving:
id string 1 1.001 2 123.123 3 b 23.45 4 c <na> 5 d 134 6 d 1.45
2) magrittr variation of (1) writes magrittr pipeline.
library(magrittr) data %>% transform(string = gsub(", ", " ", string)) %>% transform(string = gsub(",", ".", string)) %>% transform(string = trimws(string)) %>% with(setnames(strsplit(string, "\\s+"), id)) %>% lapply(grep, pattern = "^[0-9.]*$", value = true) %>% replace(lengths(.) == 0, na) %>% stack() %>% with(data.frame(id = ind, string = values))
3) dplyr/tidyr alternate pipeline solution using dplyr , tidyr. unnest
converts long form, id
made factor can later use complete
recover id's removed subsequent filtering, filter removes junk rows , complete
inserts na rows each id
otherwise not appear.
library(dplyr) library(tidyr) data %>% mutate(string = gsub(", ", " ", string)) %>% mutate(string = gsub(",", ".", string)) %>% mutate(string = trimws(string)) %>% mutate(string = strsplit(string, "\\s+")) %>% unnest() %>% mutate(id = factor(id)) filter(grepl("^[0-9.]*$", string)) %>% complete(id)
4) data.table
library(data.table) dt <- as.data.table(data) dt[, string := gsub(", ", " ", string)][, string := gsub(",", ".", string)][, string := trimws(string)][, string := setnames(strsplit(string, "\\s+"), id)][, list(string = list(grep("^[0-9.]*$", unlist(string), value = true))), = id][, list(string = if (length(unlist(string))) unlist(string) else na_character_), = id] dt
update removed assumption junk words not have digit or dot. added (2), (3) , (4) , improvements.
Comments
Post a Comment