regex - R - extract all strings matching pattern and create relational table -


i looking shorter , more pretty solution (possibly in tidyverse) following problem. have data.frame "data":

  id            string 1  1.001 xxx 123.123 2  b 23,45 lorem ipsum 3  c      donald trump 4  d    ssss 134, 1,45 

what wanted extract numbers (no matter if delimiter "." or "," -> in case assume string "134, 1,45" can extracted 2 numbers: 134 , 1.45) , create data.frame "output" looking similar this:

  id  string 1    1.001 2  123.123 3  b   23.45 4  c    <na> 5  d     134 6  d    1.45 

i managed (code below) solution pretty ugly me not efficient (two for-loops). suggest better way do (preferably using dplyr)

# data data <- data.frame(id = c("a", "b", "c", "d"),                    string = c("1.001 xxx 123.123",                               "23,45 lorem ipsum",                               "donald trump",                               "ssss 134, 1,45"),                   stringsasfactors = false)  # creating empty data.frame                      len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x)))) output <- data.frame(id = rep(na, len), string = rep(na, len))  # main solution start = 0  for(i in 1:dim(data)[1]){   tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))   for(j in (start+1):(start+tmp_len)){     output[j,1] <- data$id[i]     output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]   }   start = start + tmp_len }  # further modifications output$string <- gsub(",", ".", output$string) output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",                                    substring(output$string, 1, nchar(output$string) - 1),                                    output$string))  output 

1) base r uses relatively simple regular expressions , no packages.

in first 2 lines of code replace comma followed space space , replace remaining commas dot. after these 2 lines s be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")

in next 4 lines of code trim whitespace beginning , end of each string field , split string field on whitespace producing list. grep out elements consisting of digits , dots. (the regular expression ^[0-9.]*$ matches start of word followed 0 or more digits or dots followed end of word words containing characters matched.) replace 0 length components na. add data$id names. after these 4 lines run list l list(a = c("1.001", "123.123"), b = "23.45", c = na, d = c("134", "1.45")) .

in last line of code convert list l data frame appropriate names.

s <- gsub(", ", " ", data$string) s <- gsub(",", ".", s)  l <- strsplit(trimws(s), "\\s+") l <- lapply(l, grep, pattern = "^[0-9.]*$", value = true) l <- ifelse(lengths(l), l, na) names(l) <- data$id  with(stack(l), data.frame(id = ind, string = values)) 

giving:

  id  string 1    1.001 2  123.123 3  b   23.45 4  c    <na> 5  d     134 6  d    1.45 

2) magrittr variation of (1) writes magrittr pipeline.

library(magrittr)  data %>%      transform(string = gsub(", ", " ", string)) %>%      transform(string = gsub(",", ".", string)) %>%      transform(string = trimws(string)) %>%      with(setnames(strsplit(string, "\\s+"), id)) %>%      lapply(grep, pattern = "^[0-9.]*$", value = true) %>%      replace(lengths(.) == 0, na) %>%      stack() %>%      with(data.frame(id = ind, string = values)) 

3) dplyr/tidyr alternate pipeline solution using dplyr , tidyr. unnest converts long form, id made factor can later use complete recover id's removed subsequent filtering, filter removes junk rows , complete inserts na rows each id otherwise not appear.

library(dplyr) library(tidyr)  data %>%   mutate(string = gsub(", ", " ", string)) %>%   mutate(string = gsub(",", ".", string)) %>%   mutate(string = trimws(string)) %>%   mutate(string = strsplit(string, "\\s+")) %>%   unnest() %>%   mutate(id = factor(id))   filter(grepl("^[0-9.]*$", string)) %>%   complete(id) 

4) data.table

library(data.table)  dt <- as.data.table(data) dt[, string := gsub(", ", " ", string)][,       string := gsub(",", ".", string)][,      string := trimws(string)][,      string := setnames(strsplit(string, "\\s+"), id)][,      list(string = list(grep("^[0-9.]*$", unlist(string), value = true))), = id][,      list(string = if (length(unlist(string))) unlist(string) else na_character_), = id] dt 

update removed assumption junk words not have digit or dot. added (2), (3) , (4) , improvements.


Comments

Popular posts from this blog

magento2 - Magento 2 admin grid add filter to collection -

Android volley - avoid multiple requests of the same kind to the server? -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -