csv - R: Histogram of missing data -
i have csv of sec-by-sec values, looks this:
"x","timestamp","value" "1",2016-01-01 00:00:00,124 "2",2016-01-01 00:00:01,121 "3",2016-01-01 00:00:02,na "4",2016-01-01 00:00:03,na "5",2016-01-01 00:00:04,na "6",2016-01-01 00:00:05,123 "7",2016-01-01 00:00:06,122 "8",2016-01-01 00:00:07,124 "9",2016-01-01 00:00:08,na "10",2016-01-01 00:00:09,124
so there data missing , marked na
. want make histogram of length of missing data blocks. in given example mean count how many missing data blocks have length of 1 sec (1)
, of 2 sec (0)
, of 3 sec (1)
, on.
in real life data set bins/intervals bit different, think of these 8 categories:
= 1 sec 2 5 sec 6 10 sec 11 30 sec 31 300 sec 301 3600 sec 3600 86400 sec > 86400 sec
so idea let r code run through lines of csv , whenever detects na
value, count lines until finds real value again. 8 categories integer variable counted +1
everytime fitting block of na
-values detected.
as complete r-noob have no idea how that. highly appreciated :)
i sure there must timeseries solution, started (using set.seed generate repeatable random values):
set.seed(42) # create sample data df <- data.frame(x = 1:100, timestamp = seq(from = sys.time() - 99, = sys.time(), = "secs"), value = sample(c(na, 1:3), 100, replace = true)) # runs of identical data runs <- rle(is.na(df$value)) # missing missing <- which(runs$values) # end positions in sequence missing positions <- cumsum(runs$lengths) # start times start <- df$timestamp[positions[missing] - runs$lengths[missing] + 1] end <- df$timestamp[positions[missing]] # time difference delta <- difftime(end, start, "seconds") # combine in usable data.frame output <- data.frame(startrow = positions[missing] - runs$lengths[missing] + 1, endrow = positions[missing], starttime = start, endtime = end, duration = delta)
Comments
Post a Comment