Sisifo - Time series neural net

Introduction & purpose

In the previous post, the data coming from the accelerometer was analysed in a over-simplistic way: just taking some common statistical measures (e.g. standard deviation) to get an idea of the overall physical activity of the user of the wearable over time. What to do when a more advanced analysis of the data is required?

Such a goal would imply analyzing a time series, and thus the need to understand quite a few far-from-obvious concepts from signal processing. But are these really needed?

Not really, not necessarily. For example in this podcast from O’Reilly, it’s stated that it should be easier to apply a neural network for solving time-series related problems. This post takes the neural network approach for analyzing the accelerometer data, keeping in mind the question on how easy and practical it is, and other problems that pop up (e.g. convergence, processing time…)

The post includes code in R, and also a dataset for the accelerometer data, in case you want to follow it along.

Dataset used

The dataset is the same as the one used in the previous post, you can find it in this R data file. Just download it and then load it into your R session.

Although this post intends to be self-contained, of course it could be better if you read quickly that previous post.

References

From this podcast from O’Reilly, already mentioned, find below a nice quote from Christopher Nguyen, interviewed in the podcast, co-founder of Arimo,

Let’s divide the world into before deep learning on time series and after deep learning on time series. Techniques exist for processing time series, yet I would say they’re too hard. It’s possible, but it costs too much or it’s too hard using traditional techniques, and it yields a value that hasn’t been worth the investment. Deep learning flips that around and gets you much larger value for less effort.

Besides, there’s quite a lot of info around on how to actually do it. One nice summary of the most straight-forward approach can be found in this Cross-Validated answer -in this post, the approach used is even more naive, but on the same lines. Actually, as usually happens, this is far from a new idea, take a look to the following link, a time series forecasting competition using neural networks in 2010.

On neural networks themselves, possibly one of the best introduction is the very well known Coursera MOOC by Andrew Ng. As for how to apply it in R, this tutorial covers the essentials. As for the common convergence problems when training a neural net, Efficient BackProp by Yann LeCun and others, 1998.

Reformulation of the problem so that the neural net can solve it

The following picture is a summary of the accelerometer data available for the exercise. It corresponds to around 8 hours of activity of a subject wearing the wristband.

A straight-forward problem would be to detect what kind of activity is the subject doing. In order to simplify the problem, but still keeping it not too easy for the neural net, let’s try to build a “sleep detector”. The cutting point is around minute 260, although it is a kind of lousy threshold, since the subject reported that he walked up several times and even went to the toilet after minute 260. The following would be then the “lousy” output to predict (see details in the R Code section below.)

This “sleep detection” correlates somehow with what a simple detection of activity levels would do, since of course low activity periods are more likely to be of sleep. Anyhow, if the output to predict was deduced on a reasonable threshold on the activity level, it would look as follows.

And after smoothing it may be possible to actually see that it is kind of similar but not the same.

The “lousy” definition of the expected output should be more challenging for the neural net, but also more interesting, since if the objective was just an activity detector, we would not actually need the neural net. (Still, one could argue for a better labeling of the output for “sleep” or “low activity”.)

Inputs to the neural network

As inputs, the neural network will get a window of values of the accelerometer corresponding to a certain amount of time. The number of values of the window is fixed to 75 samples, where 750 samples correspond to 1 minute of data.

How this value of 75 samples (6 seconds) can be decided? From the literature (see references above), one possibility would be to run a auto-correlation analysis on the data, to try and figure out the lag dependence in the data, i.e. how many values in the past have a real effect on the value of the data. This would make a lot of sense for a generic forecasting problem e.g. stock price time series. However, for the accelerometer data, the meaning of the series is actually quite intuitive, and the chosen window, around 6 seconds, could be challenged but at the same time sounds reasonable. Besides, it fixes the input size of the network to a reasonable value as well.

The results of other possible values (25 and 150 samples per window) are discussed briefly at the end.

Number and size of the layers

The aim for this post is to keep it simple, fast, and manageable. The starting point is a neural net of just one output neuron (which roughly corresponds to a logistic regression model). Then the size of the network is increased in small steps until the training process is too slow in a laptop.

Convergence of the backpropagation process

In order to keep it simple and manageable as stated above, it is important to take care of convergence problems as explained e.g. in this paper on Efficient BackProp.

First of all, normalizing the output is a simple first step, but extremely important -probably a good way to learn it would be to forget it once and struggle with the mess.

If one forgot it with the data of this post, all inputs provided to the neural net would be always positive, and as stated in the paper:

When all the components of an input vector are positive, all the updates of the weights that feed into a node would be the same sign […] As a result, these weights can only all decrease or all increase together for a given input pattern. Thus, if a weight vector must change direction it can only do so by zigzagging, which is inefficient and thus very slow.

Additionally, inputs should be uncorrelated if possible. The inputs are the 75 samples of accelerometer data in a window, but this can be done at least in the following two ways:

by dividing the total number of samples in group of 75 samples, so that each sample is only in one group
by using a sliding window of 75 samples, so that each sample actually appears in 75 windows, on a different position in each of them (i.e. as the 1st input, or the 2nd, 3rd, etc)

First sight the second approach would look like a better idea, since it would be more similar to a real scenario, however convergence will be by far more difficult, since the inputs will be quite correlated. Thus, to keep it simple, let’s stick to the first approach (although we’ve done some trials of the second, see Variance section below).

Finally, the activation function which shows fastest is the hyperbolic tangent (as predicted by the same paper). Accordingly, the values of the binary outputs are set to [-1, 1].

Summary of results

When it comes to the evaluation of the results of the model, let’s once again try and keep it simple, and stick to the F1 score. Let’s also skip the discussion about the meaning and relative importance of false negatives and true positives in the context of the sleep detector. The assumption is that the F1 score will give a good estimate. (Still, the results below show the accuracy as well, since it is more intuitive measure to understand.)

But 1st, let’s establish the results of a baseline classifier based on activity levels. The detector would consider that the subject is sleeping if the activity level is smaller than 0.01.

# if you haven't done it yet, load the file @ http://sisifospage.tech/data/Acceleromenter8hours.Rdata
baseline <- rollapply(post_data$mod, width=75, FUN=sd, by=75) < 0.01
result <- seq(1:length(baseline)) > 2600
cm.base <- table(baseline, result)
cm.base

##         result
## baseline FALSE TRUE
##    FALSE  1113   65
##    TRUE   1487 2195

sum(baseline == result) / length(baseline)

## [1] 0.6806584362

getFBetaScore(cm.base)

## [1] 0.738808482

Thus the accuracy of the baseline is around 68%, with quite a high rate of false positives (i.e. the detector says that the subject is sleeping, just because the level of activity is low). The F1 score of 0.74.

For the most simple neural net, with one output neuron (which behaves quite like logistic regression) the accuracy for the test set is 60% and the F1 score is 0.51 (see R code below). For a neural network of 1 hidden layer of 6 neurons, which still takes just a few minutes to train, the accuracy for the test set is of about 71%, while the F1 score is 0.76 -i.e. already much better than the baseline. However this model already shows overfitting.

A larger neural net, with 2 hidden layers, the 1st with 12 neurons and the 2nd with 3, takes approximately the same time to train, but it does not get significantly better results in the test set -again it suffers from overfitting, see paragraph on Variance below.

Question: was it simple and/or useful?

Neural networks are not the most easy model to train, any mistake can make you to waste a lot of time, and the results are not straightforward to interpret. But yes, if you know nothing about time series, the path with neural network looks by far more promising.

As a conclusion of the results: even with a relatively simple neural network, the results of the baseline can be beaten. The neural network will have the advantage that it will be able to learn more complex patterns (see e.g. this paper), and, as starters, it will avoid the kind of built-in error of the baseline, which assumes the subject is sleeping whenever the activity is low.

Next steps for a finer analysis

Variance

When increasing little by little the complexity of the model (i.e. reducing bias), we were taking the risk to overfit. From the difference of the accuracy/F1 score between the training and the test data there is a clear hint that there is overfitting. In the model with 2 hidden layers of 12 & 3 neurons respectively, the F1 score of the training set is 0.93, while for the test set is 0.72.

Even in models with relatively higher bias, the model seems to be already suffering from high variance, which could be solved with either more training data or by using regularization. In some trials we constructed the windows of samples as a sliding window (explained above), and the accuracy results for the model were significantly better, at the cost of convergence speed. Definitively, the results should improve with more fresh input data, while it makes sense to tune for regularization (decay parameter in neuralnet).

Window size

For this particular case of the “sleep detector”, in our trials, the longer the window size, the better the results. This makes sense, more or less. Anyway the drawbacks of a bigger window are more complex neural network and less data to feed in. The selected size of 75 looks like a compromise.

Multi-output classification (other kind of activities)

Of course, one could think about labeling (maybe more carefully) different kind of activities of the subject and building a multi-output neural net.

R Code

Finally the R code. For the data preparation, the first step is to download this RData file and then load it into your RSession.

load("<your download path>/Acceleromenter8hours.RData")

summary(post_data, c('x', 'y', 'z'))

##    header_id              id                  x
##  Min.   :312.0000   Min.   :     1.00   Min.   :-2.2580000
##  1st Qu.:314.0000   1st Qu.: 45643.75   1st Qu.:-0.4850000
##  Median :315.0000   Median : 91299.00   Median :-0.1250000
##  Mean   :314.4156   Mean   : 91323.44   Mean   :-0.1703398
##  3rd Qu.:315.0000   3rd Qu.:137071.00   3rd Qu.: 0.1480000
##  Max.   :316.0000   Max.   :184419.00   Max.   : 1.6050000
##        y                     z                   mod
##  Min.   :-2.48500000   Min.   :-2.2660000   Min.   :0.05151699
##  1st Qu.:-0.47700000   1st Qu.:-0.6840000   1st Qu.:0.98969086
##  Median :-0.20400000   Median :-0.5160000   Median :0.99546472
##  Mean   :-0.04410843   Mean   :-0.2350543   Mean   :0.99840825
##  3rd Qu.: 0.60100000   3rd Qu.: 0.0500000   3rd Qu.:1.00551579
##  Max.   : 1.47600000   Max.   : 1.5110000   Max.   :3.29936206
##      minute          window           sample
##  Min.   :  0.0   Min.   :   0.00   Length:364500
##  1st Qu.:121.0   1st Qu.:1214.75   Class :character
##  Median :242.5   Median :2429.50   Mode  :character
##  Mean   :242.5   Mean   :2429.50
##  3rd Qu.:364.0   3rd Qu.:3644.25
##  Max.   :485.0   Max.   :4859.00

Let’s calculate then the module of the output (which will be the input of the neural net), add some info on the minute to which the sample belongs (there are 750 samples per minute for a window size of 75), the window (10 windows per minute), and the unique id of the sample within the window (which will become handy later). The result is copied into a data.table:

post_data$mod <- sqrt(post_data$x^2 + post_data$y^2 + post_data$z^2)
nn_window_size <- 75
nrow(post_data) / nn_window_size

## [1] 4860

post_data$minute <- floor(0:(nrow(post_data)-1)/ 750)
post_data$window <- floor(0:(nrow(post_data)-1)/ nn_window_size)
post_data$sample <- paste("s",
                          rep(0:(nn_window_size-1), max(post_data$window)+1),
                          sep="")
post_data.lousy <- data.table(post_data)
head(post_data.lousy)

##    header_id  id      x      y     z          mod minute window sample
## 1:       312 501 -0.969 -0.149 0.082 0.9838119739      0      0     s0
## 2:       312 502 -0.836  0.109 0.136 0.8539748240      0      0     s1
## 3:       312 503 -1.106 -0.250 0.203 1.1519309875      0      0     s2
## 4:       312 504 -1.079 -0.145 0.218 1.1103107673      0      0     s3
## 5:       312 505 -0.930  0.035 0.015 0.9307792434      0      0     s4
## 6:       312 506 -1.063 -0.192 0.246 1.1078578429      0      0     s5

tail(post_data.lousy)

##    header_id     id     x      y      z          mod minute window sample
## 1:       316 183414 0.437 -0.403 -0.793 0.9910736602    485   4859    s69
## 2:       316 183415 0.437 -0.403 -0.797 0.9942771243    485   4859    s70
## 3:       316 183416 0.437 -0.403 -0.793 0.9910736602    485   4859    s71
## 4:       316 183417 0.433 -0.403 -0.793 0.9893164307    485   4859    s72
## 5:       316 183418 0.437 -0.403 -0.797 0.9942771243    485   4859    s73
## 6:       316 183419 0.437 -0.403 -0.793 0.9910736602    485   4859    s74

The expected output will be TRUE if the sample belongs to the minute before 260, and FALSE otherwise.

post_data.lousy$activity <- ifelse(post_data.lousy$minute >= 260, 1, -1)

Very importantly, normalizing the inputs:

post_data.lousy$mod <- scale(post_data.lousy$mod, center=mean(post_data.lousy$mod), scale=sd(post_data.lousy$mod))

And transposing the data.table so that the samples are organized in windows:

dt.transpose.lousy <- dcast.data.table(post_data.lousy,
                                       window + activity ~ sample,
                                       value.var="mod")

Let’s define a random index to divide the dataset into training and test:

set.seed(1342)
index <- sample(1:nrow(dt.transpose),
                size=floor(0.75*nrow(dt.transpose)))

The formula for the neural network:

# formula: activity type from the 750 samples
f.al <- as.formula(paste("activity",
                          paste(paste("s", as.character(0:(nn_window_size - 1)), sep=""),
                                collapse="+"),
                          sep="~"))
f.al

## activity ~ s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s8 + s9 +
##     s10 + s11 + s12 + s13 + s14 + s15 + s16 + s17 + s18 + s19 +
##     s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 + s29 +
##     s30 + s31 + s32 + s33 + s34 + s35 + s36 + s37 + s38 + s39 +
##     s40 + s41 + s42 + s43 + s44 + s45 + s46 + s47 + s48 + s49 +
##     s50 + s51 + s52 + s53 + s54 + s55 + s56 + s57 + s58 + s59 +
##     s60 + s61 + s62 + s63 + s64 + s65 + s66 + s67 + s68 + s69 +
##     s70 + s71 + s72 + s73 + s74

And we’re ready to fit the neural net. The most simple neuralnet will have only one output neuron, converges in no time:

require(neuralnet)
nn.lousy.3 <- neuralnet(f.al,
                        dt.transpose.lousy[index,],
                        hidden=0,
                        act.fct="tanh",
                        linear.output=FALSE,
                        stepmax=1e5)

The following is to compute the results for the test set:

pr.nn.lousy.3 <- compute(nn.lousy.3,
                         dt.transpose.lousy[-index, paste("s", as.character(0:(nn_window_size - 1)), sep=""), with=FALSE])

And these would be simple steps to evaluate,

threshold <- 0
cm.train <- table(unlist(nn.lousy.3$net.result) > threshold, dt.transpose.lousy[index, ]$activity == 1)
cm.train

##
##         FALSE TRUE
##   FALSE  1440  843
##   TRUE    509  853

sum((unlist(nn.lousy.3$net.result) > threshold) == (dt.transpose.lousy[index, ]$activity == 1)) / length(index)

## [1] 0.6290809328

getFBetaScore(cm.train)

## [1] 0.557880968

cm.test <- table(pr.nn.lousy.3$net.result > threshold, dt.transpose.lousy[-index, ]$activity == 1)
cm.test

##
##         FALSE TRUE
##   FALSE   454  292
##   TRUE    197  272

sum((pr.nn.lousy.3$net.result > threshold) == (dt.transpose.lousy[-index, ]$activity == 1)) / length(dt.transpose.lousy[-index, ]$activity)

## [1] 0.5975308642

getFBetaScore(cm.test)

## [1] 0.5266214908

For a more complex network, this takes in the order of minutes:

nn.lousy.7 <- neuralnet(f.al,
                        dt.transpose.lousy[index,],
                        hidden=6,
                        act.fct="tanh",
                        linear.output=FALSE,
                        stepmax=1e6)

While this one takes the same, it is blatantly overfitting:

nn.lousy.8 <- neuralnet(f.al,
                        dt.transpose.lousy[index,],
                        hidden=c(12,3),
                        linear.output=FALSE,
                        stepmax=1e6)

Finally, just to cover some of the trials that are not really explained deeply in the post, this would be the code to organize the samples as in a sliding window.

post_data.lag <- post_data.lousy[, c("header_id", "id", "mod", "minute", "activity")]
# note that post_data.lousy already has normalized `mod` and the `activity` properly set
nn_window_size <- 75
# replicate the data as many times as the window size
post_data.lag <- do.call(rbind, replicate(nn_window_size, post_data.lag, simplify=FALSE))

# window id
post_data.lag$window <- floor(0:(nrow(post_data.lag)-1)/ nn_window_size)
# vector of sample ids for just one replication of the data
samples <- paste("s",
                 rep(0:(nn_window_size-1), max(post_data$window)+1),
                 sep="")
# sample id vector replicated for each replica of data, each time adding a lag of 1
post_data.lag$sample <- unlist(lapply(0:(nn_window_size-1), function (i) lag(zoo(samples), i, na.pad=TRUE)))

# transpose as usual
dt.transpose.lag <- dcast.data.table(post_data.lag[! is.na(sample),],
                                     window + activity ~ sample,
                                     value.var="mod")

# take a look to consecutive windows (4860 is equal to nrow(post_data)/nn_window_size)
head(dt.transpose.lag[c(1,4861,9721),
                      c('window', 'activity', 's0', 's1', 's2', 's3', 's4', 's5')])

##    window activity            s0            s1            s2           s3
## 1:      0       -1 -0.3625625899 -3.5876387514  3.8134118783  2.779590818
## 2:   4860       -1  1.8204892528 -0.3625625899 -3.5876387514  3.813411878
## 3:   9720       -1  1.7036379129  1.8204892528 -0.3625625899 -3.587638751
##              s4           s5
## 1: -1.679863493  2.718661660
## 2:  2.779590818 -1.679863493
## 3:  3.813411878  2.779590818

idx <- 73
head(dt.transpose.lag[c(idx,4860+idx,(4860*2)+idx,(4860*3)+idx,(4860*4)+idx,(4860*5)+idx),
                      c('window', 'activity', 's0', 's1', 's2', 's3', 's4', 's5')])

##    window activity           s0            s1            s2            s3
## 1:     72       -1 0.2784741481 0.06855847267 0.44922628880 0.93373239170
## 2:   4932       -1 3.4293477910 0.27847414813 0.06855847267 0.44922628880
## 3:   9792       -1 1.4838500951 3.42934779100 0.27847414813 0.06855847267
## 4:  14652       -1 1.2830928299 1.48385009509 3.42934779100 0.27847414813
## 5:  19512       -1 0.7343646452 1.28309282991 1.48385009509 3.42934779100
## 6:  24372       -1 2.9277401291 0.73436464522 1.28309282991 1.48385009509
##               s4            s5
## 1: 1.43101469279 1.64820899479
## 2: 0.93373239170 1.43101469279
## 3: 0.44922628880 0.93373239170
## 4: 0.06855847267 0.44922628880
## 5: 0.27847414813 0.06855847267
## 6: 3.42934779100 0.27847414813

# the lapply above creates 74 (window_size-1) NA's
nrow(dt.transpose.lag)
dt.transpose.lag.na <- na.omit(dt.transpose.lag)
nrow(dt.transpose.lag.na)

# index
set.seed(234)
index.lag <- sample(1:nrow(dt.transpose.lag.na),
                    size=floor(0.75*nrow(dt.transpose.lag.na)))

# simple neuralnet
nn.lag.3 <- neuralnet(f.al,
                      dt.transpose.lag.na[index.lag,],
                      hidden=0,
                      act.fct="tanh",
                      linear.output=FALSE,
                      stepmax=1e5)

Applying neural nets to time series: accelerometer data from wearable