datacloning.github.io/usage.Rmd at master · datacloning/datacloning.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
title: "Usage"
output: html_document
layout: default
---

Consider a simple Bernoulli model, $$Y_{i} \sim Binomial(1, p)$$,
 $$i=1, 2, \ldots, n$$.
The data generation corresponding to this model looks like
as follows:

```r
set.seed(4321) # set random seed for reproducibility
n <- 25 # sample size
p <- 0.3 # true parameter value
y <- rbinom(n = n, size = 1, prob = p)
```

## Bayesian analysis in JAGS

```r
library(dclone)
library(rjags)

## model specification
model <- custommodel("model {
    for (i in 1:n) {
        #Y[i] ~ dbin(p, 1) # Binomial(N,p)
        Y[i] ~ dbern(p) # Bernoulli(p)
    }
    p ~ dunif(0.001, 0.999)
}")

## data
dat <- list(Y = y, n = n)

## Bayesian MCMC results
fit <- jags.fit(data = dat, params = "p", model = model)

summary(fit)
plot(fit)
```

### Data cloning based maximum likelihood estimation

To make sure that both locations and clones are independent
(i.i.d.), it is safest to include and extra dimension
and the corresponding loop:

```r
## dclone-ified model specification
model <- custommodel("model {
    for (k in 1:K) {
        for (i in 1:n) {
            Y[i,k] ~ dbin(p, 1)
        }
    }
    p ~ dunif(0.001, 0.999)
}")

## dclone-ified data specification
dat <- list(Y = dcdim(data.matrix(y)), n = n, K = 1)

## data cloning based MCMC results
dcfit <- dc.fit(data = dat, params = "p", model = model,
    n.clones = c(1,2,4,8), unchanged = "n", multiply = "K")

summary(dcfit)
plot(dcfit)

coef(dcfit) # MLE
dcsd(dcfit) # asymptotic SEs
vcov(dcfit) # inverse Fisher information matrix
confint(dcfit) # asymptotic confidence interval
```

Data cloning based diagnostics can be used to spot
identifiability issues (which is not the case here):

```r
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
```

## High performance computing

Because MCMC chains are independent, is can be seen as an
[embarrassingly parallel problem](https://en.wikipedia.org/wiki/Embarrassingly_parallel).
On Windows, the only option is through
clusters (the cluster object `cl` initialized
using the `makeCluster` function and alikes).
On other platforms, forking can be used
(`cl <- 3`) besides clusters.

```r
## model specification
model <- custommodel("model {
    for (i in 1:n) {
        #Y[i] ~ dbin(p, 1) # Binomial(N,p)
        Y[i] ~ dbern(p) # Bernoulli(p)
    }
    p ~ dunif(0.001, 0.999)
}")

## data
dat <- list(Y = y, n = n)

## parallel Bayesian MCMC results
cl <- makeCluster(3)
pfit <- jags.parfit(cl, data = dat, params = "p", model = model)
stopCluster(cl)
```

Data cloning increases the size of the
DAG (directed acyclic graph). This means
increasing computation times.
Parallel computations for can be utilized
to cut down computing times.

```r
## dclone-ified model specification
model <- custommodel("model {
    for (k in 1:K) {
        for (i in 1:n) {
            Y[i,k] ~ dbin(p, 1)
        }
    }
    p ~ dunif(0.001, 0.999)
}")

## dclone-ified data specification
dat <- list(Y = dcdim(data.matrix(y)), n = n, K = 1)

## parallel data cloning based MCMC results
cl <- makeCluster(3)
dcpfit <- dc.parfit(cl, data = dat, params = "p", model = model,
    n.clones = c(1,2,4,8), unchanged = "n", multiply = "K",
    n.chains = 3, partype = "parchains")
stopCluster(cl)
```

<ul class="pager">
<li class="previous"><a href="{{ site.baseurl }}/install.html"><i class="fa fa-caret-left"></i> Install</a></li>
</ul>