Generating overlapping/overlaying decile frequency histograms in Stata

I recently had a dataset with two groups (0 or 1), and a continuous variable. I wanted to show how the overall deciles of that continuous variable varied by group. Step 1 was to generate an overall decile variable with an –xtile– command. Step 2 was to make a frequency histogram. BUT! I wanted these histograms to overlap and not be side-by-side. Stata’s handy –histogram– is a quick and easy way to make histograms by groups using the –by– command, but it makes them side-by-side like this, and not overlapping. (Note: see how to use –twoway histogram– to make overlapping histograms at the end of this post.)

I instead used a collapse command to generate a count of # in each decile by group (using the transparent color command as color percent sign number), like this:

Here’s the code to make both:

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

// make overall deciles of continuousvalue
xtile decilesbygroup = continuousvalue, nq(10)

// now make a frequency histogram of those deciles
set scheme s1color // I like this scheme
hist decilesbygroup, by(var0or1) frequency bin(10)

// make a variable equal to 1 that we will sum in collapse
gen countbygroup = 1
// now sum that variable by the 0 or 1 indicator and deciles
collapse (sum) countbygroup, by(var0or1 decilesbygroup)
// now render the count from above as a bar graph:
set scheme s1color // I like this scheme
twoway ///
(bar countbygroup decilesbygroup if var0or1==0, vertical color(red%40)) ///
(bar countbygroup decilesbygroup if var0or1==1, vertical color(blue%40)) ///
, ///
legend(order(1 "var0or1==0" 2 "var0or1==1")) ///
title("Title!") ///
xtitle("Decile of continuousvalue") ///
xla(1(1)10) ///
yla(0(10)70, angle(0)) ///
ytitle("N in Decile")

You could also offset the deciles by the var0or1 and shrink the bar width a bit to get a frequency histogram where the bars are next to each other, like this:

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

// make overall deciles of continuousvalue
xtile decilesbygroup = continuousvalue, nq(10)

// now make a frequency histogram of those deciles
set scheme s1color // I like this scheme
hist decilesbygroup, by(var0or1) frequency bin(10)

// offset the decilesbygroup by var0or1 a bit:
replace decilesbygroup = decilesbygroup - 0.2 if var0or1==0
replace decilesbygroup = decilesbygroup + 0.2 if var0or1==1

// make a variable equal to 1 that we will sum in collapse
gen countbygroup = 1
// now sum that variable by the 0 or 1 indicator and deciles
collapse (sum) countbygroup, by(var0or1 decilesbygroup)
// now render the count from above as a bar graph:
set scheme s1color // I like this scheme
twoway ///
(bar countbygroup decilesbygroup if var0or1==0, vertical color(red%40) barwidth(0.4)) ///
(bar countbygroup decilesbygroup if var0or1==1, vertical color(blue%40) barwidth(0.4)) ///
, ///
legend(order(1 "var0or1==0" 2 "var0or1==1")) ///
title("Title!") ///
xtitle("Decile of continuousvalue") ///
xla(1(1)10) ///
yla(0(10)70, angle(0)) ///
ytitle("N in Decile")

A few quick notes here: The way that I am specifying the “bins” for the histograms here is different than how Stata specifies bins for histograms, since I’m forcing it to render by decile. If you were to generate a histogram of the “continuousvalue” instead of the above example using “decilebygroup”, you’ll notice that the resulting histograms looks a bit different from each other:

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

// make overall deciles of continuousvalue
xtile decilesbygroup = continuousvalue, nq(10)

// now make a frequency histogram of those deciles
set scheme s1color // I like this scheme
hist decilesbygroup, title("hist decilesbygroup") by(var0or1) frequency bin(10) name(a)
hist continuousvalue, title("hist continuousvalue") by(var0or1)  frequency bin(10) name(b)

Also, this code will only render frequency histograms, not density histograms, which are the default in Stata. You can also use the –twoway hist– command to overlay two bar graphs, but these might not perfectly align with the deciles. But, using the –twoway hist– allows you to use density histograms instead. See the example that follows. I suspect that most people will get what they need with the –twoway hist– command in Stata.

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

set scheme s1color // I like this scheme
twoway ///
(hist continuousvalue if var0or1==0, bin(10) color(red%40) density) ///
(hist continuousvalue if var0or1==1, bin(10) color(blue%40) density) ///
, ///
legend(order(1 "var0or1==0" 2 "var0or1==1")) ///
title("Title!") ///
xtitle("Grouping in 10 Bins") 

Making Box Plots in Stata from scratch

Stata has some handy built-in features to make boxplots. If you are trying to make a typical boxplot in Stata, go read up on the –graph box– command as described on this post.

I was asked to abstract a boxplot from an old paper and re-render it in Stata.

I used the excellent WebPlotDigitizer to abstract the points in these figure. It took me a few rounds of data abstraction to get all of the data. I generated the following variables

  • (1) “row” – rows 1-8 — the “A-D” labels on the x axis are offset at 1.5, 3.5, 5.5, and 7.5, respectively,
  • (2) “group” – an indicator for group 1 or 2,
  • (3-5) “boxlow” “boxmid” and “boxhigh” – the lower, mid, and upper bounds of the box,
  • (6-7) “bar1” and “bar2” – the low and upper end of the bars, and
  • (8) “dot” – extreme values.

I used “input” at the top of a do file to load these data. The colors were taken from ColorBrewer2. The “box” are just really wide pcspike lines with overlaying horizontal bars as floating text boxes to indicate the bottom, median, and top of the box. One problem is the placement of the horizontal lines on and above/below the box itself, since the “―” ascii character isn’t perfectly in the middle of the vertical space that is used to render it. To get around this, I included an offset value that could be modified to shift these lines up and down so they lined up with the boxes. The “whiskers” are an rcap. The extreme values are just scatterplot dots.

Code to make this follows, data here are fake.

// abstracted with this: https://apps.automeris.io/wpd/
clear all

input row group boxlow boxmid boxhigh bar1 bar2 dot
1 1 110 130 140 100 150 .
2 2 115 125 135 90 150 .
3 1 135 145 155 120 175 . 
4 2 80 110 115 70 125 .
5 1 160 175 180 140 200 .
6 2 120 130 140 110 160 .
7 1 145 160 170 135 190 .
8 2 120 135 155 110 160 .
1 . . . . . . 95
1 . . . . . . 100
1 . . . . . . 155
1 . . . . . . 160
2 . . . . . . 80
2 . . . . . . 85
2 . . . . . . 155
2 . . . . . . 160
3 . . . . . . 110
3 . . . . . . 115
3 . . . . . . 180
3 . . . . . . 185
3 . . . . . . 190
4 . . . . . . 60
4 . . . . . . 65
4 . . . . . . 130
4 . . . . . . 135
5 . . . . . . 130
5 . . . . . . 135
5 . . . . . . 210
5 . . . . . . 140
6 . . . . . . 100
6 . . . . . . 105
6 . . . . . . 170
6 . . . . . . 175
7 . . . . . . 125
7 . . . . . . 130
7 . . . . . . 200
7 . . . . . . 210
8 . . . . . . 100
8 . . . . . . 105
8 . . . . . . 170
8 . . . . . . 175
end



// offset to move the bars on the box up or down on Y-axis,
// tweak as needed:
local offset = 2

set scheme s1mono

twoway ///
(rcap bar1 bar2 row, vert lcolor(black)) /// code for 95% CI
(pcspike boxlow row boxhigh row if group==1, vert lwidth(vvvthick) lcolor("141 211 199")) ///
(pcspike boxlow row boxhigh row if group==2, vert lwidth(vvvthick) lcolor("255 255 179")) ///
(scatter dot row , mcolor(black) msymbol(o) msize(tiny)) ///
, ///
text(`=boxmid[1]+`offset'' 1 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[1]+`offset'' 1 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[1]+`offset'' 1 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[2]+`offset'' 2 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[2]+`offset'' 2 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[2]+`offset'' 2 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[3]+`offset'' 3 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[3]+`offset'' 3 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[3]+`offset'' 3 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[4]+`offset'' 4 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[4]+`offset'' 4 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[4]+`offset'' 4 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[5]+`offset'' 5 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[5]+`offset'' 5 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[5]+`offset'' 5 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[6]+`offset'' 6 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[6]+`offset'' 6 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[6]+`offset'' 6 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[7]+`offset'' 7 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[7]+`offset'' 7 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[7]+`offset'' 7 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[8]+`offset'' 8 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[8]+`offset'' 8 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[8]+`offset'' 8 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
legend(order(2 "Group 1" 3 "Group 2") row(2) size(small)) ///
scale(1.3) ///
///
xtitle("X Title") /// 
ytitle("Y Title") ///
///
title("Title") ///
xsize(5) ///
ysize(6) ///
yla(0(50)200, angle(0)) ///
xscale(r(.5 8.5)) ///
xla(1.5 "A" 3.5 "B" 5.5 "C" 7.5 "D") ///
xline(2.5 4.5 6.5)