Extracting numbers from strings in Excel

Stata’s great at taking raw numbers and chugging out graphs with minimal edits. Often times you’ll get results that aren’t raw numbers, but instead will exist as a string. Instead, of getting:

…you’ll get

My previous strategy has been to manually extract these numbers into rows and columns. I just came across these two pages of Excel strategies that will pluck the numbers in the second picture and produce the first one!

Step 1: Split strings into separate columns.

Reference: https://support.office.com/en-us/article/split-text-into-different-columns-with-functions-49ec57f9-3d5a-44b2-82da-50dded6e4a68

Here, we’ll split the string at the spaces. There are two spaces, so you’ll end up with three new variables. First, make three new ‘temp’ columns to the right. You’ll need a different code for each section’s destination cell.

Left bit, or “0.80”:

=LEFT(A2, SEARCH(" ",A2,1))

Middle bit, or “(0.70,”:

=MID(A2,SEARCH(" ",A2,1)+1,SEARCH(" ",A2,SEARCH(" ",A2,1)+1)-SEARCH(" ",A2,1))

Right bit, or “0.90)”:

=RIGHT(A2,LEN(A2)-SEARCH(" ",A2,SEARCH(" ",A2,1)+1))

This assumes that your string of interest is sitting cell A2. Change this cell reference as needed. If all goes well, you should have new cells that look like this:

Step 2: Pluck numbers from strings.

Reference: https://www.ablebits.com/office-addins-blog/2017/11/22/excel-extract-number-from-string/

Welp, there’s still some non-numeric text here. Time to pluck out the raw numbers! We’ll pretend that the ‘hrtemp’ cell also has a non-numerical character in it (e.g., a percent sign) for completeness’ sake. (Excel actually considers it a string still, which is why it’s not showing as “0.8”). Make 3 new rows to the right and use this code to extract the raw numbers contained in the string:

=(SUMPRODUCT(MID(0&B2, LARGE(INDEX(ISNUMBER(--MID(B2, ROW(INDIRECT("1:"&LEN(B2))), 1)) * ROW(INDIRECT("1:"&LEN(B2))), 0), ROW(INDIRECT("1:"&LEN(B2))))+1, 1) * 10^ROW(INDIRECT("1:"&LEN(B2)))/10))/100

…obviously, you’ll need to change the B2 cell to C2 and D2 as needed. You should get:

Boom! You should be able to copy to strings in subsequent rows by hovering over the bottom right of each cell and dragging down. Now you can “import excel” in stata and use your hr, low95, and high95 variables!

Make a Table 1 in Stata in no time with table1_mc

What’s in a Table 1?

Baseline demographic tables (colloquially known as ‘Table 1’ given their common location) are a core feature of nearly all epidemiologic manuscripts. The columns represent the exposure you are studying. The rows are characteristics of your population that are relevant to your research project. In placebo-controlled RCTs, the columns are drug and placebo. In observational studies, the column is your exposure of interest. Say you are curious about the relationship between smoking and development of breast cancer in a cohort. Here, the columns would be smoking and no smoking.

Wait, I’m looking at a Table 1 has more than just a column for each exposure!

There are certain variations that you’ll see in Table 1s:

  • A row for the entire population – This always seems overkill to me.
  • A row with P-values – These are of no value in RCTs in my opinion. They are only occasionally helpful in observational studies.

The ultimate design of the Table 1 will be dictated by the target journal. This creates challenges for authors, who may need to rework Table 1s in the submission (and resubmission) process.

Why have Table 1s historically been such a pain in the butt to make in Stata?

Well, Stata doesn’t natively pop out Table 1s. Formulating one either requires manually running –sum– commands over and over again or writing custom code to help automate this for you.

Enter table1_mc

The Stata program table1_mc was released by Mark Chatfield, a biostatistician at the University of Queensland. It’s a derivation of the original table1 program by Phil Clayton. It’s a work of wonder. It automates the generation of a Table 1 with a few simple codes. Need to reformat for a new target journal? Make minor changes and hit re-run and — ”POOF”’ — out pops an updated and compliant Table 1.

Step 1: Install the program


ssc install table1_mc

Step 2: Label your variables

Pluck out the variables you’ll include as the exposure and outcome. The table1_mc code will apply your bizarre, space-less variable name to the output unless you are using labels. Use real capitalization and formatting like you’d want to appear.

Step 2a: Labeling the variable itself

Let’s say you want to label your systolic blood pressure variable ‘sbp’ to be ‘Systolic blood pressure, mm Hg’. Type:

label variable sbp "Systolic blood pressure, mm Hg" 

Step 2b: Labeling the categories within variables

My suggestion is to generate a numerical ordinal variable and apply the labels to a number. The table1_mc program will put things in alphabetical or numerical order. Applying labels to numbers makes it easy to control the order. In this example, I have labels for income that I’ll make into a numerical ordinal variable first. In the raw dataset, the variables are defined using strings like “$20k-$34k”.

gen income1234=.
replace income1234=1 if income_4cat=="less than $20k"
replace income1234=2 if income_4cat=="$20k-$34k"
replace income1234=3 if income_4cat=="$35k-$74k"
replace income1234=4 if income_4cat=="$75k and above"
replace income1234=99 if income_4cat=="Refused"


1. Define the labels that you want to apply to income1234’s values of 1, 2, 3, 4, or 99, and

2. Apply the stupid labels. I always forget to apply the labels to the categorical values.

label define income_labels 1 "<$20K" 2 "$20k-$34k" 3 "$35k-$74k" 4 "$75k and above" 99 "Refused" // define the labels
label values income1234 income_labels // apply the labels!!

And, while you’re at it, don’t forget to apply a label to the overall ‘income1234’ variable that you made.

label variable income1234 "Annual household income"

Step 3: Make a table 1

The help document (type ‘help table1_mc’) is a must read. Please look at it.

First: Start with ‘table1_mc,’ then the exposure expressed as ‘by(EXPOSURE VARIABLE NAME)’. Then just list out the variables you want in each row one by one. Each variable should have an indicator for the specific data types:

  • Binary:
    • bin – binary with P-value from Pearson’s chi2
    • bine – binary with P-value from Fisher’s exact
  • Continuous:
    • contn – normally distributed, continuous variable, which will give mean and SD
    • contln – log-normally distributed, continuous variable, which will give geometric mean and GSD
    • conts – other continuous variable, which will give median and IQR.
  • Categorical:
    • cat – categorical with P-value from Pearson’s chi2
    • cate – categorical with P-value from Fisher’s exact

After the code telling Stata which format you are using, you tell it what output format you want it to report the variables. Stata defaults to a lot of decimals. If you don’t specify, mean age may be presented as ‘42.818742022’. What a mess.

You can probably do 99% of your formatting with two codes:

  • %4.0f – four leading digits, nothing after the decimal (e.g., 43)
  • %4.1f – four leading digits, one digits after the decimal (e.g., 42.8)

Next, separate each variable with a backslash (‘\’). I like to break each line using the three forward slashes after (‘///’) so that I don’t have one ungodly line of text.

FINALLY, tell it some key options at the end:

  • Ones I recommend including every time:
    • onecol – categorical variables will have a header that’s an extra leading row before they are presented, rather than a whole separate column.
    • missing – this keeps missing variables included. Helpful to show missingness of categorical variables.
    • nospace – this will drop dead spaces before single digit numbers. E.g., it’ll present ‘(3%)’ instead of ‘( 3%)’.
    • saving – output the Table 1 to Excel. Make sure that the Excel file output is not open in an Excel window when trying to overwrite a table. Otherwise, Stata will not run and you will be sad.
  • Simple things to help reformatting for journals:
    • [nothing] – presents n (%)
    • percent – presents a % alone without including the n
    • percent_n – % (n)
    • slashN – n/N instead of just n
    • total(before) – leading row with overall baseline demographics.

Some actual code to run table1_mc!

// install it!
ssc install table1_mc

// now specify things by "myexposure"
table1_mc, by(myexposure) ///
vars( ///
age contn %4.0f \ ///
sex0m1f bin %4.0f \ ///
race0w1b bin %4.0f \ ///
region123 cat %4.0f \ ///
educ1234 cat %4.0f \ ///
income1234 cat %4.0f \ ///
sbp contn %4.0f \ ///
dbp contn %4.0f \ ///
smoke7_ideal bin %4.0f \ ///
pa7_ideal bin %4.0f \ ///
diet7_ideal bin %4.0f \ ///
chol7_ideal bin %4.0f \ ///
fpg7_ideal bin %4.0f \ ///
bmi7_ideal bin %4.0f \ ///
bp7_ideal bin %4.0f \ ///
) ///
nospace percent onecol missing total(before) ///
saving("table 1.xlsx", replace)

…And here’s the (fake) result!

I’m working on an actual analysis right now so replaced all of the data from the actual output above with fake numbers. But you get the idea!

The example table!

Rendering XKCD #2023 “Misleading Graph Makers” in Stata

Let’s render an XKCD comic using Stata!

I loved today’s XKCD comic so I decided to take some time while eating my sandwich to write a .do file script to render it in Stata. There aren’t great smooth line options without figuring out the exact function for each line in Stata, so I approximated the data points. One interesting problem was including quotes in the X axis label since quotation marks are used to define the label and line breaks for labels. The solution was wrapping the line with an opening tick (`, to the left of number 1 on your keyboard) and closing with an apostrophe. This is also a nice example of how to input data in a .do file.

End result:

Code follows.

clear all

input id proportion band1 band2 band3 band4 band5 band6 band7 band8 band9 band10
id proportion band1 band2 band3 band4 band5 band6 band7 band8 band9 band10
0 . 21 22 23 24 25 26 27 28 29 30
0.3 . 21 22 23.7 25.5 26.3 28 28.8 29.2 29.5 30
0.5 . 20.8 22.5 24.7 27 28 29 29.2 29.4 29.7 30
0.7 . 20.6 25 27.4 28.4 29 29.3 29.5 29.6 29.9 30
0.9 . 20.1 28 28.5 29 29.3 29.5 29.7 29.8 29.9 30
1 23 20.1 28.5 29 29.3 29.5 29.6 29.7 29.8 29.9 30
1.3 . 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
2 23.5 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
3 22.3 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
4 23.5 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
5 23 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
6 28 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30

set scheme s1mono

graph twoway ///
(connected proportion id, lcolor(gs0) mcolor(gs0)) ///
(scatter band1 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band2 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band3 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band4 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band5 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band6 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band7 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band8 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band9 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band10 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
, ///
title(Y-Axis) ///
xlabel(none) /// 
xline(1(1)6, lpattern(solid) lcolor(gs12)) ///
ylabel(20 "0%" 25 "50%" 30 "100%", angle(0)) ///
aspect(1) ///
ytitle("") ///
legend(off) ///
xtitle(`"People have wised up to the "carefully"' ///
`"chosen Y-axis range" trick, so we misleading"' ///
"graph makers have had to get creative.")

graph export xkcd_2023.png, width(1000) replace

Making Scatterplots and Bland-Altman plots in Stata

NOTE: I have more recently put together user-friendly code on scatterplots and Bland-Altman plots. Go check those pages out.

Scatterplots with a fitted line

This is pretty straightforward in Stata. This is a variation of a figure that I made for a JAMA Internal Medicine paper. (I like to think that the original figure was publication quality, but they had their graphics team redo it in their own format.) This contains a diagonal line of unity and cutoffs for systolic hypertension as vertical/horizontal bars. Different from the publication version, this includes a thick dotted fitted line.

This code actually makes two different scatterplots then merges them together and puts labels over the merged version.

The scatter function here wants a Y-axis variable and X-axis variable. The two variables were ibp_sbp_pair (Y-axis) and average_sbp_omron_23 (X-axis) for systolic and ibp_dbp_pair and average_dbp_omron_23 for diastolic.

Bland-Altman plots

This is from the same paper.

Again, two different B-A plots that are merged then labels applied. The dotted line is relative mean difference, the long dashed lines are +/- 2 SD.

As far as Stata’s graph maker is concerned, this is a scatterplot. You just need to set up all of the variables intentionally to trick it into rendering a B-A plot. The Y-axis is the difference between the variables and the X-axis is a mean of the variables.

Code for both figures follows.

**********scatterplot ave with lines of fit
twoway (lfit ibp_sbp_pair average_sbp_omron_23, lcolor(gray) lpattern(dash) lwidth(vthick)) /// line of fit code
(function y=x, ra(average_sbp_omron_23) clcolor(gs4)) /// diagonal line of unity
(scatter ibp_sbp_pair average_sbp_omron_23 , mcolor(black) msize(vsmall)), /// make dots appear for scatter, y x axis
legend(off) /// hide legend
title("Systolic BP", color(black)) ///
ytitle("") /// no title, will add when merging SBP and DBP
xtitle("") /// ditto
xline(140, lpattern(solid) lcolor(gray)) /// cutoff for systolic hypertension
yline(140, lpattern(solid) lcolor(gray)) /// ditto
graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// white background, light gray lines
xlabel(90(20)170) ylabel(90(20)170) /// where X and Y labels occur
aspectratio(1) // force figure to be a 1x1 square, not a rectangle
graph save 20_sbp_scatterplot_fit.gph, replace // need graph to merge later
graph export 20_sbp_scatterplot_fit.png, width(4000) replace

twoway (lfit ibp_dbp_pair average_dbp_omron_23, lcolor(gray) lpattern(dash) lwidth(vthick)) /// 
(function y=x, ra(average_dbp_omron_23) clcolor(gs4)) ///
(scatter ibp_dbp_pair average_dbp_omron_23, mcolor(black) msize(vsmall)), /// 
legend(off) ///
title("Diastolic BP", color(black)) ///
ytitle("") ///
xtitle("") ///
xline(90, lpattern(solid) lcolor(gray)) ///
yline(90, lpattern(solid) lcolor(gray)) ///
graphregion(color(white)) ylabel(, grid glcolor(gs14)) ///
xlabel(30(20)110) ylabel(30(20)110) ///
graph save 21_dbp_scatterplot_fit.gph, replace
graph export 21_dbp_scatterplot_fit.png, width(4000) replace

****combined scatterplot
graph combine 20_sbp_scatterplot_fit.gph 21_dbp_scatterplot_fit.gph, /// 
graphregion(color(white)) ///
b1title("Standard (mmHg)") ///
l1title("IBP (mmHg)") ///
graph save combined_scatterplots_fit.gph, replace // 
graph export combined_scatterplots_fit.png, width(4000) replace

***************************Bland-altman plots
***sbp ***prep for figure gen mean_sbp_ave=(average_sbp_omron_23+ibp_sbp_pair)/2 // this will be the x-axis gen diff_sbp_ave=ibp_sbp_pair-average_sbp_omron_23 // this will be y-axis sum diff_sbp_ave // this allows you to make a macro of the mean ("r(mean)") of the y axis variable global mean1=r(mean) // this saves the macro as mean1, to be called later global lowerCL1=r(mean) - 2*r(sd) // this saves a macro for the mean+2 times the SD ("r(sd)") global upperCL1=r(mean) + 2*r(sd) ***make graph graph twoway scatter diff_sbp_ave mean_sbp_ave, /// legend(off) mcolor(black) /// ytitle("") /// ytitle("Reference Minus Comparator (mmHg)") xtitle("") /// xtitle("Average of Reference and Comparator (mmHg)") title("Systolic BP", color(black)) /// yline($mean1, lpattern(shortdash) lcolor(gray)) /// calls the macro from above yline($lowerCL1, lpattern(dash) lcolor(gray)) /// ditto yline($upperCL1, lpattern(dash) lcolor(gray)) /// graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// white background ylabel(-40(20)40) xlabel(90(20)170) /// aspectratio(1.08) // annoyingly, this wasn't a perfectly square figure so this line fixes it. ***save graph graph save 1_sbp_bland_altman_ave.gph, replace graph export 1_sbp_bland_altman_ave.png, width(4000) replace ***dbp ***prep for figure gen mean_dbp_ave=(average_dbp_omron_23+ibp_dbp_pair)/2 gen diff_dbp_ave=ibp_dbp_pair-average_dbp_omron_23 sum diff_dbp_ave global mean1=r(mean) global lowerCL1=r(mean) - 2*r(sd) global upperCL1=r(mean) + 2*r(sd) ***make graph graph twoway scatter diff_dbp_ave mean_dbp_ave, /// legend(off) mcolor(black) /// ytitle("") /// xtitle("") /// title("Diastolic BP", color(black)) /// msize(vsmall) /// yline($mean1, lpattern(shortdash) lcolor(gray)) /// yline($lowerCL1, lpattern(dash) lcolor(gray)) /// yline($upperCL1, lpattern(dash) lcolor(gray)) /// graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// ylabel(-40(20)40) xlabel(30(20)110) /// aspectratio(1.08) ***save graph graph save 2_dbp_bland_altman_ave.gph, replace graph export 2_dbp_bland_altman_ave.png, width(4000) replace ***********combined image bland altman graph combine 1_sbp_bland_altman_ave.gph pictures/2_dbp_bland_altman_ave.gph, /// ycommon /// so the y axes are on the same scale graphregion(color(white)) /// b1title("Average of IBP and Standard (mmHg)") /// l1title("IBP Minus Standard (mmHg)") /// ysize(3) graph save combined_dbp_sbp_ba.gph, replace // graph export combined_dbp_sbp_ba.png, width(4000) replace

The confusion nomenclature of epidemiology and biostatistics

This should be more simple.

Epidemiology and biostatistics are awash with synonyms and each institution has its own preferred nomenclature to describe the same general concepts. I started this page as a central place to document the various terms by concept. I’ll plan on revisiting and updating over time.



You probably learned the fundamentals of regression in introductory algebra but may not realize it.  Remember drawing a graph from a slope-intercept equation? Draw a graph where Y is equal to 1/4x plus 5. (Here is the relevant Khan Academy Algebra I video about this.) You take the general equation:

Y = mx + b

…where Y is the y-axis, m is the slope of the line, and b is where the line crosses the y-axis. The equation you will write is:

Y=1/4x + 5

…and you will draw:

This sounding familiar? When you do a linear regression, you do the same thing. Instead, you regress Y on X, or:

Y = β1x1 + β0

And fitting in the variables here, you want to figure out what a predicted cholesterol level will be for folks by a given age. You would regress cholesterol level on age:

Cholesterol level = β1*Age + β0

Here, x1 is the slope of the line for age and β0 is the intercept on the Y-axis, essentially the same as the b in Y=mx+b. When you run a regression in Stata, you type

regress y x

or here,

regress cholesterol age

Let’s say that Stata spits out something like:

      Source |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx  
       Model |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx   
    Residual |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx    
       Total |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx   

  cholesterol|      coeff       se         t     P>|t|    [95% Conf. Interval]
         age |     0.500     xxxxxxxx    xxxxx   0.000     0.4000      0.60000
       _cons |     100       xxxxxxxx    xxxxx   0.000     90.000      110.000

The βcoefficient for age is 0.5. The intercept, or β0 is 100. You would interpret this as cholesterol level = 0.5*age in years + 100. You could plot this using your Algebra 1 skills.

Cholesterol = 0.5*age + 100

Or you can substitute in actual numbers. What is the predicted cholesterol at age 50? Answer: 125.

If you want to make it more complex and add more variables to explain cholesterol level, it’s no longer a straight line on a graph, but the concept is the same. A multiple linear regression adds more X variables. You can figure out what a predicted cholesterol level will be for folks by age, sex, and BMI. You would regress cholesterol level on age, sex, and BMI. (You would code sex as 0 or 1, like female = 1 and male = 0.)

Y =  β1x1 + β2x2 + β3x3 + β0


Y =  β1*Age + β2*Sex + β3*BMI + β0

You get the idea.

Names of Y and X

This is what irks me. There are so many synonyms for Y and X variables. Here is a chart that I’ll update over time with synonyms seen in the wild.

Dependent Independent
Outcome Predictor
  Exposure variable
  Explanatory variable

Code to make a dot and 95% confidence interval figure in Stata

Dot and confidence interval figures in Stata

Stata has a pretty handy -twoway scatter- code that can be combined with -twoway rcap- to make the figure below. Example code at the bottom.

Horizontal version

Vertical version

First step, make an Excel file

I made an excel file with the below columns called “dot and 95 percent ci data.xlsx” saved in the same folder as my .do file. This figure will display row 1 at the top and row 14 at the bottom. The gaps in between the lines are the absent rows 3,6, 9, and 12. Group tells Stata if you want a red diamond or blue square. Proportion is the point estimate and low95 and high95 are the surrounding 95% confidence intervals.

Note that the data here are made up and are not related to any actual ongoing clinical investigation. 

Next step, make a .do file

In the same folder as the Excel file, copy/paste/save the code below as a .do file. Close Excel and close Stata then find the .do file from Windows Explorer and double click it. Doing this will force Stata to set the working directory as the folder containing the .do file (and the Excel file).

If the code won’t work, you probably have Excel open. Close it and try again.

That’s it! Code for both versions of the figure follows.

******************************IMPORT DATA HERE**********************************
import excel "dot and 95 percent ci data.xlsx", firstrow clear
destring, replace

******************************CODE STARTS HERE********************************** 
set scheme s1mono // black and white

twoway ///
(rcap low95 high95 row, horizontal) /// code for 95% CI
(scatter row proportion if group ==1, mcolor(red)) /// dot for group 1
(scatter row proportion if group ==2, mcolor(blue)) /// dot for group 2
, legend(row(1) order(2 "legend 1" 3 "legend 2") pos(6)) /// legend at 6 o'clock position
ylabel(1.5 "Model A" 4.5 "Model B" 7.5 "Model C" 10.5 "Model D" 13.5 "Model E", angle(0) noticks) ///
/// note that the labels are 1.5, 4.5, etc so they are between rows 1&2, 4&5, etc.
/// also note that there is a space in between different rows by leaving out rows 3, 6, 9, and 12 
xlabel(.95 " " 1 "1.0" 1.1 "1.1" 1.2 "1.2" 1.3 "1.3" 1.4 "1.4" 1.5 "1.5" 1.6 " ", angle(0)) /// no 1.6 label
title("Title") ///
xtitle("X axis") /// 
ytitle("Y axis") /// 
yscale(reverse) /// y axis is flipped
xline(1.0, lpattern(dash) lcolor(gs8)) ///
/// aspect (next line) is how tall or wide the figure is

graph export "dot and 95 percent ci figure horiz.png", replace width(2000)
//graph export "dot and 95 percent ci figure horiz.tif", replace width(2000)

******************************CODE STARTS HERE**********************************
set scheme s1mono // black and white

twoway ///
(rcap low95 high95 row, vert) /// code for 95% CI
(scatter proportion row if group ==1, mcolor(red)) /// dot for group 1
(scatter proportion row if group ==2, mcolor(blue)) /// dot for group 2
, legend(row(1) order(2 "legend 1" 3 "legend 2") pos(6)) /// legend at 6 o’clock position
xlabel(1.5 "Model A" 4.5 "Model B" 7.5 "Model C" 10.5 "Model D" 13.5 "Model E", angle(0) noticks) ///
/// note that the labels are 1.5, 4.5, etc so they are between rows 1&2, 4&5, etc.
/// also note that there is a space in between different rows by leaving out rows 3, 6, 9, and 12
ylabel(0.9 "0.9" 1 "1.0" 1.1 "1.1" 1.2 "1.2" 1.3 "1.3" 1.4 "1.4" 1.5 "1.5" , angle(0)) /// no 1.6 label
title("Title") ///
xtitle("X axis") ///
ytitle("Y axis") ///
yline(1.0, lpattern(dash) lcolor(gs8)) ///
/// aspect (next line) is how tall or wide the figure is

graph export "dot and 95 percent ci figure vert.png", replace width(2000)
//graph export "dot and 95 percent ci figure vert.tif", replace width(2000)

Making a horizontal stacked bar graph with -graph twoway rbar- in Stata

Making a horizontal stacked bar graph in Stata

I spent a bit of time making a variation of this figure today. (The data here are made up.) I’m pleased with how it came out. I first tried to use the -graph bar, horizontal- command, but it didn’t give me as much customization as -twoway graph rbar…, horizontal-. I think it looks pretty slick.

Start with an Excel file

I made an Excel file called stacked bar graph data.xlsx that I saved in the same folder as a .do file. I closed Stata and reopened that .do file from Windows explorer so that Stata set the working directory as the same folder that contains the .do file. More importantly, it set the working directory as the same folder that also contains the Excel file.

Group is the number of the individual bars, bottom is the bottom of the first segment of a bar, q1 is the top of the first segment of each bar. The rest should be obvious. I made this for quartiles, hence the q1-4 names. You can tweak the numbers by editing the Excel file, hitting save/close, and rerunning the .do file.

Make sure that you save and close Excel before running the .do file or the .do file won’t run and you will be sad. 

My .do file for making this horizontal stacked bar graph

Here’s my code! I hope it’s useful.

******************************IMPORT DATA HERE**********************************
import excel "stacked bar graph data.xlsx", firstrow clear // make sure that excel
//                                                           is closed before you
//                                                           run this script!
destring, replace

******************************CODE STARTS HERE**********************************
capture ssc install scheme-burd, replace // this installs nicer color schemes
// see the schemes here: https://github.com/briatte/burd
set scheme burd4 

graph twoway ///
(rbar bottom q1 group, horizontal) ///
(rbar q1 q2 group, horizontal) ///
(rbar q2 q3 group, horizontal) ///
(rbar q3 q4 group, horizontal) ///
, /// if you modify this file and it stops working, check the placement of this comma
xscale(log) /// make the x axis log scale
xla(0.25 "0.25" 0.5 "0.5" 1 "1" 2.5 "2.5" 5 "5" 10 "10" 25 "25" 50 "50" 100 "100") ///
yla(1 "Group 1" 2 "Group 2" 3 "Group 3" 4 "Group 4") ///
ytitle("Y title") ///
xtitle("X title") ///
text(4 1.2 "Q1", color(white)) /// first number is y second is x.
text(4 4.3 "Q2", color(white)) ///
text(4 20 "Q3", color(white)) ///
text(4 90 "Q4", color(white)) ///
legend(off) /// no legend, aspect is the shape of the figure. 1 is tall and thin.
// export the graph as a PNG file
graph export "stacked graph.png", replace width(2000)
// graph export "stacked graph.tif", replace width(2000) // in case you want as a tiff

Downloading and analyzing NHANES datasets with Stata in a single .do file

Learn to love NHANES

NHANES is a robust, nationally-representative cross-sectional study. For the past ~18 years it sampled different communities across the US in 2 year continuous cycles. A few of these years are linked to National Death Index data, so you can assess risk factors at the time of the survey and use time-to-event mortality data to identify novel risk factors for death. Manuscripts using NHANES data have been published across the spectrum of medical journals, all the way up to NEJM. The best part? You can download almost all of NHANES data from the CDC website right now for free. (Here’s a list of all data variables available.)

Manipulating NHANES data is challenging for beginners because of the sheer quantity of individual files and requirement for weighting. Plus, all of the files are in SAS XPT format so you have to download, import, save, and merge before you can even think about starting an analysis. To make this data management task slightly more complex, the CDC sporadically publishes interval updates of the source data files on their website. Files may be updated for errors or removed entirely without you knowing about it. (I strongly recommend subscribing the the NHANES Listserv to get real-time updates.) If you have NHANES files saved locally from 2-3 years ago, there’s a reasonable chance that you are using outdated databases, which could yield some false conclusions. Re-downloading all of the many files every time you want to do a project is a big headache.

Leverage Stata’s internet connectivity to make NHANES analyses easy

I love that Stata will download datasets for you with just a URL. The .do file below shows you how easy it is to download just the needed files on the fly then do some simple analyses. This means that you don’t have to worry about maintaining your personal database of NHANES files. If the source files are updated by the CDC, no worry! Every time you run this .do file, it’ll grab the freshest data files available. If the source data files are removed for inaccuracies, the file won’t run and you’ll be prompted to investigate. For example, the 2011-2012 Folate lab results were withdrawn February 2018. If you tried to download the FOLATE_G file, CDC’s website will give their version of a 404 error and Stata will stop the .do file cold in its tracks.

What this .do file does

In this short script, you’ll see how to 1. Import the NHANES SAS XPT files directly from the CDC website with just the XPT file’s URL, 2. Save data as Stata .dta files, 3. Merge the .dta files, 4. Review basic coding issues, 5. Run an analysis using weighting, and 5. Display data.

In this example, we’ll look at the 2009-2010 NHANES results and apply weighting to estimate the amount of the US population who have been told that they have high blood pressure. Just copy/paste the code below and save into a .do file. Set the working directory to be in the same folder as your .do file. Or, copy/paste the .do file, save it, close Stata, open the .do file through Explorer, and then run! Opening the .do file from Windows Explorer with Stata closed sets the .do file’s parent folder to be Stata’s working directory.

What this .do file doesn’t

You won’t be able to merge files with multiple entries per users. For example, the Prescription Medications – Drug Information questionnaire has a row for each medication and ICD code its use for all participants. You’ll need to use the reshape wide command on those variables before merging. BTW, that code is: .bys seqn: gen j= _n [linebreak] .reshape wide rxddrug rxddrgid rxqseen rxddays, i(seqn) j(j)

This also won’t merge with the National Death Index files, which are hosted elsewhere. The NHANES-NDI linkage website provides an example Stata .do script that would be straightforward to include below.

Also, if you are trying to combine analyses from multiple NHANES cycles (say, combining 2009-2010 with 2011-2012), things get a bit more complicated. You’ll need to append .dta files and consider adjusting the weights.

Finally, if you are interested in using Frames functionality to “merge” a dataset (aka skip saving multiple dta files in the process of building an analytical dataset), then check out this post

// Link to all NHANES datasets: https://wwwn.cdc.gov/nchs/nhanes/default.aspx
// 1. Click on years of interest (e.g., 2009-2010).
// 2. Scroll down, click on data of interest (e.g., demographics).
// 3. Right click on the XPT data file and copy the URL.
// 4. Note that the DOC file containing an overview of the file is right there.
//    Take a peek at its contents and return with questions. 
*********************download the demographics!!**********************
// The demographics file contains the weights.
// You ALWAYS need the demographics file.
// Paste the URL for the demographics of interest below:
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear
// NOTE: if you are using Stata 15 or older, above should be "sasxport"
// and not "sasxport5". Ditto for all "sasxport" commands that follow. 
sort seqn // Sort the file in order of the unique identifiers. Not necessary,
//           but rarely having an unsorted file will cause analytic issues.
// Save as a Stata dataset
save "DEMO_F.dta", replace
*********************download other files*****************************
// Let's look at the "questionnaire" for "blood pressure & cholesterol"
// and "kidney conditions - urology"
// Lost? Go back to the link on the very first line of this .do file
// and click on the year of interest again (2009-2010), scroll down and
// click "questionnaire". 
// Paste the URL for the BP & cholesterol below:
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn 
save "BPQ_F.dta", replace
// We aren't going to use the kidney file in this analysis, but just an
// example of how to merge a second dataset, copy/paste URL for kidney conditions
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn
save "KIQ_U_F.dta", replace
*********************merge your datsets*******************************
clear all // clear all memory
// start with the demographics file
use "DEMO_F.dta", clear
// Merge demographics with bp & chol
// The unique identifier in this dataset is "seqn", it may be different in other
// NHANES datasets.
merge 1:1 seqn using "BPQ_F.dta"
drop  _merge // this is a merge variable that needs to be dropped prior to 
//              merging other datasets. This variable will be helpful 
//              in exploring causes of merging issues. 
// Merge with kidney conditions datafile
merge 1:1 seqn using "KIQ_U_F.dta"
drop  _merge
// you can now save as a merged data file if you want. 
save nhanesBPall.dta, replace
*********************do an analysis!!*********************************
// Fortunately, Stata makes this easy with built in survey commands.
// You need to tell it that it's survey data with the "svyset" command then use 
// specific functions designed for weighted analyses.
// To explore available these functions: 
// 1. Check out the drop-down menu in stata, statistics --> survey data
// 2. Stata 13 documentation about this code: 
//            https://www.stata.com/manuals13/svysvyset.pdf
// 3. A helpful video from Stata: https://www.youtube.com/watch?v=lRTl8GKsZTE
// For the more recent continuous NHANES ones, here are the variables needed for
// weighting:
//   weight for interview data: wtint2yr 
//   weight for laboratory (MEC) data: wtmec2yr 
//   Sampling units (PSU): sdmvpsu
//   Strata: sdmvstra
//   Single unit: there isn't one. 
// NOTE: Older NHANES datasets use different variables
// We are using interview data (questionnaires), so our command is
svyset sdmvpsu [pweight = wtint2yr], strata(sdmvstra) vce(linearized) singleunit(missing)
// Syntax: "svy: COMMAND var" - for a list of commands, type "help svy_estimation"
// The 4 basic survey descriptive commands:
// 1. mean
// 2. proportion
// 3. ratio
// 4. total
// there are a bunch of other commands, including logistic regression, etc. 
// let's see what the mean age was, the variable is "ridageyr"
svy: mean ridageyr
// you see that the mean age of the US population (301 million people) in 2009-2010 was
// 36.7 years. COOL!
// NEXT:
// Let's see the amount of people who have been told they have high blood pressure
// VARIABLE: bpq020
// First, look at the documentation for this variable on the source XPT file's DOC page
// https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.htm
// note that this is coded as 1=yes, 2=no, 7=refused, 9=don't know, .=missing
// Only asked if age >=16, so not the entire 301 million US population will be here
// let's look at the responses given
svy: proportion bpq020
// let's make a new variable with 0=no, 1=yes, .=all others
gen toldhtn0n1y =.
replace toldhtn0n1y=0 if bpq020==2 // no (you need a double equals sign after the if)
replace toldhtn0n1y=1 if bpq020==1 // yes
//                                    all others remain as =.
// let's see the population on BP meds
svy: proportion toldhtn0n1y
// SO: of the 234 million adults >16 years, 28% have been told that hey have htn
// A little program to spit out the results in english:
matrix htnansw = e(b) // e(b) is where the # of people is stored. Type "ereturn list" to see
//                       which matrices are available. Type "matrix list [name of matrix]" to 
//                       see content of each matrix. This "matrix htnansw = e(b)" command will
//                       save the temporary matrix for e(b) under the permanent matrix
//                       "htnansw" that we can then manipulate.
// If you typed "matrix list htnansw", you'd see that the proportion answering "yes"
// is saved in the second column of the first row of this matrix, or [1,2].
local yesbp = htnansw[1,2] // pluck out the value of the 1,2 cell in the saved matrix 
//                            (the yes proportion) as a macro to call later
// NOTE: You call macros by opening with the tick to the left of number 1 on your keyboard,
// writing the name of the macro, then closing with a traditional apostrophe.
// Read about macros here: https://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm
// save # of people who answered y/n (234 million)
matrix subpop = e(_N_subp) // pluck out the # of people in this population, aka the # of 
//                            americans >=16 years old, as a permanent matrix named "subpop"
local population = subpop[1,1] // make a macro plucking the 1,1 cell where the total
//                                # of americans are in this population
// how data can be presented:
di "Unrounded " "`yesbp'" // just to prove how you can present it. Note that yesbp is a macro.
di "Rounded " round(`yesbp'*100) // round at the decimal after mult by 100
di "Total population is " `population' // note exponent
di "Total population is " %18.0fc `population' // note no exponent but helpful commas
di "Among Americans >=16 years, " round(`yesbp'*100) "% have been told that they have high BP."
di "Among Americans >=16 years, " round(`yesbp'*`population') " Americans have been told they have high BP."
// NOTE: If you run this line-by-line, stata may drop the macros above.  
// Run the script from the very top if you are getting errors in these last few lines. 
// Fin.

ClipSpeak: The most user-friendly, simple text-to-speech app ever

I came across the excellent text-to-speech app, ClipSpeak, last week. It’s a freeware program that you can download here.

Installing and opening ClipSpeak

I extracted the .zip file to my desktop and created a shortcut on my Windows Taskbar for ClipSpeak. Open it up and you’ll notice that nothing happened. (You may notice a new yellow icon next to your clock.)

Using ClipSpeak

Open up anything with text. Yes, anything. A browser. A Word document. A .txt file. Highlight what you want read outloud can hit “copy” (Ctrl-C).


Yes, that’s ClipSpeak reading you your copied text. You can right-click on the system icon of ClipSpeak (next to the clock, possibly buried under the ^ symbol) and increase/decrease the speed under Select Voice

Why I love this

This is the most simple program ever. There is no interface. It does what you want it to do. Want it to stop reading? Quit the program and open it back up to clear the audio playback.

MS Word’s new Read Aloud feature: Helpful for dyslexia and typo-finding

New in MS Word 365 is Read Aloud. It’s a pretty straightforward, stripped down/simplified text-to-speech (TTS) tool. If you have dyslexia or otherwise are having a difficult time finding typos in your work, try this feature. It’s very helpful.

To turn it on, just head over to the Review tab and click on the Read Aloud button.

It’ll automatically start reading from your cursor forward. Traditional icons for audio playback will float in the top right corner of your document window. These include play/pause, skip back (one paragraph), ad skip forward (one paragraph).

There is an icon of a dude with radio waves coming out of his head. That’s the settings button.

I like cranking the speed up by a few notches. You can also switch to other installed voices. I’m a fan of David’s sonorous speech. I’d totally pay a few bucks to have a Gilbert Gottfried voice though.

What if you have an older version of MS Word? Well, before this feature was released, I was using WordTalk v4.3. It’s a free, standalone app from the folks at the University of Edinburgh. It has some nice keyboard shortcuts that you can enable. The interface is very klunky though. In a pinch, it’ll do the trick.

2017-09-28 update

After using Read Aloud consistently for the past month, I still enjoy and recommend its use. However, I find it to be a bit glitchy with documents that have many co-author comments or embedded references from Zotero. In these documents, the playback will stop mid-word on occasion.

There are also some keyboard shortcuts for Read Aloud, documented here:

  • CTRL+Alt+Space – Start or quit Read Aloud
  • CTRL+Space – Play/pause
  • CTRL+Left arrow or CTRL+Right arrow – Skip back or forward a paragraph
  • Alt+Left or Alt+Right – Decrease or increase reading speed

2017-11-08 update

I received a new computer that ships with MS Office 2016, not 365. Read Aloud isn’t in Office 2016. It turns out that you can enable the older Text-to-Speech functionality in a pinch. If you don’t have the MS Speech Platform installed, you can download it here.

First, right click on the ribbon and select “customize the ribbon”.

Next, make a new group under the Review tab. I called mine Speak. Then under Choose Commands From, select All Commands. Scroll down to Speak and add it to your new group called Speak. Hit save.

You should now have a Speak button on your Review tab. Highlight what you want it to read and click Speak. Click it again to stop the reading.

You can adjust the reading speed under the Windows Control Panel –> Speech Recognition –> Text to Speech (on the left).