Mediation analysis in Stata using IORW (inverse odds ratio-weighted mediation)

Mediation is a commonly-used tool in epidemiology. Inverse odds ratio-weighted (IORW) mediation was described in 2013 by Eric J. Tchetgen Tchetgen in this publication. It’s a robust mediation technique that can be used in many sorts of analyses, including logistic regression, modified Poisson regression, etc. It is also considered valid if there is an observed exposure*mediator interaction on the outcome.

There have been a handful of publications that describe the implementation of IORW (and its cousin inverse odds weighting, or IOW) in Stata, including this 2015 AJE paper by Quynh C Nguyen and this 2019 BMJ open paper by Muhammad Zakir Hossin (see the supplements of each for actual code). I recently had to implement this in a REGARDS project using a binary mediation variable and wanted to post my code here to simplify things. Check out the Nguyen paper above if you need to modify the following code to run IOW instead of IOWR, or if you are using a continuous mediation variable, rather than a binary one.

A huge thank you to Charlie Nicoli MD, Leann Long PhD, and Boyi Guo (almost) PhD who helped clarify implementation pieces. Also, Charlie wrote about 90% of this code so it’s really his work. I mostly cleaned it up, and clarified the approach as integrated in the examples from Drs. Nguyen and Hossin from the papers above.

IORW using pweight data (see below for unweighted version)

The particular analysis I was running uses pweighting. This code won’t work in data that doesn’t use weighting. This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. (Although you'll probably 
// have to adjust the svyset commands in the 
// program below to work in your dataset, in all 
// actuality)
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
//
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 2.
global outcome mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Components of weighting to go into "svyset" command. 
// You might have 
global samplingweight mysamplingweightvar
global id id_num // ID for your participants
global strata mystratavar
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw_weighted
program iorw_weighted, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
*Step 1: svyset data since your dataset is weighted. If your dataset 
// does NOT require weighting for its analysis, do not use this program. 
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
//
*Step 2: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
svy linearized, subpop(${ifstatement}): logit ${exposure} ${mediator} ${covariates}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 3: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights.  Since our dataset uses sampling
// weights, we need to multiply the dataset's weights times the IORW for the 
// exposure group. This step is fundamentally different for non-weighted 
// datasets. 
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures. 
gen weight_iorw = ${samplingweight} if ${exposure}==0
replace weight_iorw = inverseoddsratio*${samplingweight} if ${exposure}==1
//
*Step 4: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. 
// (same as direct effect, but without the IORW)
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log) 
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 5: DIRECT EFFECTS; using IORW weights instead of the weighting that
// is used typically in your analysis. 
svyset ${id}, strata(${strata}) weight(weight_iorw) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 6: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 7: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 8: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw_weighted
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title} (bootstrap count=" e(N_reps) ")*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%" _col(50) %4.2f `beta4' "% (" %4.2f `low954' "%, " %4.2f `high954' "%)"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
noisily di " "
}
// the end.

IORW for datasets that don’t use weighting (probably the one you are looking for)

Here is the code above, except without consideration of weighting in the overall dataset. (Obviously, IORW uses weighting itself.) This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. 
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 1.
global outcome  mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw
program iorw, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
//
*Step 1: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
logit ${exposure} ${mediator} ${covariates} ${ifstatement}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 2: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights. 
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures. 
gen weight_iorw = 1 if ${exposure}==0
replace weight_iorw = inverseoddsratio if ${exposure}==1
//
*Step 3: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. (same as direct effect, but without the IORW)
glm ${outcome} ${exposure} ${covariates} ${ifstatement}, family(poisson) link(log) vce(robust)
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 4: DIRECT EFFECTS; using IORW weights
glm ${outcome} ${exposure} ${covariates} ${ifstatement} [pweight=weight_iorw], family(poisson) link(log) vce(robust)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 5: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 6: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 7: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title} (bootstrap count=" e(N_reps) ")*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%"  _col(50) %4.2f `beta4' " (" %4.2f `low954' ", " %4.2f `high954' ")"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
noisily di " "
}
// the end.

Rounding/formatting a value while creating or displaying a Stata local or global macro

If you are looking to automate the printing of values from macros/regressions or whatnot on your figures, make sure to check out this post.

I love Stata’s macro functionality. It allows you to grab information from r- or e-level data after executing a Stata command then calling that back later. (Local macros are temporary, global macros are more persistent.) This is particularly useful when generating summary statistics collected in macros after the –sum– command, or displaying a subset of components from a regression, such as the beta coefficient and 95% confidence intervals, or P-values (details on how to manipulate regression function results with macros are here).

One problem in using macros is that raw r- or e-level data are really long and not amenable to output in tables for publication without formatting. I’ve hadn’t previously been able to apply formatting (eg %4.2f) while generating macros, outside of applying the “round” command. (I don’t like the round command because it’s tricky to code in a program for reasons I won’t get into.) Instead, I have applied the number formatting when displaying the macro. That creates issues when generating string output from numerical macros, since my prior strategy of applying numerical formatting (eg %4.2f) didn’t work when displaying a numerical macro in a string (i.e., embedding it within quotations). Instead, I wanted to apply the format while also generating the macro itself.

I came across this post on the Stata List by Nick Cox, which details how to do just that: https://www.stata.com/statalist/archive/2011-05/msg00269.html

It turns out that not only can you apply formatting while generating a macro with the “: display” subcommand, you can also trim extra spaces from the generated macro at the same time. (Note that the “trim” command has been replaced by “strltrim” and a lot of other related commands that you can find in –help string functions–.) As a bonus, it turns out that you can also apply formatting (e.g., %4.2f) of a macro when displaying within a string using the “: display” subcommand strategically surrounded by opening and closing tick marks.

Here’s a demo script of what I’m getting at.

(Note: I strongly recommend against formatting/rounding/reducing precision of any macros that you are generating if you will later do math on them. Formatting during generation of macros is only useful for macros intended to be displayed later on.)

sysuse auto, clear
sum mpg
//
// 1. generate an unformatted macro, equals sign 
//    is required
local mpg_mean = r(mean) 
// 
// 2. apply formatting while generating the local 
//    macro, note the colon for the macro 
//    subcommand, instead of an equals sign
local mpg_mean_fmt: display %10.2f r(mean)
//
// 3. here's formatting mixed with a trim command,
//    this combines an equals sign and colon. 
local mpg_mean_neat = strltrim("`: display %10.2f r(mean)'")
//
//
// Now let's call the unformatted macro from #1:
display "here it is unformatted: `mpg_mean'"
// note that it's really long after the decimal
//
// Let's apply formatting to #1 OUTSIDE of quotations:
display "here it is formatted " %10.2f `mpg_mean'
// ...and here's formatting to #1 WITHIN quotations:
display "here it is formatted `: display %10.2f `mpg_mean''"
// see all of the leading spaces? Using %3.2f would fix 
// that, but I wanted to show the trim function.
//
// here's the format applied during macro generation in 
// #2 (without a trim function):
display "here's an alt format: `mpg_mean_fmt'"
// still lots of leading spaces.
//
// here's trimming and formatting mixed from #3:
display "here's fmt & trim: `mpg_mean_neat'"
// Bingo!

Extracting variable labels and categorical/ordinal value labels in Stata

Stata allows the labeling of variables and also the individual values of categorical or ordinal variable values. For example, in the –sysuse auto– database, “foreign” is labeled as “Car origin”, 0 is “Domestic”, and 1 is “Foreign”. It isn’t terribly intuitive to extract the variable label of foreign (here, “Car origin”) or the labels from the categorical values (here, “Domestic” and “Foreign”).

Here’s a script that you might find helpful to extract these labels and save them as macros that can later be called back. This example generates a second string variable that applies those labels. I use this sort of code to automate the labeling of figures with the value labels, but this is a pretty simple example for now.

Remember to run the entire script from top to bottom or else Stata might drop your macros. Good luck!

sysuse auto, clear
// Note: saving variable label details at 
//       --help macro--, under "Macro functions 
//       for extracting data attributes"
// 
// 1. Extract the label for the variable itself.
//    If we look at the --codebook-- for the 
//    variable "foreign", we see...
//
codebook foreign
//
//    ...that the label for "foreign" is "Car 
//    origin" (see it in the top right of 
//    the output).  Here's how we grab the 
//    label of "foreign", save it as a macro, 
//    and print it.
//
local foreign_lab: variable label foreign
di "`foreign_lab'"
//
// 2. Extract the label for the values of the variable.
//    If we look at --codebook-- again for "foreign", 
//    we see...
//
codebook foreign
// 
//    ...that 0 is "Domestic" and 1 is "Foreign". 
//    Here's how to grab those labels, save macros, 
//    and print them
//
local foreign_vallab_0: label (foreign) 0 
local foreign_vallab_1: label (foreign) 1 
di "The label of `foreign_lab' 0 is `foreign_vallab_0' and 1 is `foreign_vallab_1'"
//
// 3. Now you can make a variable for the value labels.
//  
gen strL foreign_vallab = "" // this makes a string 
replace foreign_vallab="`foreign_vallab_0'" if foreign==0
replace foreign_vallab="`foreign_vallab_1'" if foreign==1
// 
// 4. You can also label this new string variable 
//    using the label from #1
// 
label variable foreign_vallab "`foreign_lab' as string"
// 
// BONUS: You can automate this with a loop, using 
//    --levelsof-- to extract options for each 
//    categorical variable. There is only one 
//    labeled categorical variable in this dataset 
//    (foreign) so this loop only uses the single one. 
// 
sysuse auto, clear
foreach x in foreign {
	local `x'_lab: variable label `x' // #1 from above
	gen strL `x'_vallab = "" // start of #3
	label variable `x'_vallab "``x'_lab' string"
	levelsof `x', local(valrange)
	foreach n of numlist `valrange' { 
		local `x'_vallab_`n': label (`x') `n' // #2 from above
		replace `x'_vallab = "``x'_vallab_`n''" if `x' == `n' // end of #3
	}
}

Part 4: Defining your population, exposure, and outcome

Importing and merging files

Odds are that you will get raw data to work with that is in a CSV, sas7bdat, Excel’s xlsx file format, or something else. Stata can natively import many (but not all) file types. The simplest thing to do is use the –import– command then immediately save things as a temporary Stata .dta file and later merge them.

Importing commands differ by filetype. Type –help import– for details. But here are the 3 commands I use most frequently. These assume that the files are in the present working directory, which you can see by typing –pwd–. Remember to open Stata in Windows by double-clicking the .do file of interest in Windows explorer to set the folder that the .do file is sitting in as the working directory.

// CSV aka comma separated value, importing variable names
// from the first row
import delim using "NAME.csv", clear varnames(1)
save "name_csv.dta", replace // save as Stata dataset

// sas7bdat:
import sas using "NAME.sas7bdat", clear case(lower)
save "name_sas.dta", replace // save as Stata dataset


// Excel, MAKE SURE THAT YOU DON'T ALSO HAVE THE FILE
// OPEN IN EXCEL or it won't import it. This will import
// variable names from the first row.
import excel using "NAME.xlsx", clear firstrow.
save "name_xlsx.csv", replace // save as Stata dataset

There are lots of fancy settings within these commands.

To merge, simple merge all of your new .dta files using the –merge– command. This assumes that all files have a variable named “id” that uniquely identifies all rows and is preserved across files. eg:

use name_csv.dta
merge using name_sas.dta
drop _merge
merge using name_xlsx.dta
drop _merge

The merge command generates a “_merge” variable that tells you where a specific variable came from. Review this variable and the output in Stata very closely. You need to drop the “_merge” variable before merging other datasets.

Getting the population, exposure, and outcome correct in your analytical dataset, and being able to come back and fix goofs later

Defining a study population, exposure variable, and outcome variable is a critical early step after determining your analysis plan. Most epidemiology projects come as a huge set of datasets, and you’ll probably need to join multiple files into one when defining your analytical population. Defining your analytical population is an easy place to make errors so you’ll want to have a specific script that you can come back and edit again if and when you find goofs.

For the love of Pete — Please generate your population, exposure, and outcome variables using a script so you can go back and reproduce these variables and fix any bugs you might find!

When you make these variables, you’ll likely need to combine several datasets. This will require mastery of importing datasets (if not in the native format for your statistical program) and combining datasets. For Stata, this means using –import– and –save– commands to bring everything over into Stata format, and then using –merge– commands to combine multiple datasets.

Make a variable for your population that is 0 (not included) or 1 (included)

One option in generating your dataset is to drop everyone who isn’t in your dataset. I recommend against dropping individuals who aren’t in your dataset. Instead, create a variable to define your population. Name it something simple like “included”, “primary population”, “group_a” or whatnot. If you will have multiple populations (say, one defined by prevalent hypertension using JNC7 vs. ACC/AHA 2017 hypertension thresholds), then you should have a variable for each addended with a simple way to tell them apart. Like “group_jnc7” and “group_aha2017”.

Useful code in R and Stata to do this:

  • Count
  • Generate and replace (Stata), mutate (R)
  • Combine these with assigning single equals sign “=” (Stata & R, I say out loud “assign” when using this) and “<-" (R)
  • use –if–, –and–, & –or– statements
  • Tests of equality: >, =, <=, != (not), == ("equals exactly"), not single equal sign

Example Stata code to count # of people with diabetes, generate a variable for group_a and assign someone to group_a if they have diabetes.

count if diabetes==1
gen group_a=0
replace group_a=1 if diabetes==1
count if group_a==1 

When creating a new variable from another variable, sometimes it’s helpful to start to generate a missing variable first (a dot) then replace a certain group as 0 and another as 1. This is especially helpful for strings. Strings must be in quotations, btw. Here’s an example of how to make a numeric variable from “Y” and “N” strings

gen diabetes = . // start with variable that's missing for all
replace diabetes = 0 if diab_string == "N"
replace diabetes = 1 if diab_string == "Y"

Missing as positive infinity in Stata

Read up about missing values by typing –help missing–. In Stata, missing values are either a dot or a dot with a letter (e.g., “.” or “.a”). Mathematically, Stata considers a dot missing value to be positive infinity, and each dot and letter missing value to be even larger positive infinity, so: one billion trillion < . < .a < .b < .c and so on. Understanding that positive infinity is considered missing in Stata is critically important when using greater than statements, since anything greater than another value will include all missing values. So, imagine you had a population of 1 million people and only 100 of them were asked how many popsicles they have in their freezer and half had 0 to 10 and the other half had 11 to 20. If you try to make a variable called “popsicle_count” that is 1 for the lower half (0-10) and 2 for the higher half (11-20), you’d do something like this:

gen popsicle_count = . // everyone has a missing variable 
replace popsicle_count=1 if popsicle <=10
replace popsicle count = 2 if popsicle >10

…you would the popsicle_count variable would have 25 people with a value of 1 and 999,975 with a value of 2. this is because the last line didn’t specify what to do with missing values. The easy workaround here is to use “& popsicle <." to specify that you wanted to include anyone with a value less than positive infinity, aka missing values, in the last line. The correct way of writing this would be:

gen popsicle_count = . // everyone has a missing variable 
replace popsicle_count=1 if popsicle <=10
replace popsicle_count=2 if popsicle >10 & popsicle <. // NOTE!!

This last line correctly ignores all variables when assigning a value of 2 since it applies the number of 2 to anything greater than 10 and less than positive infinity, aka anything less than a missing value. .

Here’s example R code to do the same (df=data frame).

nrow( df %>% filter(diabetes == 1) )
df = df %>% mutate(group_a = ifelse(diabetes == 1, 1, 0) )

Make an inclusion flowchart

These are essential charts in observational epidemiology. As you define your population, generate this sort of figure. Offshoots of the nodes define why folks are dropped from the subsequent node. Here’s how I approach this, folks might have different approaches:

  • Node 1 is the overall population.
  • Node 2 is individuals who you would not drop for baseline eligibility reasons (had prior event that discounts them or missing data to prevent assessment of their eligibility)
  • Node 3 is individuals who you would not drop because you can assess them for necessary follow-up (incomplete follow-up, died before required follow-up time, missing data)
  • Node 4 is individuals who you would not drop because they had all required exposure covariates (if looking at stroke by cholesterol level, people who all have cholesterol). This is your analytical population.

If you have two separate populations (eg, different hypertension populations by JNC7 or ACC/AHA 2017), you might opt to make two entirely separate figures. If you have slightly different populations because of multiple exposures (e.g., 3 different inflammatory biomarkers, but you have different missingness between the 3), you might have the last node fork off into different nodes, like this:

I generate these via text output in Stata then manually generate them in PowerPoint. To make these, I use a series of “sum” and “count” commands following along with “noisily display” commands all in a quietly loop. (Noisily overrides quietly for a single line. When debugging, you might want to hide the quietly loop.)

I also make an “include” variable that defines the analytical population(s) of interest.

You can display specific bits of data after a “sum” command, including r(N), which is the N of a sample. If you wonder what bits of data are available after a command like “sum if include==1”, type “return list”.

Example:

quietly {
gen include=1 // make a variable called "include" that is 1 for everyone
count if include ==1 // count the # of rows with an include variable equal to 1, this is everyone. It will save that value as r(N). 
noisily display "Original study population, n= " r(N) // you can print this
//
// now lets print out the people we exclude
count if prevalent_htn_jnc7==1 // this will count the # with prevalent htn to be excluded
noisily display " --> Hypertension at baseline, n= " r(N)
count if prevalent_htn_missing==1 
noisily display " --> Missing bp at baseline, n= " r(N)
//
// now we are going to replace the include variable as 0 for people missing the two things above
replace include = 0 if prevalent_htn_jnc7==1
replace include = 0 if prevalent_htn_missing==1
count if include==1
noisily display "normotensive at baseline, n= " r(N)
// and so on
}

If you are using weighted data, this approach will differ slightly. First, you will have to svyset your data. Next, you will use “svy, subpop(IF COMMAND HERE): total [thing]”. Instead of using “return list”, you use “ereturn list” to see the bits that are saved. The weighted N is e(N_pop), for example.

svyset [pweight=sampleweight]
gen include=1
svy, subpop(if include==1): total include 
ereturn list // notice e(b) is there, it's the beta from the prior estimation
noisily di "original study population, n= " e(b)[1,1]
// above prints out the first cell of the matrix e(b), hence the [1,1]
// type "matrix list e(b)" to see what's in that matrix. 
// now figure out how many are excluded for missing a biomarker
svy, subpop(if include==1 & biomarker==.): total include
// now print it out, but since this uses sampling, it will not be a whole number. Print out a whole number with the %4.0f formatting code. 
noisily di " --> missing biomarker, n= " %4.0f e(b)[1,1]
// now update the include to be 0 for missing biomarkers 
// and display the count of the node print the node
replace include=0 if biomarker==. 
svy, subpop(if include==1): total include
no di "not missing biomarker, n= " %4.0f e(b)[1,1]
// and so on

When you get to the end, you’ll have a variable called “include” that you will use in all of your later analyses to define your analytical population. Depending on your analysis, you might need to make a few different include variables. For example, we commonly run hypertension analyses using both jnc7 and acc/aha 2017 hypertension definitions, so I usually have an “include_jnc7” and also a separate “include_aha2017” variable.

Defining exposure and outcome

This seems simple, but define clearly what your exposure is and your outcome is. Each should have a simple 0 or 1 variable (if dichotomous) with an intuitive name. You might need 2 separate outcomes if you are using different definitions, like “incident_htn_jnc7” and “incident_htn_aha2017”.

Table 1

“Table 1” shows core features of the population by the exposure. Don’t include the outcome as a row, but include demographics and key risk factors/covariates for outcome (eg if CVD, then diabetes, blood pressure, cholesterol, etc.). Some folks include a 2nd column that presents the N for that row. Some folks also include a P-value comparison as a final row. I tend to generate the P value every time but only present it if the reviewers ask for it.

In Stata, I use the excellent table1_mc program to generate these, which you can read about here. If you are using p-weighted data, you can use this script that I wrote, here.

For R, I am told that gtsummary works well.

For lots more details on Table 1s, please continue to the next post, here.

Part 2: Effective collaborations in epidemiology projects

Pin down your authorship list

Determine authorship list before you lift a finger on any analysis and get buy-in from collaborators on that list.

Stay organized

Have one folder where you save everything, and have subfolders within that for groups of documents. I suggest these subfolders:

  • Manuscript
  • Abstract
  • Data and analysis
  • Paper proposal (for REGARDS projects & others with manuscript proposal documents)

…and keep relevant documents within each one. You might want to put an “archive” folder within each subfolder (e.g., manuscript\archive) and move old drafts into the archive folder to reduce clutter.

Give documents a descriptive name. Don’t call it “manuscript [versioning system].docx”– use terms for your projects. If you are doing a REGARDS paper looking at CRP and risk of hypertension, name it “regards crp htn abstract [versioning system].docx”.

Use an intuitive versioning system. I like revision # then version # (eg r00 v01). Many people use dates. If you use dates to keep track of versions, append your documents with the ISO8601 date convention of YYYY-MM-DD. Trust me. Lots of details on this post.

Have realistic goals and stick to deadlines

Come up with some firm deadlines and do your best to stick with them. Here are some goals to accomplish in moving a project forward, if you wanted an example.

  • Combine all existing written documents (eg, proposal) into one manuscript.
  • Draft blank tables and decide what figures you want to make. Write methods.
  • Generate baseline characteristics. Describe in results.
  • Generate descriptive statistics/histograms for your exposure and outcome(s). Describe in results.
  • Estimate primary and secondary outcome(s). Describe in results.
  • Complete secondary analyses. Describe in results.
  • Finish first draft. Send to your primary mentor or collaborator.
  • Integrate feedback from primary mentor or collaborator into a second draft. Circulate to coauthors.
  • Integrate feedback from coauthors into a document to be submitted to a journal.
  • Format your manuscript for a specific journal and submit it. (This takes a surprisingly large amount of time.)

Managing your mentor: Send reminder emails more frequently than you probably realize

I block off time to work on your stuff, but clinical priorities or other professional/parenting challenges might bump that time. I try to find other time to work on your stuff, but a big crisis might mean that I don’t have a chance to reschedule.

Please, please, please, please email me early and persistently about your projects. This will never annoy me — these emails are very helpful. Quick focused emails are helpful here, especially if you re-forward your prior email threads. Eg, “Hi Tim, wondering if you had a chance to take a look at that draft from last week, reforwarded here. Thanks, [name].”

Working on revisions

Use tracked changes

And remember to turn them on when you send around a draft!

Append your initials to the end of the document that you are editing for someone else

For me, I’ll change a name to “My cool document v1 tbp.docx”.

Stata-R integration with Rcall

Stata is great because of its intuitive syntax, reasonable learning curve, and dependable implementation. There’s some cutting edge functionality and graphical tools in R that are missing in Stata. I came across the Rcall package that allows Stata to interface with R and use some of these advanced features. (Note: this worked for me as far back as Stata 16. I have no reason to think that this wouldn’t work with Stata 13 or newer.)

Below details:

  • How to get set up with Rcall
  • Example 1: How to make figures with the ‘ggplot2’ and ‘ggstatsplot’ R packages
  • Example 2: How to estimate the Charlson comorbidity index or Elixhauser comorbidity score with the ‘comorbidity’ R package
  • Example 3: Estimating race by last name and sex by first name with the ‘predictrace’ R package

Installation of R, R packages, and Rcall (you only need to do this once)

Download and install R. Open R and install the readstata13 package, which is required to install Rcall. While you’re at it, install ggplot2 and ggstatsplot. Note: ggplot2 is included in the excellent multi-package collection called Tidyverse. We are going to install Tidyverse instead of ggplot2 alone. Tidyverse also installs several other packages useful in data science that you might need later. This includes dplyr, tidyr, readr, purrr, tibble, stringr, forcats, import, wrangle, program, and model. I have also gotten an error saying “‘Rcpp_precious_remove’ not provided by package ‘Rcpp'”, which was fixed by installing Rcpp, so install that too.

In R, type:

install.packages("readstata13")
install.packages("tidyverse")
install.packages("ggstatsplot")
install.packages("Rcpp")

It’ll prompt you to set up an install directory and choose your mirror/repository. Just pick one geographically close to you. After these finish installing, you can close R.

Rcall’s installation is within Stata (as usual for Stata programs) but originates from Github, not the usual SSC install. You need to install a separate package to allow you to install things from Github in Stata. From the Stata command line, type:

net install github, from("https://haghish.github.io/github/")

Now install Rcall itself from the Stata command line:

github install haghish/rcall, stable

If all goes well, it should install!

Edit: In July 2024, there seems to be a problem with the installation. If you get an error saying “github package was not found” and “please update your GitHub package”, you might try to install an older version of rcall than the current version (which is 3.1.0 in 7/2024) as such:

github install haghish/rcall, version(3.0.7)

Using Rcall

You should read details on the Rcall help file (type –help rcall– in Stata) for an overview. Also read the Rcall overview on Github. In brief, you can send datasets from Stata to R using –rcall st.data()–. You can kick things back to stata with –st.load(name of R frame)–. –rcall clear– reboots R as a new instance.

There are four modes for using Rcall: vanilla, sync, interactive, and console. For our purposes, we are going to focus on the interactive mode since this allows you to manipulate R from within a do file.

Example 1: Make a figure in ggplot2 using Stata and Rcall

Here’s some demo code to make a figure with ggplot2, which is the standard for figures in R. There’s a handy cheat sheet here. This intro page is quite helpful. This overview is excellent. Check out the demo figures from this page as well. If your ggplot command extends across multiple lines, make sure to end each line (except the final line) with the three forward slash (“///”) line break notation that is used by Stata.

// load sysuse auto dataset
sysuse auto, clear
// set up rcall, clean session and load necessary packages
rcall clear // starts with a new instance of R
rcall: library(ggplot2) // load the ggplot2 library
// move Stata's auto dataset over to R and prove it's there.
rcall: data<- st.data() // move auto dataset to r
rcall: names(data) // prove that you can see the variables. 
rcall: head(data, n=10) // now look at the first 10 rows of the data in R
// now make a scatterplot with ggplot2, note the three slashes for line break
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
       geom_point()
rcall: ggsave("ggtest.png", plot=e)
// figure out where that PNG is saved:
rcall: getwd()

Note: rather than using the three forward slashes, you can also change the delimiter to a semicolon, like the following. Just remember to change it back to normal (“cr”). Here’s an equivalent to above with semicolon delimiters. Note that the ggplot bit that spreads across two lines no longer has any slashes. This looks a bit more like “true R code”.

#delimit ;
rcall clear ; 
rcall: library(ggplot2) ; 
rcall: data<- st.data() ;
rcall: names(data) ;
rcall: head(data, n=10) ;
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + 
       geom_point() ;
rcall: ggsave("ggtest.png", plot=e) ; 
rcall: getwd() ; 
#delimit cr

Here’s what it made! It was saved in my Documents folder, but check the output above to see where you working directory is.

You can get much more complex with the figure, like specifying colors by foreign status, specifying dot size by headroom size, adding a loess curve with 95% CI, and adding some labels. You can swap out the “rcall: e <- ggplot(…)" bit above for the following. Remember to end every non-final line with the three forward slashes.

rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
	geom_point(aes(col=foreign, size=headroom)) + ///
	geom_smooth(method="loess") +   ///
	labs(title="ggplot2 demo", x="MPG", y="Weight", caption="Caption!")

Here’s what I got. Varying dot size by a third variable can be done in Stata using weighted markers, as FYI.

Let’s make a figure in ggstatsplot using Stata and Rcall

Here’s some demo code to make a figure with ggstatsplot (which is very awesome and you should check it out). If your ggstatsplot command extends across multiple lines, make sure to end each line (except the final line) with the three forward slash (“///”) line break notation that is used by Stata.

// load sysuse auto dataset
sysuse auto, clear
// set up rcall, clean session and load necessary packages
rcall clear // starts with a new instance of R
rcall: library(ggstatsplot) // load the ggstatsplot library
rcall: library(ggplot2) // need ggplot2 to save the png
// move Stata's auto dataset over to R and prove it's there.
rcall: data<- st.data() // move auto dataset to r
rcall: names(data) // prove that you can see the variables. 
rcall: head(data, n=10) // now look at the first 10 rows of the data in R
// let's make a violin plot using ggstatsplot
rcall: f <- ggbetweenstats( data = data, x=foreign, y=weight, title="title")
rcall: ggsave("ggstatsplottest.png", plot=f)
// figure out where that PNG is saved:
rcall: getwd()

If you check your working directory (it was my “Documents” folder in Windows), you’ll find this figure as a PNG:

You can automate the output of ggstatsplot figures by editing the ggplot2 components that make it up. You’d insert the following into the ggstats plot code in the parentheses following “ggbetweenstats” to make the y scale on a log axis, for example:

ggplot.component = ggplot2::scale_y_continuous(trans='log') 

Quick do file to automate R-Stata integration and make ggplot2 or ggstatsplot figures

I made a do file that simplifies the setup of Rcall. Specifically, it 1. Sets R’s working directory to match your current Stata working directory, 2. Starts with a fresh R install, 3. Loads your current Stata dataset in R, and 3. Loads ggplot2 and ggstatsplot in R.

To use, just load your data, run a “do” command followed by the URL to my do file, then run whatever ggplot2 or ggstatsplots commands you want.This assumes you have installed R, the required packages, and Rcall (see very top of this page). If you get an error, try using this alternative version of the do file that doesn’t try to match Stata and R’s working directory.

Example code:

// Step 1: open dataset
sysuse auto, clear
// Step 2: run the do file, hosted on my UVM directory:
do https://www.uvm.edu/~tbplante/rcall_ggplot2_ggstatsplot_setup_v1_0.do
// if errors with above, use this do file instead:
// do https://www.uvm.edu/~tbplante/rcall_ggplot2_ggstatsplot_setup_alt_v1_0.do
// Step 3: run whatever ggplot2 or ggstatsplot code you want:
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
	geom_point(aes(col=foreign, size=headroom)) + ///
	geom_smooth(method="loess") +   ///
	labs(title="ggplot2 demo", x="MPG", y="Weight", caption="Caption!")
rcall: ggsave("ggtest.png", plot=e)

Example 2: Using “comorbidity” R package in Stata with Rcall to estimate Charlson comorbidity index or Elixhauser comorbidity score

Read all about this handy package here and in the PDF reference manual. In R, type:

install.packages("comorbidity") 

Here’s some semicolon delimited Stata code to run from a Stata do file apply the Charlson comorbidity index to some Stata data.

webuse australia10, clear // load a default stata dataset with an ICD10 variable
gen id=_n // make an ID by row as there's no ID variable in this dataset

#delimit ;
rcall clear ; 
rcall: library(comorbidity) ; // load comorbidity package
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at data
rcall: head(data, n=10) ; // look at rows
rcall: charlston <- comorbidity(x=data, id="id", code = "cause", 
		map = "charlson_icd10_quan", assign0 = FALSE) ;
rcall: score(x=charlston, weights = "charlson", assign0=FALSE) ;
rcall: mergeddata <- merge(data, charlston, by="id") ; // merge the original & new charlson data
rcall: head(mergeddata, n=10) ; // look at rows
rcall: st.load(mergeddata) ; // kick the merged data back to stata
#delimit cr

Example 3: Predicting race by last name and sex by first name using the ‘predictrace’ package

I came across this ‘predictrace’ package: https://jacobkap.github.io/predictrace/

…which says that it implements the methods described in this paper:

Here’s how to use this package in Stata using Rcall. There is a lot of nuance in this package so make sure to read the paper and the github page.

In R, type:

install.packages(predictrace)

Then in Stata, write a do file that generates a variable called “lastnames” that is lower case last names (for race matching) and “firstnames” that is lowercase first names (for sex matching). Below is the code to estimate race by last name (first batch of code) and then sex by first name (second batch of code). This is semicolon delimited.

Estimating race by last name

(Note: this considers “Hispanic” to be a race and not an ethnicity…)

// Clear memory, input dataset of first and last names.
// Here, I'm formatting the strings so they are up to 100 characters
// in length so they don't get clipped (str100). 
// If I specified str5 then "Flintstone" would be "Flint". 
clear all
input str100 firstname str100 lastname
"Jacob" "Peralta"
"Rosa" "Diaz"
"Terrence" "Jeffords"
"Amy" "Santiago"
"Charles" "Boyle"
"Regina" "Linetti"
"Raymond" "Holt"
"Michael" "Hitchcock"
"Norman" "Scully"
end

compress firstname // optional, shortens string format from 100 char to the minimum length
compress lastname // optional, shortens string format from 100 char to the minimum length

// now replace all names with their lower case variant:
replace firstname = lower(firstname) // first name isn't used here fyi
replace lastname = lower(lastname)

#delimit ;
rcall clear ; 
rcall: library(predictrace) ; // load predictrace
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at names of data
rcall: head(data, n=10) ; // look at first 10 rows
rcall: lastnamevector <-data[, "lastname"] ; // make vector from lastname column 
rcall: data = merge(data, predict_race(lastnamevector), by.x = 'lastname', by.y = 'name', sort = FALSE) ; 
rcall: head(data, n=10) ; // look at rows
rcall: st.load(data) ; // kick the merged data back to stata
#delimit cr

Estimating sex by first name

// Clear memory, input dataset of first and last names.
// Here, I'm formatting the strings so they are up to 100 characters
// in length so they don't get clipped (str100). 
// If I specified str5 then "Flintstone" would be "Flint". 
clear all
input str100 firstname str100 lastname
"Jacob" "Peralta"
"Rosa" "Diaz"
"Terrence" "Jeffords"
"Amy" "Santiago"
"Charles" "Boyle"
"Regina" "Linetti"
"Raymond" "Holt"
"Michael" "Hitchcock"
"Norman" "Scully"
end

compress firstname // optional, shortens string format from 100 char to the minimum length
compress lastname // optional, shortens string format from 100 char to the minimum length

// now replace all names with their lower case variant:
replace firstname = lower(firstname)
replace lastname = lower(lastname) // last name isn't used here fyi

#delimit ;
rcall clear ; 
rcall: library(predictrace) ; // load predictrace
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at names of data
rcall: head(data, n=10) ; // look at first 10 rows
rcall: firstnamevector <-data[, "firstname"] ; // make vector from firstname column 
rcall: data = merge(data, predict_gender(firstnamevector), by.x = 'firstname', by.y = 'name', sort = FALSE) ; 
rcall: head(data, n=10) ; // look at rows
rcall: st.load(data) ; // kick the merged data back to stata
#delimit cr

Special thanks to Katherine Wilkinson for her R brilliance in debugging this.

Getting your grant below the page limit using built-in MS Word features

It’s just a little too long!

You’ve toiled on your grant day in and out for weeks on end, and despite chopping out loads of overly verbose text, it’s still over the length. It turns out that there are some built-in settings in MS Word to help you get below the length limit without removing additional text. This post is focused on NIH grant formatting but details here are relevant for most grants. This also assumes that you are already using narrow margins. I made up a 4 page ‘ipsum lorem’ document for this so I can give actual quantifications of what this does to document length.

Hyphenation and justification

I only just learned about hyphens from Jason Buxbaum in this tweet. Hyphenation breaks longer words across lines with a hyphen in the style commonly used in novels. Hyphenation will get you a few lines in a 4 page document.

Justification makes words reach from the left to rightmost extremes of the margin, stretching or compressing the width of the spacing between words to make it fit. Justification’s effect on length is unpredictable. Sometimes it shortens a lot, sometimes it stays the same, sometimes it’s a smidge longer. In my 4 page ipsum lorem document, the length didn’t change. It’s worked to shorten some prior grants, so it’s worth giving a try. (Also, try combining justification with different fonts, see below.)

Here is the button to turn on justification.

Personally, I like ragged lines (“align left”) and not justified lines because I find justified text harder to read. I have colleagues who really like justification because it looks more orderly on a page. If you are going to use justification, please remember to apply it to the entirety of the text and not just a subset of paragraphs for the same reason that you don’t wear a tie with a polo shirt.

You can try combining hyphenation and justification, though I’m not sure it will gain anything. It didn’t in my demo document.

Modifying your size 11 font

Try Georgia, Palatino Linotype, or Helvetica fonts instead of Arial

The NIH guidelines specify size 11 Arial, Georgia, Helvetica, and Palatino Linotype fonts as acceptable options. (Note: Helvetica doesn’t come pre-installed on Windows. It’s pre-installed on Mac.) There were not major differences in length in my aligned-left ipsum lorem document between any of the fonts when the lines were aligned-left. But, try combining different fonts with justification. In the ipsum lorem document, justified Georgia was a couple of lines shorter than any other combinations of aligned-left/justification and NIH-approved fonts in Windows.

Condensing fonts

Kudos to Jason Buxbaum for this one. You can shrink the space between your letters without actually changing the font size/size of the letters. Highlight your text then home –> font little arrow –> advanced –> spacing becomes condensed then change the selecter menu to 0.1 pt.

This change will give you a few lines back in a 4 page document.

I can’t tell the difference in the letter spacing before and after using 0.1. If you increase to a number larger than 0.1, it might start looking weird, so don’t push it to far.

A word of advice with this feature: If you are too aggressive, you might run amok with NIH guidelines, which specify 15 characters per linear inch, so double check the character count in an inch (view –> ruler will allow you to manually check). FYI: all NIH-approved fonts are proportional fonts so narrow characters like lowercase L (“l”) take up less width than an uppercase W, and a random sample of text that happens to have a lot of narrow letters might have more than 15 characters/linear inch. You might need to sample a few inches to get a better idea of whether you or not are under the 15 character limit. (In contrast, Courier is a monospaced font and every character is exactly the same width.)

Adjust line and paragraph spacing

Both line and paragraph spacing affect the amount of white space on your page. Maintaining white space in your grant is crucial to improve its readability, so don’t squeeze it too much. In my opinion folks will notice shrunken paragraph spacing but not shrunken line spacing. So if you have to choose between modifying line or paragraph spacing, do line spacing first.

You can modify line and paragraph spacing by clicking this tiny checkbox under home tab –> “paragraph”.

Remember to highlight text before changing this (or if you are using MS Word’s excellent built-in Styles, just directly edit the style).

Line spacing

As long as you have 6 or fewer lines per vertical inch (view –> ruler will allow you to manually check), you are set by NIH guidelines. The default line spacing in MS Word is 1.08. Changing it to “single” will give you back about and eighth of a page in a 4 pg document. Today I learned that there’s ANOTHER option called “exactly” that will get you even more than a quarter of a page beyond single spacing. Exact spacing is my new favorite thing. Wow. Thanks to Michael McWilliams for sharing exact line spacing in this tweet. I wouldn’t go below “exactly” at 12 pt because that gets you at about 6.5 lines per inch, which goes against NIH standards of 6 lines per inch.

Paragraph spacing

The default in MS Word is 0 points before and 8 points after the paragraph. I don’t see a need to have any gaps between a heading and the following paragraph, so set the line spacing before and after headings to be zero. Looks nice here, right?

Now you can tweak the spacing between paragraphs. I like leaving the before to zero and modifying the after. If you modify the before and not the after, you’ll re-introduce the space after the header. Also, leave the “don’t add space between paragraphs of the same style” box unchecked of you’ll have no spacing between most paragraphs.

Here’s the same document from above changing the after spacing from 8 to 6 points.

Looks about the same, right? This got us about 3 lines down on a 4 pg document. Don’t be too greedy here, if you go too far, it’ll look terrible (unless you also indent the first line, but then you run the risk of it looking like a high school essay).

Play around with modifying both paragraph and line spacing. Again, I recommend tweaking line spacing before fiddling with paragraph spacing.

Window/orphan control, or how to make paragraphs break at the maximum page length

MS Word tries to keep paragraphs together if a small piece of it extends across pages. For example, if the first line of a paragraph is on page 2 and the rest of the paragraph is on page 3, it’ll bring that first line so that it ALSO starts on page 3, leaving valuable space unused on page 2. This is called window/orphan control, and it’s easy to disable. Highlight your text and shut it off under home –> paragraph tiny arrow –> line and page breaks then uncheck the window/orphan control button.

This gives a couple of lines back in our 4 page document.

Modifying the format of embedded tables of figures

Tables and figures can take up some serious real estate! You might want to nudge a figure or table out of the margin bounds, but that will get you in some serious trouble with the NIH — Stay inside the margins! Try these strategies instead.

Note: I have a separate post that goes in-depth into optimizing tables in MS word and PPT, here. Check it out if you are struggling with general formatting issues for tables.

Wrap text around your tables or figures

Note: In 3/2024, I discovered a better way to get tables and figures to behave that is described at the end of this section. I’m keeping the older pieces here since I think they are still relevant.

Consider reclaiming some unused real estate by wrapping the text around tables or figures. Be warned! Wrapping text unearths the demons of MS Word formatting. For this example, we’ll focus on just wrapping text around a table to make a ‘floating table’. Below is an example of a table without text wrapping.

Right click on your table and select “Table Properties” then click right or left alignment and set text wrapping to around. (For figures, right click your figure and click “size and position”, which is analogous to the table options. For simplicity, I’ll focus on tables only here.).

Adjust your row width a bit and now you have a nice compact table! But wait, what’s this? The table decided to insert itself between a word and a period! That’s not good.

When MS Word wraps text around a table, it decides the placement of the now floating table by inserting an invisible anchor followed by a line break. Here, there’s an invisible anchor is placed between “nunc” and the period. Your instinct will be to move the table to fix this problem, and that is the wrong thing to do. Avoid moving the table because the anchor will do unpredictable things to your document’s formatting. This is so well known to create havoc that it led to a viral Tumblr post from user laurelhach:

laurelhach: using microsoft word *moves an image a mm to the left* all text and images shift. four new pages appear. paragraph breaks form a union. a swarm of commas buzzes at the window. in the distance, sirens. Text Font Line

Moving tables is pointless in MS word because it doesn’t do what you think it does and you will be sad. Move the text instead. Here, highlight that stray period and the rest of the paragraph starting with “Mauris eleifend” and move it where that weird line break occurred after “nunc”.

There will be a new line break to erase, but the table should now follow the entire paragraph.

If you are hopelessly lost in fighting the MS Word Floating Table Anchor Demon, and the table decides that it doesn’t want to move ever or is shifted way to the right (so much so that it’s sitting off screen on the right), then the invisible anchor might be sitting to the right of the final word in a heading or paragraph. I recommend reverting the floating table to a non-text wrapped table to figure out what’s wrong and fix everything. Right-click the table and open up the “table properties” option again and change the text wrapping to “none”. The table will appear where the invisible anchor is and now you can shift around the text a bit to get it away from the end of a sentence. Now turn back on text wrapping. This usually fixes everything.

Note: I actually made the table intentionally insert between ‘nunc” and the period for this example. This was just a re-enactment so it’s not MS Word’s fault — this time. BUT this really happens. It’s very problematic if you have >1 table or figure on a page because the Floating Table Anchor Demons will fight with each other and your grant’s formatting will pay.

Update 3/2024: There’s actually a simpler way to fight the demons of embedded tables and figures: Setting the position relative to the margin. Setting relative to the margin not only makes anchors disappear, it also ensures that your figures don’t extend outside of the margin. We’ve all heard horror stories about grants being rejected because some table or figure snuck beyond the margin limits.

To set the positioning of your table or figure relative to the margin, right click on your table/figure and select “table properties”, set alignment to “right” (or center or left) and text wrapping to “around” then hit the “positioning” button:

You can then set the horizontal and vertical position relative to the margin rather than the paragraph. Then the anchors go away completely! You can set it all the way to the extremes (left/right, top/bottom) or put in a measurement (eg replacing “bottom” with “5 in” in the vertical “position” box will put your figure or table about 2/3rds of the way down a page).

Tables and figures behave much better when you set the positioning relative to the margin.

Changing cell padding *around* your tables and figures

Continuing with the window above (right click table –> table properties –> position), there’s this “distance from the surrounding text” section at the bottom that changes how much white space padding there is around the table/figure. The default padding is 0.13 inches, which I think is a bit too generous. Try changing this to 0.08 inches and you might get to squeeze in a few extra words in the surrounding text before a new line break.

For figures, there’s an analogous series of options that you can get to by right clicking your figure –> size and position. Under “size and position”, set the wrapping to “tight” and then have at it.

Reduce cell padding *inside* your tables

This is especially helpful for tables with lots of cells. Reducing the cell padding shrinks the white space between the text in a cell and borders of the tables. In contrast with the “save the whitespace” principle of lines and paragraph spacing, I personally think that less white space in tables improves readability. Here’s before, with default cell margins of 0.08:

Highlight your entire table and you’ll notice a new contextual ribbon with “design” and “layout” tabs appear. Click layout –> cell size little arrow –> cell –> options –> uncheck the box next to same size as the whole table then reduce the cell margins.

Here’s that same table reduced with cell margins reduced from 0.08 to 0.03.

Now you can strategically adjust the column size to get back some space.

Also note that you can also apply justification and adjust the line and paragraph spacing within your tables, which might also help shrink these things down a bit.

Shrink the font in your tables

AFAIK, NIH guidelines don’t specify a font size to use in tables, just something that can be read. I typically use size 8 or 9 font.

Did I miss anything?

If I did, shoot me an email at timothy.plante@uvm.edu!

Part 3: Introduction to Stata

Note in 2023: We are using R and not Stata this summer so this post doesn’t apply to this year’s projects.

Stata is a popular commercial statistical software package that was first released 30+ years ago. It has some really nice features, loads of top-rate documentation, a very active community, and approachable syntax. For beginners, I think it’s the simplest to learn.

Learning how to use Stata

Stata has really, really, really good documentation.

The documentation is outstanding. Let’s say that you want to learn how to use the –destring– command. In the command line (1a under “Stata’s Interface” below), type:

help destring

…and up will pop a focused help file. There’s the “View complete PDF manual entry” option that has EXTENSIVE documentation of the command. (Note: This file seems to only work well with Adobe PDF reader, not alternative PDF readers like Sumatra). If the focused help file isn’t sufficient to answer your questions, try the complete PDF manual.

The focused help file has multiple parts, but the syntax example is gold. Further down you’ll see example uses of the command.

Web searches will find even more answers

Odds are that someone has already hit the same problem you have in using Stata. Queries in your favorite search engine are likely to find answers on the Statalist archive or UCLA’s excellent website.

You can install Stata programs that other users have written

There are MANY MANY MANY user-written programs out there that can be installed and used in your code. You only need to install them once. Most are on BU’s repository called SSC. I use the table1_mc program extensively (it makes pretty table 1s, you can read about it here). To install table1_mc from SSC, you type:

ssc install table1_mc

…and Stata will download it and install it for you. It’s ready to use when it finishes installing. And, there’s no need to re-install it, it will load each time you start Stata.

Quirks of Stata

Stata only works with rectangular datasets

Think of a rectangular dataset as a single spreadsheet in Excel. It has vertical columns and horizontal rows. There’s no data on a Z axis coming out of the computer at your face.

A rectangular dataset is the only type that Stata works with. Other statistical software like R or Python can handle many more complex data structures. For learners, forcing data to fit within a rectangular dataset is a huge advantage in my mind since that structure is intuitive, and you can always browse your data with the built-in data browser (see 3c under Stata’s Interface, below).

Stata only works with one dataset at a time*

One dataset in Stata is akin to one spreadsheet in a workbook in Excel. In Excel, you can have multiple spreadsheets in one .xlsx workbook file, with each spreadsheet appearing on a different tab at the bottom. All spreadsheets are in the memory at the same time. You can do math across spreadsheets in a workbook in excel, summarize costs in one column in spreadsheet A and have the result appear in one cell on spreadsheet B. In Stata, you can only have one spreadsheet (here, dataset) open at a time.* Because of this, Stata users spend a good deal of time merging and appending multiple datasets to make a single dataset that has all of the necessary variables in the best format from the get-go.

A big problem historically with Stata was that datasets are loaded in the RAM, and big datasets would be too big for conventional computers. That’s not an issue anymore since even cheap computers have several gigabytes of RAM.

*This isn’t true anymore. Starting in version 16, Stata can actually now have multiple datasets in memory, each stored in its own frame. These frames can be very useful in certain scenarios, but for our purposes, we are going to pretend that you can have just one dataset open at a time.

Data are either string or numeric. Their color changes in the data browser

Strings are basically text that are thought to be words and not numbers. But sometimes a dataset will be imported wrong and things that are actually numbers (“1.5”, “2.5” in different rows of the same column) will be imported and considered to be strings and not numbers. This might be because they were imported incorrectly. This might be that later down in the list there is a word in a different cell (“1.5”, “2.5”, “Specimen error”). If any row of a variable contains something that isn’t a number, Stata makes the entire column, and with it the variable, a string.

IMPORTANT: When viewing strings in the data browser (3c under “Stata’s Interface” below), they appear in RED text. When specifying strings in commands, you need to enclose them in quotations (eg count if name==”Old”). Missing strings are two quotes with nothing in between them (eg count if name==””).

In order to do math, you need to have things be numbers. There are several different numerical formats that you can read about here. If something is an integer (nothing after the decimal), it can be byte, int, and long. If something has a decimal point, it’s float or double. Stata does a nice job selecting which numerical format your data should be in, so you probably don’t need to think much about the difference between byte, int, long, float, or double again.

IMPORTANT: When viewing numeric variables in the data browser, they appear in BLACK text (or BLUE if they have a label applied). When specifying strings in commands, no quotations are needed (eg count if quartile==1). Missing strings are periods (eg count if quartile==.), and a period is positive infinity (a missing value is bigger than a value of one billion).

To convert from a string to a numerical value (change the “1” to a 1), you use the –destring– command. You might need to include the force and replace options, but read up on those by typing –help destring–.

To convert from a numerical value to a string (change the “1” to a 1), use the –tostring– command. Note that missing numerical values will go from a dot to a dot in quotations (. becomes “.”), which is not the same as a missing value for a string, which is just empty quotations (“”). It’s a good idea to follow up a –tostring– command with a command that replaces “.” values with “” values.

Stata’s output is only 255 characters wide, max

The output window of Stata will print (“display”) the inputted command and results from that command. It will clip the output at up to 255 characters, and insert a line break to the next row. You can specify:

set linesize 255

…so that the output is always 255 characters wide. Otherwise, it’ll adjust the output to match how wide your output window is.

The working directory is your “documents folder” unless you manually set the working directory with the cd command or open up Stata by double clicking on a .do file in Windows explorer

The working directory is where Stata is working from. If you save a dataset with the –save– command, it’ll save it in the working directory unless you specify all of the files from the C: drive on. If you double click on the Stata icon to open it up in Windows and type the present working directory command to see where it’s working from (that’s –pwd–), it’ll print out:

. pwd 
C:\Users\USERNAME\Documents

So, if you type:

save "dataset.dta", replace

…it’ll save dataset.dta in C:\Users\USERNAME\Documents

Let’s say that you really want to be working in your OneDrive folder because that’s secure and backed up and your Documents folder isn’t. The directory for your desired folder is:

C:\Users\USERNAME\OneDrive\Research project\Analysis

In order to save your file there, you’d type:

save "C:\Users\USERNAME\OneDrive\Research project\Analysis\dataset.dta", replace

Note that there’s a space in the Research project folder name so the directory needs to be in quotations. If there was no space anywhere in the directory, you could omit the quotations. I’m including quotations everywhere here because it’s good practice.

One option is to change your working directory to the OneDrive folder. You use the –cd– command to do that then any save command will automatically save in that folder:

cd "C:\Users\USERNAME\OneDrive\Research project\Analysis\"
save "dataset.dta", replace

Alternatively, you can save your project’s Do file in the “C:\Users\USERNAME\OneDrive\Research project\Analysis\” folder. Rather than opening Stata by clicking on the icon, find the Do file in your OneDrive folder in Windows Explorer and double click on it. It’ll open Stata AND set that folder as the working directory!! For a new project, this means opening Stata by clicking on its icon, opening a blank do file, saving that do file in your OneDrive folder, closing the Do File Editor and Stata, then reopening stata by double clicking on your blank do file in Windows Explorer.

Stata is most effectively used with with command-line input, specifically through the Do File Editor. There is a graphical user interface that can be handy.

I think that everything in Stata should be completed through Do files. These are text files with sequential lines of codes that make Stata perform commands in order.

I strongly, strongly, strongly recommend having a set number of do files, the first (“01 cleanup.do”) will merge original data files, generate new variables, print out your inclusion flow diagram, and save your analytical dataset (“analytical.dta”). The rest of the files all do just one thing, like make a table or render a figure. All other files except for the first (“01 cleanup.do”) will first (a) open your analytical dataset (“analytical.dta”) then (b) run some sort of stats without modifying the analytical dataset. I also have one last miscellaneous one for random things that you might need from the text (“99 misc things for text.do”), like estimating median and IQR follow up or some one-off t-tests. If you realize in a later do file (say “02 table 1.do”) that you need to generate a new variable or merge in another raw dataset that you didn’t realize that you needed when you first built your “analytical.dta” dataset, resist the impulse to make new variables in your “02 table 1.do” file and instead go back and rework your “01 cleanup.do” file to generate that variable or merge in another raw dataset. Then, rerun your “01 cleanup.do” file it so it overwrites your “analytical.dta” dataset, and then go back and continue working on your “02 table 1.do” file.

So, your do file directory might look like this:

  • 01 cleanup.do – (a) Merge original data files, (b) generate new variables, (c) print out an inclusion flow diagram and (d) also builds your “analytical.dta” analytical dataset that all other do files will use.
  • 02 table 1.do – Opens “analytical.dta” then uses table1_mc or something similar to make a table 1.
  • 03 histogram of exposure.do – Opens “analytical.dta” then uses the –histogram– command to draw a distribution of your exposure.
  • 04 event table.do – Opens “analytical.dta” then uses some sum commands to print out the # of events by group of interest.
  • 05 regressions.do – Opens “analytical.dta” then runs your regressions.
  • 06 spline.do – Opens “analytical.dta” then runs some code to make restricted cubic splines.
  • 99 misc things for text.do – Opens “analytical.dta” then runs one-off code for random details that you might need for the text (e.g., median and IQR follow-up).

GUI in Stata

There is a graphical user interface (GUI) with clickable menus. You can click through commands and it’ll generate the code and run the command of interest, and these can be handy for stealing syntax to run an annoying command. The command from the GUI will appear in the Command History (1c below) and you can right click and copy/paste it into your do file.

I find –import excel– to be frustrating and use the GUI probably 90% of the time to generate that command then copy/paste the syntax into my do file.

You’ll be much better off if you use a do file for nearly everything. Try to minimize the use of the GUI.

Stata won’t let you close a dataset in the memory or overwrite an existing dataset without some effort

The –use– command will open up a dataset in the memory. If you don’t have a dataset opened yet, this will open one:

use dataset.dta

Remember that Stata can only have one dataset opened at a time, so any time you open one when you already have a different dataset opened in memory, Stata will need to drop the open dataset. If you spent a lot of time on the open dataset creating new variables or merging with other datasets, closing it will make you lose all of your work unless you have also saved it. Stata doesn’t want you to make this mistake so if you already have a dataset opened and you type in the above command, Stata will say “No” and you won’t be opening the new dataset.

Instead, you need to put “–, clear–” at the end of the command, like this:

use dataset.dta, clear

And now Stata will drop whatever you have open. It’s really just a nice check to keep you from discarding your work accidentally.

Similarly, if you are trying to save a dataset with the –save– command into an empty folder, you just need to type:

save newdataset.dta

…and Stata will save it no problem. HOWEVER if you are trying to overwrite an existing dataset with that same name, Stata will say “No” and you won’t be saving your dataset today. This is another check. instead, you just need to use “–, replace–” to overwrite. Example:

save newdataset.dta, replace

Stata’s interface

Here’s a quick overview of the Stata interface in Windows.

  1. Ways to input and interact with commands:
    1a. Command line – This is where you type command by command. Unless you are just poking around in your data, you should avoid using this. Anything that you want to reproduce in your analysis should be done in the Do file editor.
    1b. Open Do file editor button – The Do File Editor is the most important part of Stata in my opinion. A do file is a long text file saving command after command. This is where you should do all of your analytical work.
    1c. Command history – If you use the command line or GUI to make a command, it’ll be saved here. You can right click on old commands and copy/paste them into your do file.
  2. Output window – Your command will appear here with a preceding dot (“. sysuse auto” means that I had previously typed in “sysuse auto”). The output from your do file or command will appear immediately below.
  3. Ways to interact with data
    3a. Variable list – This is a list of variables in the open dataset. You can double click on them and the variable name will be copied to the command line. You can ctrl+click and select multiple and then copy them to the clipboard. This is quite handy.
    3b. Variable and dataset properties – This will let you see details about a selected variable in the variable list and the current dataset in memory.
    3c. Data browser – You can also pop this open with the –bro– command. this views all data in a spreadsheet format that looks like Excel.

Table 1 with pweights in Stata

The very excellent table1_mc program will automate generation of your Table 1 for nearly all needs (read about it here), except for datasets using pweight. I’ve been toying around with automating Excel table generation using Stata v16+ Frames features. I recently started working on a database that requires pweighting for analyses, and opted to use this to as an opportunity to use Frames to generate the automation of a pweight adjusted Table 1.

v1.3 of my code (updated 2024-2) to automate this lives here: https://www.uvm.edu/~tbplante/p_weight_table1_v1_3.do

You can just put:

do https://www.uvm.edu/~tbplante/p_weight_table1_v1_3.do

…in your Stata do file and it’ll pull in the entire script! Note that full instructions will show up in the Stata output window when you run the above line. The instructions below are incomplete. This code does not produce P-values.

How to use this do file

// Step 1a: close all open frames, drop all macros, 
//          and open your dataset
frames reset
macro drop _all
webuse multistage, clear
//
// Step 1b: Figure out where your present working directory is, 
//          this is where the excel spreadsheet will be saved. 
//          Change the working directory with the "cd"
//          command as needed. 
pwd
//
// Step 2: Declare your data to be pweighted
svyset county [pweight=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
//
// Step 3: If your columns require the generation of pweighted 
//         tertiles, quartiles, or whatnot, do that now. 
//         For this example, we'll do by quartile of weight. 
// note: per this website: https://www.stata.com/support/faqs/statistics/percentiles-for-survey-data/
//       ...Only the pweight needs to be specified when making 
//        weighted quartiles. 
xtile weightquart=weight [pweight=sampwgt], n(4) 
//
// Step 4: Recode binary variables so they are 0 and 1 (if needed)
// Note: in this dataset, it's 1 and 2 for male and female, 
//       respectively. 
gen female = 1 if sex==2 // recode sex to female, where 1 is female
replace female=0 if sex==1 // male is now 0
//
// Step 5: Name your variables and options for multiple options
// Note: The variables are already labeled but we are doing it 
//       again for completeness' sake. 
//
// Continuous variables
label variable weight "Weight in lbs"
label variable height "Height in in" // I don't know why people are 400 in tall. that's 33 ft.
// Nominal variables (same process would happen for ordinal or continuous varibles)
label variable race "Race" // Race is nominal so need to also define values of race
label define racelabels 1 "White" 2 "Black" 3 "Other"
label values race racelabels // Apply the labels!!!
// Binary variables, no need to apply labels
label variable female "Female sex"
//
// Step 6: Call the do file
// Note: Instructions on this program's use will show right 
//       after it's called. Look at the Stata output window. 
do https://www.uvm.edu/~tbplante/p_weight_table1_v1_3.do
//
// Step 7: Now follow the instructions! That are in the stata 
//         output window!
table1pweight_start table1 1 4 weightquart weight %10.1f
table1pweight_contn_sd  table1 1 4 weightquart height %10.1f
table1pweight_bin  table1 1 4 weightquart female %10.0f
table1pweight_cat  table1 1 4 weightquart race %10.0f
table1pweight_end table1 1 4 weightquart weight %10.2f
//
// CLOSE THE NEW EXCEL FILE OR YOU'LL GET AN ERROR WHEN 
// RUNNING STEP 7.
//
// Step 8: Look at the excel output! Here, it's a file called 
//         table1.xlsx that's sitting in  your pwd (see step 
//         1b above). You might notice blanks for the 2nd and 
//         3rd columns, but that's because of a strata with a single
//         sampling unit. You can confirm numbers using the survey
//         tools.
// remember! This is where your excel file is saved:
pwd

Summer medical student research project series Part 1: Getting set up

Summer goals and expectations

Hi there! Thanks for expressing interest in working on an epidemiology project with me this summer. This project will entail:

  • Using cohort study or clinical trial data trying to extend knowledge of cardiovascular disease (CVD).
  • You getting your hands dirty with statistical coding (e.g., R or Stata).

My expectations for all LCOM summer students are:

  • You have a computer that works and internet that is dependable enough to allow Zoom conferences. You don’t have to be in VT.
  • In advance of the summer, you will submit a manuscript proposal to the cohort that will be used, and apply for funding through the CVRI or LCOM (typically due in February). If we need an IRB proposal (which we likely don’t since there’s probably an existing IRB), then you’ll lead the completion of that.
  • We’ll meet weekly via Zoom for an hour or so during the summer to review the project’s progress. I’ll be available in between meetings via email, Zoom, or whatnot.
  • If we’re working with a biostatistician and using R, you’ll meet with them to learn R pointers.
  • You’ll complete the analysis, with help from me in learning the ins and outs of coding.
  • If doing a REGARDS/LCBR-related project, you’ll attend lab meetings.
  • At the end of the summer you will have: (1) A complete first draft of a manuscript and (2) a completed abstract that can be submitted to a conference.

Things to set up now.

Zotero

I use Zotero as a reference manager. It’s the bomb diggity. We will share references in a private group library that we can both edit. Only the people who have this library shared with them can see its contents.

To install Zotero, do the following:

  1. A free Zotero account. Sign up here. Please tell me your username so I can start a group library with you.
  2. Zotero’s desktop app. Make sure to log into your account in the desktop app. It’s under edit –> preferences.
  3. The Zotero web browser plugin for your web browser of choice. You need to have the Zotero desktop app open for this to work.
  4. The Zotero MS Word plugin (in Zotero desktop app, click Edit –> Preferences –> Cite –> Word Processors). This has been finicky with the specific MS Office install on the LCOM laptops so it might take some working to get it to work. But! It’ll work.

I’ll send you a shared library invitation. To accept the group library invite, do the following:

  • Go to zotero.org, log in.
  • In the top right, click on your username. A menu should drop down, click “inbox”.
  • Accept the group library invitation.
  • Open up the Zotero desktop app and let it sync (again, you need to be signed in on the desktop app, seen #2 above). The group library folder should appear in the left column all the way on the bottom.

Group libraries are awesome because we can compile references that either of us can insert into a document. Please keep the group library organized. If you add a new reference, please make a subfolder to stick it in so you don’t have to search for references one by one.

How to use Zotero

This is covered in a separate post, here.

Microsoft OneDrive

This is through LCOM. Not UVM, not your personal account.

  1. Open the OneDrive on your computer and sign in with your LCOM credentials if you aren’t already.
  2. I’ll share a research folder with you. You’ll need to sync it with your computer. To do that, go to onedrive.com, log in with your LCOM credentials (firstname dot lastname at med dot uvm dot edu).
    • After you log in, you’ll be on the landing page for OneDrive. Click “Shared” on the left column.
    • If you see a mix of files and folders, limit the view to just folders by clicking the icon of a folder to the right of “With You”.
    • Find the research folder and select it. On the top bar click “Add shortcut to My Files” (you might need to make the window full screen to see this option).
    • Allow the OneDrive desktop app to sync. Now all of the files should be available offline in the Onedrive folder on your computer.

Microsoft Word

Unfortunately, writing papers in Google Drive is a bit too onerous. Please download this manuscript file, which you will be using to draft your manuscript.

Your statistical software, option 1: Statanote: this is what we are using for the 2024 summer session.

At this point you should know if you are using Stata or R. I use Stata. UVM has an institutional subscription to Stata. You can download and install it from the UVM Software page, here. For this you will log in with your UVM (not LCOM) credentials. To download it, hit the down arrow (1) then download. After it’s installed, you’ll need the serial number, code, and authorization to activate it. That’s under “more info” (2).

<– Two steps to install Stata from UVM

Your statistical software, option 2: R – we aren’t using it in the 2023 summer session.

R is a free and open source computer language intended for use in statistics. RStudio is a commercial software that makes using R much more user-friendly. I have only used R a few times and don’t have expertise in the language.

Research credentialing things

CITI training and directory modification

You need to complete the CITI training (both “human subjects training/IRB” and “GCP” trainings) in advance of starting the summer so you can be added to our human subjects IRB, see details at these links:

And please modify your directory listing so you can be added to the IRB as described here: https://www.uvm.edu/sites/default/files/media/Students_-_How_To_Sign_Up_to_Do_Research_at_UVM_2.pdf

Once you have finished these steps, let me know so we can add you to the IRB. There is a delay between when we request you to be added to the IRB and when you are added to the IRB, fyi.

Cornerstone research training

Historically, students have been required to do an online research training session in Cornerstone by the end of the first week. Details from this come from the Student Services office. I don’t have details about how to complete this. IIRC this is specifically for students doing projects that involve interacting with patients or patient data at UVMMC, which isn’t what we are doing. But, look for those details coming your way.