{"id":832,"date":"2021-06-29T15:03:50","date_gmt":"2021-06-29T19:03:50","guid":{"rendered":"https:\/\/blog.uvm.edu\/tbplante\/?p=832"},"modified":"2024-07-01T15:49:38","modified_gmt":"2024-07-01T19:49:38","slug":"part-4-defining-your-population-exposure-and-outcome","status":"publish","type":"post","link":"https:\/\/blog.uvm.edu\/tbplante\/2021\/06\/29\/part-4-defining-your-population-exposure-and-outcome\/","title":{"rendered":"Part 4: Defining your population, exposure, and outcome"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Importing and merging files<\/h2>\n\n\n\n<p>Odds are that you will get raw data to work with that is in a CSV, sas7bdat, Excel&#8217;s xlsx file format, or something else. Stata can natively import many (but not all) file types. The simplest thing to do is use the &#8211;import&#8211; command then immediately save things as a temporary Stata .dta file and later merge them. <\/p>\n\n\n\n<p>Importing commands differ by filetype. Type &#8211;help import&#8211; for details. But here are the 3 commands I use most frequently. These assume that the files are in the present working directory, which you can see by typing &#8211;pwd&#8211;. Remember to open Stata in Windows by double-clicking the .do file of interest in Windows explorer to set the folder that the .do file is sitting in as the working directory. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ CSV aka comma separated value, importing variable names\n\/\/ from the first row\nimport delim using \"NAME.csv\", clear varnames(1)\nsave \"name_csv.dta\", replace \/\/ save as Stata dataset\n\n\/\/ sas7bdat:\nimport sas using \"NAME.sas7bdat\", clear case(lower)\nsave \"name_sas.dta\", replace \/\/ save as Stata dataset\n\n\n\/\/ Excel, MAKE SURE THAT YOU DON'T ALSO HAVE THE FILE\n\/\/ OPEN IN EXCEL or it won't import it. This will import\n\/\/ variable names from the first row.\nimport excel using \"NAME.xlsx\", clear firstrow.\nsave \"name_xlsx.csv\", replace \/\/ save as Stata dataset<\/code><\/pre>\n\n\n\n<p>There are lots of fancy settings within these commands.<\/p>\n\n\n\n<p>To merge, simple merge all of your new .dta files using the &#8211;merge&#8211; command. This assumes that all files have a variable named &#8220;id&#8221; that uniquely identifies all rows and is preserved across files. eg:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>use name_csv.dta\nmerge using name_sas.dta\ndrop _merge\nmerge using name_xlsx.dta\ndrop _merge<\/code><\/pre>\n\n\n\n<p>The merge command generates a &#8220;_merge&#8221; variable that tells you where a specific variable came from. Review this variable and the output in Stata <em>very closely<\/em>. You need to drop the &#8220;_merge&#8221; variable before merging other datasets. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Getting the population, exposure, and outcome correct in your analytical dataset, and being able to come back and fix goofs later<\/h2>\n\n\n\n<p>Defining a study population, exposure variable, and outcome variable is a critical early step after determining your analysis plan. Most epidemiology projects come as a huge set of datasets, and you&#8217;ll probably need to join multiple files into one when defining your analytical population. Defining your analytical population is an easy place to make errors so you&#8217;ll want to have a specific script that you can come back and edit again if and when you find goofs.<\/p>\n\n\n\n<p><strong><em>For the love of Pete &#8212; Please generate your population, exposure, and outcome variables using a script so you can go back and reproduce these variables and fix any bugs you might find!<\/em><\/strong><\/p>\n\n\n\n<p>When you make these variables, you&#8217;ll likely need to combine several datasets. This will require mastery of importing datasets (if not in the native format for your statistical program) and combining datasets. For Stata, this means using &#8211;import&#8211; and &#8211;save&#8211; commands to bring everything over into Stata format, and then using &#8211;merge&#8211; commands to combine multiple datasets. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Make a variable for your population that is 0 (not included) or 1 (included)<\/h3>\n\n\n\n<p>One option in generating your dataset is to drop everyone who isn&#8217;t in your dataset. I recommend against dropping individuals who aren&#8217;t in your dataset. Instead, create a variable to define your population. Name it something simple like &#8220;included&#8221;, &#8220;primary population&#8221;, &#8220;group_a&#8221; or whatnot. If you will have multiple populations (say, one defined by prevalent hypertension using JNC7 vs. ACC\/AHA 2017 hypertension thresholds), then you should have a variable for each addended with a simple way to tell them apart. Like &#8220;group_jnc7&#8221; and &#8220;group_aha2017&#8221;. <\/p>\n\n\n\n<p>Useful code in R and Stata to do this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Count <\/li>\n\n\n\n<li>Generate and replace (Stata), mutate (R)<\/li>\n\n\n\n<li>Combine these with assigning single equals sign &#8220;=&#8221; (Stata &amp; R, I say out loud &#8220;assign&#8221; when using this) and &#8220;&lt;-&quot; (R)<\/li>\n\n\n\n<li>use  &#8211;if&#8211;, &#8211;and&#8211;, &amp; &#8211;or&#8211; statements<\/li>\n\n\n\n<li>Tests of equality: &gt;, =, &lt;=, != (not), == (&quot;equals exactly&quot;), not single equal sign<\/li>\n<\/ul>\n\n\n\n<p>Example Stata code to count # of people with diabetes, generate a variable for group_a and assign someone to group_a if they have diabetes. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>count if diabetes==1\ngen group_a=0\nreplace group_a=1 if diabetes==1\ncount if group_a==1 <\/code><\/pre>\n\n\n\n<p>When creating a new variable from another variable, sometimes it&#8217;s helpful to start to generate a missing variable first (a dot) then replace a certain group as 0 and another as 1. This is especially helpful for strings. Strings must be in quotations, btw. Here&#8217;s an example of how to make a numeric variable from &#8220;Y&#8221; and &#8220;N&#8221; strings<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gen diabetes = . \/\/ start with variable that's missing for all\nreplace diabetes = 0 if diab_string == \"N\"\nreplace diabetes = 1 if diab_string == \"Y\"<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Missing as positive infinity in Stata<\/h3>\n\n\n\n<p>Read up about missing values by typing &#8211;help missing&#8211;. In Stata, missing values are either a dot or a dot with a letter (e.g., &#8220;.&#8221; or &#8220;.a&#8221;). Mathematically, Stata considers a dot missing value to be positive infinity, and each dot and letter missing value to be even larger positive infinity, so:  one billion trillion &lt; . &lt; .a &lt; .b &lt; .c and so on. Understanding that positive infinity is considered missing in Stata is <strong><em>critically important<\/em><\/strong> when using greater than statements, since anything greater than another value will include all missing values. So, imagine you had a population of 1 million people and only 100 of them were asked how many popsicles they have in their freezer and half had 0 to 10 and the other half had 11 to 20. If you try to make a variable called &#8220;popsicle_count&#8221; that is 1 for the lower half (0-10) and 2 for the higher half (11-20),  you&#8217;d do something like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gen popsicle_count = . \/\/ everyone has a missing variable \nreplace popsicle_count=1 if popsicle &lt;=10<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>replace popsicle count = 2 if popsicle &gt;10<\/code><\/pre>\n\n\n\n<p>&#8230;you would the popsicle_count variable would have 25 people with a value of 1 and 999,975 with a value of 2. this is because the last line didn&#8217;t specify what to do with missing values. The easy workaround here is to use &#8220;&amp; popsicle &lt;.&quot; to specify that you wanted to include anyone with a value less than positive infinity, aka missing values, in the last line. The <em>correct<\/em> way of writing this would be:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gen popsicle_count = . \/\/ everyone has a missing variable \nreplace popsicle_count=1 if popsicle &lt;=10\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>replace popsicle_count=2 if popsicle &gt;10 &amp; popsicle &lt;. \/\/ NOTE!!<\/code><\/pre>\n\n\n\n<p>This last line correctly ignores all variables when assigning a value of 2 since it applies the number of 2 to anything greater than 10 and less than positive infinity, aka anything less than a missing value. . <\/p>\n\n\n\n<p>Here&#8217;s example R code to do the same (df=data frame).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>nrow( df %&gt;% filter(diabetes == 1) )\ndf = df %&gt;% mutate(group_a = ifelse(diabetes == 1, 1, 0) )\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Make an inclusion flowchart<\/h3>\n\n\n\n<p>These are essential charts in observational epidemiology. As you define your population, generate this sort of figure. Offshoots of the nodes define why folks are dropped from the subsequent node. Here&#8217;s how I approach this, folks might have different approaches:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Node 1 <\/strong>is the overall population.<\/li>\n\n\n\n<li><strong>Node 2<\/strong> is individuals who you would not drop for <strong><em>baseline<\/em><\/strong> eligibility reasons (had prior event that discounts them or missing data to prevent assessment of their eligibility) <\/li>\n\n\n\n<li><strong>Node 3<\/strong> is individuals who you would not drop because you can assess them for necessary <strong><em>follow-up<\/em><\/strong> (incomplete follow-up, died before required follow-up time, missing data)<\/li>\n\n\n\n<li><strong>Node 4<\/strong> is individuals who you would not drop because they had all required <strong><em>exposure <\/em><\/strong>covariates (if looking at stroke by cholesterol level, people who all have cholesterol). This is your analytical population. <\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"583\" src=\"https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-1-1024x583.png\" alt=\"\" class=\"wp-image-847\" srcset=\"https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-1-1024x583.png 1024w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-1-300x171.png 300w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-1-768x438.png 768w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-1.png 1304w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>If you have two separate populations (eg, different hypertension populations by JNC7 or ACC\/AHA 2017), you might opt to make two entirely separate figures. If you have slightly different populations because of multiple exposures (e.g.,  3 different inflammatory biomarkers, but you have different missingness between the 3), you might have the last node fork off into different nodes, like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"313\" src=\"https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-2-1024x313.png\" alt=\"\" class=\"wp-image-848\" srcset=\"https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-2-1024x313.png 1024w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-2-300x92.png 300w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-2-768x235.png 768w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-2.png 1351w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>I generate these via text output in Stata then manually generate them in PowerPoint. To make these, I use a series of &#8220;sum&#8221; and &#8220;count&#8221; commands following along with &#8220;noisily display&#8221; commands all in a quietly loop. (Noisily overrides quietly for a single line. When debugging, you might want to hide the quietly loop.) <\/p>\n\n\n\n<p>I also make an &#8220;include&#8221; variable that defines the analytical population(s) of interest.<\/p>\n\n\n\n<p>You can display specific bits of data after a &#8220;sum&#8221; command, including r(N), which is the N of a sample. <strong><em>If you wonder what bits of data are available after a command like &#8220;sum if include==1&#8221;, type &#8220;return list&#8221;<\/em><\/strong>. <\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>quietly {\ngen include=1 \/\/ make a variable called \"include\" that is 1 for everyone\ncount if include ==1 \/\/ count the # of rows with an include variable equal to 1, this is everyone. It will save that value as r(N). \nnoisily display \"Original study population, n= \" r(N) \/\/ you can print this\n\/\/\n\/\/ now lets print out the people we exclude\ncount if prevalent_htn_jnc7==1 \/\/ this will count the # with prevalent htn to be excluded\nnoisily display \" --&gt; Hypertension at baseline, n= \" r(N)\ncount if prevalent_htn_missing==1 \nnoisily display \" --&gt; Missing bp at baseline, n= \" r(N)\n\/\/\n\/\/ now we are going to replace the include variable as 0 for people missing the two things above\nreplace include = 0 if prevalent_htn_jnc7==1\nreplace include = 0 if prevalent_htn_missing==1\ncount if include==1\nnoisily display \"normotensive at baseline, n= \" r(N)\n\/\/ and so on\n}<\/code><\/pre>\n\n\n\n<p>If you are using weighted data, this approach will differ slightly. First, you will have to svyset your data. Next, you will use &#8220;svy, subpop(IF COMMAND HERE): total [thing]&#8221;. Instead of using &#8220;return list&#8221;, you use &#8220;ereturn list&#8221; to see the bits that are saved. The weighted N is e(N_pop), for example. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>svyset &#091;pweight=sampleweight]\ngen include=1\nsvy, subpop(if include==1): total include \nereturn list \/\/ notice e(b) is there, it's the beta from the prior estimation\nnoisily di \"original study population, n= \" e(b)&#091;1,1]\n\/\/ above prints out the first cell of the matrix e(b), hence the &#091;1,1]\n\/\/ type \"matrix list e(b)\" to see what's in that matrix. \n\/\/ now figure out how many are excluded for missing a biomarker\nsvy, subpop(if include==1 &amp; biomarker==.): total include\n\/\/ now print it out, but since this uses sampling, it will not be a whole number. Print out a whole number with the %4.0f formatting code. \nnoisily di \" --&gt; missing biomarker, n= \" %4.0f e(b)&#091;1,1]\n\/\/ now update the include to be 0 for missing biomarkers \n\/\/ and display the count of the node print the node\nreplace include=0 if biomarker==. \nsvy, subpop(if include==1): total include\nno di \"not missing biomarker, n= \" %4.0f e(b)&#091;1,1]\n\/\/ and so on\n<\/code><\/pre>\n\n\n\n<p>When you get to the end, you&#8217;ll have a variable called &#8220;include&#8221; that you will use in all of your later analyses to define your analytical population. Depending on your analysis, you might need to make a few different include variables. For example, we commonly run hypertension analyses using both jnc7 and acc\/aha 2017 hypertension definitions, so I usually have an &#8220;include_jnc7&#8221; and also a separate &#8220;include_aha2017&#8221; variable. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Defining exposure and outcome<\/h2>\n\n\n\n<p>This seems simple, but define clearly what your exposure is and your outcome is. Each should have a simple 0 or 1 variable (if dichotomous) with an intuitive name. You might need 2 separate outcomes if you are using different definitions, like &#8220;incident_htn_jnc7&#8221; and &#8220;incident_htn_aha2017&#8221;. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Table 1<\/h2>\n\n\n\n<p>&#8220;Table 1&#8221; shows core features of the population by the exposure. Don&#8217;t include the outcome as a row, but include demographics and key risk factors\/covariates for outcome (eg if CVD, then diabetes, blood pressure, cholesterol, etc.). Some folks include a 2nd column that presents the N for that row. Some folks also include a P-value comparison as a final row. I tend to generate the P value every time but only present it if the reviewers ask for it. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"648\" height=\"425\" src=\"https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-3.png\" alt=\"\" class=\"wp-image-849\" srcset=\"https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-3.png 648w, https:\/\/blog.uvm.edu\/tbplante\/files\/2021\/06\/image-3-300x197.png 300w\" sizes=\"auto, (max-width: 648px) 100vw, 648px\" \/><\/figure>\n\n\n\n<p>In Stata, I use the excellent table1_mc program to generate these, which you can read about <a href=\"https:\/\/blog.uvm.edu\/tbplante\/2019\/07\/11\/make-a-table-1-in-stata-in-no-time-with-table1_mc\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. If you are using p-weighted data, you can use this script that I wrote, <a href=\"https:\/\/blog.uvm.edu\/tbplante\/2021\/03\/01\/table-1-with-pweights-in-stata\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. <\/p>\n\n\n\n<p>For R, I am told that <a href=\"http:\/\/www.danieldsjoberg.com\/gtsummary\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">gtsummary<\/a> works well.<\/p>\n\n\n\n<p>For lots more details on Table 1s, please continue to the next post, <a href=\"https:\/\/blog.uvm.edu\/tbplante\/2022\/07\/05\/part-5-baseline-characteristics-in-a-table-1\/\" data-type=\"link\" data-id=\"https:\/\/blog.uvm.edu\/tbplante\/2022\/07\/05\/part-5-baseline-characteristics-in-a-table-1\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Importing and merging files Odds are that you will get raw data to work with that is in a CSV, sas7bdat, Excel&#8217;s xlsx file format, or something else. Stata can natively import many (but not all) file types. The simplest thing to do is use the &#8211;import&#8211; command then immediately save things as a temporary &hellip; <a href=\"https:\/\/blog.uvm.edu\/tbplante\/2021\/06\/29\/part-4-defining-your-population-exposure-and-outcome\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Part 4: Defining your population, exposure, and outcome<\/span><\/a><\/p>\n","protected":false},"author":4473,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[678998],"tags":[],"class_list":["post-832","post","type-post","status-publish","format-standard","hentry","category-summer-students"],"_links":{"self":[{"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/posts\/832","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/users\/4473"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/comments?post=832"}],"version-history":[{"count":27,"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/posts\/832\/revisions"}],"predecessor-version":[{"id":1821,"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/posts\/832\/revisions\/1821"}],"wp:attachment":[{"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/media?parent=832"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/categories?post=832"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.uvm.edu\/tbplante\/wp-json\/wp\/v2\/tags?post=832"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}