Understanding Scientific Research – A Comprehensive Guide

Reading research articles can be intimidating and confusing, but it doesn’t have to be that way! Learn how to easily understand research like a pro.

Reading primary literature – meaning research articles – is intimidating, confusing, and seems out of reach for most people who aren’t trained scientists. But it doesn’t have to be that way. Let’s cover how to make reading research articles easy, fun, and approachable.

As Richard Feynman once said, “the first principle is that you must not fool yourself — and you are the easiest person to fool.” We’ll equip you with the tools and strategies to not be fooled with regards to scientific research moving forward.

As part of my neuroscience major in college, we were required to read dozens of research articles related to the field. We spent hours going over every single article, dissecting its strengths, weaknesses, and working to accurately assess what value the paper provided to the scientific community.

Yet despite reading dozens of these neuroscience papers, when I entered medical school, I still didn’t enjoy reading the primary literature. In fact, I avoided doing so unless absolutely necessary. It wasn’t until I began doing research of my own, read hundreds of papers, and published dozens of my own include a scrolling screenshot or recording of my own publication list at kevinjubbal.com that it all began to click. Being able to understand and assess the scientific literature is so important to parse out the noise from the truth, but it doesn’t have to take you years as it did for me.

1 | The Types of Research Studies

When it comes to scientific studies, there are different levels of evidence. Not all studies are created equal, and the study design is a big part of how strong the evidence is.

At the top, randomized controlled trials are the gold standard, the cream of the crop. Below that, prospective cohort and case-control studies. Prospective means you follow the subjects over time to see the outcomes of interest. Third, we have retrospective cohort or case-control studies, meaning you already have the outcomes of interest, but look back historically and make interpretations. Fourth, we have case series and case reports, which are investigations into individual patient cases. There are other levels, such as systematic reviews, meta-analyses, expert opinion, and others, but for simplicity, we’ll stick to these four levels.

This ranking may not make sense just yet, and that’s ok. We’ll now cover the elements of research, and how they apply to each type of research study, and it will all begin to come together.

Epidemiology, coming from the Greek term epidēmia, translates to “prevalence of disease”. It is the branch of medicine dealing with the incidence, distribution, and control of diseases. If the primary aim of science is discovering the truth and determining cause and effect, then it’s important to note that most observational epidemiological studies cannot establish causality, and therefore they cannot soundly accept or reject a hypothesis. Strong correlations found in observational studies can be compelling enough to take seriously, but there are limitations.

When it comes to observational studies, compared to experimental studies, we have cohort, case-control, and cross-sectional. Without diving into the differences of each type of observational study, understand this generally entails observing large groups of individuals and recording their exposure to risk factors to find associations with possible causes of disease. If they’re retrospective, they’re looking back in time to identify particular characteristics associated with the outcome of interest. These types of studies are prone to confounding and other biases, which can take us further from the truth. We’ll cover this in more detail shortly. Prospective cohort studies recruit subjects and collect baseline information before the subjects have developed the outcome of interest. The advantage of prospective studies is they reduce several types of biases that are commonplace in retrospective studies.

There are four steps to the scientific method:

Make an observation
Come up with a (falsifiable) hypothesis based on this observation
Test the hypothesis through an experiment
Accept or reject hypothesis based on experiment results

To determine causality, meaning if a cause results in an effect (like whether or not red meat causes cancer), the hypothesis must be adequately tested. This is the part that is most commonly overlooked, particularly in disciplines such as nutrition, because doing experiments necessary to establish causality presents several obstacles. For that reason, many researchers turn to doing easier observational studies, and I’m guilty of this too, but the problem is that most of these don’t get us closer to the truth.

The gold standard for determining causality is a well designed randomized controlled trial, or RCT for short. The researchers create inclusion and exclusion criteria to gather a group of subjects qualified for the study. Then, they randomize subjects into two groups. For example, one group receives drug A, and the other group receives a placebo.

By randomly allocating participants into the treatment or control group, much of the bias from observational studies is substantially reduced. In short, finding cause and effect becomes much easier. If randomized controlled trials are so much better, then why aren’t they always used?

First, they can be very expensive. One report looking at all RCTs funded by the US National Institute of Neurological Disorders and Stroke found 28 trials with a total cost of $335 million.

Second, RCTs take a long time. According to one study, the median time from the start of enrollment to publication was 5 and a half years.

Third, not all RCTs are created equal, and it’s quite challenging to conduct a high-quality RCT. These studies must have adequate randomization, stratification, blinding, sample size, power, proper selection of endpoints, clearly defined selection criteria, and more.

Fourth, ethical considerations. If you’re assigning someone to be in the control or experimental group, you can assign them to something you think will be helpful, like a medication or other treatment, or not have an effect, like placebo or control group. But you wouldn’t be able to assign someone to a group that you would expect to harm them – can you imagine assigning some teenagers to smoke cigarettes and some not to? This is a key distinction between RCTs and observational studies. While RCTs seek to establish cause-and-effect relationships that are beneficial, epidemiologists seek to establish associations that are harmful.

2 | Relative Risk vs Absolute Risk

To better understand the strengths and weaknesses of any particular research study, we’ll need to explore statistics. Don’t worry, we’ll keep it to basic statistics, nothing too crazy.

Relative risk, in its simplest terms, is the relative difference in risk between two groups. If a certain drug decreases the risk of colon cancer from 0.2% to 0.1%, that’s a 50% relative risk reduction. Decreasing the initial risk, 0.2%, by 50%, gives you a risk of 0.1%. The actual change in the rate of the event occurring would be the absolute risk reduction, which in this instance would be 0.1%, because 0.2% – 0.1% = 0.1%.

The way most studies, and especially journalists, summarize and report the results is through relative risk changes. This is much more headline-worthy but obscures the truth where the absolute risk would be more useful at communicating true impact. But what’s more likely to get clicks? “New drug reduces colon cancer risk by 50%!” That would be relative risk reduction. Alternatively, “New drug reduces colon cancer risk from 2 per 1000 to 1 per 1000”. That would be absolute risk reduction.

3 | Confounding & Biases

In the world of research, bias is anything that causes false conclusions and is potentially misleading.

Let’s start with one of the biggest offenders: confounding.

A confounding variable is one that influences both the independent and dependent variables but wasn’t accounted for in the study. For example, let’s say we’re studying the correlation between bicycling and the sale of ice-cream. As the bicycling rate increases, so does the sale of ice cream. The researchers conclude that bicycling causes people to consume ice cream. The third variable, weather, confounds the relationship between bicycling and ice cream, as when it’s hot outside, people are more likely to bicycle and also more likely to buy ice cream.

Another bias that isn’t properly appreciated, particularly in the world of nutrition, is the healthy user bias. Health-conscious people are more likely to do certain activities. For example, most health-conscious people have heard that red meat is bad, and therefore they’re less likely to eat red meat. People who eat more red meat are less health-conscious, and therefore are also more likely to smoke, not exercise, and consume soft drinks. When an observational study comes out comparing those who eat red meat to those who don’t, we cannot actually conclude it’s due to the red meat and not these other factors. Even when researchers are aware of these factors, they are virtually impossible to properly account for.

Selection bias refers to the study population not being representative of the target population, usually due to errors in the selection of subjects into a study, or the likelihood of them staying in the study. In the “lost to follow-up” bias, researchers are unable to follow up with certain subjects, so they don’t know what happened to them, such as whether they developed the outcome of interest. This leads to a selection bias when the loss to follow up is not the same across the exposed and unexposed groups.

There are many other biases, but we don’t have time to explore each and everyone here.

4 | Randomization & Statistics

Good research minimizes the effects of confounding and biases. How do we do that?

Randomization is a method where study participants are randomly assigned to a treatment or control group. Randomization is a key part of being able to distinguish cause and effect, as proper randomization eliminates confounding. You cannot do this in observational studies, as subjects self-select themselves into whichever group.

When confounding variables are inevitably present, there are statistical methods to “control” or “adjust for” the confounders. The two are stratification and multivariate models.

Stratification fixes the level of the confounders and produces subgroups within which the confounder does not vary. This allows for evaluation of the exposure-outcome association within each stratum of the confounder. This works because the confounder does not vary across the exposure-outcome at each level.

Multivariate models are better at controlling for a greater number of confounders. There are various types, one of the most common of which is linear regression. In its simplest terms, regression is fitting the best straight line to a dataset. Think back to algebra and y = mx + b. We’re trying to find the equation that best predicts the linear relationship between the observed data, being y, and the experimental variable, being x. Logistical regression deals with more complex relationships with multiple continuous variables.

The important thing to note is that confounding often still persists, even after adjustment. There are almost an infinite number of possibilities that can confound an observation, but researchers can only eliminate or control for the ones they are aware of.

Alex Reinhart, author of Statistics Done Wrong, points out that it’s common to interpret results by saying, “If weight increases by one pound, with all other variables held constant, then heart attack rates increase by X percent. You can quote the numbers from the regression equation, but in the real world, the process of gaining a pound of weight also involves other changes. Nobody ever gains a pound with all other variables held constant, so your regression equation doesn’t translate to reality.”

Because confounding is such a central limitation to observational research, we must be careful when drawing conclusions from these types of studies. With observational epidemiology, it’s incredibly difficult to prove an association right or wrong. While a small minority of these associations may be causal, the overwhelming majority are not. Therefore, we should err on the side of skepticism.

5 | Power & Significance

When you propose a hypothesis in a research study, there are two forms: the null hypothesis, meaning there is no relationship between the two phenomena, and the alternative hypothesis, meaning there is a relationship. The study seeks to provide data to suggest one over the other — note that science doesn’t prove things, as you could in math, but rather provides evidence for or against.

The p-value is the scoring metric that makes the final call. It’s the probability of obtaining test results from chance alone, assuming the null hypothesis is correct. In other words, it’s the likelihood that no relationship exists, but the findings occurred due to chance alone. A smaller p-value more strongly rejects a null hypothesis. A larger p-value means a larger chance that the effect you are seeing is due to chance, thus supporting the null hypothesis.

A p-value cutoff is assigned by the researchers to determine the cutoff at which statistical significance is achieved. We call this number α, and it is usually set to 0.05, meaning 5%, or sometimes lower. If the p-value is less than 0.05, we say the results are “statistically significant,” and the null hypothesis is rejected.

There’s a chance we’re wrong, and we have terms for this, too. When there’s no true effect, but we think there is, we call this a false positive, or a Type I error. We failed to reject the null hypothesis even when it was true. The opposite, where there is an effect but we think there isn’t, is called a Type II error. We accepted the null hypothesis when we shouldn’t have. The chance of committing a Type II error is called β.

Statistical power is the probability that a study will correctly find a real effect, meaning a true positive. This translates to Power = 1 – β. Power is influenced by four factors:

Probability of a false positive (α, or Type I error rate)
Sample size (N)
Effect size (the magnitude of difference between groups)
Probability of a false negative (β, or Type II error rate)

Keep this in mind, as we’ll be coming back to it.

A corollary to p-values are confidence intervals. To find the confidence interval, you take 1 – α, so if α is commonly set to 0.05, the confidence interval would be 0.95, or 95%. When reading a study, you can quickly determine if statistical significance was achieved by whether or not the confidence intervals include the number 1.00. If it’s larger, like 1.05 – 1.27, then a positive association is present with statistical significance, and if it’s smaller, like 0.56 – 0.89, then a negative association is present with statistical significance.

Confidence intervals are commonly misunderstood. With a 95% confidence interval of 1.05 – 1.27, this doesn’t mean that we are 95% confident that the true effect is between 1.05-1.27. Rather, if we were to take 100 different samples and compute a 95% confidence interval for each sample, then 95 of the 100 confidence intervals will contain the true value. In other words, a 95% confidence interval states that 95% of experiments conducted in this exact manner will include the true value, but 5% will not.

Lastly, let’s clarify statistical significance versus practical significance. A study can find statistical significance but have no practical significance. This is more common than you think. A common case where this happens is when the sample size is too large. The larger the sample size, the greater the probability the study will reach statistical significance. At these extremes, even minute differences in outcomes can be statistically significant. If a study finds that a new intervention reduces weight by 0.5 pounds, who cares? It’s not clinically relevant.

The reverse is also true, where a study demonstrates practical significance, yet was unable to achieve statistical significance. If we revisit the four factors that influence power, we see that sample size is the most easily manipulated to over- or underpower a study. Often times, observational studies are overpowered with thousands of subjects, such that any minute difference may yield a statistically significant result. Other studies experience the opposite, whereby they have a small number of subjects, and even if there is a real difference, statistical significance cannot be demonstrated.

Each of these components in isolation isn’t enough to make you an expert at deciphering research studies. However, when you put each piece in context and understand the why of how sound science is conducted, you’ll become far better equipped to think critically and make sense of the primary literature yourself, without having to rely on lazy thinking and black and white summaries from journalists.

If you made it to the end of this post, congratulations! This was an incredibly challenging post to make, as there’s so much to research, but I hope you learned something that will make reading research articles in the future easier and more productive.