Which Statistical Test Should I Use? Practical Decision Tree

Which statistical test should I use? Start with the type of outcome you measured, not the name of your project. Then ask whether you are comparing groups or testing a relationship, how many groups or time points you have, and whether the observations are independent or paired.

For a continuous outcome, two independent groups usually point to a Welch independent-samples t-test. Two measurements from the same people point to a paired t-test. Three or more independent groups point to ANOVA, and three or more repeated measurements point to repeated-measures ANOVA or a mixed-effects model.

For categorical counts, think chi-square, Fisher’s exact test, McNemar’s test, or logistic regression. For relationships between numeric variables, think Pearson or Spearman correlation. For prediction or adjustment, think regression.

The test name is only the beginning. A defensible analysis also checks independence, data quality, assumptions, effect size, confidence intervals, missing data, multiple comparisons, and whether the study design can support the conclusion.

Statistical test decision tree branching from outcome type and study design to t-tests, ANOVA, chi-square, correlation, regression, and nonparametric alternatives

Quick answer: choose the test from the outcome and design

Your research question Common test Main alternative Critical detail
Is one sample mean different from a known value? One-sample t-test One-sample Wilcoxon signed-rank or sign test The benchmark and direction of the hypothesis should be defined before examining the result.
Is one observed proportion different from a benchmark? Exact binomial test or one-sample proportion test Confidence-interval-based analysis Use the number of successes and total observations, not only the observed percentage.
Do two independent groups have different means? Welch independent-samples t-test Mann–Whitney U test The people or units in one group must not also appear in the other group.
Did the same participants change from before to after? Paired t-test Wilcoxon signed-rank test Analyze within-person differences rather than treating the measurements as independent groups.
Do three or more independent groups differ? One-way ANOVA or Welch ANOVA Kruskal–Wallis test A significant omnibus result does not identify which groups differ.
Did the same participants complete three or more conditions? Repeated-measures ANOVA Friedman test or mixed-effects model Repeated observations from the same participant are correlated.
Are two independent categorical variables associated? Chi-square test of independence Fisher’s exact test for a sparse 2×2 table Enter observed counts, not percentages alone.
Did a paired yes/no outcome change from before to after? McNemar’s test Exact McNemar test for small discordant counts Ordinary chi-square treats observations as independent and is not appropriate for paired binary data.
Are two continuous variables linearly related? Pearson correlation Spearman rank correlation Correlation does not establish causation.
Do predictors explain a continuous outcome? Linear regression Robust, transformed, nonlinear, or mixed models Inspect residuals and model form, not only coefficient p-values.
Do predictors explain a yes/no outcome? Binary logistic regression Other generalized linear or mixed models Odds ratios are not automatically equivalent to risk ratios.
Are you pooling estimates from several studies? Meta-analysis Structured narrative synthesis The unit of analysis is the study estimate, not an individual participant.

Fastest reliable rule: identify the outcome type first, then the study design. Do not choose a test because it produced the smallest p-value.

A practical statistical test decision tree

Answer these five questions in order. Most common analyses become clear by the fourth question.

Decision Question to ask If yes If no
1. Define the outcome Is the main outcome numeric and meaningfully measured on a scale? Continue toward t-tests, ANOVA, correlation, or linear regression. For categorical outcomes, continue toward binomial, chi-square, Fisher’s exact, McNemar, or logistic regression.
2. Define the purpose Are you comparing groups or conditions? Count the groups and decide whether observations are independent or paired. If testing a relationship or prediction, use correlation or regression.
3. Count the groups Are there exactly two groups or conditions? Think t-test, McNemar, or a two-group nonparametric alternative, depending on the outcome. For three or more groups, think ANOVA, chi-square, repeated-measures methods, or a multi-group alternative.
4. Check dependence Are the same participants measured more than once, or are observations explicitly matched? Use a paired, repeated-measures, clustered, or mixed-effects method. Use an independent-samples method.
5. Check model fit Are the assumptions reasonable for the chosen model? Use the model and report diagnostics. Consider Welch methods, transformation, robust methods, nonparametric tests, permutation tests, or a model designed for the outcome distribution.

The fifth question does not mean “run a normality test and obey it mechanically.” Statistical test selection should consider plots, outliers, sample size, residual behavior, variance differences, measurement scale, study design, and the parameter you actually want to estimate.

Master statistical test selection table

Outcome Predictor or design Common test Alternative or extension Report with it
Continuous One sample compared with a benchmark One-sample t-test One-sample Wilcoxon or sign test Mean difference, confidence interval, effect size
Binary One sample proportion compared with a benchmark Exact binomial test Large-sample proportion test Observed proportion, confidence interval, absolute difference
Continuous Two independent groups Welch t-test Mann–Whitney U, permutation test, robust model Group means, mean difference, confidence interval, standardized effect size
Continuous Two paired measurements Paired t-test Wilcoxon signed-rank or sign test Mean paired difference, confidence interval, paired effect size
Continuous Three or more independent groups One-way ANOVA or Welch ANOVA Kruskal–Wallis or robust ANOVA Group summaries, omnibus result, effect size, post-hoc comparisons
Continuous Three or more repeated conditions Repeated-measures ANOVA Friedman test or mixed-effects model Condition summaries, corrected result if needed, post-hoc comparisons
Categorical Two independent categorical variables Chi-square test of independence Fisher’s exact test for a sparse 2×2 table Counts, percentages, Cramér’s V, risk ratio, or odds ratio as appropriate
Binary Two paired measurements McNemar’s test Exact McNemar test Discordant-pair counts, paired proportions, confidence interval when available
Binary Three or more paired conditions Cochran’s Q test Generalized mixed model Condition proportions, omnibus result, adjusted follow-up tests
Continuous or ordinal Relationship between two variables Pearson correlation Spearman or Kendall correlation Coefficient, confidence interval, scatterplot, sample size
Continuous One or more predictors Linear regression Robust regression, nonlinear model, mixed model Coefficients, confidence intervals, residual diagnostics, model fit
Binary One or more predictors Logistic regression Mixed logistic or other generalized models Odds ratios, confidence intervals, predicted probabilities, calibration
Count One or more predictors Poisson regression Negative binomial or zero-inflated model Rate ratios, confidence intervals, overdispersion diagnostics
Time to event Groups or predictors with censoring Log-rank test or Cox regression Parametric survival or competing-risk models Survival curves, hazard ratios, confidence intervals

First identify your outcome and predictor variables

The outcome is what you are trying to explain, compare, or predict. The predictor identifies groups, exposures, treatments, time points, or characteristics that may relate to that outcome.

Variable type Examples Common analysis direction Common mistake
Continuous Height, blood pressure, reaction time, income, test score T-test, ANOVA, Pearson correlation, linear regression Treating a badly skewed or bounded measurement as automatically normal.
Binary Yes/no, disease/no disease, converted/did not convert Binomial test, chi-square, Fisher’s exact, McNemar, logistic regression Using an independent test for paired binary outcomes.
Nominal categorical Blood type, region, product category, treatment arm Chi-square, multinomial models, ANOVA when used as a group predictor Assigning arbitrary numbers to categories and treating them as continuous.
Ordinal Single Likert item, pain category, satisfaction rank Rank-based methods, ordinal regression, carefully justified scale analysis Assuming the distance between every category is exactly equal.
Count Number of visits, defects, clicks, infections Poisson or negative binomial regression Using ordinary linear regression when counts are highly skewed or overdispersed.
Time to event Time to relapse, churn, death, machine failure Survival analysis Ignoring participants who have not yet experienced the event.

A group variable such as treatment A versus treatment B can be a predictor. A numeric variable such as age can also be a predictor. The correct test depends on the combination of outcome, predictor, study design, and intended interpretation.

T-test vs. ANOVA: when should you use each?

A t-test is usually used to compare two means. ANOVA is usually used to compare three or more means or handle more complex factor structures.

You could run several t-tests across three groups, but that inflates the chance of at least one false-positive result. ANOVA begins with one overall test of whether all group means can reasonably be treated as equal. If the omnibus result is significant, planned contrasts or adjusted post-hoc comparisons identify where differences occur.

Design Use Do not use
Two independent groups Welch independent-samples t-test Paired t-test when the participants are unrelated.
Two matched or repeated measurements Paired t-test Independent t-test, because it discards the pairing.
Three or more independent groups One-way ANOVA or Welch ANOVA A collection of unadjusted pairwise t-tests.
Three or more repeated measurements Repeated-measures ANOVA or mixed-effects model Ordinary one-way ANOVA that assumes all observations are independent.
Two or more factors Factorial ANOVA or regression model Separate one-way analyses that ignore interactions.

With exactly two groups, a two-sided t-test and a corresponding one-factor ANOVA often test the same mean-difference hypothesis. Use the t-test because it communicates the two-group design more directly.

Two independent groups: Welch t-test or Mann–Whitney?

Use an independent-samples test when the observations in group A are unrelated to those in group B. Examples include treatment versus control, customers from two regions, or two different classrooms.

Welch independent-samples t-test

Use Welch’s t-test when the outcome is continuous and the scientific question concerns a difference in group means. It does not require the two groups to have equal variances and is often a more defensible default than the pooled equal-variance version.

Check:

  • observations are independent
  • the outcome is meaningfully numeric
  • extreme outliers are not dominating the result
  • the sampling distribution and residual behavior are reasonable
  • the mean is a meaningful summary for the research question

Mann–Whitney U test

Use Mann–Whitney when the data are ordinal or when a rank-based comparison fits the research question better than comparing means. It evaluates whether observations from one group tend to rank above observations from the other group.

Common interpretation mistake: Mann–Whitney is not automatically a test of medians. A median-shift interpretation requires additional assumptions about the shapes and spreads of the group distributions.

Do not choose Mann–Whitney solely because a normality test returned p < 0.05. Inspect the data, define the estimand, and consider whether Welch’s test, transformation, a permutation test, a robust method, or a generalized model better answers the question.

Two paired measurements: paired t-test or Wilcoxon signed-rank?

Paired data arise when the same participant is measured twice or when observations are deliberately matched. Examples include pre-treatment versus post-treatment blood pressure, left versus right eye, or matched twins.

Paired t-test

The paired t-test analyzes the differences within each pair. The relevant normality assumption concerns the distribution of those paired differences, not the separate before and after distributions.

Report:

  • number of complete pairs
  • mean paired difference
  • confidence interval for the difference
  • t statistic, degrees of freedom, and p-value
  • a paired effect-size measure when useful

Wilcoxon signed-rank test

Use the Wilcoxon signed-rank test when a rank-based paired comparison is appropriate. It uses the directions and ranks of within-pair differences and is not appropriate when the observations are independent.

If the paired differences are highly asymmetric, contain many zeros, or do not support the signed-rank interpretation, a sign test, permutation approach, robust model, or domain-specific method may be more appropriate.

Three or more independent groups: ANOVA or Kruskal–Wallis?

One-way ANOVA

Use one-way ANOVA when you are comparing the means of three or more independent groups and the model assumptions are reasonable.

A significant ANOVA result means the data are inconsistent with all population means being equal under the model. It does not tell you which groups differ or whether the differences are practically important.

Welch ANOVA

Use Welch ANOVA when group variances differ or group sizes are uneven and an unequal-variance mean comparison is appropriate. If the omnibus result is significant, Games–Howell comparisons are a common follow-up because they do not assume equal variances.

Kruskal–Wallis test

Use Kruskal–Wallis for three or more independent groups when a rank-based comparison fits the data and question. A significant result means at least one group distribution tends to differ, but it does not identify which pairs differ.

Omnibus method Common follow-up Correction issue
One-way ANOVA Tukey HSD or prespecified contrasts Do not run every pairwise t-test without adjustment.
Welch ANOVA Games–Howell comparisons Use a follow-up that respects unequal variances.
Kruskal–Wallis Dunn test or adjusted pairwise rank tests Control family-wise error or false discovery rate.

Three or more repeated measurements

Repeated-measures data occur when the same participant, machine, school, location, or other unit contributes several observations. Those observations are correlated and cannot be treated as independent.

Repeated-measures ANOVA

Use repeated-measures ANOVA for a continuous outcome measured under three or more conditions or time points when its assumptions are reasonable. The sphericity assumption concerns the variances of differences among conditions. If sphericity is violated, software may provide Greenhouse–Geisser or Huynh–Feldt corrections.

Friedman test

Use the Friedman test as a rank-based alternative for three or more related conditions. A significant result is an omnibus finding; follow it with adjusted pairwise comparisons if you need to identify the differing conditions.

Mixed-effects model

A mixed-effects model is often better when participants have missing visits, unequal numbers of observations, irregular timing, nested data, multiple grouping levels, or person-specific trajectories. It models within-subject correlation directly instead of requiring a perfectly complete repeated-measures table.

Categorical data: chi-square, Fisher’s exact, McNemar, or logistic regression?

Chi-square test of independence

Use a chi-square test when two categorical variables are measured on independent observations. Examples include treatment group by recovery status or device type by conversion outcome.

Enter actual observed counts. A table containing percentages without the underlying sample sizes is not enough.

Fisher’s exact test

Use Fisher’s exact test for a 2×2 table when sample size is small or expected cell counts are too sparse for the usual chi-square approximation. It calculates an exact conditional probability under its assumptions.

McNemar’s test

Use McNemar’s test when the outcome is binary and measured twice on the same participants, or when binary observations are explicitly matched. It focuses on pairs that changed from one category to the other.

For example, if the same 100 people answer yes/no before and after a campaign, ordinary chi-square is inappropriate because the two sets of responses are not independent.

Exact binomial test

Use an exact binomial test when one observed binary proportion is being compared with a predefined probability. Examples include whether a defect rate differs from 5% or whether a coin-like process differs from 50%.

Logistic regression

Use logistic regression when the outcome is binary and you want to include one or more predictors, adjust for potential confounders, test interactions, or estimate how the odds change with a predictor.

Question Method Useful effect measure
Does one observed proportion differ from a benchmark? Exact binomial test Observed proportion, absolute difference, confidence interval
Are two independent categorical variables associated? Chi-square test Cramér’s V, risk difference, risk ratio, or odds ratio
Is a small independent 2×2 table associated? Fisher’s exact test Odds ratio and confidence interval
Did a paired binary outcome change? McNemar’s test Discordant-pair counts and paired proportion difference
Does one predictor relate to a binary outcome? Simple logistic regression Odds ratio and predicted probabilities
Does the relationship remain after adjustment? Multiple logistic regression Adjusted odds ratios and predicted probabilities

Correlation and regression

Pearson correlation

Use Pearson correlation when the question concerns the strength and direction of a linear relationship between two continuous variables. Inspect a scatterplot first. A strong curved relationship can have a weak Pearson correlation, and one extreme outlier can dominate the result.

Spearman correlation

Use Spearman correlation when the variables are ordinal, the relationship is monotonic rather than strictly linear, or ranks better represent the question. Spearman correlation still requires properly paired observations and does not remove confounding.

Linear regression

Use linear regression when you want to estimate a continuous outcome from one or more predictors. Regression can handle continuous predictors, coded categorical predictors, interactions, covariate adjustment, and several explanatory variables.

Check whether:

  • the functional form is appropriate
  • residual variance is reasonably modeled
  • residuals are independent or their dependence is modeled
  • influential observations are not driving the coefficients
  • predictors are not severely redundant
  • the sample size supports the model complexity

Correlation and regression do not prove causation. Causal conclusions depend on design, timing, randomization, confounding control, measurement quality, and subject-matter assumptions—not merely a significant coefficient.

Parametric vs. nonparametric tests

“Parametric” and “nonparametric” do not mean “good” and “bad,” or simply “normal” and “not normal.” They often target different quantities and rely on different assumptions.

Feature Parametric method Nonparametric or rank-based method
Common target Means, mean differences, model coefficients Ranks, distributional ordering, or other non-mean contrasts
Examples T-tests, ANOVA, Pearson correlation, linear regression Mann–Whitney, Wilcoxon, Kruskal–Wallis, Friedman, Spearman
Strength Direct interpretation of means and model parameters when appropriate Useful for ordinal data and some non-normal or outlier-sensitive situations
Limitation Can mislead when the model form or variance assumptions are badly wrong May answer a different question and can be less efficient in some settings
Still requires Independent or correctly modeled observations, good design, valid measurement Independent or correctly paired observations, valid ranking, good design

Switching to a nonparametric test does not repair pseudoreplication, biased sampling, confounding, missing-not-at-random data, bad measurement, or an outcome that needs a count, binary, ordinal, clustered, or survival model.

Statistical assumption checklist

Assumption or check What it means What to inspect Possible response
Independence One observation does not improperly duplicate or depend on another. Study design, repeated participants, clusters, households, schools, sites. Use paired, repeated-measures, clustered, or mixed-effects methods.
Outcome scale The measurement supports the proposed analysis. Continuous, ordinal, binary, count, proportion, or time-to-event structure. Choose a model designed for that outcome.
Distribution shape The model’s error structure is plausible. Histograms, Q–Q plots, residual plots, skewness, ceiling and floor effects. Transform, use robust methods, ranks, or a generalized model.
Outliers A few observations are not dominating the estimate. Raw plots, boxplots, residuals, leverage, influence statistics. Verify data, report sensitivity analyses, use robust methods when justified.
Equal variance Some traditional methods assume similar variability across groups. Group spread, residual-versus-fitted plots, variance estimates. Use Welch t-test, Welch ANOVA, robust standard errors, or another model.
Linearity Pearson correlation and linear regression model a linear form. Scatterplot and residual patterns. Add nonlinear terms, transform variables, or use another model.
Sphericity Repeated-measures ANOVA assumes a particular covariance structure. Software diagnostics and study structure. Apply a correction or use a mixed-effects model.
Expected cell information Chi-square approximation needs enough expected information in the table. Expected-count table, sparse categories, zero cells. Use Fisher’s exact test, an exact/Monte Carlo method, or a better model.
Sample size and power The design must estimate a meaningful effect with useful precision. Expected effect, variability, alpha, power, attrition, model complexity. Plan sample size before collecting data and report uncertainty afterward.

Independence is usually a design issue, not a box to tick in software. No transformation or p-value can fix data analyzed as independent when the same person, classroom, household, clinic, or laboratory batch contributed repeated observations.

High-stakes research: for clinical, regulated, grant-funded, or publication-critical work, prespecify the primary analysis and involve a statistician before data collection whenever possible.

Worked statistical test examples

Research scenario Likely method Why What could change the choice
Compare mean exam scores in two unrelated teaching groups. Welch independent-samples t-test Continuous outcome and two independent groups. Ordinal scoring, severe outliers, clustering by classroom, or repeated students.
Compare blood pressure before and after treatment in the same patients. Paired t-test Continuous outcome with paired observations. Highly irregular paired differences, missing follow-up, or several time points.
Compare mean recovery time across four independent treatments. One-way ANOVA or Welch ANOVA Continuous outcome and four independent groups. Severe skew, censoring, unequal variance, or site clustering.
Compare satisfaction ratings across three unrelated service plans. Kruskal–Wallis or carefully justified ANOVA A single ordinal rating may favor a rank-based analysis. A validated multi-item scale may support a continuous-score model.
Test whether smoking status is associated with disease status. Chi-square test Both variables are categorical and observations are independent. Use Fisher’s exact test if the 2×2 table is sparse.
Test whether the same people changed from no to yes after training. McNemar’s test The outcome is paired and binary. Use exact McNemar when the number of discordant pairs is small.
Test whether a defect rate differs from a 5% target. Exact binomial test One observed binary proportion is compared with a benchmark. Use a model if defects are clustered by machine, batch, or site.
Test whether age predicts disease after adjusting for sex and treatment. Multiple logistic regression The outcome is binary and several predictors are included. Repeated observations, separation, nonlinear age effect, or rare outcomes.
Test whether hours studied are linearly related to exam score. Pearson correlation or linear regression Both variables are continuous and the question is linear association. Curvature, outliers, ordinal data, or adjustment for confounders.
Compare the same participants under four interface designs. Repeated-measures ANOVA or Friedman test Each participant contributes four related measurements. Missing conditions, order effects, irregular timing, or random slopes.
Pool odds ratios from eight independent studies. Meta-analysis The observations are study-level estimates with uncertainty. Study quality, heterogeneity, publication bias, incompatible outcomes.

Use Jivaro Pvalyzer to choose and check common tests

Pvalyzer is Jivaro’s statistical test selector and summary-statistics calculator. Use it to narrow the choice among common tests and to check calculations when you already have the required summary values.

The relevant calculator modules include common analyses such as independent and paired t-tests, 2×2 categorical tests, and Pearson correlation significance. Check the labels in the live app before entering data because the interface may be expanded or updated over time.

Pvalyzer module What to enter What to read Most common mistake
Independent two-sample t-test For each independent group, enter the mean, standard deviation, and sample size. Read the estimated mean difference, t statistic, degrees of freedom, p-value, and any reported interval or summary sentence. Using this module when the same people were measured twice.
Paired t-test Enter the mean of the within-pair differences, the standard deviation of those differences, and the number of complete pairs. Read the estimated paired change, t statistic, degrees of freedom, and p-value. Entering separate pre- and post-test means and standard deviations without the standard deviation of the paired differences. Those separate summaries are not enough by themselves.
Chi-square 2×2 Enter the four observed cell counts from an independent 2×2 contingency table. Read the chi-square statistic, degrees of freedom, and p-value. Entering percentages instead of counts or using chi-square for paired before-and-after binary data.
Fisher-style exact 2×2 Enter the four observed cell counts from a small or sparse independent 2×2 table. Read the exact-style probability result and any reported association measure. Choosing Fisher’s exact test only after seeing that it gives a more favorable p-value.
Pearson correlation significance Enter the Pearson correlation coefficient and the number of paired observations. Read the correlation test result and p-value. Interpreting significance without inspecting the scatterplot, linearity, outliers, or data quality.

Pvalyzer limitation: a summary-statistics calculator cannot reveal outliers, nonlinear relationships, duplicated observations, miscoding, unequal distribution shapes, influential cases, clustering, or missing-data problems. Use it as a selection and checking tool, not as a replacement for raw-data analysis or methodological review.

For tests not available as a matching Jivaro calculator—such as repeated-measures ANOVA, Mann–Whitney, Wilcoxon, Kruskal–Wallis, Friedman, McNemar, logistic regression, mixed models, and survival analysis—use full statistical software and verify the model assumptions directly.

When your rows are studies: use ForestIQ for meta-analysis

If each row represents a study estimate rather than an individual participant, a t-test or ANOVA is usually the wrong framework. You may need a meta-analysis.

ForestIQ is Jivaro’s mini meta-analysis calculator for combining study-level estimates and generating a forest-plot-style summary.

ForestIQ step What to enter or inspect Common mistake
Choose the effect type Select the effect measure reported consistently across the included studies, such as a ratio or mean-difference measure supported by the live app. Pooling odds ratios, risk ratios, hazard ratios, and mean differences as though they were interchangeable.
Enter each study Add a study label, its effect estimate, and the uncertainty input requested by the app, such as a confidence interval or standard error. Entering point estimates without any measure of study precision.
Read the pooled result Inspect the pooled estimate and confidence interval, along with the individual study estimates. Assuming the pooled estimate is meaningful when the studies measure materially different populations, interventions, or outcomes.
Inspect heterogeneity Review the heterogeneity statistics and the visual spread of study estimates. Using one I² cutoff as an automatic accept/reject rule.
Export the result Use the forest plot and generated summary as a starting point for reporting. Skipping risk-of-bias assessment, sensitivity analysis, study-quality review, or publication-bias checks.

ForestIQ is useful for teaching and quick evidence summaries. It is not a substitute for a systematic review protocol, duplicate screening, risk-of-bias assessment, sensitivity analysis, publication-bias assessment, or dedicated meta-analysis software for publication-critical work.

How to interpret a statistical test result

A complete result is not “p < 0.05.” It should explain the estimated effect, uncertainty, test, sample size, direction, and practical meaning.

What a p-value means

A p-value describes how unusual the observed result, or a more extreme result, would be under the specified null model and assumptions. It is not the probability that the null hypothesis is true. It is not the probability that the result happened “by chance.”

What “not significant” means

A non-significant result does not prove that there is no difference or relationship. It may reflect a small effect, imprecise measurement, limited sample size, high variability, model mismatch, or data that remain compatible with several possible effects.

What statistical significance does not mean

Statistical significance does not automatically mean the effect is large, clinically important, commercially useful, reproducible, or causal. Large samples can make tiny differences statistically significant, while small studies can leave meaningful effects uncertain.

Weak interpretation:
Treatment A was statistically significant, p = 0.03.

Better interpretation:
Treatment A reduced the mean score by 4.2 points compared with treatment B
(95% CI: 0.4 to 8.0 points; Welch t-test, p = 0.03).

The interval includes effects ranging from small to potentially meaningful,
so the result should be interpreted using the scale's practical threshold.

Report effect sizes and confidence intervals

Analysis Useful effect size What it communicates
Two-group continuous comparison Raw mean difference, Cohen’s d, or Hedges’ g How far apart the groups are in original or standardized units.
Paired comparison Mean paired difference and paired standardized effect How much the same units changed.
ANOVA Eta squared, partial eta squared, or omega squared How much variation is associated with the factor under the model.
Chi-square Cramér’s V, risk difference, risk ratio, or odds ratio The strength and practical direction of categorical association.
Correlation Pearson r or Spearman rho The direction and strength of association.
Linear regression Regression coefficient, standardized coefficient, R² Expected outcome change and model-explained variation.
Logistic regression Odds ratio and predicted probability difference How predictors relate to the odds or predicted probability of an outcome.

A confidence interval communicates estimate precision and the range of parameter values reasonably compatible with the data and model. Do not describe a frequentist 95% confidence interval as a 95% probability that the fixed true value lies inside the observed interval.

Post-hoc tests and multiple comparisons

Testing many hypotheses increases the chance of false-positive findings. Decide which comparisons matter before seeing the results whenever possible.

Situation Possible approach What to avoid
All pairwise comparisons after standard ANOVA Tukey HSD Unadjusted pairwise t-tests.
All pairwise comparisons after Welch ANOVA Games–Howell A follow-up method that assumes equal variances.
Selected planned comparisons Prespecified contrasts with an appropriate correction Inventing contrasts after inspecting every result and presenting them as planned.
Many exploratory hypotheses Holm, Bonferroni, or false-discovery-rate control Reporting only the smallest p-values.
Many outcomes and subgroups Declare primary outcomes and distinguish confirmatory from exploratory analyses Treating every analysis as an independent primary test.

When the basic decision tree is not enough

Data complication Method family to consider
Participants nested within schools, clinics, companies, or countries Multilevel or mixed-effects models
Repeated observations with missing visits or unequal timing Mixed-effects or generalized estimating equation models
Count outcomes with overdispersion Negative binomial regression
Many zero counts Zero-inflated or hurdle models
Time-to-event outcome with censoring Kaplan–Meier, log-rank, Cox, or parametric survival models
More than two unordered outcome categories Multinomial logistic regression
Ordered categorical outcome Ordinal regression
Strong nonlinear relationship Splines, generalized additive models, or nonlinear regression
Complex survey weights or clustered sampling Survey-weighted analysis
Very high-dimensional predictors Penalized models, dimension reduction, or specialist methods

Advanced methods should be chosen from the data-generating structure and research objective. They should not be added merely because they sound more sophisticated.

Common statistical test selection mistakes

Mistake Why it matters Better approach
Choosing the test after seeing which gives p < 0.05 Inflates false-positive risk and makes the analysis secretly data-driven. Prespecify the primary method and document justified sensitivity analyses.
Treating repeated data as independent Standard errors and p-values can be wrong. Use paired, repeated-measures, clustered, or mixed-effects methods.
Using chi-square for paired binary data The independence assumption is violated. Use McNemar’s test for paired yes/no outcomes.
Running a normality test and obeying it mechanically Normality tests can be insensitive in small samples and overly sensitive in large samples. Use plots, residuals, design, sample size, outliers, and the target estimand together.
Assuming nonparametric means assumption-free Rank-based tests still require valid independence or pairing and have their own interpretations. State what the selected test actually compares.
Using chi-square on percentages The test requires counts and expected frequencies. Build the contingency table from observed counts.
Reporting only a p-value The reader cannot judge direction, magnitude, or precision. Report the effect estimate, confidence interval, sample size, and descriptive statistics.
Calling correlation causal Confounding, reverse causation, and selection can explain the relationship. Match the conclusion to the study design.
Ignoring missing data Complete-case analysis can reduce power or introduce bias. Describe missingness and consider appropriate imputation or modeling.
Using a complex model with too little data Estimates become unstable and overfit. Reduce model complexity, collect more data, or use carefully validated regularization.

FAQ

Which statistical test should I use for two groups?

Use a Welch independent-samples t-test for two unrelated groups with a continuous outcome. Use a paired t-test when the same participants are measured twice. Mann–Whitney and Wilcoxon are common rank-based alternatives.

Should I use a t-test or ANOVA?

Use a t-test for two groups or conditions. Use ANOVA for three or more groups, repeated conditions, multiple factors, or designs requiring an omnibus comparison.

When should I use a chi-square test?

Use chi-square when two categorical variables are measured on independent observations and you have observed counts in a contingency table. For paired binary data, use McNemar’s test instead.

What is the difference between independent and paired data?

Independent observations come from unrelated participants or units. Paired data come from the same participant measured more than once or from deliberately matched observations.

When should I use a nonparametric test?

Consider a rank-based method for ordinal data or when the rank-based question fits better than a mean comparison. Do not switch automatically because one normality test is significant.

Should I use Pearson or Spearman correlation?

Use Pearson for a linear relationship between continuous variables. Use Spearman for ordinal data or monotonic relationships where ranks are more appropriate. Inspect a scatterplot either way.

Does p < 0.05 prove my hypothesis?

No. A small p-value indicates that the observed result is relatively incompatible with the specified null model under its assumptions. It does not prove the hypothesis, practical importance, reproducibility, or causation.

What if my data are not normally distributed?

Inspect the outcome and model residuals, sample size, outliers, skew, measurement scale, and study design. Options include Welch methods, transformation, robust models, rank-based tests, permutation tests, or a model designed for the outcome distribution.

Can Pvalyzer choose the test for me?

Pvalyzer can help narrow common test choices and check selected calculations from summary data. It cannot inspect raw-data quality, research design, confounding, outliers, missingness, clustering, or every model assumption.

What should I report besides the p-value?

Report sample size, descriptive statistics, effect estimate, confidence interval, exact test, assumptions or diagnostics, and any post-hoc or multiple-testing correction.

Sources and useful links

The right statistical test is the one that matches the outcome, research question, study design, dependence structure, and quantity you want to estimate. Choose that structure before calculating the p-value, then report the effect and uncertainty—not merely whether the result crossed a threshold.

Harry Negron

Harry Negron is the CEO of Jivaro, a writer, and an entrepreneur with a background in science, technology, and digital publishing. He holds a B.S. in Microbiology and Mathematics and a Ph.D. in Genetics, with a specialization in biomedical sciences. His work spans finance, science, health, gaming, and technology, and his projects include free apps, automation tools, and large-scale search utilities. Originally from Puerto Rico and based in Japan since 2018, he brings an international perspective to Jivaro’s content, research, and tools.

Previous
Previous

Supplements Studied for Cholesterol & Triglycerides: Evidence, Safety, and What the Research Says

Next
Next

What Is BMI? How to Calculate Body Mass Index and Understand Your Number