Which statistical test should I use? Start with the type of outcome you measured, not the name of your project. Then ask whether you are comparing groups or testing a relationship, how many groups or time points you have, and whether the observations are independent or paired.
For a continuous outcome, two independent groups usually point to a Welch independent-samples t-test. Two measurements from the same people point to a paired t-test. Three or more independent groups point to ANOVA, and three or more repeated measurements point to repeated-measures ANOVA or a mixed-effects model.
For categorical counts, think chi-square, Fisher’s exact test, McNemar’s test, or logistic regression. For relationships between numeric variables, think Pearson or Spearman correlation. For prediction or adjustment, think regression.
The test name is only the beginning. A defensible analysis also checks independence, data quality, assumptions, effect size, confidence intervals, missing data, multiple comparisons, and whether the study design can support the conclusion.
Quick answer: choose the test from the outcome and design
| Your research question | Common test | Main alternative | Critical detail |
|---|---|---|---|
| Is one sample mean different from a known value? | One-sample t-test | One-sample Wilcoxon signed-rank or sign test | The benchmark and direction of the hypothesis should be defined before examining the result. |
| Is one observed proportion different from a benchmark? | Exact binomial test or one-sample proportion test | Confidence-interval-based analysis | Use the number of successes and total observations, not only the observed percentage. |
| Do two independent groups have different means? | Welch independent-samples t-test | Mann–Whitney U test | The people or units in one group must not also appear in the other group. |
| Did the same participants change from before to after? | Paired t-test | Wilcoxon signed-rank test | Analyze within-person differences rather than treating the measurements as independent groups. |
| Do three or more independent groups differ? | One-way ANOVA or Welch ANOVA | Kruskal–Wallis test | A significant omnibus result does not identify which groups differ. |
| Did the same participants complete three or more conditions? | Repeated-measures ANOVA | Friedman test or mixed-effects model | Repeated observations from the same participant are correlated. |
| Are two independent categorical variables associated? | Chi-square test of independence | Fisher’s exact test for a sparse 2×2 table | Enter observed counts, not percentages alone. |
| Did a paired yes/no outcome change from before to after? | McNemar’s test | Exact McNemar test for small discordant counts | Ordinary chi-square treats observations as independent and is not appropriate for paired binary data. |
| Are two continuous variables linearly related? | Pearson correlation | Spearman rank correlation | Correlation does not establish causation. |
| Do predictors explain a continuous outcome? | Linear regression | Robust, transformed, nonlinear, or mixed models | Inspect residuals and model form, not only coefficient p-values. |
| Do predictors explain a yes/no outcome? | Binary logistic regression | Other generalized linear or mixed models | Odds ratios are not automatically equivalent to risk ratios. |
| Are you pooling estimates from several studies? | Meta-analysis | Structured narrative synthesis | The unit of analysis is the study estimate, not an individual participant. |
Fastest reliable rule: identify the outcome type first, then the study design. Do not choose a test because it produced the smallest p-value.
A practical statistical test decision tree
Answer these five questions in order. Most common analyses become clear by the fourth question.
| Decision | Question to ask | If yes | If no |
|---|---|---|---|
| 1. Define the outcome | Is the main outcome numeric and meaningfully measured on a scale? | Continue toward t-tests, ANOVA, correlation, or linear regression. | For categorical outcomes, continue toward binomial, chi-square, Fisher’s exact, McNemar, or logistic regression. |
| 2. Define the purpose | Are you comparing groups or conditions? | Count the groups and decide whether observations are independent or paired. | If testing a relationship or prediction, use correlation or regression. |
| 3. Count the groups | Are there exactly two groups or conditions? | Think t-test, McNemar, or a two-group nonparametric alternative, depending on the outcome. | For three or more groups, think ANOVA, chi-square, repeated-measures methods, or a multi-group alternative. |
| 4. Check dependence | Are the same participants measured more than once, or are observations explicitly matched? | Use a paired, repeated-measures, clustered, or mixed-effects method. | Use an independent-samples method. |
| 5. Check model fit | Are the assumptions reasonable for the chosen model? | Use the model and report diagnostics. | Consider Welch methods, transformation, robust methods, nonparametric tests, permutation tests, or a model designed for the outcome distribution. |
The fifth question does not mean “run a normality test and obey it mechanically.” Statistical test selection should consider plots, outliers, sample size, residual behavior, variance differences, measurement scale, study design, and the parameter you actually want to estimate.
Master statistical test selection table
| Outcome | Predictor or design | Common test | Alternative or extension | Report with it |
|---|---|---|---|---|
| Continuous | One sample compared with a benchmark | One-sample t-test | One-sample Wilcoxon or sign test | Mean difference, confidence interval, effect size |
| Binary | One sample proportion compared with a benchmark | Exact binomial test | Large-sample proportion test | Observed proportion, confidence interval, absolute difference |
| Continuous | Two independent groups | Welch t-test | Mann–Whitney U, permutation test, robust model | Group means, mean difference, confidence interval, standardized effect size |
| Continuous | Two paired measurements | Paired t-test | Wilcoxon signed-rank or sign test | Mean paired difference, confidence interval, paired effect size |
| Continuous | Three or more independent groups | One-way ANOVA or Welch ANOVA | Kruskal–Wallis or robust ANOVA | Group summaries, omnibus result, effect size, post-hoc comparisons |
| Continuous | Three or more repeated conditions | Repeated-measures ANOVA | Friedman test or mixed-effects model | Condition summaries, corrected result if needed, post-hoc comparisons |
| Categorical | Two independent categorical variables | Chi-square test of independence | Fisher’s exact test for a sparse 2×2 table | Counts, percentages, Cramér’s V, risk ratio, or odds ratio as appropriate |
| Binary | Two paired measurements | McNemar’s test | Exact McNemar test | Discordant-pair counts, paired proportions, confidence interval when available |
| Binary | Three or more paired conditions | Cochran’s Q test | Generalized mixed model | Condition proportions, omnibus result, adjusted follow-up tests |
| Continuous or ordinal | Relationship between two variables | Pearson correlation | Spearman or Kendall correlation | Coefficient, confidence interval, scatterplot, sample size |
| Continuous | One or more predictors | Linear regression | Robust regression, nonlinear model, mixed model | Coefficients, confidence intervals, residual diagnostics, model fit |
| Binary | One or more predictors | Logistic regression | Mixed logistic or other generalized models | Odds ratios, confidence intervals, predicted probabilities, calibration |
| Count | One or more predictors | Poisson regression | Negative binomial or zero-inflated model | Rate ratios, confidence intervals, overdispersion diagnostics |
| Time to event | Groups or predictors with censoring | Log-rank test or Cox regression | Parametric survival or competing-risk models | Survival curves, hazard ratios, confidence intervals |
First identify your outcome and predictor variables
The outcome is what you are trying to explain, compare, or predict. The predictor identifies groups, exposures, treatments, time points, or characteristics that may relate to that outcome.
| Variable type | Examples | Common analysis direction | Common mistake |
|---|---|---|---|
| Continuous | Height, blood pressure, reaction time, income, test score | T-test, ANOVA, Pearson correlation, linear regression | Treating a badly skewed or bounded measurement as automatically normal. |
| Binary | Yes/no, disease/no disease, converted/did not convert | Binomial test, chi-square, Fisher’s exact, McNemar, logistic regression | Using an independent test for paired binary outcomes. |
| Nominal categorical | Blood type, region, product category, treatment arm | Chi-square, multinomial models, ANOVA when used as a group predictor | Assigning arbitrary numbers to categories and treating them as continuous. |
| Ordinal | Single Likert item, pain category, satisfaction rank | Rank-based methods, ordinal regression, carefully justified scale analysis | Assuming the distance between every category is exactly equal. |
| Count | Number of visits, defects, clicks, infections | Poisson or negative binomial regression | Using ordinary linear regression when counts are highly skewed or overdispersed. |
| Time to event | Time to relapse, churn, death, machine failure | Survival analysis | Ignoring participants who have not yet experienced the event. |
A group variable such as treatment A versus treatment B can be a predictor. A numeric variable such as age can also be a predictor. The correct test depends on the combination of outcome, predictor, study design, and intended interpretation.
T-test vs. ANOVA: when should you use each?
A t-test is usually used to compare two means. ANOVA is usually used to compare three or more means or handle more complex factor structures.
You could run several t-tests across three groups, but that inflates the chance of at least one false-positive result. ANOVA begins with one overall test of whether all group means can reasonably be treated as equal. If the omnibus result is significant, planned contrasts or adjusted post-hoc comparisons identify where differences occur.
| Design | Use | Do not use |
|---|---|---|
| Two independent groups | Welch independent-samples t-test | Paired t-test when the participants are unrelated. |
| Two matched or repeated measurements | Paired t-test | Independent t-test, because it discards the pairing. |
| Three or more independent groups | One-way ANOVA or Welch ANOVA | A collection of unadjusted pairwise t-tests. |
| Three or more repeated measurements | Repeated-measures ANOVA or mixed-effects model | Ordinary one-way ANOVA that assumes all observations are independent. |
| Two or more factors | Factorial ANOVA or regression model | Separate one-way analyses that ignore interactions. |
With exactly two groups, a two-sided t-test and a corresponding one-factor ANOVA often test the same mean-difference hypothesis. Use the t-test because it communicates the two-group design more directly.
Two independent groups: Welch t-test or Mann–Whitney?
Use an independent-samples test when the observations in group A are unrelated to those in group B. Examples include treatment versus control, customers from two regions, or two different classrooms.
Welch independent-samples t-test
Use Welch’s t-test when the outcome is continuous and the scientific question concerns a difference in group means. It does not require the two groups to have equal variances and is often a more defensible default than the pooled equal-variance version.
Check:
- observations are independent
- the outcome is meaningfully numeric
- extreme outliers are not dominating the result
- the sampling distribution and residual behavior are reasonable
- the mean is a meaningful summary for the research question
Mann–Whitney U test
Use Mann–Whitney when the data are ordinal or when a rank-based comparison fits the research question better than comparing means. It evaluates whether observations from one group tend to rank above observations from the other group.
Common interpretation mistake: Mann–Whitney is not automatically a test of medians. A median-shift interpretation requires additional assumptions about the shapes and spreads of the group distributions.
Do not choose Mann–Whitney solely because a normality test returned p < 0.05. Inspect the data, define the estimand, and consider whether Welch’s test, transformation, a permutation test, a robust method, or a generalized model better answers the question.
Two paired measurements: paired t-test or Wilcoxon signed-rank?
Paired data arise when the same participant is measured twice or when observations are deliberately matched. Examples include pre-treatment versus post-treatment blood pressure, left versus right eye, or matched twins.
Paired t-test
The paired t-test analyzes the differences within each pair. The relevant normality assumption concerns the distribution of those paired differences, not the separate before and after distributions.
Report:
- number of complete pairs
- mean paired difference
- confidence interval for the difference
- t statistic, degrees of freedom, and p-value
- a paired effect-size measure when useful
Wilcoxon signed-rank test
Use the Wilcoxon signed-rank test when a rank-based paired comparison is appropriate. It uses the directions and ranks of within-pair differences and is not appropriate when the observations are independent.
If the paired differences are highly asymmetric, contain many zeros, or do not support the signed-rank interpretation, a sign test, permutation approach, robust model, or domain-specific method may be more appropriate.
Three or more independent groups: ANOVA or Kruskal–Wallis?
One-way ANOVA
Use one-way ANOVA when you are comparing the means of three or more independent groups and the model assumptions are reasonable.
A significant ANOVA result means the data are inconsistent with all population means being equal under the model. It does not tell you which groups differ or whether the differences are practically important.
Welch ANOVA
Use Welch ANOVA when group variances differ or group sizes are uneven and an unequal-variance mean comparison is appropriate. If the omnibus result is significant, Games–Howell comparisons are a common follow-up because they do not assume equal variances.
Kruskal–Wallis test
Use Kruskal–Wallis for three or more independent groups when a rank-based comparison fits the data and question. A significant result means at least one group distribution tends to differ, but it does not identify which pairs differ.
| Omnibus method | Common follow-up | Correction issue |
|---|---|---|
| One-way ANOVA | Tukey HSD or prespecified contrasts | Do not run every pairwise t-test without adjustment. |
| Welch ANOVA | Games–Howell comparisons | Use a follow-up that respects unequal variances. |
| Kruskal–Wallis | Dunn test or adjusted pairwise rank tests | Control family-wise error or false discovery rate. |
Three or more repeated measurements
Repeated-measures data occur when the same participant, machine, school, location, or other unit contributes several observations. Those observations are correlated and cannot be treated as independent.
Repeated-measures ANOVA
Use repeated-measures ANOVA for a continuous outcome measured under three or more conditions or time points when its assumptions are reasonable. The sphericity assumption concerns the variances of differences among conditions. If sphericity is violated, software may provide Greenhouse–Geisser or Huynh–Feldt corrections.
Friedman test
Use the Friedman test as a rank-based alternative for three or more related conditions. A significant result is an omnibus finding; follow it with adjusted pairwise comparisons if you need to identify the differing conditions.
Mixed-effects model
A mixed-effects model is often better when participants have missing visits, unequal numbers of observations, irregular timing, nested data, multiple grouping levels, or person-specific trajectories. It models within-subject correlation directly instead of requiring a perfectly complete repeated-measures table.
Categorical data: chi-square, Fisher’s exact, McNemar, or logistic regression?
Chi-square test of independence
Use a chi-square test when two categorical variables are measured on independent observations. Examples include treatment group by recovery status or device type by conversion outcome.
Enter actual observed counts. A table containing percentages without the underlying sample sizes is not enough.
Fisher’s exact test
Use Fisher’s exact test for a 2×2 table when sample size is small or expected cell counts are too sparse for the usual chi-square approximation. It calculates an exact conditional probability under its assumptions.
McNemar’s test
Use McNemar’s test when the outcome is binary and measured twice on the same participants, or when binary observations are explicitly matched. It focuses on pairs that changed from one category to the other.
For example, if the same 100 people answer yes/no before and after a campaign, ordinary chi-square is inappropriate because the two sets of responses are not independent.
Exact binomial test
Use an exact binomial test when one observed binary proportion is being compared with a predefined probability. Examples include whether a defect rate differs from 5% or whether a coin-like process differs from 50%.
Logistic regression
Use logistic regression when the outcome is binary and you want to include one or more predictors, adjust for potential confounders, test interactions, or estimate how the odds change with a predictor.
| Question | Method | Useful effect measure |
|---|---|---|
| Does one observed proportion differ from a benchmark? | Exact binomial test | Observed proportion, absolute difference, confidence interval |
| Are two independent categorical variables associated? | Chi-square test | Cramér’s V, risk difference, risk ratio, or odds ratio |
| Is a small independent 2×2 table associated? | Fisher’s exact test | Odds ratio and confidence interval |
| Did a paired binary outcome change? | McNemar’s test | Discordant-pair counts and paired proportion difference |
| Does one predictor relate to a binary outcome? | Simple logistic regression | Odds ratio and predicted probabilities |
| Does the relationship remain after adjustment? | Multiple logistic regression | Adjusted odds ratios and predicted probabilities |
Correlation and regression
Pearson correlation
Use Pearson correlation when the question concerns the strength and direction of a linear relationship between two continuous variables. Inspect a scatterplot first. A strong curved relationship can have a weak Pearson correlation, and one extreme outlier can dominate the result.
Spearman correlation
Use Spearman correlation when the variables are ordinal, the relationship is monotonic rather than strictly linear, or ranks better represent the question. Spearman correlation still requires properly paired observations and does not remove confounding.
Linear regression
Use linear regression when you want to estimate a continuous outcome from one or more predictors. Regression can handle continuous predictors, coded categorical predictors, interactions, covariate adjustment, and several explanatory variables.
Check whether:
- the functional form is appropriate
- residual variance is reasonably modeled
- residuals are independent or their dependence is modeled
- influential observations are not driving the coefficients
- predictors are not severely redundant
- the sample size supports the model complexity
Correlation and regression do not prove causation. Causal conclusions depend on design, timing, randomization, confounding control, measurement quality, and subject-matter assumptions—not merely a significant coefficient.
Parametric vs. nonparametric tests
“Parametric” and “nonparametric” do not mean “good” and “bad,” or simply “normal” and “not normal.” They often target different quantities and rely on different assumptions.
| Feature | Parametric method | Nonparametric or rank-based method |
|---|---|---|
| Common target | Means, mean differences, model coefficients | Ranks, distributional ordering, or other non-mean contrasts |
| Examples | T-tests, ANOVA, Pearson correlation, linear regression | Mann–Whitney, Wilcoxon, Kruskal–Wallis, Friedman, Spearman |
| Strength | Direct interpretation of means and model parameters when appropriate | Useful for ordinal data and some non-normal or outlier-sensitive situations |
| Limitation | Can mislead when the model form or variance assumptions are badly wrong | May answer a different question and can be less efficient in some settings |
| Still requires | Independent or correctly modeled observations, good design, valid measurement | Independent or correctly paired observations, valid ranking, good design |
Switching to a nonparametric test does not repair pseudoreplication, biased sampling, confounding, missing-not-at-random data, bad measurement, or an outcome that needs a count, binary, ordinal, clustered, or survival model.
Statistical assumption checklist
| Assumption or check | What it means | What to inspect | Possible response |
|---|---|---|---|
| Independence | One observation does not improperly duplicate or depend on another. | Study design, repeated participants, clusters, households, schools, sites. | Use paired, repeated-measures, clustered, or mixed-effects methods. |
| Outcome scale | The measurement supports the proposed analysis. | Continuous, ordinal, binary, count, proportion, or time-to-event structure. | Choose a model designed for that outcome. |
| Distribution shape | The model’s error structure is plausible. | Histograms, Q–Q plots, residual plots, skewness, ceiling and floor effects. | Transform, use robust methods, ranks, or a generalized model. |
| Outliers | A few observations are not dominating the estimate. | Raw plots, boxplots, residuals, leverage, influence statistics. | Verify data, report sensitivity analyses, use robust methods when justified. |
| Equal variance | Some traditional methods assume similar variability across groups. | Group spread, residual-versus-fitted plots, variance estimates. | Use Welch t-test, Welch ANOVA, robust standard errors, or another model. |
| Linearity | Pearson correlation and linear regression model a linear form. | Scatterplot and residual patterns. | Add nonlinear terms, transform variables, or use another model. |
| Sphericity | Repeated-measures ANOVA assumes a particular covariance structure. | Software diagnostics and study structure. | Apply a correction or use a mixed-effects model. |
| Expected cell information | Chi-square approximation needs enough expected information in the table. | Expected-count table, sparse categories, zero cells. | Use Fisher’s exact test, an exact/Monte Carlo method, or a better model. |
| Sample size and power | The design must estimate a meaningful effect with useful precision. | Expected effect, variability, alpha, power, attrition, model complexity. | Plan sample size before collecting data and report uncertainty afterward. |
Independence is usually a design issue, not a box to tick in software. No transformation or p-value can fix data analyzed as independent when the same person, classroom, household, clinic, or laboratory batch contributed repeated observations.
High-stakes research: for clinical, regulated, grant-funded, or publication-critical work, prespecify the primary analysis and involve a statistician before data collection whenever possible.
Worked statistical test examples
| Research scenario | Likely method | Why | What could change the choice |
|---|---|---|---|
| Compare mean exam scores in two unrelated teaching groups. | Welch independent-samples t-test | Continuous outcome and two independent groups. | Ordinal scoring, severe outliers, clustering by classroom, or repeated students. |
| Compare blood pressure before and after treatment in the same patients. | Paired t-test | Continuous outcome with paired observations. | Highly irregular paired differences, missing follow-up, or several time points. |
| Compare mean recovery time across four independent treatments. | One-way ANOVA or Welch ANOVA | Continuous outcome and four independent groups. | Severe skew, censoring, unequal variance, or site clustering. |
| Compare satisfaction ratings across three unrelated service plans. | Kruskal–Wallis or carefully justified ANOVA | A single ordinal rating may favor a rank-based analysis. | A validated multi-item scale may support a continuous-score model. |
| Test whether smoking status is associated with disease status. | Chi-square test | Both variables are categorical and observations are independent. | Use Fisher’s exact test if the 2×2 table is sparse. |
| Test whether the same people changed from no to yes after training. | McNemar’s test | The outcome is paired and binary. | Use exact McNemar when the number of discordant pairs is small. |
| Test whether a defect rate differs from a 5% target. | Exact binomial test | One observed binary proportion is compared with a benchmark. | Use a model if defects are clustered by machine, batch, or site. |
| Test whether age predicts disease after adjusting for sex and treatment. | Multiple logistic regression | The outcome is binary and several predictors are included. | Repeated observations, separation, nonlinear age effect, or rare outcomes. |
| Test whether hours studied are linearly related to exam score. | Pearson correlation or linear regression | Both variables are continuous and the question is linear association. | Curvature, outliers, ordinal data, or adjustment for confounders. |
| Compare the same participants under four interface designs. | Repeated-measures ANOVA or Friedman test | Each participant contributes four related measurements. | Missing conditions, order effects, irregular timing, or random slopes. |
| Pool odds ratios from eight independent studies. | Meta-analysis | The observations are study-level estimates with uncertainty. | Study quality, heterogeneity, publication bias, incompatible outcomes. |
Use Jivaro Pvalyzer to choose and check common tests
Pvalyzer is Jivaro’s statistical test selector and summary-statistics calculator. Use it to narrow the choice among common tests and to check calculations when you already have the required summary values.
The relevant calculator modules include common analyses such as independent and paired t-tests, 2×2 categorical tests, and Pearson correlation significance. Check the labels in the live app before entering data because the interface may be expanded or updated over time.
| Pvalyzer module | What to enter | What to read | Most common mistake |
|---|---|---|---|
| Independent two-sample t-test | For each independent group, enter the mean, standard deviation, and sample size. | Read the estimated mean difference, t statistic, degrees of freedom, p-value, and any reported interval or summary sentence. | Using this module when the same people were measured twice. |
| Paired t-test | Enter the mean of the within-pair differences, the standard deviation of those differences, and the number of complete pairs. | Read the estimated paired change, t statistic, degrees of freedom, and p-value. | Entering separate pre- and post-test means and standard deviations without the standard deviation of the paired differences. Those separate summaries are not enough by themselves. |
| Chi-square 2×2 | Enter the four observed cell counts from an independent 2×2 contingency table. | Read the chi-square statistic, degrees of freedom, and p-value. | Entering percentages instead of counts or using chi-square for paired before-and-after binary data. |
| Fisher-style exact 2×2 | Enter the four observed cell counts from a small or sparse independent 2×2 table. | Read the exact-style probability result and any reported association measure. | Choosing Fisher’s exact test only after seeing that it gives a more favorable p-value. |
| Pearson correlation significance | Enter the Pearson correlation coefficient and the number of paired observations. | Read the correlation test result and p-value. | Interpreting significance without inspecting the scatterplot, linearity, outliers, or data quality. |
Pvalyzer limitation: a summary-statistics calculator cannot reveal outliers, nonlinear relationships, duplicated observations, miscoding, unequal distribution shapes, influential cases, clustering, or missing-data problems. Use it as a selection and checking tool, not as a replacement for raw-data analysis or methodological review.
For tests not available as a matching Jivaro calculator—such as repeated-measures ANOVA, Mann–Whitney, Wilcoxon, Kruskal–Wallis, Friedman, McNemar, logistic regression, mixed models, and survival analysis—use full statistical software and verify the model assumptions directly.
When your rows are studies: use ForestIQ for meta-analysis
If each row represents a study estimate rather than an individual participant, a t-test or ANOVA is usually the wrong framework. You may need a meta-analysis.
ForestIQ is Jivaro’s mini meta-analysis calculator for combining study-level estimates and generating a forest-plot-style summary.
| ForestIQ step | What to enter or inspect | Common mistake |
|---|---|---|
| Choose the effect type | Select the effect measure reported consistently across the included studies, such as a ratio or mean-difference measure supported by the live app. | Pooling odds ratios, risk ratios, hazard ratios, and mean differences as though they were interchangeable. |
| Enter each study | Add a study label, its effect estimate, and the uncertainty input requested by the app, such as a confidence interval or standard error. | Entering point estimates without any measure of study precision. |
| Read the pooled result | Inspect the pooled estimate and confidence interval, along with the individual study estimates. | Assuming the pooled estimate is meaningful when the studies measure materially different populations, interventions, or outcomes. |
| Inspect heterogeneity | Review the heterogeneity statistics and the visual spread of study estimates. | Using one I² cutoff as an automatic accept/reject rule. |
| Export the result | Use the forest plot and generated summary as a starting point for reporting. | Skipping risk-of-bias assessment, sensitivity analysis, study-quality review, or publication-bias checks. |
ForestIQ is useful for teaching and quick evidence summaries. It is not a substitute for a systematic review protocol, duplicate screening, risk-of-bias assessment, sensitivity analysis, publication-bias assessment, or dedicated meta-analysis software for publication-critical work.
How to interpret a statistical test result
A complete result is not “p < 0.05.” It should explain the estimated effect, uncertainty, test, sample size, direction, and practical meaning.
What a p-value means
A p-value describes how unusual the observed result, or a more extreme result, would be under the specified null model and assumptions. It is not the probability that the null hypothesis is true. It is not the probability that the result happened “by chance.”
What “not significant” means
A non-significant result does not prove that there is no difference or relationship. It may reflect a small effect, imprecise measurement, limited sample size, high variability, model mismatch, or data that remain compatible with several possible effects.
What statistical significance does not mean
Statistical significance does not automatically mean the effect is large, clinically important, commercially useful, reproducible, or causal. Large samples can make tiny differences statistically significant, while small studies can leave meaningful effects uncertain.
Weak interpretation: Treatment A was statistically significant, p = 0.03. Better interpretation: Treatment A reduced the mean score by 4.2 points compared with treatment B (95% CI: 0.4 to 8.0 points; Welch t-test, p = 0.03). The interval includes effects ranging from small to potentially meaningful, so the result should be interpreted using the scale's practical threshold.
Report effect sizes and confidence intervals
| Analysis | Useful effect size | What it communicates |
|---|---|---|
| Two-group continuous comparison | Raw mean difference, Cohen’s d, or Hedges’ g | How far apart the groups are in original or standardized units. |
| Paired comparison | Mean paired difference and paired standardized effect | How much the same units changed. |
| ANOVA | Eta squared, partial eta squared, or omega squared | How much variation is associated with the factor under the model. |
| Chi-square | Cramér’s V, risk difference, risk ratio, or odds ratio | The strength and practical direction of categorical association. |
| Correlation | Pearson r or Spearman rho | The direction and strength of association. |
| Linear regression | Regression coefficient, standardized coefficient, R² | Expected outcome change and model-explained variation. |
| Logistic regression | Odds ratio and predicted probability difference | How predictors relate to the odds or predicted probability of an outcome. |
A confidence interval communicates estimate precision and the range of parameter values reasonably compatible with the data and model. Do not describe a frequentist 95% confidence interval as a 95% probability that the fixed true value lies inside the observed interval.
Post-hoc tests and multiple comparisons
Testing many hypotheses increases the chance of false-positive findings. Decide which comparisons matter before seeing the results whenever possible.
| Situation | Possible approach | What to avoid |
|---|---|---|
| All pairwise comparisons after standard ANOVA | Tukey HSD | Unadjusted pairwise t-tests. |
| All pairwise comparisons after Welch ANOVA | Games–Howell | A follow-up method that assumes equal variances. |
| Selected planned comparisons | Prespecified contrasts with an appropriate correction | Inventing contrasts after inspecting every result and presenting them as planned. |
| Many exploratory hypotheses | Holm, Bonferroni, or false-discovery-rate control | Reporting only the smallest p-values. |
| Many outcomes and subgroups | Declare primary outcomes and distinguish confirmatory from exploratory analyses | Treating every analysis as an independent primary test. |
When the basic decision tree is not enough
| Data complication | Method family to consider |
|---|---|
| Participants nested within schools, clinics, companies, or countries | Multilevel or mixed-effects models |
| Repeated observations with missing visits or unequal timing | Mixed-effects or generalized estimating equation models |
| Count outcomes with overdispersion | Negative binomial regression |
| Many zero counts | Zero-inflated or hurdle models |
| Time-to-event outcome with censoring | Kaplan–Meier, log-rank, Cox, or parametric survival models |
| More than two unordered outcome categories | Multinomial logistic regression |
| Ordered categorical outcome | Ordinal regression |
| Strong nonlinear relationship | Splines, generalized additive models, or nonlinear regression |
| Complex survey weights or clustered sampling | Survey-weighted analysis |
| Very high-dimensional predictors | Penalized models, dimension reduction, or specialist methods |
Advanced methods should be chosen from the data-generating structure and research objective. They should not be added merely because they sound more sophisticated.
Common statistical test selection mistakes
| Mistake | Why it matters | Better approach |
|---|---|---|
| Choosing the test after seeing which gives p < 0.05 | Inflates false-positive risk and makes the analysis secretly data-driven. | Prespecify the primary method and document justified sensitivity analyses. |
| Treating repeated data as independent | Standard errors and p-values can be wrong. | Use paired, repeated-measures, clustered, or mixed-effects methods. |
| Using chi-square for paired binary data | The independence assumption is violated. | Use McNemar’s test for paired yes/no outcomes. |
| Running a normality test and obeying it mechanically | Normality tests can be insensitive in small samples and overly sensitive in large samples. | Use plots, residuals, design, sample size, outliers, and the target estimand together. |
| Assuming nonparametric means assumption-free | Rank-based tests still require valid independence or pairing and have their own interpretations. | State what the selected test actually compares. |
| Using chi-square on percentages | The test requires counts and expected frequencies. | Build the contingency table from observed counts. |
| Reporting only a p-value | The reader cannot judge direction, magnitude, or precision. | Report the effect estimate, confidence interval, sample size, and descriptive statistics. |
| Calling correlation causal | Confounding, reverse causation, and selection can explain the relationship. | Match the conclusion to the study design. |
| Ignoring missing data | Complete-case analysis can reduce power or introduce bias. | Describe missingness and consider appropriate imputation or modeling. |
| Using a complex model with too little data | Estimates become unstable and overfit. | Reduce model complexity, collect more data, or use carefully validated regularization. |
FAQ
Which statistical test should I use for two groups?
Use a Welch independent-samples t-test for two unrelated groups with a continuous outcome. Use a paired t-test when the same participants are measured twice. Mann–Whitney and Wilcoxon are common rank-based alternatives.
Should I use a t-test or ANOVA?
Use a t-test for two groups or conditions. Use ANOVA for three or more groups, repeated conditions, multiple factors, or designs requiring an omnibus comparison.
When should I use a chi-square test?
Use chi-square when two categorical variables are measured on independent observations and you have observed counts in a contingency table. For paired binary data, use McNemar’s test instead.
What is the difference between independent and paired data?
Independent observations come from unrelated participants or units. Paired data come from the same participant measured more than once or from deliberately matched observations.
When should I use a nonparametric test?
Consider a rank-based method for ordinal data or when the rank-based question fits better than a mean comparison. Do not switch automatically because one normality test is significant.
Should I use Pearson or Spearman correlation?
Use Pearson for a linear relationship between continuous variables. Use Spearman for ordinal data or monotonic relationships where ranks are more appropriate. Inspect a scatterplot either way.
Does p < 0.05 prove my hypothesis?
No. A small p-value indicates that the observed result is relatively incompatible with the specified null model under its assumptions. It does not prove the hypothesis, practical importance, reproducibility, or causation.
What if my data are not normally distributed?
Inspect the outcome and model residuals, sample size, outliers, skew, measurement scale, and study design. Options include Welch methods, transformation, robust models, rank-based tests, permutation tests, or a model designed for the outcome distribution.
Can Pvalyzer choose the test for me?
Pvalyzer can help narrow common test choices and check selected calculations from summary data. It cannot inspect raw-data quality, research design, confounding, outliers, missingness, clustering, or every model assumption.
What should I report besides the p-value?
Report sample size, descriptive statistics, effect estimate, confidence interval, exact test, assumptions or diagnostics, and any post-hoc or multiple-testing correction.
Sources and useful links
- Jivaro Pvalyzer: Statistical Test Selector
- Jivaro ForestIQ: Meta-Analysis Calculator
- NIST/SEMATECH e-Handbook of Statistical Methods
- American Statistical Association Statement on P-Values
- SciPy Statistical Functions Reference
- SciPy: Independent-Samples T-Test
- SciPy: Paired T-Test
- SciPy: One-Way ANOVA
- SciPy: Mann–Whitney U Test
- SciPy: Wilcoxon Signed-Rank Test
- SciPy: Kruskal–Wallis Test
- SciPy: Friedman Test
- SciPy: Chi-Square Contingency Test
- SciPy: Fisher’s Exact Test
- SciPy: Exact Binomial Test
- Statsmodels: McNemar’s Test
- SciPy: Pearson Correlation
- SciPy: Spearman Correlation
- Statsmodels Regression Documentation
The right statistical test is the one that matches the outcome, research question, study design, dependence structure, and quantity you want to estimate. Choose that structure before calculating the p-value, then report the effect and uncertainty—not merely whether the result crossed a threshold.
