HalesAir · Session 06

Statistical
Analysis

Halesowen College · T Level Data Analytics

Correlation analysis Time-series trends Comparing sites statistically Turning results into findings

🔗

Part 1
Correlation Analysis

Do variables move together? How strongly?

What is Correlation?

Pearson's r

Measures linear relationship between two variables
Range: +1.0 (perfect positive) to -1.0 (perfect negative)
0 = no linear relationship
|r| > 0.7 = strong · 0.4–0.7 = moderate · < 0.4 = weak
Does NOT imply causation

Example: If temperature and humidity are strongly negatively correlated (r = −0.72), it might mean the sensor heats up and desiccates — a calibration artefact, not a physical relationship.

	temp_c	hum_pct	pres_hpa	gas_ohm
temp_c	1.00	−0.71	0.12	0.48
hum_pct	−0.71	1.00	−0.09	−0.22
pres_hpa	0.12	−0.09	1.00	0.18
gas_ohm	0.48	−0.22	0.18	1.00

Example correlation matrix from a 24-hour dataset

Computing Correlation in pandas

 import seaborn as sns import matplotlib.pyplot as plt # Compute correlation matrix corr = df.corr(numeric_only=True)
print(corr.round(2))
# Visualise as heatmap fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(
corr,
    annot=True, fmt=".2f",
    cmap="coolwarm", center=0,
    ax=ax
)
plt.savefig("correlation_heatmap.png", dpi=150)
# Specific pair: scatter plot df.plot.scatter(x='temp_c', y='hum_pct',
    alpha=0.3, color='#2dd4bf')
plt.savefig("temp_vs_hum.png", dpi=150)

Interpreting the heatmap

Red = negative correlation (as one rises, other falls)
Blue = positive correlation (both rise together)
White/neutral = no relationship
Diagonal is always 1.0 (variable vs itself)

pip install seaborn — built on matplotlib, makes beautiful statistical visualisations with minimal code.

Always follow up a correlation with a scatter plot to visually confirm the relationship isn't driven by outliers.

📈

Part 2
Time-Series Trends

Patterns over time · Resampling · Rolling averages

Resampling & Rolling Averages

 # Resample to hourly means df_hourly = df.resample('1h').mean()
# 30-minute rolling average (smoothing) df['temp_rolling'] = (
df['temp_c']
    .rolling('30min')
    .mean()
)
# Plot raw vs smoothed fig, ax = plt.subplots(figsize=(12, 4))
df['temp_c'].plot(ax=ax, alpha=0.25,
    color='#2dd4bf', label='raw')
df['temp_rolling'].plot(ax=ax,
    color='#fbbf24', label='30-min avg')
ax.legend()
plt.savefig("temp_smoothed.png", dpi=150)

Why resample?

10-second data has 8640 points/day — too noisy for trend spotting
Resample to hourly: 24 points/day — patterns become visible
Rolling average: Smooth out noise while preserving real trends

Patterns to look for

Morning temperature rise as sunlight hits the sensor
Humidity peaks at night, drops midday
IAQ worsens during rush hours (8–9am, 3–5pm)?
Weekend vs weekday differences in your data

⚖️

Part 3
Comparing Sites

Is one location significantly different from another?

Do Sites Differ? A Simple Statistical Test

 from scipy import stats # Load both site datasets df_a = pd.read_csv("site_a_clean.csv",
         parse_dates=['timestamp'], index_col='timestamp')
df_b = pd.read_csv("site_b_clean.csv",
         parse_dates=['timestamp'], index_col='timestamp')
# Mann-Whitney U test (non-parametric) stat, p = stats.mannwhitneyu(
df_a['temp_c'].dropna(),
df_b['temp_c'].dropna()
)
print(f"p-value: {p:.4f}")
if p < 0.05:
print("Sites differ significantly (p < 0.05)")

What the p-value means

p < 0.05 — statistically significant difference
p 0.05–0.1 — marginally significant, interpret with caution
p > 0.1 — no evidence of a difference at this sample size

Why Mann-Whitney? Our data may not be normally distributed (it won't be — environmental data rarely is). Mann-Whitney is a non-parametric test that doesn't assume normality.

Box plots are the best visual companion to this test — they show median, IQR, and outliers side by side for both sites.

Box Plots — Comparing Distributions

 import pandas as pd import matplotlib.pyplot as plt # Combine sites into one DataFrame df_a['site'] = 'Car Park' df_b['site'] = 'Courtyard' df_all = pd.concat([df_a, df_b])
# Grouped box plot df_all.boxplot(
    column='temp_c',
    by='site',
    figsize=(7, 5),
    patch_artist=True
)
plt.suptitle("") # remove auto title plt.title("Temperature by Site")
plt.savefig("boxplot_temp.png", dpi=150)

Reading a box plot

Box = 25th to 75th percentile (IQR)
Central line = median (50th percentile)
Whiskers = extend to 1.5 × IQR
Points beyond whiskers = statistical outliers

The box plot IS your results figure. If the boxes don't overlap significantly, the sites are different. Pair it with the p-value for a complete finding.

Turning Results into Findings

A finding is a result with an interpretation. Not just numbers — statements.

❌ Result (not enough)

Mean temperature at the car park was 18.4°C and at the courtyard was 19.1°C.

✓ Finding (good)

The courtyard recorded mean temperatures 0.7°C higher than the car park (Mann-Whitney U, p = 0.002), consistent with reduced airflow in an enclosed space.

Structure of a finding

What: the difference.
How much: magnitude.
Confidence: p-value or effect size.
Why: physical interpretation.

Aim for 3–5 findings in your final report. Each should be one clear, evidenced sentence. They become the backbone of your discussion section and your presentation.

Activity: Analyse Your Data

⏱ 35 minutes

Goal: correlation heatmap + two time-series plots + one written finding.

(8 min) Compute the correlation matrix. Save it as a heatmap. Which pair has the strongest correlation? Is it positive or negative? Write one sentence explaining why this might be.

(10 min) Resample your data to 30-minute means. Plot temperature and IAQ (normalised gas) on the same time axis. Do you see any patterns matching school hours?

(10 min) Combine your dataset with a partner's (different site). Build a box plot comparing temperature between the two locations. Run the Mann-Whitney U test. What is your p-value?

(7 min) Write one complete finding following the structure above: WHAT + HOW MUCH + CONFIDENCE + WHY. This goes directly into your report draft.

💡 Effect size matters as much as the p-value. A statistically significant but tiny difference (0.1°C) may not be practically meaningful. Always report both the statistical and the practical significance.

Coming Up — Session 07

Date TBC · Data Visualisation

What we'll cover

Principles of effective data visualisation, choosing the right chart type for each finding, building a multi-panel summary figure, dashboard design basics, and colour accessibility.

Before Session 07: Have your 3–5 findings written up. We'll build a figure for each one.

Stretch task: Compute a 7-day rolling average if you have enough data. Do weekly trends differ from daily patterns?

Questions?

jwilliams.science · HalesAir Project

StatisticalAnalysis

Part 1Correlation Analysis

What is Correlation?

Pearson's r

Computing Correlation in pandas

Interpreting the heatmap

Part 2Time-Series Trends

Resampling & Rolling Averages

Why resample?

Patterns to look for

Part 3Comparing Sites

Do Sites Differ? A Simple Statistical Test

What the p-value means

Box Plots — Comparing Distributions

Reading a box plot

Turning Results into Findings

Activity: Analyse Your Data

Coming Up — Session 07

Statistical
Analysis

Part 1
Correlation Analysis

Part 2
Time-Series Trends

Part 3
Comparing Sites