โ† Back to Lectures
HalesAir Logo
HalesAir ยท Session 06

Statistical
Analysis

Halesowen College ยท T Level Data Analytics

Correlation analysis Time-series trends Comparing sites statistically Turning results into findings
๐Ÿ”—

Part 1
Correlation Analysis

Do variables move together? How strongly?

What is Correlation?

Pearson's r

  • Measures linear relationship between two variables
  • Range: +1.0 (perfect positive) to -1.0 (perfect negative)
  • 0 = no linear relationship
  • |r| > 0.7 = strong ยท 0.4โ€“0.7 = moderate ยท < 0.4 = weak
  • Does NOT imply causation

Example: If temperature and humidity are strongly negatively correlated (r = โˆ’0.72), it might mean the sensor heats up and desiccates โ€” a calibration artefact, not a physical relationship.

temp_chum_pctpres_hpagas_ohm
temp_c1.00โˆ’0.710.120.48
hum_pctโˆ’0.711.00โˆ’0.09โˆ’0.22
pres_hpa0.12โˆ’0.091.000.18
gas_ohm0.48โˆ’0.220.181.00

Example correlation matrix from a 24-hour dataset

Computing Correlation in pandas

import seaborn as sns import matplotlib.pyplot as plt # Compute correlation matrix corr = df.corr(numeric_only=True) print(corr.round(2)) # Visualise as heatmap fig, ax = plt.subplots(figsize=(6, 5)) sns.heatmap( corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, ax=ax ) plt.savefig("correlation_heatmap.png", dpi=150) # Specific pair: scatter plot df.plot.scatter(x='temp_c', y='hum_pct', alpha=0.3, color='#2dd4bf') plt.savefig("temp_vs_hum.png", dpi=150)

Interpreting the heatmap

  • Red = negative correlation (as one rises, other falls)
  • Blue = positive correlation (both rise together)
  • White/neutral = no relationship
  • Diagonal is always 1.0 (variable vs itself)

pip install seaborn โ€” built on matplotlib, makes beautiful statistical visualisations with minimal code.

Always follow up a correlation with a scatter plot to visually confirm the relationship isn't driven by outliers.

๐Ÿ“ˆ

Part 2
Time-Series Trends

Patterns over time ยท Resampling ยท Rolling averages

Resampling & Rolling Averages

# Resample to hourly means df_hourly = df.resample('1h').mean() # 30-minute rolling average (smoothing) df['temp_rolling'] = ( df['temp_c'] .rolling('30min') .mean() ) # Plot raw vs smoothed fig, ax = plt.subplots(figsize=(12, 4)) df['temp_c'].plot(ax=ax, alpha=0.25, color='#2dd4bf', label='raw') df['temp_rolling'].plot(ax=ax, color='#fbbf24', label='30-min avg') ax.legend() plt.savefig("temp_smoothed.png", dpi=150)

Why resample?

  • 10-second data has 8640 points/day โ€” too noisy for trend spotting
  • Resample to hourly: 24 points/day โ€” patterns become visible
  • Rolling average: Smooth out noise while preserving real trends

Patterns to look for

  • Morning temperature rise as sunlight hits the sensor
  • Humidity peaks at night, drops midday
  • IAQ worsens during rush hours (8โ€“9am, 3โ€“5pm)?
  • Weekend vs weekday differences in your data
โš–๏ธ

Part 3
Comparing Sites

Is one location significantly different from another?

Do Sites Differ? A Simple Statistical Test

from scipy import stats # Load both site datasets df_a = pd.read_csv("site_a_clean.csv", parse_dates=['timestamp'], index_col='timestamp') df_b = pd.read_csv("site_b_clean.csv", parse_dates=['timestamp'], index_col='timestamp') # Mann-Whitney U test (non-parametric) stat, p = stats.mannwhitneyu( df_a['temp_c'].dropna(), df_b['temp_c'].dropna() ) print(f"p-value: {p:.4f}") if p < 0.05: print("Sites differ significantly (p < 0.05)")

What the p-value means

  • p < 0.05 โ€” statistically significant difference
  • p 0.05โ€“0.1 โ€” marginally significant, interpret with caution
  • p > 0.1 โ€” no evidence of a difference at this sample size

Why Mann-Whitney? Our data may not be normally distributed (it won't be โ€” environmental data rarely is). Mann-Whitney is a non-parametric test that doesn't assume normality.

Box plots are the best visual companion to this test โ€” they show median, IQR, and outliers side by side for both sites.

Box Plots โ€” Comparing Distributions

import pandas as pd import matplotlib.pyplot as plt # Combine sites into one DataFrame df_a['site'] = 'Car Park' df_b['site'] = 'Courtyard' df_all = pd.concat([df_a, df_b]) # Grouped box plot df_all.boxplot( column='temp_c', by='site', figsize=(7, 5), patch_artist=True ) plt.suptitle("") # remove auto title plt.title("Temperature by Site") plt.savefig("boxplot_temp.png", dpi=150)

Reading a box plot

  • Box = 25th to 75th percentile (IQR)
  • Central line = median (50th percentile)
  • Whiskers = extend to 1.5 ร— IQR
  • Points beyond whiskers = statistical outliers

The box plot IS your results figure. If the boxes don't overlap significantly, the sites are different. Pair it with the p-value for a complete finding.

Turning Results into Findings

A finding is a result with an interpretation. Not just numbers โ€” statements.

โŒ Result (not enough)

Mean temperature at the car park was 18.4ยฐC and at the courtyard was 19.1ยฐC.

โœ“ Finding (good)

The courtyard recorded mean temperatures 0.7ยฐC higher than the car park (Mann-Whitney U, p = 0.002), consistent with reduced airflow in an enclosed space.

Structure of a finding

What: the difference.
How much: magnitude.
Confidence: p-value or effect size.
Why: physical interpretation.

Aim for 3โ€“5 findings in your final report. Each should be one clear, evidenced sentence. They become the backbone of your discussion section and your presentation.

Activity: Analyse Your Data

โฑ 35 minutes

Goal: correlation heatmap + two time-series plots + one written finding.

1

(8 min) Compute the correlation matrix. Save it as a heatmap. Which pair has the strongest correlation? Is it positive or negative? Write one sentence explaining why this might be.

2

(10 min) Resample your data to 30-minute means. Plot temperature and IAQ (normalised gas) on the same time axis. Do you see any patterns matching school hours?

3

(10 min) Combine your dataset with a partner's (different site). Build a box plot comparing temperature between the two locations. Run the Mann-Whitney U test. What is your p-value?

4

(7 min) Write one complete finding following the structure above: WHAT + HOW MUCH + CONFIDENCE + WHY. This goes directly into your report draft.

๐Ÿ’ก Effect size matters as much as the p-value. A statistically significant but tiny difference (0.1ยฐC) may not be practically meaningful. Always report both the statistical and the practical significance.

Coming Up โ€” Session 07

Date TBC ยท Data Visualisation

What we'll cover

Principles of effective data visualisation, choosing the right chart type for each finding, building a multi-panel summary figure, dashboard design basics, and colour accessibility.

Before Session 07: Have your 3โ€“5 findings written up. We'll build a figure for each one.

Stretch task: Compute a 7-day rolling average if you have enough data. Do weekly trends differ from daily patterns?

Questions?

jwilliams.science ยท HalesAir Project