โ† Back to Lectures
HalesAir Logo
HalesAir ยท Session 05

Data Cleaning &
Exploration

Halesowen College ยท T Level Data Analytics

What is dirty data? Outliers & missing values Summary statistics in pandas First exploratory plots
๐Ÿงน

Part 1
What is Dirty Data?

Every real-world dataset has problems. Learning to find them is the skill.

Types of Data Quality Problems

Structural issues

  • Missing values โ€” sensor disconnect, power cut, network drop
  • Duplicate rows โ€” re-transmission or logging error
  • Wrong type โ€” temperature stored as text "22.5ยฐC"
  • Inconsistent format โ€” mixed date styles in same column

Value issues

  • Outliers โ€” physically impossible (temp = 99ยฐC, hum = 150%)
  • Sensor drift โ€” steady upward creep over days
  • Stuck values โ€” same number repeated for hours
  • Warm-up artefacts โ€” first 3โ€“5 minutes unreliable

Key insight: Dirty data doesn't mean bad science โ€” it means you used a real sensor in the real world. Acknowledging and handling data quality issues is what makes your analysis credible.

Spotting Problems in Your CSV

timestamptemp_chum_pctpres_hpagas_ohmissue
2026-03-10 09:00:0218.3621013.248200โœ“ clean
2026-03-10 09:00:1299.0611013.249100temp spike
 18.5611013.349800missing timestamp
2026-03-10 09:00:3218.51501013.350200humidity impossible
2026-03-10 09:00:4218.5611013.349800stuck / duplicate

All of these have appeared in real HalesAir historical data. Today you'll detect and handle them systematically using pandas.

๐Ÿผ

Part 2
Loading & Cleaning with pandas

Python's essential data analysis library

Loading Your Data

import pandas as pd # Load CSV, parse dates automatically df = pd.read_csv( "log.csv", parse_dates=['timestamp'] ) # First checks print(df.shape) # (rows, cols) print(df.dtypes) # column types print(df.head(5)) # first 5 rows print(df.isnull().sum()) # null counts

What to check

  • df.shape โ€” e.g. (8640, 5) = 8640 rows
  • df.dtypes โ€” timestamp should be datetime64, not object
  • isnull().sum() โ€” how many NaN per column?

Install pandas: On your laptop, not the Pico โ€” pip install pandas matplotlib. Run this in your laptop's terminal.

Never modify the original CSV. Load it once, then save your cleaned version separately as log_clean.csv.

Cleaning: Step by Step

# 1. Remove exact duplicate rows df = df.drop_duplicates() # 2. Sort by time, set as index df = df.set_index('timestamp').sort_index() # 3. Drop rows with missing core values df = df.dropna(subset=['temp_c', 'hum_pct']) # 4. Filter physically impossible values df = df[df['temp_c'].between(-10, 60)] df = df[df['hum_pct'].between(0, 100)] # 5. Save cleaned version df.to_csv("log_clean.csv")

Decisions you're making

  • Drop vs impute: We drop rows missing critical values. For gas_ohm you might fill with the column median instead.
  • Physical bounds: -10 to 60ยฐC is a realistic UK outdoor range. 0โ€“100% for humidity is definitional.
  • Document everything: Record how many rows removed and why โ€” this becomes your Methods section.

Never overwrite raw data. Always save cleaned data to a new file. Raw data is your ground truth.

๐Ÿ“Š

Part 3
Exploratory Data Analysis

Summary statistics and first visualisations

Summary Statistics

# One command gives everything print(df.describe()) # temp_c hum_pct pres_hpa # count 8580.0 8580.0 8580.0 # mean 18.4 62.1 1013.5 # std 2.1 8.3 1.2 # min 12.1 41.0 1009.3 # 25% 16.8 57.0 1012.8 # 50% 18.2 62.4 1013.5 # 75% 19.9 67.2 1014.1 # max 24.6 89.0 1016.2

What each stat tells you

  • mean โ€” average; where most readings cluster
  • std โ€” spread; how variable readings are day-to-day
  • 50% (median) โ€” middle value; robust to outliers
  • 25% / 75% โ€” IQR; middle 50% of data
  • min / max โ€” confirm cleaning worked correctly

Research question check: Does mean temperature differ between sites? Is variability greater in one location? These are your first results.

First Visualisations

import matplotlib.pyplot as plt # 1. Temperature over time fig, ax = plt.subplots(figsize=(12, 4)) df['temp_c'].plot(ax=ax, color='#2dd4bf') ax.set_ylabel("Temperature (ยฐC)") plt.savefig("temp_time.png", dpi=150) # 2. Humidity histogram df['hum_pct'].hist(bins=30, color='#38bdf8') plt.xlabel("Humidity (%)") plt.savefig("hum_hist.png", dpi=150)

What to look for

  • Time-series: Daily cycles? Patterns matching school hours? Anomalous spikes?
  • Histogram: Bell-shaped or skewed? Multiple peaks suggest different regimes.

Save properly: Use plt.savefig("plot.png", dpi=150) โ€” not screenshots. This produces publication-quality figures.

Set timestamp as the DataFrame index first โ€” the x-axis will automatically show dates when you call .plot().

Activity: Clean Your Data

โฑ 30 minutes

Goal: a clean CSV + three saved plots + one paragraph of methods notes.

1

(5 min) Load your CSV. Run df.shape, df.isnull().sum() and df.describe(). Write down: row count, any nulls, suspicious min/max values.

2

(8 min) Apply cleaning steps: dedup โ†’ sort by timestamp โ†’ drop nulls โ†’ filter impossible ranges. Print row count after each step.

3

(10 min) Plot: temperature time-series, humidity time-series, temperature histogram. Save all three at 150 DPI.

4

(7 min) Compare df.describe() with the person next to you. Do your sensor locations show different means or spreads? What might explain the difference?

๐Ÿ’ก Write one sentence per cleaning step in a Word or Markdown doc. This is your Methods section โ€” it will directly justify every decision in your final report.

Coming Up โ€” Session 06

Date TBC ยท Statistical Analysis

What we'll cover

Correlation between variables, time-series trend analysis, comparing sites using statistical tests, and distilling your findings into clear statements for the final report.

Before Session 06: Save your cleaned CSV as log_clean.csv. Bring both raw and clean files next session.

Stretch task: Plot temp and humidity on dual axes. Do they correlate during wet weather?

Questions?

jwilliams.science ยท HalesAir Project