HalesAir · Session 05

Data Cleaning &
Exploration

Halesowen College · T Level Data Analytics

What is dirty data? Outliers & missing values Summary statistics in pandas First exploratory plots

🧹

Part 1
What is Dirty Data?

Every real-world dataset has problems. Learning to find them is the skill.

Types of Data Quality Problems

Structural issues

Missing values — sensor disconnect, power cut, network drop
Duplicate rows — re-transmission or logging error
Wrong type — temperature stored as text "22.5°C"
Inconsistent format — mixed date styles in same column

Value issues

Outliers — physically impossible (temp = 99°C, hum = 150%)
Sensor drift — steady upward creep over days
Stuck values — same number repeated for hours
Warm-up artefacts — first 3–5 minutes unreliable

Key insight: Dirty data doesn't mean bad science — it means you used a real sensor in the real world. Acknowledging and handling data quality issues is what makes your analysis credible.

Spotting Problems in Your CSV

timestamp	temp_c	hum_pct	pres_hpa	gas_ohm	issue
2026-03-10 09:00:02	18.3	62	1013.2	48200	✓ clean
2026-03-10 09:00:12	99.0	61	1013.2	49100	temp spike
	18.5	61	1013.3	49800	missing timestamp
2026-03-10 09:00:32	18.5	150	1013.3	50200	humidity impossible
2026-03-10 09:00:42	18.5	61	1013.3	49800	stuck / duplicate

All of these have appeared in real HalesAir historical data. Today you'll detect and handle them systematically using pandas.

🐼

Part 2
Loading & Cleaning with pandas

Python's essential data analysis library

Loading Your Data

 import pandas as pd # Load CSV, parse dates automatically df = pd.read_csv(
"log.csv",
    parse_dates=['timestamp']
)
# First checks print(df.shape) # (rows, cols) print(df.dtypes) # column types print(df.head(5)) # first 5 rows print(df.isnull().sum()) # null counts 

What to check

df.shape — e.g. (8640, 5) = 8640 rows
df.dtypes — timestamp should be datetime64, not object
isnull().sum() — how many NaN per column?

Install pandas: On your laptop, not the Pico — pip install pandas matplotlib. Run this in your laptop's terminal.

Never modify the original CSV. Load it once, then save your cleaned version separately as log_clean.csv.

Cleaning: Step by Step

 # 1. Remove exact duplicate rows df = df.drop_duplicates()
# 2. Sort by time, set as index df = df.set_index('timestamp').sort_index()
# 3. Drop rows with missing core values df = df.dropna(subset=['temp_c', 'hum_pct'])
# 4. Filter physically impossible values df = df[df['temp_c'].between(-10, 60)]
df = df[df['hum_pct'].between(0, 100)]
# 5. Save cleaned version df.to_csv("log_clean.csv")

Decisions you're making

Drop vs impute: We drop rows missing critical values. For gas_ohm you might fill with the column median instead.
Physical bounds: -10 to 60°C is a realistic UK outdoor range. 0–100% for humidity is definitional.
Document everything: Record how many rows removed and why — this becomes your Methods section.

Never overwrite raw data. Always save cleaned data to a new file. Raw data is your ground truth.

📊

Part 3
Exploratory Data Analysis

Summary statistics and first visualisations

Summary Statistics

 # One command gives everything print(df.describe())
#         temp_c   hum_pct  pres_hpa # count  8580.0    8580.0    8580.0 # mean     18.4      62.1    1013.5 # std       2.1       8.3       1.2 # min      12.1      41.0    1009.3 # 25%      16.8      57.0    1012.8 # 50%      18.2      62.4    1013.5 # 75%      19.9      67.2    1014.1 # max      24.6      89.0    1016.2 

What each stat tells you

mean — average; where most readings cluster
std — spread; how variable readings are day-to-day
50% (median) — middle value; robust to outliers
25% / 75% — IQR; middle 50% of data
min / max — confirm cleaning worked correctly

Research question check: Does mean temperature differ between sites? Is variability greater in one location? These are your first results.

First Visualisations

 import matplotlib.pyplot as plt # 1. Temperature over time fig, ax = plt.subplots(figsize=(12, 4))
df['temp_c'].plot(ax=ax, color='#2dd4bf')
ax.set_ylabel("Temperature (°C)")
plt.savefig("temp_time.png", dpi=150)
# 2. Humidity histogram df['hum_pct'].hist(bins=30, color='#38bdf8')
plt.xlabel("Humidity (%)")
plt.savefig("hum_hist.png", dpi=150)

What to look for

Time-series: Daily cycles? Patterns matching school hours? Anomalous spikes?
Histogram: Bell-shaped or skewed? Multiple peaks suggest different regimes.

Save properly: Use plt.savefig("plot.png", dpi=150) — not screenshots. This produces publication-quality figures.

Set timestamp as the DataFrame index first — the x-axis will automatically show dates when you call .plot().

Activity: Clean Your Data

⏱ 30 minutes

Goal: a clean CSV + three saved plots + one paragraph of methods notes.

(5 min) Load your CSV. Run df.shape, df.isnull().sum() and df.describe(). Write down: row count, any nulls, suspicious min/max values.

(8 min) Apply cleaning steps: dedup → sort by timestamp → drop nulls → filter impossible ranges. Print row count after each step.

(10 min) Plot: temperature time-series, humidity time-series, temperature histogram. Save all three at 150 DPI.

(7 min) Compare df.describe() with the person next to you. Do your sensor locations show different means or spreads? What might explain the difference?

💡 Write one sentence per cleaning step in a Word or Markdown doc. This is your Methods section — it will directly justify every decision in your final report.

Coming Up — Session 06

Date TBC · Statistical Analysis

What we'll cover

Correlation between variables, time-series trend analysis, comparing sites using statistical tests, and distilling your findings into clear statements for the final report.

Before Session 06: Save your cleaned CSV as log_clean.csv. Bring both raw and clean files next session.

Stretch task: Plot temp and humidity on dual axes. Do they correlate during wet weather?

Questions?

jwilliams.science · HalesAir Project

Data Cleaning &Exploration

Part 1What is Dirty Data?

Types of Data Quality Problems

Structural issues

Value issues

Spotting Problems in Your CSV

Part 2Loading & Cleaning with pandas

Loading Your Data

What to check

Cleaning: Step by Step

Decisions you're making

Part 3Exploratory Data Analysis

Summary Statistics

What each stat tells you

First Visualisations

What to look for

Activity: Clean Your Data

Coming Up — Session 06

Data Cleaning &
Exploration

Part 1
What is Dirty Data?

Part 2
Loading & Cleaning with pandas

Part 3
Exploratory Data Analysis