Halesowen College ยท T Level Data Analytics
Every real-world dataset has problems. Learning to find them is the skill.
Key insight: Dirty data doesn't mean bad science โ it means you used a real sensor in the real world. Acknowledging and handling data quality issues is what makes your analysis credible.
| timestamp | temp_c | hum_pct | pres_hpa | gas_ohm | issue |
|---|---|---|---|---|---|
| 2026-03-10 09:00:02 | 18.3 | 62 | 1013.2 | 48200 | โ clean |
| 2026-03-10 09:00:12 | 99.0 | 61 | 1013.2 | 49100 | temp spike |
| 18.5 | 61 | 1013.3 | 49800 | missing timestamp | |
| 2026-03-10 09:00:32 | 18.5 | 150 | 1013.3 | 50200 | humidity impossible |
| 2026-03-10 09:00:42 | 18.5 | 61 | 1013.3 | 49800 | stuck / duplicate |
All of these have appeared in real HalesAir historical data. Today you'll detect and handle them systematically using pandas.
Python's essential data analysis library
df.shape โ e.g. (8640, 5) = 8640 rowsdf.dtypes โ timestamp should be datetime64, not objectisnull().sum() โ how many NaN per column?Install pandas: On your laptop, not the Pico โ pip install pandas matplotlib. Run this in your laptop's terminal.
Never modify the original CSV. Load it once, then save your cleaned version separately as log_clean.csv.
Never overwrite raw data. Always save cleaned data to a new file. Raw data is your ground truth.
Summary statistics and first visualisations
Research question check: Does mean temperature differ between sites? Is variability greater in one location? These are your first results.
Save properly: Use plt.savefig("plot.png", dpi=150) โ not screenshots. This produces publication-quality figures.
Set timestamp as the DataFrame index first โ the x-axis will automatically show dates when you call .plot().
Goal: a clean CSV + three saved plots + one paragraph of methods notes.
(5 min) Load your CSV. Run df.shape, df.isnull().sum() and df.describe(). Write down: row count, any nulls, suspicious min/max values.
(8 min) Apply cleaning steps: dedup โ sort by timestamp โ drop nulls โ filter impossible ranges. Print row count after each step.
(10 min) Plot: temperature time-series, humidity time-series, temperature histogram. Save all three at 150 DPI.
(7 min) Compare df.describe() with the person next to you. Do your sensor locations show different means or spreads? What might explain the difference?
๐ก Write one sentence per cleaning step in a Word or Markdown doc. This is your Methods section โ it will directly justify every decision in your final report.
Date TBC ยท Statistical Analysis
Correlation between variables, time-series trend analysis, comparing sites using statistical tests, and distilling your findings into clear statements for the final report.
Before Session 06: Save your cleaned CSV as log_clean.csv. Bring both raw and clean files next session.
Stretch task: Plot temp and humidity on dual axes. Do they correlate during wet weather?
Questions?
jwilliams.science ยท HalesAir Project