Cancer Incidence and Environmental Pollutants

Creating a thriving and resilient community with machine-learning and data driven evidence.

Pollution Data Scatter Plot
Ozone Data Scatter Plot
The pollution and ozone datasets were broken down by year from 2001 to 2016. The cancer data set comprised a single “recent trend” data point per FIPS based on data from 2000 to 2014. The years 2015 and 2016 were removed from the environmental data to create a cohesive dataset, then all three datasets were aggregated using the pollution and ozone mean values by FIPS and year.
Model Dataset Table
FIPS Recent Trend PM25 Max Pred PM25 Med Pred PM25 Mean Pred O3 Max Pred O3 Med Pred O3 Mean Pred PM25 Max Pred 2001 PM25 Max Pred 2002 PM25 Max Pred 2003 PM25 Max Pred 2004 PM25 Max Pred 2005 PM25 Max Pred 2006
The confusion matrix generated using the SMOTE algorithm still favors the majority class, classifying around 90% of the stable class accurately, and only 20% of the falling class, and none of the rising. After applying the cluster centroid algorithm, the data points were undersampled to 37 data points for each class. This improved the overall balanced accuracy from 37% to 51%.
Oversampling
Confusion Matrix SMOTE Resampling
Confusion Matrix
Undersampling
Confusion Matrix Cluster Centroid Resampling
Confusion Matrix