Biodiversity loss is a pressing global concern, and U.S. National Parks serve as critical habitats for many vulnerable species. This project analyzed conservation statuses of species across U.S. National Parks using datasets provided by Codecademy, based on real National Park Service data.
The primary aim was to identify endangered species, detect patterns across parks, and explore ecological factors that might influence conservation status. The full data lifecycle was covered: cleaning, exploratory analysis, statistical testing, visualization, and interpretation.
The two datasets (species_info.csv and observations.csv) were cleaned with Pandas: missing values filled, species names standardized, datasets merged, and categorical variables encoded for analysis.
Exploratory analysis covered species distribution by conservation status, common characteristics among endangered species, and observation patterns across parks. Chi-squared tests via SciPy confirmed whether certain species categories were statistically more likely to be endangered.
Visualizations were built with Matplotlib and Seaborn. The entire project was written and executed in Jupyter Notebook.
Working with even fictionalized data made clear how complex conservation work is. Quantitative analysis alone isn't enough without ecological context to interpret what the numbers actually mean.
Next steps would include applying the same techniques to real NPS datasets and integrating geospatial analysis for deeper geographic insights.