College Admissions Data
Using the geostates package
geostates
can be used to create choropleth plots of the United States or individual states. It is easy to use so we will start out with an example to show you some of the ins and outs of the package.
Admissions analysis
Goal: To illustrate the power of the package, we will start out by creating a plot that shows how the number of Princeton University acceptances varies by state in the United States.
We will start by importing the pandas
and geostates
packages.
[1]:
import pandas as pd
[2]:
%matplotlib inline
Loading in the data
For this example, we use admissions data on the Princeton University Class of 2025 from the Princeton University Undergraduate Admissions Department. The CSV includes the total number of admits in the United States as of 30 August 2021 broken down by each geography (state).
[16]:
# read in the data
admissions_data = pd.read_csv('Desktop/admissions_data_22.csv', index_col='state')
[20]:
# take a look at what the CSV file looks like
admissions_data.head()
[20]:
admits | |
---|---|
state | |
Washington | 17 |
Oregon | 6 |
California | 140 |
Nevada | 4 |
Montana | 1 |
Analyzing the data
Now let’s take a look at which states have the most admits by sorting the list by descending values.
[26]:
# sort the values to see which states have the most admits
sorted_admits_data = admissions_data.sort_values(by='admits', ascending=False)
# view the first 10 values of the sorted pandas dataframe
sorted_admits_data.head(10)
[26]:
admits | |
---|---|
state | |
New Jersey | 188 |
California | 140 |
New York | 133 |
Massachusetts | 76 |
Pennsylvania | 68 |
Texas | 52 |
Florida | 50 |
Connecticut | 42 |
Maryland | 41 |
Illinois | 32 |
The table above shows that New Jersey, California, and New York have the most number of admits for the Princeton undergraduate class of 2025.
[46]:
# see what percent of the total number of domestic admits come from these top three states
# calculate the total number of admits from New Jersey, California, and New York
top_three_total_admits = sorted_admits_data.head(3)['admits'].sum()
print('Total admits from top three states:', top_three_total_admits, 'students')
# calculate the total number of domestic admits
total_domestic_admits = sorted_admits_data['admits'].sum()
print('Total domestic admits:', total_domestic_admits, 'students')
# calculate the percent of the total admits that these three states contribute
percent = (top_three_total_admits/total_domestic_admits)
print('{:.2%}'.format(percent), 'of domestic admits come from NJ, CA, and NY')
Total admits from top three states: 461 students
Total domestic admits: 1145 students
40.26% of domestic admits come from NJ, CA, and NY
This is interesting! It turns out just three states comprise over 40% of the domestic undergraduate admits to Princeton University.
Visualize the data using geostates
The first step for using the geostates
package is to load in the geodataframe containing all of the state values. For this, we will use the load_states()
function and assign it to a value df
. Once we’ve loaded in the geodataframe we need to merge it with out cattle data.
[47]:
# import the load_states() function from the geostates package
from geostates.shapefiles import load_states
[48]:
# load in the geodataframe and assign it to df
df = load_states()
df.head()
[48]:
STATEFP | STATENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|
STUSPS | |||||||||
MS | 28 | 01779790 | 0400000US28 | 28 | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |
NC | 37 | 01027616 | 0400000US37 | 37 | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |
OK | 40 | 01102857 | 0400000US40 | 40 | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |
VA | 51 | 01779803 | 0400000US51 | 51 | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |
WV | 54 | 01779805 | 0400000US54 | 54 | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |
Merging the data
In order to sucessfully create a choropleth map of the college admissions data, we need to merge it with the geodataframe that contains all the information for creating the plots of the states. We can do this by using the pandas merge
function. Since the index for the college admissions data is state
and our geodataframe contains a similar column (NAME
) we can use this value to merge both dataframes. Let’s start out by renaming the NAME
column in our geodataframe to state
so
that the names of both columns match.
[49]:
# rename the 'NAME' column in the geodataframe to 'State'
geo_df = df.rename(columns={'NAME': 'state'})
geo_df.head()
[49]:
STATEFP | STATENS | AFFGEOID | GEOID | state | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|
STUSPS | |||||||||
MS | 28 | 01779790 | 0400000US28 | 28 | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |
NC | 37 | 01027616 | 0400000US37 | 37 | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |
OK | 40 | 01102857 | 0400000US40 | 40 | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |
VA | 51 | 01779803 | 0400000US51 | 51 | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |
WV | 54 | 01779805 | 0400000US54 | 54 | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |
Important: To make sure that we do not accidentally loose any important data during the merge, we need to make sure that we include the how='outer'
parameter in the merge statement.
[55]:
data = pd.merge(admissions_data, geo_df, on='state', how='outer')
data.head()
[55]:
state | admits | STATEFP | STATENS | AFFGEOID | GEOID | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Washington | 17 | 53 | 01779804 | 0400000US53 | 53 | 00 | 172112588220 | 12559278850 | MULTIPOLYGON (((-122.57039 48.53785, -122.5686... |
1 | Oregon | 6 | 41 | 01155107 | 0400000US41 | 41 | 00 | 248606993270 | 6192386935 | MULTIPOLYGON (((-123.59892 46.25145, -123.5984... |
2 | California | 140 | 06 | 01779778 | 0400000US06 | 06 | 00 | 403503931312 | 20463871877 | MULTIPOLYGON (((-118.60442 33.47855, -118.5987... |
3 | Nevada | 4 | 32 | 01779793 | 0400000US32 | 32 | 00 | 284329506470 | 2047206072 | POLYGON ((-120.00574 39.22866, -120.00559 39.2... |
4 | Montana | 1 | 30 | 00767982 | 0400000US30 | 30 | 00 | 376962738765 | 3869208832 | POLYGON ((-116.04914 48.50205, -116.04913 48.5... |
Plotting the data
[56]:
# import the plot_states() function from geostates
from geostates.plot import plot_states
[57]:
# create a choropleth map that displays the admits for each state in the United States
# plot = plot_states(data_2, column='admits', cmap=new_cmap, labels='both', linestyle='none', legend='legend',
#bins=15)
# add a title to the plot
# plot.annotate('Princeton Admissions Data 2022', xy=(-97, 50.5), fontsize=18, ha='center');
[ ]:
[ ]:
[ ]:
[ ]: