College Admissions Data

Using the geostates package

geostates can be used to create choropleth plots of the United States or individual states. It is easy to use so we will start out with an example to show you some of the ins and outs of the package.

Admissions analysis

Goal: To illustrate the power of the package, we will start out by creating a plot that shows how the number of Princeton University acceptances varies by state in the United States.

We will start by importing the pandas and geostates packages.

[1]:
import pandas as pd
[2]:
%matplotlib inline

Loading in the data

For this example, we use admissions data on the Princeton University Class of 2025 from the Princeton University Undergraduate Admissions Department. The CSV includes the total number of admits in the United States as of 30 August 2021 broken down by each geography (state).

[16]:
# read in the data
admissions_data = pd.read_csv('Desktop/admissions_data_22.csv', index_col='state')
[20]:
# take a look at what the CSV file looks like
admissions_data.head()
[20]:
admits
state
Washington 17
Oregon 6
California 140
Nevada 4
Montana 1

Analyzing the data

Now let’s take a look at which states have the most admits by sorting the list by descending values.

[26]:
# sort the values to see which states have the most admits
sorted_admits_data = admissions_data.sort_values(by='admits', ascending=False)

# view the first 10 values of the sorted pandas dataframe
sorted_admits_data.head(10)
[26]:
admits
state
New Jersey 188
California 140
New York 133
Massachusetts 76
Pennsylvania 68
Texas 52
Florida 50
Connecticut 42
Maryland 41
Illinois 32

The table above shows that New Jersey, California, and New York have the most number of admits for the Princeton undergraduate class of 2025.

[46]:
# see what percent of the total number of domestic admits come from these top three states

# calculate the total number of admits from New Jersey, California, and New York
top_three_total_admits = sorted_admits_data.head(3)['admits'].sum()
print('Total admits from top three states:', top_three_total_admits, 'students')

# calculate the total number of domestic admits
total_domestic_admits = sorted_admits_data['admits'].sum()
print('Total domestic admits:', total_domestic_admits, 'students')

# calculate the percent of the total admits that these three states contribute
percent = (top_three_total_admits/total_domestic_admits)
print('{:.2%}'.format(percent), 'of domestic admits come from NJ, CA, and NY')
Total admits from top three states: 461 students
Total domestic admits: 1145 students
40.26% of domestic admits come from NJ, CA, and NY

This is interesting! It turns out just three states comprise over 40% of the domestic undergraduate admits to Princeton University.

Visualize the data using geostates

The first step for using the geostates package is to load in the geodataframe containing all of the state values. For this, we will use the load_states() function and assign it to a value df. Once we’ve loaded in the geodataframe we need to merge it with out cattle data.

[47]:
# import the load_states() function from the geostates package
from geostates.shapefiles import load_states
[48]:
# load in the geodataframe and assign it to df
df = load_states()
df.head()
[48]:
STATEFP STATENS AFFGEOID GEOID NAME LSAD ALAND AWATER geometry
STUSPS
MS 28 01779790 0400000US28 28 Mississippi 00 121533519481 3926919758 MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
NC 37 01027616 0400000US37 37 North Carolina 00 125923656064 13466071395 MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
OK 40 01102857 0400000US40 40 Oklahoma 00 177662925723 3374587997 POLYGON ((-103.00257 36.52659, -103.00219 36.6...
VA 51 01779803 0400000US51 51 Virginia 00 102257717110 8528531774 MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV 54 01779805 0400000US54 54 West Virginia 00 62266474513 489028543 POLYGON ((-82.64320 38.16909, -82.64300 38.169...

Merging the data

In order to sucessfully create a choropleth map of the college admissions data, we need to merge it with the geodataframe that contains all the information for creating the plots of the states. We can do this by using the pandas merge function. Since the index for the college admissions data is state and our geodataframe contains a similar column (NAME) we can use this value to merge both dataframes. Let’s start out by renaming the NAME column in our geodataframe to state so that the names of both columns match.

[49]:
# rename the 'NAME' column in the geodataframe to 'State'
geo_df = df.rename(columns={'NAME': 'state'})
geo_df.head()
[49]:
STATEFP STATENS AFFGEOID GEOID state LSAD ALAND AWATER geometry
STUSPS
MS 28 01779790 0400000US28 28 Mississippi 00 121533519481 3926919758 MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
NC 37 01027616 0400000US37 37 North Carolina 00 125923656064 13466071395 MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
OK 40 01102857 0400000US40 40 Oklahoma 00 177662925723 3374587997 POLYGON ((-103.00257 36.52659, -103.00219 36.6...
VA 51 01779803 0400000US51 51 Virginia 00 102257717110 8528531774 MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV 54 01779805 0400000US54 54 West Virginia 00 62266474513 489028543 POLYGON ((-82.64320 38.16909, -82.64300 38.169...

Important: To make sure that we do not accidentally loose any important data during the merge, we need to make sure that we include the how='outer' parameter in the merge statement.

[55]:
data = pd.merge(admissions_data, geo_df, on='state', how='outer')
data.head()
[55]:
state admits STATEFP STATENS AFFGEOID GEOID LSAD ALAND AWATER geometry
0 Washington 17 53 01779804 0400000US53 53 00 172112588220 12559278850 MULTIPOLYGON (((-122.57039 48.53785, -122.5686...
1 Oregon 6 41 01155107 0400000US41 41 00 248606993270 6192386935 MULTIPOLYGON (((-123.59892 46.25145, -123.5984...
2 California 140 06 01779778 0400000US06 06 00 403503931312 20463871877 MULTIPOLYGON (((-118.60442 33.47855, -118.5987...
3 Nevada 4 32 01779793 0400000US32 32 00 284329506470 2047206072 POLYGON ((-120.00574 39.22866, -120.00559 39.2...
4 Montana 1 30 00767982 0400000US30 30 00 376962738765 3869208832 POLYGON ((-116.04914 48.50205, -116.04913 48.5...

Plotting the data

[56]:
# import the plot_states() function from geostates
from geostates.plot import plot_states
[57]:
# create a choropleth map that displays the admits for each state in the United States
# plot = plot_states(data_2, column='admits', cmap=new_cmap, labels='both', linestyle='none', legend='legend',
                   #bins=15)

# add a title to the plot
# plot.annotate('Princeton Admissions Data 2022', xy=(-97, 50.5), fontsize=18, ha='center');
[ ]:

[ ]:

[ ]:

[ ]: