College Admissions Data

Using the geostates package

geostates can be used to create choropleth plots of the United States or individual states. It is easy to use so we will start out with an example to show you some of the ins and outs of the package.

Admissions analysis

Goal: To illustrate the power of the package, we will start out by creating a plot that shows how the number of Princeton University acceptances varies by state in the United States.

We will start by importing the pandas and geostates packages.

[1]:

import pandas as pd

[2]:

%matplotlib inline

Loading in the data

For this example, we use admissions data on the Princeton University Class of 2025 from the Princeton University Undergraduate Admissions Department. The CSV includes the total number of admits in the United States as of 30 August 2021 broken down by each geography (state).

[16]:

# read in the data
admissions_data = pd.read_csv('Desktop/admissions_data_22.csv', index_col='state')

[20]:

# take a look at what the CSV file looks like
admissions_data.head()

[20]:

	admits
state
Washington	17
Oregon	6
California	140
Nevada	4
Montana	1

Analyzing the data

Now let’s take a look at which states have the most admits by sorting the list by descending values.

[26]:

# sort the values to see which states have the most admits
sorted_admits_data = admissions_data.sort_values(by='admits', ascending=False)

# view the first 10 values of the sorted pandas dataframe
sorted_admits_data.head(10)

[26]:

	admits
state
New Jersey	188
California	140
New York	133
Massachusetts	76
Pennsylvania	68
Texas	52
Florida	50
Connecticut	42
Maryland	41
Illinois	32

The table above shows that New Jersey, California, and New York have the most number of admits for the Princeton undergraduate class of 2025.

[46]:

# see what percent of the total number of domestic admits come from these top three states

# calculate the total number of admits from New Jersey, California, and New York
top_three_total_admits = sorted_admits_data.head(3)['admits'].sum()
print('Total admits from top three states:', top_three_total_admits, 'students')

# calculate the total number of domestic admits
total_domestic_admits = sorted_admits_data['admits'].sum()
print('Total domestic admits:', total_domestic_admits, 'students')

# calculate the percent of the total admits that these three states contribute
percent = (top_three_total_admits/total_domestic_admits)
print('{:.2%}'.format(percent), 'of domestic admits come from NJ, CA, and NY')

Total admits from top three states: 461 students
Total domestic admits: 1145 students
40.26% of domestic admits come from NJ, CA, and NY

This is interesting! It turns out just three states comprise over 40% of the domestic undergraduate admits to Princeton University.

Visualize the data using geostates

The first step for using the geostates package is to load in the geodataframe containing all of the state values. For this, we will use the load_states() function and assign it to a value df. Once we’ve loaded in the geodataframe we need to merge it with out cattle data.

[47]:

# import the load_states() function from the geostates package
from geostates.shapefiles import load_states

[48]:

# load in the geodataframe and assign it to df
df = load_states()
df.head()

[48]:

	STATEFP	STATENS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER	geometry
STUSPS
MS	28	01779790	0400000US28	28	Mississippi	00	121533519481	3926919758	MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
NC	37	01027616	0400000US37	37	North Carolina	00	125923656064	13466071395	MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
OK	40	01102857	0400000US40	40	Oklahoma	00	177662925723	3374587997	POLYGON ((-103.00257 36.52659, -103.00219 36.6...
VA	51	01779803	0400000US51	51	Virginia	00	102257717110	8528531774	MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV	54	01779805	0400000US54	54	West Virginia	00	62266474513	489028543	POLYGON ((-82.64320 38.16909, -82.64300 38.169...

Merging the data

In order to sucessfully create a choropleth map of the college admissions data, we need to merge it with the geodataframe that contains all the information for creating the plots of the states. We can do this by using the pandas merge function. Since the index for the college admissions data is state and our geodataframe contains a similar column (NAME) we can use this value to merge both dataframes. Let’s start out by renaming the NAME column in our geodataframe to state so that the names of both columns match.

[49]:

# rename the 'NAME' column in the geodataframe to 'State'
geo_df = df.rename(columns={'NAME': 'state'})
geo_df.head()

[49]:

	STATEFP	STATENS	AFFGEOID	GEOID	state	LSAD	ALAND	AWATER	geometry
STUSPS
MS	28	01779790	0400000US28	28	Mississippi	00	121533519481	3926919758	MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ...
NC	37	01027616	0400000US37	37	North Carolina	00	125923656064	13466071395	MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ...
OK	40	01102857	0400000US40	40	Oklahoma	00	177662925723	3374587997	POLYGON ((-103.00257 36.52659, -103.00219 36.6...
VA	51	01779803	0400000US51	51	Virginia	00	102257717110	8528531774	MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ...
WV	54	01779805	0400000US54	54	West Virginia	00	62266474513	489028543	POLYGON ((-82.64320 38.16909, -82.64300 38.169...

Important: To make sure that we do not accidentally loose any important data during the merge, we need to make sure that we include the how='outer' parameter in the merge statement.

[55]:

data = pd.merge(admissions_data, geo_df, on='state', how='outer')
data.head()

[55]:

	state	admits	STATEFP	STATENS	AFFGEOID	GEOID	ALAND	AWATER	geometry
0	Washington	17	53	01779804	0400000US53	53	172112588220	12559278850	MULTIPOLYGON (((-122.57039 48.53785, -122.5686...
1	Oregon	6	41	01155107	0400000US41	41	248606993270	6192386935	MULTIPOLYGON (((-123.59892 46.25145, -123.5984...
2	California	140	06	01779778	0400000US06	06	403503931312	20463871877	MULTIPOLYGON (((-118.60442 33.47855, -118.5987...
3	Nevada	4	32	01779793	0400000US32	32	284329506470	2047206072	POLYGON ((-120.00574 39.22866, -120.00559 39.2...
4	Montana	1	30	00767982	0400000US30	30	376962738765	3869208832	POLYGON ((-116.04914 48.50205, -116.04913 48.5...

Plotting the data

[56]:

# import the plot_states() function from geostates
from geostates.plot import plot_states

[57]:

# create a choropleth map that displays the admits for each state in the United States
# plot = plot_states(data_2, column='admits', cmap=new_cmap, labels='both', linestyle='none', legend='legend',
                   #bins=15)

# add a title to the plot
# plot.annotate('Princeton Admissions Data 2022', xy=(-97, 50.5), fontsize=18, ha='center');

[ ]:

[ ]:

[ ]:

[ ]: