Cattle Data
Using the geostates package
geostates
can be used to create choropleth plots of the United States or individual states. It is easy to use so we will start out with an example to show you some of the ins and outs of the package.
Cattle analysis
Goal: To illustrate the power of the package, we will start out by creating a plot that shows how the number of cattle varies by state in the United States.
We will start by importing the pandas
and geostates
packages.
[1]:
import pandas as pd
[2]:
%matplotlib inline
Loading in the data
For this example, we use data on US cattle from the United States Department of Agriculture National Agricultural Statistics Service. The CSV includes the total number of cattle (including calves) in the United States as of January 2022 broken down by each state.
[3]:
# read in the data
cattle_data = pd.read_csv('Desktop/cattle_data_22.csv', index_col='State', thousands=',')
cattle_data.index = cattle_data.index.str.title()
[4]:
# take a look at what the CSV file looks like
cattle_data.head()
[4]:
Program | Year | Period | Week Ending | Geo Level | State ANSI | Ag District | Ag District Code | County | County ANSI | Zip Code | Region | watershed_code | Watershed | Commodity | Data Item | Domain | Domain Category | Value | CV (%) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
State | ||||||||||||||||||||
Alabama | SURVEY | 2022 | FIRST OF JAN | NaN | STATE | 1 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | CATTLE | CATTLE, INCL CALVES - INVENTORY | TOTAL | NOT SPECIFIED | 1260000 | NaN |
Alaska | SURVEY | 2022 | FIRST OF JAN | NaN | STATE | 2 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | CATTLE | CATTLE, INCL CALVES - INVENTORY | TOTAL | NOT SPECIFIED | 18000 | NaN |
Arizona | SURVEY | 2022 | FIRST OF JAN | NaN | STATE | 4 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | CATTLE | CATTLE, INCL CALVES - INVENTORY | TOTAL | NOT SPECIFIED | 960000 | NaN |
Arkansas | SURVEY | 2022 | FIRST OF JAN | NaN | STATE | 5 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | CATTLE | CATTLE, INCL CALVES - INVENTORY | TOTAL | NOT SPECIFIED | 1690000 | NaN |
California | SURVEY | 2022 | FIRST OF JAN | NaN | STATE | 6 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | CATTLE | CATTLE, INCL CALVES - INVENTORY | TOTAL | NOT SPECIFIED | 5200000 | NaN |
Cleaning the data
It looks like our CSV file has a few extra columns including Program, Commodity, Domain, etc. that we do not need. It also shows a few columns that have missing (NaN) values. Let’s start out by removing all of the unnecessary columns and removing all of the NaNs. Let’s also rename the ‘Value’ column to ‘Cattle’ to make it more clear. Finally, by using the type()
function we can check to see that the ‘Cattle’ column is of dtype str
. We need to convert this to an int
.
[5]:
# drop the NaN values and unnecessary columns
cattle_data = cattle_data.dropna(axis='columns')
cattle_data = cattle_data.drop(columns=['Program', 'Year', 'Period', 'Geo Level', 'State ANSI', 'watershed_code', 'Commodity',
'Data Item', 'Domain', 'Domain Category'])
# rename the column from 'Value' to 'Cattle'
cattle_data = cattle_data.rename(columns={'Value': 'Cattle'})
# view the first five values
cattle_data.head()
[5]:
Cattle | |
---|---|
State | |
Alabama | 1260000 |
Alaska | 18000 |
Arizona | 960000 |
Arkansas | 1690000 |
California | 5200000 |
Now that we have the total number of cattle for each state we could visualize this by creating a choropleth map that shows the variation in total cattle inventory by state. While this is interesting, it might not fully capture the variation we are looking for. For example, bigger states like California and Texas are likely to have the largest total number of cattle. One interesting metric we can use to compare the relative values of cattle across multiple states is by computing the cattle to person ratio. This allows us to examine a state’s total inventory of cattle relative to its population.
For this, we will use population data from the United States Census Bureau’s Population and Housing Unit Estimates.
[6]:
# read in the data
population_data = pd.read_csv('Desktop/state_population_21.csv', index_col='State', thousands=',')
[7]:
population_data.head()
[7]:
Population | |
---|---|
State | |
Oklahoma | 3986639 |
Nebraska | 1963692 |
Hawaii | 1441553 |
South Dakota | 895376 |
Tennessee | 6975218 |
Let’s merge these two datasets together.
[8]:
merged_df = pd.merge(cattle_data, population_data, on='State')
merged_df.head()
[8]:
Cattle | Population | |
---|---|---|
State | ||
Alabama | 1260000 | 5039877 |
Alaska | 18000 | 732673 |
Arizona | 960000 | 7276316 |
Arkansas | 1690000 | 3025891 |
California | 5200000 | 39237836 |
Analyzing the data
Now let’s compute the cattle to person ratio for each state and sort the list by descending values.
[34]:
# compute the cattle to person ratio by dividing the Cattle column by the Population column
cattle_ratio = merged_df['Cattle']/merged_df['Population']
# sort the values to see which states have the highest Cattle to Person ratio
sorted_cattle_ratio = cattle_ratio.sort_values(ascending=False)
# view the first 15 values of the sorted pandas series
sorted_cattle_ratio.head(15)
[34]:
State
South Dakota 4.244027
Nebraska 3.462865
North Dakota 2.387257
Kansas 2.214966
Wyoming 2.159629
Montana 1.992265
Idaho 1.341454
Oklahoma 1.304357
Iowa 1.205733
Missouri 0.654974
New Mexico 0.614402
Wisconsin 0.593632
Arkansas 0.558513
Colorado 0.455948
Kentucky 0.447954
dtype: float64
This is interesting! In fact, it turns out there are nine states where there are more cattle than people!
Finally, let’s append this as a third column to our original dataframe and round the values to three decimal places.
[35]:
# convert the series containing the ratio to a dataframe and merge it with the original dataframe
final_df = merged_df.merge(cattle_ratio.to_frame('Ratio'), on='State')
# round the values of the ratio column to three decimal places
final_df['Ratio'] = final_df['Ratio'].round(3)
[36]:
final_df.head(10)
[36]:
Cattle | Population | Ratio | |
---|---|---|---|
State | |||
Alabama | 1260000 | 5039877 | 0.250 |
Alaska | 18000 | 732673 | 0.025 |
Arizona | 960000 | 7276316 | 0.132 |
Arkansas | 1690000 | 3025891 | 0.559 |
California | 5200000 | 39237836 | 0.133 |
Colorado | 2650000 | 5812069 | 0.456 |
Connecticut | 47000 | 3605597 | 0.013 |
Delaware | 12000 | 1003384 | 0.012 |
Florida | 1630000 | 21781128 | 0.075 |
Georgia | 1050000 | 10799566 | 0.097 |
Now that we have a dataframe containing the ratio of cattle inventory to population we are ready to use geostates
to visualize it!
Visualize the data using geostates
The first step for using the geostates
package is to load in the geodataframe containing all of the state values. For this, we will use the load_states()
function and assign it to a value df
. Once we’ve loaded in the geodataframe we need to merge it with out cattle data.
[12]:
# import the load_states() function from the geostates package
from geostates.shapefiles import load_states
[13]:
# load in the geodataframe and assign it to df
df = load_states()
df.head()
[13]:
STATEFP | STATENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|
STUSPS | |||||||||
MS | 28 | 01779790 | 0400000US28 | 28 | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |
NC | 37 | 01027616 | 0400000US37 | 37 | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |
OK | 40 | 01102857 | 0400000US40 | 40 | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |
VA | 51 | 01779803 | 0400000US51 | 51 | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |
WV | 54 | 01779805 | 0400000US54 | 54 | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |
Merging the data
In order to sucessfully create a choropleth map of the cattle data, we need to merge it with the geodataframe that contains all the information for creating the plots of the states. We can do this by using the pandas merge
function. Since the index for the cattle data is State
and our geodataframe contains a similar column (NAME
) we can use this value to merge both dataframes. Let’s start out by renaming the NAME
column in our geodataframe to State
so that the names of both
columns match.
[25]:
# rename the 'NAME' column in the geodataframe to 'State'
geo_df = df.rename(columns={'NAME': 'State'})
geo_df.head()
[25]:
STATEFP | STATENS | AFFGEOID | GEOID | State | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|
STUSPS | |||||||||
MS | 28 | 01779790 | 0400000US28 | 28 | Mississippi | 00 | 121533519481 | 3926919758 | MULTIPOLYGON (((-88.50297 30.21523, -88.49176 ... |
NC | 37 | 01027616 | 0400000US37 | 37 | North Carolina | 00 | 125923656064 | 13466071395 | MULTIPOLYGON (((-75.72681 35.93584, -75.71827 ... |
OK | 40 | 01102857 | 0400000US40 | 40 | Oklahoma | 00 | 177662925723 | 3374587997 | POLYGON ((-103.00257 36.52659, -103.00219 36.6... |
VA | 51 | 01779803 | 0400000US51 | 51 | Virginia | 00 | 102257717110 | 8528531774 | MULTIPOLYGON (((-75.74241 37.80835, -75.74151 ... |
WV | 54 | 01779805 | 0400000US54 | 54 | West Virginia | 00 | 62266474513 | 489028543 | POLYGON ((-82.64320 38.16909, -82.64300 38.169... |
Important: To make sure that we do not accidentally loose any important data during the merge, we need to make sure that we include the how='outer'
parameter in the merge statement.
[37]:
data = pd.merge(final_df, geo_df, on='State', how='outer')
data.head()
[37]:
State | Cattle | Population | Ratio | STATEFP | STATENS | AFFGEOID | GEOID | LSAD | ALAND | AWATER | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1260000.0 | 5039877.0 | 0.250 | 01 | 01779775 | 0400000US01 | 01 | 00 | 131174048583 | 4593327154 | MULTIPOLYGON (((-88.05338 30.50699, -88.05109 ... |
1 | Alaska | 18000.0 | 732673.0 | 0.025 | 02 | 01785533 | 0400000US02 | 02 | 00 | 1478839695958 | 245481577452 | MULTIPOLYGON (((179.48246 51.98283, 179.48656 ... |
2 | Arizona | 960000.0 | 7276316.0 | 0.132 | 04 | 01779777 | 0400000US04 | 04 | 00 | 294198551143 | 1027337603 | POLYGON ((-114.81629 32.50804, -114.81432 32.5... |
3 | Arkansas | 1690000.0 | 3025891.0 | 0.559 | 05 | 00068085 | 0400000US05 | 05 | 00 | 134768872727 | 2962859592 | POLYGON ((-94.61783 36.49941, -94.61765 36.499... |
4 | California | 5200000.0 | 39237836.0 | 0.133 | 06 | 01779778 | 0400000US06 | 06 | 00 | 403503931312 | 20463871877 | MULTIPOLYGON (((-118.60442 33.47855, -118.5987... |
Plotting the data
To plot the data we need to use the plot_states
function in the geostates package.
[24]:
# import the plot_states() function from geostates
from geostates.plot import plot_states
[33]:
# create a choropleth map that displays the cattle to person ratio for each state in the United States
# plot = plot_states(data_2, column='Ratio', cmap=new_cmap, labels='both', linestyle='none', legend='colorbar',
#bins=15)
# add a title to the plot
# plot.annotate('Cattle to Person Ratio 2022', xy=(-97, 50.5), fontsize=18, ha='center');
[ ]:
[ ]:
[ ]:
[ ]: