US Population Data

A meme floating around these days takes a swing at the existence of the electoral college stating "land doesn't vote, people do". While I don't subscribe to either side of that particular argument, the meme does a great job of highlighting how population data is essential for any analysis involving people. As such it is important to have a reliable source for population data.

Luckily in the U.S. we have the census bureau. Census.gov provides several datasets for our consumption. You can retrieve census data automatically using a pandas dataframe.

census.gov data

import pandas as pd

df = pd.read_excel('https://www2.census.gov/programs-surveys/popest/tables/2010-2019/state/totals/nst-est2019-01.xlsx')

df.head(10)
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts) Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Table 1. Annual Estimates of the Resident Popu... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Geographic Area 2010-04-01 00:00:00 NaN Population Estimate (as of July 1) NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN Census Estimates Base 2010 2011.0 2012.0 2013.0 2014.0 2015.0 2016.0 2017.0 2018.0 2019.0
3 United States 308745538 308758105 309321666 311556874.0 313830990.0 315993715.0 318301008.0 320635163.0 322941311.0 324985539.0 326687501.0 328239523.0
4 Northeast 55317240 55318443 55380134 55604223.0 55775216.0 55901806.0 56006011.0 56034684.0 56042330.0 56059240.0 56046620.0 55982803.0
5 Midwest 66927001 66929725 66974416 67157800.0 67336743.0 67560379.0 67745167.0 67860583.0 67987540.0 68126781.0 68236628.0 68329004.0
6 South 114555744 114563030 114866680 116006522.0 117241208.0 118364400.0 119624037.0 120997341.0 122351760.0 123542189.0 124569433.0 125580448.0
7 West 71945553 71946907 72100436 72788329.0 73477823.0 74167130.0 74925793.0 75742555.0 76559681.0 77257329.0 77834820.0 78347268.0
8 .Alabama 4779736 4780125 4785437 4799069.0 4815588.0 4830081.0 4841799.0 4852347.0 4863525.0 4874486.0 4887681.0 4903185.0
9 .Alaska 710231 710249 713910 722128.0 730443.0 737068.0 736283.0 737498.0 741456.0 739700.0 735139.0 731545.0

Next we need to clean up our data

You can see that our headers are in row index 2

We can extract our headers to a new series

new_header = df.iloc[2] 
new_header
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts)               NaN
Unnamed: 1                                                                                                              Census
Unnamed: 2                                                                                                      Estimates Base
Unnamed: 3                                                                                                                2010
Unnamed: 4                                                                                                                2011
Unnamed: 5                                                                                                                2012
Unnamed: 6                                                                                                                2013
Unnamed: 7                                                                                                                2014
Unnamed: 8                                                                                                                2015
Unnamed: 9                                                                                                                2016
Unnamed: 10                                                                                                               2017
Unnamed: 11                                                                                                               2018
Unnamed: 12                                                                                                               2019
Name: 2, dtype: object

Next we only want state an territory data so we can remove any aggregate rows or non-data rows from our dataset

Looking at the above dataframe we can see that our states and territories start at row 10. We can therefore discard everything above row 10

df = df[10:]

df.head()
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts) Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
10 .Arizona 6392017 6392288 6407172 6472643.0 6554978.0 6632764.0 6730413.0 6829676.0 6941072.0 7044008.0 7158024.0 7278717.0
11 .Arkansas 2915918 2916031 2921964 2940667.0 2952164.0 2959400.0 2967392.0 2978048.0 2989918.0 3001345.0 3009733.0 3017804.0
12 .California 37253956 37254519 37319502 37638369.0 37948800.0 38260787.0 38596972.0 38918045.0 39167117.0 39358497.0 39461588.0 39512223.0
13 .Colorado 5029196 5029319 5047349 5121108.0 5192647.0 5269035.0 5350101.0 5450623.0 5539215.0 5611885.0 5691287.0 5758736.0
14 .Connecticut 3574097 3574147 3579114 3588283.0 3594547.0 3594841.0 3594524.0 3587122.0 3578141.0 3573297.0 3571520.0 3565287.0

We can also see when reviewing the dataset that there are 5 non-data rows at the end of the dataset. We can discard these rows as well

df = df[:-5]

df.tail(10)
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts) Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
51 .Texas 25145561 25146091 25241971 25645629.0 26084481.0 26480266.0 26964333.0 27470056.0 27914410.0 28295273.0 28628666.0 28995881.0
52 .Utah 2763885 2763891 2775332 2814384.0 2853375.0 2897640.0 2936879.0 2981835.0 3041868.0 3101042.0 3153550.0 3205958.0
53 .Vermont 625741 625737 625879 627049.0 626090.0 626210.0 625214.0 625216.0 623657.0 624344.0 624358.0 623989.0
54 .Virginia 8001024 8001049 8023699 8101155.0 8185080.0 8252427.0 8310993.0 8361808.0 8410106.0 8463587.0 8501286.0 8535519.0
55 .Washington 6724540 6724540 6742830 6826627.0 6897058.0 6963985.0 7054655.0 7163657.0 7294771.0 7423362.0 7523869.0 7614893.0
56 .West Virginia 1852994 1853018 1854239 1856301.0 1856872.0 1853914.0 1849489.0 1842050.0 1831023.0 1817004.0 1804291.0 1792147.0
57 .Wisconsin 5686986 5687285 5690475 5705288.0 5719960.0 5736754.0 5751525.0 5760940.0 5772628.0 5790186.0 5807406.0 5822434.0
58 .Wyoming 563626 563775 564487 567299.0 576305.0 582122.0 582531.0 585613.0 584215.0 578931.0 577601.0 578759.0
59 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60 Puerto Rico 3725789 3726157 3721525 3678732.0 3634488.0 3593077.0 3534874.0 3473232.0 3406672.0 3325286.0 3193354.0 3193694.0

Next we can see that there are some null rows in our dataset. We can discard these as well

df.dropna(inplace = True)
df.tail()
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts) Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
55 .Washington 6724540 6724540 6742830 6826627.0 6897058.0 6963985.0 7054655.0 7163657.0 7294771.0 7423362.0 7523869.0 7614893.0
56 .West Virginia 1852994 1853018 1854239 1856301.0 1856872.0 1853914.0 1849489.0 1842050.0 1831023.0 1817004.0 1804291.0 1792147.0
57 .Wisconsin 5686986 5687285 5690475 5705288.0 5719960.0 5736754.0 5751525.0 5760940.0 5772628.0 5790186.0 5807406.0 5822434.0
58 .Wyoming 563626 563775 564487 567299.0 576305.0 582122.0 582531.0 585613.0 584215.0 578931.0 577601.0 578759.0
60 Puerto Rico 3725789 3726157 3721525 3678732.0 3634488.0 3593077.0 3534874.0 3473232.0 3406672.0 3325286.0 3193354.0 3193694.0

To make this dataset compatible with our other state based data we need to remove the non alpha characters from out state names

df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
df.head()
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts) Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
10 Arizona 6392017 6392288 6407172 6472643.0 6554978.0 6632764.0 6730413.0 6829676.0 6941072.0 7044008.0 7158024.0 7278717.0
11 Arkansas 2915918 2916031 2921964 2940667.0 2952164.0 2959400.0 2967392.0 2978048.0 2989918.0 3001345.0 3009733.0 3017804.0
12 California 37253956 37254519 37319502 37638369.0 37948800.0 38260787.0 38596972.0 38918045.0 39167117.0 39358497.0 39461588.0 39512223.0
13 Colorado 5029196 5029319 5047349 5121108.0 5192647.0 5269035.0 5350101.0 5450623.0 5539215.0 5611885.0 5691287.0 5758736.0
14 Connecticut 3574097 3574147 3579114 3588283.0 3594547.0 3594841.0 3594524.0 3587122.0 3578141.0 3573297.0 3571520.0 3565287.0

Now we just need to reapply our headers

import numpy as np
df.columns = new_header
df.rename(columns = {np.NaN : 'States'}, inplace= True)
df.head()
2 States Census Estimates Base 2010 2011.0 2012.0 2013.0 2014.0 2015.0 2016.0 2017.0 2018.0 2019.0
10 Arizona 6392017 6392288 6407172 6472643.0 6554978.0 6632764.0 6730413.0 6829676.0 6941072.0 7044008.0 7158024.0 7278717.0
11 Arkansas 2915918 2916031 2921964 2940667.0 2952164.0 2959400.0 2967392.0 2978048.0 2989918.0 3001345.0 3009733.0 3017804.0
12 California 37253956 37254519 37319502 37638369.0 37948800.0 38260787.0 38596972.0 38918045.0 39167117.0 39358497.0 39461588.0 39512223.0
13 Colorado 5029196 5029319 5047349 5121108.0 5192647.0 5269035.0 5350101.0 5450623.0 5539215.0 5611885.0 5691287.0 5758736.0
14 Connecticut 3574097 3574147 3579114 3588283.0 3594547.0 3594841.0 3594524.0 3587122.0 3578141.0 3573297.0 3571520.0 3565287.0

Now its time to plot the data

import plotly.express as px
data = px.data.gapminder()

fig = px.bar(df, x='States', y='Census',
             hover_data=['Census'], color='Census',
             labels={'pop':'population'}, height=400)
fig.show()

Tying it all together

import pandas as pd
import numpy as np

df = pd.read_excel('https://www2.census.gov/programs-surveys/popest/tables/2010-2019/state/totals/nst-est2019-01.xlsx')
new_header = df.iloc[2] 
df = df[10:-5]
df.dropna(inplace = True)
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
df.columns = new_header
df.rename(columns = {np.NaN : 'States'}, inplace= True)

import plotly.express as px
data = px.data.gapminder()

fig = px.bar(df, x='States', y='Census',
             hover_data=['Census'], color='Census',
             labels={'pop':'population'}, height=400)
fig.show()

As always, you can open an interactive version of this notebook at mybinder.org