State and Territory Abbreviations

Data sets from different sources are rarely uniform. This means in order to merge or combine data sets from different sources we have to do data cleanup. One of the common U.S. data source issues I encounter is State Name vs State Abbreviation. In this notebook, we will look at an easy way to gather state/abbreviation information for use in other analyses.

For this data set, we will use data from usps.com. I had some issues with their website causing multiple redirects when attempting to access the data. To work around this I began including a standard request header in my request

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
}

Using the standard requests library I extracted the needed data from https://pe.usps.com/text/pub28/28apb.htm

Using the MS Edge developer tools I found that the needed HTML table had an id of ep18684.

usps_states

After identifying this table it is very easy to make an HTTP request and using beautiful soup, extract the exact information needed.

import bs4 as bs
import requests

resp = requests.get('https://pe.usps.com/text/pub28/28apb.htm', headers=headers)
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'id': 'ep18684'})

At this point, the table is still an HTML table so now we need to convert it to a more usable format. This could be a dictionary, JSON, XML or my favorite, a pandas dataframe. Using pandas extract the data from the table object and place it in a dataframe. In this case, I manually stipulated the column names, but we could very easily extract column names from the table object as well.

import pandas as pd

df = pd.read_html(str(table))[0]
df = df.iloc[1:]
df.columns = ['State', 'Abbreviation']

df.head(10)

	State	Abbreviation
1	Alabama	AL
2	Alaska	AK
3	American Samoa	AS
4	Arizona	AZ
5	Arkansas	AR
6	California	CA
7	Colorado	CO
8	Connecticut	CT
9	Delaware	DE
10	District of Columbia	DC

We now have a table that can be used to merge datasets using state abbreviation with datasets using full state name

Tying it all together

import pandas as pd
import bs4 as bs
import requests

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
}

resp = requests.get('https://pe.usps.com/text/pub28/28apb.htm', headers=headers)
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'id': 'ep18684'})

df = pd.read_html(str(table))[0]
df = df.iloc[1:]
df.columns = ['State', 'Abbreviation']

df.head(10)

	State	Abbreviation
1	Alabama	AL
2	Alaska	AK
3	American Samoa	AS
4	Arizona	AZ
5	Arkansas	AR
6	California	CA
7	Colorado	CO
8	Connecticut	CT
9	Delaware	DE
10	District of Columbia	DC

As always, you can open an interactive version of this notebook at mybinder.org

State and Territory Abbreviations

Jeremiah Nelson

Jeremiah Nelson

Tying it all together

Using Python to Change Power BI's data source file

Quick and Simple Stock Technical Analysis using Jupyter

US Population Data

Installing Raspberry Pi OS Lite No Monitor Needed

US Population Data

Tying it all together

Subscribe to Make Time Labs

Subscribe to Make Time Labs