Data sets from different sources are rarely uniform. This means in order to merge or combine data sets from different sources we have to do data cleanup. One of the common U.S. data source issues I encounter is State Name vs State Abbreviation. In this notebook, we will look at an easy way to gather state/abbreviation information for use in other analyses.

For this data set, we will use data from usps.com. I had some issues with their website causing multiple redirects when attempting to access the data. To work around this I began including a standard request header in my request

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
}

Using the standard requests library I extracted the needed data from https://pe.usps.com/text/pub28/28apb.htm

Using the MS Edge developer tools I found that the needed HTML table had an id of ep18684.

usps_states

After identifying this table it is very easy to make an HTTP request and using beautiful soup, extract the exact information needed.

import bs4 as bs
import requests

resp = requests.get('https://pe.usps.com/text/pub28/28apb.htm', headers=headers)
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'id': 'ep18684'})

At this point, the table is still an HTML table so now we need to convert it to a more usable format. This could be a dictionary, JSON, XML or my favorite, a pandas dataframe. Using pandas extract the data from the table object and place it in a dataframe. In this case, I manually stipulated the column names, but we could very easily extract column names from the table object as well.

import pandas as pd

df = pd.read_html(str(table))[0]
df = df.iloc[1:]
df.columns = ['State', 'Abbreviation']

df.head(10)
State Abbreviation
1 Alabama AL
2 Alaska AK
3 American Samoa AS
4 Arizona AZ
5 Arkansas AR
6 California CA
7 Colorado CO
8 Connecticut CT
9 Delaware DE
10 District of Columbia DC

We now have a table that can be used to merge datasets using state abbreviation with datasets using full state name

Tying it all together

import pandas as pd
import bs4 as bs
import requests

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
}

resp = requests.get('https://pe.usps.com/text/pub28/28apb.htm', headers=headers)
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'id': 'ep18684'})

df = pd.read_html(str(table))[0]
df = df.iloc[1:]
df.columns = ['State', 'Abbreviation']

df.head(10)
State Abbreviation
1 Alabama AL
2 Alaska AK
3 American Samoa AS
4 Arizona AZ
5 Arkansas AR
6 California CA
7 Colorado CO
8 Connecticut CT
9 Delaware DE
10 District of Columbia DC
As always, you can open an interactive version of this notebook at mybinder.org

Binder