Stock Ticker Symbols for Machine Learning and Artificial Intelligence
I received a few questions from multiple followers regarding a working list of all traded stocks. This is generally followed by an explanation that they want to apply machine-learning or artificial intelligence to predict stock prices. My initial response is always, that sounds like fun, let's start with a data-set that is a bit smaller first and ideally one that you are familiar with.
Capturing ALL traded stocks would become unwieldy rather quickly. Most stock data services have limits even on their paid plans. Using funnhub.io as an example, their most expensive plan at \$500 a month only allows for 300 API calls a minute. That is 216,000 API calls a day to retrieve data and a quick look at gurufocus.com indicates around 120,000 stocks, and I'm not certain that is all of them. At 300 API call an hour it would take half a day to get the latest information and most likely you would want to run multiple API calls on at least some of those symbols. The only way you are going to get faster data than that is to spend many thousands of dollars and if you have that kind of money you wouldn't be asking me how to get all the ticker symbols.
As stated above, let's make something a bit more reasonable. Everyone is familiar with the New York Stock Exchange (NYSE), and whether they know it by its actual name or not, the National Association of Securities Dealers Automated Quotation System (Nasdaq). So let's start with these, and the very first thing you are going to need is a list of all their associated tickers symbols.
If you are just getting started with the stock market and do not yet have a Robinhood account, you can collect a free stock from Robinhood by following this link. Robinhood.com
Now we could scrape this data and indeed if we wanted to constrain our search to the S&P 500 only we could get its list of symbols in three lines of python
import pandas as pd
table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
df.head()
Symbol | Security | SEC filings | GICS Sector | GICS Sub-Industry | Headquarters Location | Date first added | CIK | Founded | |
---|---|---|---|---|---|---|---|---|---|
0 | MMM | 3M Company | reports | Industrials | Industrial Conglomerates | St. Paul, Minnesota | 1976-08-09 | 66740 | 1902 |
1 | ABT | Abbott Laboratories | reports | Health Care | Health Care Equipment | North Chicago, Illinois | 1964-03-31 | 1800 | 1888 |
2 | ABBV | AbbVie Inc. | reports | Health Care | Pharmaceuticals | North Chicago, Illinois | 2012-12-31 | 1551152 | 2013 (1888) |
3 | ABMD | ABIOMED Inc | reports | Health Care | Health Care Equipment | Danvers, Massachusetts | 2018-05-31 | 815094 | 1981 |
4 | ACN | Accenture plc | reports | Information Technology | IT Consulting & Other Services | Dublin, Ireland | 2011-07-06 | 1467373 | 1989 |
Scraping the entire list would be quite cumbersome however as most websites that maintain an entire list break that list into multiple pages. Interestingly enough the website nasdaqtrader.com keeps a list of all symbols.
df = pd.read_csv('ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqtraded.txt', sep='|', index_col=False)
df.head()
Nasdaq Traded | Symbol | Security Name | Listing Exchange | Market Category | ETF | Round Lot Size | Test Issue | Financial Status | CQS Symbol | NASDAQ Symbol | NextShares | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Y | A | Agilent Technologies, Inc. Common Stock | N | N | 100.0 | N | NaN | A | A | N | |
1 | Y | AA | Alcoa Corporation Common Stock | N | N | 100.0 | N | NaN | AA | AA | N | |
2 | Y | AAA | Listed Funds Trust AAF First Priority CLO Bond... | P | Y | 100.0 | N | NaN | AAA | AAA | N | |
3 | Y | AAAU | Perth Mint Physical Gold ETF | P | Y | 100.0 | N | NaN | AAAU | AAAU | N | |
4 | Y | AACG | ATA Creativity Global - American Depositary Sh... | Q | G | N | 100.0 | N | N | NaN | AACG | N |
In order to use this list we need to know some things about it. First lets understand the Listing Exchange Column. This column gives us information regarding what exchange the stock was actually listed on. We can interpret this information using the below list.
- A = NYSE MKT
- N = New York Stock Exchange (NYSE
- P = NYSE ARCA
- Z = BATS Global Markets (BATS)
- V = Investors' Exchange, LLC (IEXG)
Since we are specifically looking at stocks, we need to remove our ETF's.
no_ETFs = df['Listing Exchange']=='N'
df = df[no_ETFs]
df.head()
Nasdaq Traded | Symbol | Security Name | Listing Exchange | Market Category | ETF | Round Lot Size | Test Issue | Financial Status | CQS Symbol | NASDAQ Symbol | NextShares | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Y | A | Agilent Technologies, Inc. Common Stock | N | N | 100.0 | N | NaN | A | A | N | |
1 | Y | AA | Alcoa Corporation Common Stock | N | N | 100.0 | N | NaN | AA | AA | N | |
9 | Y | AAIC | Arlington Asset Investment Corp Class A (new) | N | N | 100.0 | N | NaN | AAIC | AAIC | N | |
10 | Y | AAIC$B | Arlington Asset Investment Corp 7.00% | N | N | 100.0 | N | NaN | AAICpB | AAIC-B | N | |
11 | Y | AAIC$C | Arlington Asset Investment Corp 8.250% Seies C... | N | N | 100.0 | N | NaN | AAICpC | AAIC-C | N |
Next, we need to filter out all the test securities using the Test Issue Column.
no_Tests = df['Test Issue']=='N'
df = df[no_Tests]
df.head()
Nasdaq Traded | Symbol | Security Name | Listing Exchange | Market Category | ETF | Round Lot Size | Test Issue | Financial Status | CQS Symbol | NASDAQ Symbol | NextShares | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Y | A | Agilent Technologies, Inc. Common Stock | N | N | 100.0 | N | NaN | A | A | N | |
1 | Y | AA | Alcoa Corporation Common Stock | N | N | 100.0 | N | NaN | AA | AA | N | |
9 | Y | AAIC | Arlington Asset Investment Corp Class A (new) | N | N | 100.0 | N | NaN | AAIC | AAIC | N | |
10 | Y | AAIC$B | Arlington Asset Investment Corp 7.00% | N | N | 100.0 | N | NaN | AAICpB | AAIC-B | N | |
11 | Y | AAIC$C | Arlington Asset Investment Corp 8.250% Seies C... | N | N | 100.0 | N | NaN | AAICpC | AAIC-C | N |
This list provides various differences in securities by adding a period or dollar sign and denoting this variance after that indicator. In the list above you can see that AAIC is list 3 times. The second and third have \$B and \$C respectively. This indicates that AAIC has multiple classes, normal, class B, and class C stocks. These types of annotation can also denote common stock, preferred stock, etc. For this, we really only need to look at the base ticker symbol.
removeDollar = df['Symbol'].str.contains('$', regex=False) == False
df = df[removeDollar]
removePeriod = df['Symbol'].str.contains('.', regex=False) == False
df = df[removePeriod]
df.head()
Nasdaq Traded | Symbol | Security Name | Listing Exchange | Market Category | ETF | Round Lot Size | Test Issue | Financial Status | CQS Symbol | NASDAQ Symbol | NextShares | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Y | A | Agilent Technologies, Inc. Common Stock | N | N | 100.0 | N | NaN | A | A | N | |
1 | Y | AA | Alcoa Corporation Common Stock | N | N | 100.0 | N | NaN | AA | AA | N | |
9 | Y | AAIC | Arlington Asset Investment Corp Class A (new) | N | N | 100.0 | N | NaN | AAIC | AAIC | N | |
15 | Y | AAN | Aarons Holdings Company, Inc. Common Stock | N | N | 100.0 | N | NaN | AAN | AAN | N | |
18 | Y | AAP | Advance Auto Parts Inc Advance Auto Parts Inc W/I | N | N | 100.0 | N | NaN | AAP | AAP | N |
Now we need to put the ticker symbols into a list.
symbols = df['Symbol'].to_list()
print(len(symbols))
symbols[:10]
2537
['A', 'AA', 'AAIC', 'AAN', 'AAP', 'AAT', 'AB', 'ABB', 'ABBV', 'ABC']
We now have a rather large list of ticker symbols and better yet, we can parse information from finnhub.io under 10 minutes rather than taking 12 hours. Yahoo finance has a 2000 API hit per hour limit, so I am only going to grab the first 100 ticker symbols.
First we are going to write a function to retrieve the last close price of each stock in our list.
def getLastAdjustedClose(symbol):
import datetime
end = datetime.date.today()
start = datetime.date.today() - datetime.timedelta(days=7)
df = pdr.DataReader(symbol, "yahoo", start, end).reset_index()
return(df['Adj Close'].tail(1))
Next, I am going to loop through our entire list and get the last adjusted close price
import pandas_datareader as pdr
import datetime
symbol_dict = dict()
for symbol in symbols[:100]:
adjclose = getLastAdjustedClose(symbol)
symbol_dict[symbol] = adjclose
pd.DataFrame.from_dict(symbol_dict).round()
A | AA | AAIC | AAN | AAP | AAT | AB | ABB | ABBV | ABC | ... | ALG | ALK | ALL | ALLE | ALLY | ALSN | ALTG | ALUS | ALV | ALX | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 111.0 | 19.0 | 3.0 | 63.0 | 144.0 | 29.0 | 32.0 | 27.0 | 101.0 | 100.0 | ... | 136.0 | 48.0 | 100.0 | 113.0 | 29.0 | 40.0 | 9.0 | 10.0 | 89.0 | 287.0 |
1 rows × 100 columns