Sensing Market Sentiment for Algorithmic Marketing at Sell Side – Hands-On Artificial Intelligence for Banking

Sensing Market Sentiment for Algorithmic Marketing at Sell Side

In the previous chapter, we learned about investment portfolio management. We also learned some of the portfolio management techniques, such as the Markowitz mean-variance model and the Treynor–Black model for portfolio construction. We also learned about how to predict a trend for a security. So, the previous chapter was based on the buy side of a market. It depicted the behavior of portfolio managers or asset managers.

In this chapter, we will look at the sell side of the market. We will understand the behavior of the counterpart of the portfolio managers. Sell side refers to securities firms/investment banks and their main services, including sales, trading, and research. Sales refers to the marketing of securities to inform investors about the securities available for selling. Trading refers to the services that investors use to buy and sell off securities and the research performed to assist investors in evaluating securities. Being client-centric, one of the key functions of a bank is sensing the needs and sentiments of the end investors, who in turn push the asset managers to buy the product from banks. We will begin this chapter by looking at a few concepts and techniques. We will look at an example that illustrates how to sense the needs of an investor. We will look at another example to analyze the annual report and extract information from it.

The following topics will be covered in this chapter:

  • Understanding sentiment analysis
  • Sensing market requirements using sentiment analysis
  • Network building and analysis using Neo4j

Understanding sentiment analysis

Sentiment analysis is a technique in which text mining is done for contextual information. The contextual information is identified and extracted from the source material. It helps businesses understand the sentiment for their products, securities, or assets. It can be very effective to use the advanced techniques of artificial intelligence for in-depth research in the area of text analysis. It is important to classify the transactions around the following concepts:

  • The aspect of security the buyers and sellers care about
  • Customers' intentions and reactions concerning the securities

Sentiment analysis is known to be the most common text analysis and classification tool. It receives an incoming message or transaction and classifies it depending on whether the sentiment associated with the transaction is positive, negative, or neutral. By using the sentiment analysis technique, it is possible to input a sentence and understand the sentiment behind the sentence.

Now that we have understood what sentiment analysis is, let's find out how to sense market requirements in the following section.

Sensing market requirements using sentiment analysis

One of the key requirements of a security firm/investment bank on the sell side is to manufacture the relevant securities for the market. We have explored the fundamental behaviors and responsibilities of companies in Chapter 4, Mechanizing Capital Market Decisions, and Chapter 5, Predicting the Future of Investment Bankers. We learned about the momentum approach in Chapter 6, Automated Portfolio Management Using the Treynor–Black Model and ResNet. While the market does not always act rationally, it could be interesting to hear about the market's feelings. That is what we will be doing in this chapter.

In this example, we will be playing the role of the salesperson of an investment bank on the trading floor, trading in equities. What we want to find out is the likes and dislikes regarding securities so that they can market the relevant securities, including derivatives. We got our insights from Twitter Search, and the stock price from Quandl. All of this data requires a paid license.

Solution and steps

There are a total of three major steps to get the market sentiment using coding implementation. The data is used as shown in the following diagram:

The steps are as follows:

  1. Data will be retrieved from Twitter and be saved locally as a JSON file.
  2. The JSON file will then be read, further processed by counting the positive and negative words, and input as records into a SQL Lite database.
  3. Lastly, the sentiment will be read from the database and compared against stock prices retrieved from Quandl.

We will elaborate on these steps in more detail in the following sections.

Downloading the data from Twitter

By using a Twitter Search commercial license, we download data on the same industry as defined by the Shalender (Quandl) industry classification. We will use the API key to search and download the latest 500 tweets containing or tagged with the company name, one by one. All tweets are received in JSON format, which looks like a Python dictionary. The JSON file will then be saved on the computer for further processing.

Sample Python codes can be found on GitHub (https://github.com/twitterdev/search-tweets-python), especially regarding authentication. The following is the code snippet for downloading tweets from Twitter:

'''*************************************
#1. Import libraries and key variable values

'''
from searchtweets import ResultStream, gen_rule_payload, load_credentials
from searchtweets import collect_results
import json
import os

script_dir = os.path.dirname(__file__)
#Twitter search commerical accounts credential
premium_search_args = load_credentials("~/.twitter_keys.yaml",
env_overwrite=False)
MAX_RESULTS=500 #maximum at 500

#list of companies in the same industry
...

'''*************************************
#2. download tweets of each company

'''
for comp in comp_list:
...

Converting the downloaded tweets into records

The tweet's message and any linked page will then be loaded and read by a simple language processing program, which will count the number of positive and negative words in the message and linked page body. The parsed tweet will be converted to a structured SQL database format and stored in a SQL Lite database.

The following is the code snippet to convert tweets into records:

'''*************************************
#1. Import libraries and key variable values

'''
import json
import os
import re
import sqlite3
import 7A_lib_cnt_sentiment as sentiment

#db file
db_path = 'parsed_tweets.db'
db_name = 'tweet_db'

#sql db
...
#load tweet json
...
#loop through the tweets
...
for tweet in data:
...
tweet_txt_pos,tweet_txt_neg = sentiment.cnt_sentiment(tweet_txt)
keywords,sentences_list,words_list = \
sentiment.NER_topics(tweet_txt)
...
if len(url_link)>0:
...
url_txt = sentiment.url_to_string(url)
temp_tweet_link_txt_pos, temp_tweet_link_txt_neg = \
sentiment.cnt_sentiment(url_txt)
link_keywords,link_sentences_list,link_words_list = \
sentiment.NER_topics(tweet_txt)
...

There are three functions that are called by the preceding program. One is used to count the positive and negative words, one looks at the topic concerned, and one retrieves the text in the URL given in the tweet.

The following code snippet defines the functions used in the program:

import os
import requests
from bs4 import BeautifulSoup
import re
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

...
#cal the positive and negative sentiment words given the text
def cnt_sentiment(text_to_be_parsed):
...

def noun_phrase(sentence,item_list,lower):
...

#NER
import spacy
from spacy import displacy
from collections import Counter
import math

#text has to be less than 1000000
def NER_topics(text_to_be_parsed):
...
MAX_SIZE =100000
...
for nlp_cnt in range(number_nlp):
start_pos = nlp_cnt*MAX_SIZE
end_pos = min(MAX_SIZE,txt_len-start_pos)+start_pos-1
txt_selected = text_to_be_parsed[start_pos:end_pos]
...
sentences_list = [x for x in article.sents]
full_sentences_list+=sentences_list
for sent in sentences_list:
phrases_list =[]
phases_list,items_list = noun_phrase(sent, items_list, \
lower=True)
...

#convert the URL's content into string
def url_to_string(url):
...

Performing sentiment analysis

The database that stored the parsed tweet will be read by another program. For each record, the sentiment will be represented by aggregate sentiment on a daily basis. Each tweet's sentiment is calculated as the total number of negative sentiments subtracted from positive sentiments. The range of this sentiment score should be in the range of -1 to +1, with -1 representing a totally negative score and +1 a totally positive score. Each day’s sentiment score is calculated as the average of all the tweets' sentiment scores for the security. Sentiment scores of all securities in the same industry are plotted on a graph, similar to the following:

For example, in the short period of our coverage, Dominion Energy has one of the most favorable sentiment scores (between Oct 29 and Oct 30).

The sample output of Dominion Energy is shown in the following graph:

The sentiment is the orange line and the price is the blue line (please refer to the color graph provided in the graphic bundle of this book).

The following is the code snippet for sentiment analysis:

'''*************************************
#1. Import libraries and key variable values

'''
import sqlite3
import pandas as pd
import plotly
import plotly.graph_objs as go
import quandl
import json

# Create your connection.
db_path = 'parsed_tweets.db'
cnx = sqlite3.connect(db_path)
db_name = 'tweet_db'

'''*************************************
#2. Gauge the sentiment of each security

'''
...
sql_str = ...
...
print('Sentiment across securities')
field_list = ['positive','negative']
for sec in sec_list:
...

Comparing the daily sentiment against the daily price

After we obtain the sentiment score for each stock, we also want to know the predictive power or the influence of the sentiment on the stock price. The stock price of the day is calculated by the middle-of-day high and low. For each stock, we plot and compare the sentiment and stock price over a period of time. The following screenshot is an illustration of PG&E Corp's sentiment versus stock price:

The following is the code snippet for daily sentiment analysis data against the daily price:

#run it on different companies
print('Retrieve data')
df_comp = pd.read_csv('ticker_companyname.csv')
corr_results={}

for index, row in df_comp.iterrows():
tkr = row['ticker']
name = row['name']

target_sec = '"'+name +'"data.json'

corr_result = price_sentiment(tkr,target_sec,date_range)
try:
corr_results[name]=corr_result['close'][0]
except Exception:
continue

f_corr = open('corr_results.json','w')
json.dump(corr_results,f_corr)
f_corr.close()

Congratulations! You have developed a program to assist sales in finding popular securities to develop products for.

From what we have seen, comparing this example to the technical analysis examples, we can see that the information from the sentiment is far higher than the technical trend. So far, we have only looked at the primary impact of the trend, fundamental, and sentiment; however, companies are interconnected in our society. So how can we model the linkage of firms and individuals? This brings us to the next topic—network analysis.

Network building and analysis using Neo4j

As sell-side analysts, besides finding out the primary impact of news on the company, we should also find out the secondary effect of any news. In our example, we will find out the suppliers, customers, and competitors of any news on the stocks.

We can do this using three approaches:

  • By means of direct disclosure, such as annual reports
  • By means of secondary sources (media reporting)
  • By means of industry inferences (for example, raw materials industries, such as oil industries, provide the output for transportation industries)

In this book, we use direct disclosure from the company to illustrate the point.

We are playing the role of equity researchers for the company stock, and one of our key roles is to understand the relevant parties' connections to the company. We seek to find out the related parties of the company—Duke Energy—by reading the company's annual report.

Solution

There are a total of four steps. The following diagram shows the data flow:

We will now look at the steps in more detail in the following sections.

Using PDFMiner to extract text from a PDF

Besides storage, we also need to extract the relationship from text documents. Before we can start dealing with text, we need to convert the PDF data to text. To do this, we use a library called PDFMiner (specifically, the module is calledpdfminer.six (https://github.com/pdfminer/pdfminer.six) for Python 3+). PDF is an open standard to describe a document. It stores the lines, text, images, and their exact locations in the document. We will only be using a basic function in PDFMiner to extract the texts from it. Even though we could extract the coordinates, we will skip this to simplify our work. Upon extracting the text, we append all lines into one super long line.

The following code snippet imports the necessary libraries and initializes a PDF file to be processed:

'''*************************************
#1. Import relevant libraries and variables

'''
#custom made function
import 7B_lib_entitiesExtraction as entitiesExtraction
import 7B_lib_parser_pdf as pdf_parser
import json
import sqlite3

pdf_path = 'annualrpt/NYSE_DUK_2017.pdf'
...

Entity extractions

We deploy a linguistic analysis approach called part-of-speech (POS) tagging to decide whether words X and Z are a company or person, and whether Y is a product or service. Because of the sentence structure, we know that these are nouns, not because we know what X, Y, and Z are.

However, it is still not enough to label the entity. An entity is a standalone subject or object. Since there are too many entities, we should only tag entities with an uppercase first letter as those unique organizations or assets that are pertinent to our work.

The types of entity include ORG, PERSON, FAC, NORP, GPE, LOC, and PRODUCT—that is, Organization, Person, Facilities, Nationalities or religious or political groups, Geo-spatial, Location, and Product, using the SpaCy model.

Upon getting the text chunk from the PDF of step 1, we run SpaCy to extract the entities from each of the sentences. For each sentence, we store the entity types and entities in a database record. SpaCy will have a technical limitation on the length of the documents it analyzes; therefore, we cut the very long text chunk into different chunks to respect the technical limitation. However, this comes with the price of chopping sentences at the cut-off point of the text chunk. Considering that we are handling hundreds of pages, we will take the short cut. Of course, the best way to cut this is to cut it approximately around the chunk, while respecting the punctuation in order to preserve the complete sentences.

The following code snippet depicts how to extract various entities:

'''*************************************
#2. NLP

'''
#Named Entity Extraction
print('ner')
#see if we need to convert everything to lower case words - we keep the original format for this case
lower=False
common_words, sentences, words_list,verbs_list = entitiesExtraction.NER_topics(text,lower)
entities_in_sentences = entitiesExtraction.org_extraction(text)
...
#create this list to export the list of ent and cleanse them
...
print('looping sentences')
for sentence in entities_in_sentences:
ents_dict[sentence_cnt] = {}
for entity in sentence:
...
if ent_type in( 'ORG','PERSON','FAC','NORP','GPE','LOC','PRODUCT'):
...
#handle other type
...
Entity classification via the lexicon:For our use case, we need to further classify the organizations as suppliers, customers, competitors, investors, governments, or sister companies/assets—for example, banks that are the credit investors of the company will first be classified as Banks before they are inferred as the Credit Investors/Bankers for the company in its annual report. So some of the relationships require us to check against a database of organizations to classify them further. Acquiring such knowledge requires us to download the relevant databases—in our case, we use Wikipedia to download the list of banks. Only when we check against the list of names of banks will we be able to classify the organization as banks or not. We did not perform this step in our example, as we do not have the lexicon set that is normally available to banks.

Using NetworkX to store the network structure

After processing the data, the entities will be stored in SQL databases and further analyzed by NetworkX—a Python package that handles network data. Edge and Node are the building blocks of any graph; however, there are a lot more indicators to measure and describe the graph, as well as the position of the node and edge within the graph. What matters for our work now is to see whether the nodes are connected to the company in focus, and the type of connection they have.

At the end of NetworkX, the graph data is still pretty abstract. We need better interactive software to query and handle the data. Therefore, we will output the data as a CSV for Neo4j to further handle, as it provides a user interface to interact with the data.

It is, however, still far from being used—a lot of time is required to cleanse the dataset and define the types of relationship involved. Neo4j is a full-blown graph database that could satisfy the complex relationship structures.

A relationship must be established between the entities mentioned in the company's annual report and the entities stored in the database. In our example, we did not do any filtering of entities as the NLP model in the previous step has a lift of85%, and so it does not have perfect performance when it comes to spotting the entities. We extract only the people and organizations as entities. For the type of relationship (edge), we do not differentiate between the different edge types.

After defining the network structure, we prepare a list that stores the nodes and edges and generates a graph via matplotlib, which itself is not sufficient for manipulation or visualization. Therefore, we output the data from NetworkX to CSV files—one storing the nodes and the other one storing the edges.

The following is the code snippet for generating a network of entities:

'''*************************************
#1. Import relevant libraries and variables

'''
#generate network
import sqlite3
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

#db file
db_path = 'parsed_network.db'
db_name = 'network_db'

#sql db
conn = sqlite3.connect(db_path)
c = conn.cursor()

...

network_dict={}
edge_list=[]
curr_source =''
curr_entity = ''
org_list = []
person_list = []

'''*************************************
#2. generate the network with all entities connected to Duke Energy - whose annual report is parsed

'''
target_name = 'Duke Energy'
#loop through the database to generate the network format data
for index, row in df_org.iterrows():
...

#Generate the output in networkX
print('networkx')

#output the network
G = nx.from_edgelist(edge_list)
pos = nx.spring_layout(G)
nx.draw(G, with_labels=False, nodecolor='r',pos=pos, edge_color='b')
plt.savefig('network.png')

Using Neo4j for graph visualization and querying

We will install Neo4j and import the CSV files to construct the data network in Neo4j—the industry-grade graph database. Unfortunately, Neo4j itself requires another set of programming languages to manipulate its data, called Cypher. This allows us to extract and search the data we need.

We generate the files required for Neo4j. The following code snippet initializes Neo4j:

#Generate output for Neo4j
print('prep data for Neo4j')
f_org_node=open('node.csv','w+')
f_org_node.write('nodename\n')

f_person_node=open('node_person.csv','w+')
f_person_node.write('nodename\n')

f_vertex=open('edge.csv','w+')
f_vertex.write('nodename1,nodename2,weight\n')
...

In the terminal, we copy the output files to the home directory of Neo4j. The following are the commands to be executed from the terminal:

          sudo cp '[path]/edge.csv' /var/lib/Neo4j/import/edge.csv
          
sudo cp '[path]/node.csv' /var/lib/Neo4j/import/node.csv

sudo service Neo4j restart

At Neo4j, we log in via the browser. The following is the URL to enter into the browser:

http://localhost:7474/browser/

The following is the sample code snippet for Neo4j Cypher:

MATCH (n) DETACH DELETE n;

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///node.csv" AS row
CREATE (:ENTITY {node: row.nodename});

CREATE INDEX ON :ENTITY(node);


USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///edge.csv" AS row
MATCH (vertex1:ENTITY {node: row.nodename1})
MATCH (vertex2:ENTITY {node: row.nodename2})
MERGE (vertex1)-[:LINK]->(vertex2);

MATCH (n:ENTITY)-[:LINK]->(ENTITY) RETURN n;

The following screenshot is the resulting output:

Congratulations! You have managed to extract lots of important names/parties from the annual report that you need to focus your research on for further analysis.

Summary

In this chapter, we learned about the behavior of the sell side of a market. We learned about what sentiment analysis is and how to use it. We also looked at an example to sense market needs using sentiment analysis. We learned about network analysis using Neo4j, which is a NoSQL database technique. We learned about text mining using the PDF miner tool.

In the next chapter, we will learn how to use bank APIs to build personal wealth advisers. Consumer banking will be a focus of the chapter. We will learn how to access the Open Bank Project to retrieve financial health data. We will also learn about document layout analysis in the chapter. Let's jump into it without any further ado.