Author: Brian A. Ree
0: Tensorflow Logistic Regression: Intro
Welcome back to our tutorial series on tensorflow. In this segment we'll be augmenting our existing code to include a logistic regression example
using tensorflow. If you haven't done so already you should follow the previous tutorials, starting here, to get your knowledge and code setup, or possibly to just review
what we've covered before this. With a linear regression model our neural network is being trained to guess a certain value in a linear relationship given some value or information. With logistic regression we
are using a neural network to answer the question yes or no. In this case we'll be using our model to answer the question will tomorrow's closing stock price be higher than today's closing price.
1: Tensorflow Logistic Regression: Setup
Most of the code we've written is very flexible. Our CSV parsing and loading class can remain, we'll just use it to load some new stock data to use in our logistic
regression sample. So we won't need to do anything new for our logistic regression sample until we get to the feature generation, statistics step. So let's add a new entry into
our Data2DataRow class to handle the new data we want to load. Below is the new entry we'll be using.
# use -1 to ignore loading a column
mapping = {
"google_price":
{
"Date": "0",
"Open": "1",
"Close": "2",
"High": "3",
"Low": "4",
"Volume": "5",
"Symbol": "-1", # 6
"sma_100": "-1", # 7
},
"weight_age":
{
"Weight": "0",
"Age": "1",
"BloodFat": "2"
},
"google_price2":
{
"Date": "0",
"Open": "1",
"Close": "2",
"High": "3",
"Low": "4",
"Volume": "5",
"Symbol": "-1", # 6
}
}
You can see that we've added a new entry called google_price2. It's the same underlying data structure from the same source, Google Finance, so we don't need
to add anything new to it. We won't be generating a 100 day running average in this case though, so we can drop that column. But everything else is just fine.
We're going to be using daily stock data, and logistic regression to see if we can predict if tomorrow's closing price will be higher or lower than today's closing price.
Now of course there is no correlation that would cause this in a stock unless that stock is trending. Trending is a state attributed to stocks that are behaving in a more
predictable manner due to the current market forces. No worries, we're having fun here and who knows if you build a more advanced prediction system and can predict the
next days closing price change with 60% accuracy, you're 10% above the expected probability, cha ching! Just kidding. Back to the code. So we'll be using the google_price2
data map to bring in our data but we'll need to tell our little engine how to handle the feature generation so we'll look at that code next.
columns = {
"goog_lin_reg_avg100day": ['Close', 'Open', 'sma_100'],
"weight_age_lin_reg_blood_fat": ['Weight', 'Age'],
"goog_log_reg": ['Close', 'Open', 'High', 'Low', 'Volume'],
}
We're going to be pulling in a little more data than with our linear regression example. The google_log_reg entry above defines the columns of data we'll be using
for our logistic regression model. I want to bring in all the columns available in the stock daily price CSV because I want the neural network to have as much data as possible
when trying to find a pattern. Wow was that fast. With almost no effort we're primed to load new data and generate some features, statistics with almost no work. This is the part
of coding I enjoy the most, seeing how good planning and code structure can help you solve more problems faster. Instead of recoding and reworking things we can jump right in and
solve a new problem. But wait! How do we utilize our new data and features you ask? There are two places in our code that we crudely left open for customization. One such place is in
our LoadFeatureData class. This allows us to generate custom features based on the data we're loading. The second such customization point is in the evaluation method
of our RegModelLinear and RegModelLogistic. Let's take a look at our custom feature generation code next.
elif type == 'goog_log_reg':
self.resetRows()
lrows = []
lrows.extend(self.loadCsvData.rows)
rownum = 0
length = len(lrows)
for i in xrange(length):
try:
float(lrows[i].getMemberByName('Close'))
float(lrows[i].getMemberByName('Open'))
float(lrows[i].getMemberByName('High'))
float(lrows[i].getMemberByName('Low'))
if i + 1 < length:
if float(lrows[i + 1].getMemberByName('Close')) > float(lrows[i].getMemberByName('Close')):
lrows[i].setMember('Answer', 7, 1.0)
else:
lrows[i].setMember('Answer', 7, 0.0)
# eif
else:
lrows[i].setMember('Answer', 7, 0.0)
# eif
if float(lrows[i].getMemberByName('Answer')) == 1.0:
lrows[i].setMember('AnswerCatYes', 8, 2.00)
lrows[i].setMember('AnswerCatNo', 9, 1.00)
else:
lrows[i].setMember('AnswerCatYes', 8, 1.00)
lrows[i].setMember('AnswerCatNo', 9, 2.00)
# eif
except:
lrows[i].error = True
# etry
rownum = rownum + 1
# efl
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
self.rowCount = len(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
# eif
# eif
In our custom feature generation code we load our CSV data into a local data structure. As we loop through our data we're going to set our answer column value.
The answer column is the column that the RegModelLinear and RegModelLogistic classes use to evaluate their performance and they expect it to be there
for them so we'll gladly setup that column now. The conversion tests on the stock performance data is done to detect any errors in our imported data.
If any exceptions occur at this point the except clause executes and we mark our data row as having an error. Very cool. Now back to that answer column.
We're going to look forward to the next day's closing price. If that price is greater than today's closing price we're going to set our answer value to 1.
If not we're going to set our answer value to 0. Once this is done we'll sort the rows on their import id ascending. This just maintains their original import order.
We then clean out any data rows we marked with an error and bam we're good to go. Man we've got a lot done in like no time at all. Next up we're going to define our
execution dictionary in Main.
"goog_log_reg":
{
'type': 'csv',
'data_2_datarow_type': 'google_price2',
'datarow_2_tensor_type': 'goog_log_reg',
'version': '1.0',
'reset': False,
'checkpoint': False,
'limitLoad': False,
'cleanData': True,
'verbose': False,
'rowLimit': 125,
'validatePrct': 0.30,
'trainPrct': 0.70,
'randomSeed': False,
'trainStepsMultiplier': 1,
'learning_rate': 0.000001,
'log_reg_positive_result': 0.50,
'lin_reg_positive_result': 0.00,
'model_type': 'logistic_regression',
'loader': 'load_csv_data',
'feature_type': 'goog_log_reg',
'logPrint': 50,
'eval_type': '',
'files': {
'file1': {'name': dataDir + "/ivv.csv.xls", 'appendCols': [{'Symbol': 'ivv', 'idx': '6'}]},
}
},
You can see we've specified our new data loading entry, 'google_price2', our new feature generation entry, 'goog_log_reg', and our feature type, 'goog_log_reg'.
The entries here tell our engine how to load and process our CSV file in a data driven way. We're now just about up to being able to run some logistic regression
sample code. This tutorial is kind of short. We'll stop here and pick up in the next tutorial with an imlementation of the RegModelLogistic class.