Author: Brian A. Ree
0: Tensorflow Linear Regression Generating Features
Welcome to the second part of our tensorflow tutorials on linear regression models. In our first tutorial
we learned how to import data from CSV files in a general and abstracted way. We learned about our DataRow, and LoadCsvData
classes and also about some of our supporting classes and files. Next up we're going to start generating features and statistics on our loaded data.
This is the one part of the process that is difficult to abstract because the features and statistics we generate are based on the column names and data
in our CSV file. Because of this we allow the LoadFeatureData class to be a customizable class where you can create your own code to run statistics on a certain
set of data. Let's take a look at our LoadFeatureData class.
class LoadFeatureData:
""" A class for adding calculated features to the data rows loaded from csv. """
rows = []
rowCount = 0
limitLoad = True
rowLimit = 25
cleanData = False
cleanCount = 0
verbose = False
loadCsvData = None
def __init__(self, lLimitLoad=False, lRowLimit=-1, lCleanData=False, lVerbose=False, lLoadCsvData=None):
self.limitLoad = lLimitLoad
self.rowLimit = lRowLimit
self.cleanData = lCleanData
self.verbose = lVerbose
self.loadCsvData = lLoadCsvData
#edef
def generateData(self, type='goog_stock_sma100'):
print ("")
print ("")
print("Generating Feature Data: Type: " + type)
if type == '':
self.rowCount = self.loadCsvData.rowCount
rownum = self.rowCount
lrows = []
lrows.extend(self.loadCsvData.rows)
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
elif type == 'goog_stock_sma100':
self.resetRows()
lrows = []
lrows.extend(self.loadCsvData.rows)
rownum = 0
avg = 0
target = 100
cnt = 0
startingVal = lrows[0].getMemberByName('Close')
for i in xrange(len(lrows)):
try:
float(lrows[i].getMemberByName('Close'))
float(lrows[i].getMemberByName('Open'))
float(lrows[i].getMemberByName('High'))
float(lrows[i].getMemberByName('Low'))
except:
lrows[i].error = True
# efl
self.cleanRows(lrows)
for i in xrange(len(lrows)):
# The answer in linear regression models will always be stored in the 'Answer' column
lrows[i].setMember('Answer', 8, float(lrows[i].getMemberByName('Close')))
if i < target:
lrows[i].setMember('sma_100', 7, startingVal)
else:
avg = 0
for j in xrange(len(lrows)):
if j < (i - target):
''' do nothing '''
else:
if j >= (i - target) and j <= (i - 0):
avg += float(lrows[j].getMemberByName('Close'))
elif j > (i - 0):
break
# eif
# eif
# efl
if self.verbose == True:
print('sma_100: %d' % (float(avg) / float(target)))
# eif
lrows[i].setMember('sma_100', 7, (float(avg) / float(target)))
# eif
if self.limitLoad == True and cnt >= self.rowLimit and self.rowLimit > 0:
break;
# eif
rownum += 1
cnt += 1
# efl
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
self.rowCount += len(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
# eif
# edef
def resetRows(self):
self.rows = []
self.rowCount = 0
self.cleanCount = 0
# edef
def sortRows(self, lrows):
return sorted(lrows, key=id)
# edef
def cleanRows(self, lrows):
if self.cleanData == True:
print ("Cleaning row data...")
should_restart = True
while should_restart:
should_restart = False
for row in lrows:
if row.error == True:
lrows.remove(row)
self.rowCount -= 1
self.cleanCount += 1
should_restart = True
# eif
# efl
# fwl
# eif
# edef
# eclass
Let's review our class variables first.
- rows: A list of row data this class maintains.
- rowCount: The number of rows loaded into the rows list.
- limitLoad: A flag that limits the loaded data by the value in rowLimit.
- rowLimit: The maximum number of rows to load.
- cleanData: A flag that indicates if we should clean data or not.
- cleanCount: The number of rows cleaned from the rows list.
- verbose: A boolean flag that indicates whether or not verbose logging is turned on.
- loadCsvData: An instance of our LoadCsvData class where we will be accessing the loaded rows.
Next let's go over our class methods. If you need to review the data loading process that gets our data
from CSV files into our LoadCsvData class please review this tutorial.
- __init__: The default constructor for the class. Takes arguments providing data and configuration settings.
- generateData: Generates the feature data on the passed in LoadCsvData class data.
- resetRows: Resets local class data variables.
- sortRows: Sorts the loaded rows by unique id.
- cleanRows: Cleans the loaded data rows by removing any that have an internal error flag set to true.
Now let's look at some of the more important methods in this class. Many of the methods you see here are borrowed from the
LoadCsvData class so we'll skip over thos and get right to the nitty gritty. I mentioned it earlier but I'll go over it again here.
The process we've created in code is supposed to be as generic and general as possible so that we can support different types of data easily
without having to write a lot of new custom code. To this end we push the proprietary code, the code this tied to unique features of our data and
that cannot be generalized, to our LoadFeatureData class. We do this by allowing users to pass in a string that defines which set of
statistics and feature generation code to run. This string is data driven in our execution configuration dictionary but the actual statitics and columns
we use to generate them are proprietary so we assume our end user has created a special section to generate stats for their data.
It'll all make sense in a little bit. Let's review some code.
def generateData(self, type='goog_stock_sma100'):
print ("")
print ("")
print("Generating Feature Data: Type: " + type)
if type == '':
self.rowCount = self.loadCsvData.rowCount
rownum = self.rowCount
lrows = []
lrows.extend(self.loadCsvData.rows)
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
elif type == 'goog_stock_sma100':
self.resetRows()
lrows = []
lrows.extend(self.loadCsvData.rows)
rownum = 0
avg = 0
target = 100
cnt = 0
startingVal = lrows[0].getMemberByName('Close')
for i in xrange(len(lrows)):
try:
float(lrows[i].getMemberByName('Close'))
float(lrows[i].getMemberByName('Open'))
float(lrows[i].getMemberByName('High'))
float(lrows[i].getMemberByName('Low'))
except:
lrows[i].error = True
# efl
self.cleanRows(lrows)
for i in xrange(len(lrows)):
# The answer in linear regression models will always be stored in the 'Answer' column
lrows[i].setMember('Answer', 8, float(lrows[i].getMemberByName('Close')))
if i < target:
lrows[i].setMember('sma_100', 7, startingVal)
else:
avg = 0
for j in xrange(len(lrows)):
if j < (i - target):
''' do nothing '''
else:
if j >= (i - target) and j <= (i - 0):
avg += float(lrows[j].getMemberByName('Close'))
elif j > (i - 0):
break
# eif
# eif
# efl
if self.verbose == True:
print('sma_100: %d' % (float(avg) / float(target)))
# eif
lrows[i].setMember('sma_100', 7, (float(avg) / float(target)))
# eif
if self.limitLoad == True and cnt >= self.rowLimit and self.rowLimit > 0:
break;
# eif
rownum += 1
cnt += 1
# efl
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
self.rowCount += len(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
# eif
# edef
The first thing you should notice is that we've build in a passthrough feature where nothing is done to the passed in
DataRow objects expect to push them into the local data list. This is so that we can bypass this code feature if we don't need it
but without altering any of our execution code. So we're trying to make this proprietary section of code still as flexible as possible.
if type == '':
self.rowCount = self.loadCsvData.rowCount
rownum = self.rowCount
lrows = []
lrows.extend(self.loadCsvData.rows)
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
The bypass code take an empty string to trigger, is simply loads the rows from the LoadCsvData
into a local list, runs a sort and clean step on the data. Then pushes the local list into the class data list.
And bam, we're done. Nothing too crazy being done here but very useful code indeed. Next we'll look at an actual feature and
statistics generation example.
elif type == 'goog_stock_sma100':
self.resetRows()
lrows = []
lrows.extend(self.loadCsvData.rows)
rownum = 0
avg = 0
target = 100
cnt = 0
startingVal = lrows[0].getMemberByName('Close')
for i in xrange(len(lrows)):
try:
float(lrows[i].getMemberByName('Close'))
float(lrows[i].getMemberByName('Open'))
float(lrows[i].getMemberByName('High'))
float(lrows[i].getMemberByName('Low'))
except:
lrows[i].error = True
# efl
self.cleanRows(lrows)
for i in xrange(len(lrows)):
# The answer in linear regression models will always be stored in the 'Answer' column
lrows[i].setMember('Answer', 8, float(lrows[i].getMemberByName('Close')))
if i < target:
lrows[i].setMember('sma_100', 7, startingVal)
else:
avg = 0
for j in xrange(len(lrows)):
if j < (i - target):
''' do nothing '''
else:
if j >= (i - target) and j <= (i - 0):
avg += float(lrows[j].getMemberByName('Close'))
elif j > (i - 0):
break
# eif
# eif
# efl
if self.verbose == True:
print('sma_100: %d' % (float(avg) / float(target)))
# eif
lrows[i].setMember('sma_100', 7, (float(avg) / float(target)))
# eif
if self.limitLoad == True and cnt >= self.rowLimit and self.rowLimit > 0:
break;
# eif
rownum += 1
cnt += 1
# efl
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
self.rowCount += len(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
This is an example of a custom extension of this class to support proprietary feature and statistics generation.
We've named our new section goog_stock_sma100 because we're using google based stock data and we're generating a
100 day simple moving average on our stock data. We reset our local data storage variables and load the data stored in our
LoadCsvData class into a local list. We're doing some checks on the data loaded in certain columns. Basically we want to make sure
that we have numbers in each column and if there is an exception during this process we set the local DataRow error flag to true.
This will ensure the row gets cleaned out of our data set during the cleanRows method call.
The feature we are generating here is a simple moving average and we're going to use linear regression and tensorflow
to come up with a model that predicts the next value of the 100 day moving average based on the data we have.
The looping structure is something you can review on your own, it should be a double loop that calculates the 100 day moving average
based on the previous hundred rows from the current row. You can see that a lot of this code can't really be data driven or at least it's beyond the scope
of this tutorial. We'll look at our DataRow2Tensor class, it's similar to our Data2DataRow class in that it provides a data driven way to specify
what columns of data end up in our tensor for processing.
columns = {
"goog_lin_reg_avg100day": ['Close', 'High', 'Low', 'sma_100'],
"weight_age_lin_reg_blood_fat": ['Weight', 'Age']
}
As you can see above the class is very simple and hold a dictionary that provides a data driven way to define our tensor shape.
Because the column names from our CSV file and the column names of our features and statistics can be differentt depending on what we're
trying to do, we've added an abstraction layer that allows end users to specify what data they want to load into their tensor by listing the
column names here. Next up we're going to be taking a look at our LoadTensorData class that takes a LoadFeatureData class as an argument.
You can see we're adjusting our data one step at a time and carrying forward the data from the previous step. This next step actually builds the tensors
we're going to run through our tensorflow linear regression model.
import tensorflow as tf
import numpy as np
import LoadFeatureData
class LoadTensorData:
""" A simple class for converting feature data into tensor data. """
rows = []
rowCount = 0
answers = []
verbose = False
loadFeatureData = None
columnMap = []
dataModelColCount = 0
def __init__(self, lVerbose=False, lLoadFeatureData=None):
self.verbose = lVerbose
self.loadFeatureData = lLoadFeatureData
#edef
def generateData(self, lColumnMap=[]):
print ("")
print ("")
print("Generating Tensor Data:")
self.resetRows()
self.columnMap = lColumnMap
self.dataModelColCount = len(self.columnMap)
# Convert base data
val2 = []
val3 = []
rowcnt = 0
for row in self.loadFeatureData.rows:
val = []
for col in self.columnMap:
val.append(float(row.getMemberByName(col)))
# efl
val2.append(val)
rowcnt += 1
val3.append(float(row.getMemberByName('Answer')))
# efl
self.rows = tf.to_float(val2)
self.answers = tf.to_float(val3)
self.rowCount = rowcnt
print("TensorRow Data Shape: %s" % self.answers.get_shape())
print("TensorRow Answer Shape: %s" % self.rows.get_shape())
print('TensorRow Count: %i' % (self.rowCount))
# edef
def resetRows(self):
self.rows = []
self.rowCount = 0
# edef
# eclass
We're going to skip over a detailed class variable and method listing, the class is somewhat simple and most of the
variables and class methods should be familiar from previous classes we've looked at. They have a similar functionality and so
they have a familiar structure. Let's take a look at the generateData method next.
def generateData(self, lColumnMap=[]):
print ("")
print ("")
print("Generating Tensor Data:")
self.resetRows()
self.columnMap = lColumnMap
self.dataModelColCount = len(self.columnMap)
# Convert base data
val2 = []
val3 = []
rowcnt = 0
for row in self.loadFeatureData.rows:
val = []
for col in self.columnMap:
val.append(float(row.getMemberByName(col)))
# efl
val2.append(val)
rowcnt += 1
val3.append(float(row.getMemberByName('Answer')))
# efl
self.rows = tf.to_float(val2)
self.answers = tf.to_float(val3)
self.rowCount = rowcnt
print("TensorRow Data Shape: %s" % self.answers.get_shape())
print("TensorRow Answer Shape: %s" % self.rows.get_shape())
print('TensorRow Count: %i' % (self.rowCount))
# edef
How can so much cool work be done by this little method you ask? I don't know, it just does. Let's see how. First up the internal
data storage variables are rest. Next we loop over the feature rich data and use our colunm mapping to pull the target columns into a list.
We load the column data as a list into a list of rows. So we have a list of lists to work with. Notice that the data driven column listing automatically
sets the dimensions of our tensor. At the very end of our method we call the tensorflow method to_float, this will convert our list of lists into
a tensor obect of the same values.
You can check the debugging output as we print the row count and the tensor dimension at the end of the generateData call.
Congrats we now have a tensor object of our data to begin running through our linear model. Let's take a look at our execution code so
we can see how the process has evolved.
def run(exeCfg):
type = exeCfg['type']
data_2_datarow_type = exeCfg['data_2_datarow_type']
datarow_2_tensor_type = exeCfg['datarow_2_tensor_type']
version = exeCfg['version']
reset = exeCfg['reset']
checkpoint = exeCfg['checkpoint']
verbose = exeCfg['verbose']
limitLoad = exeCfg['limitLoad']
rowLimit = exeCfg['rowLimit']
validatePrct = exeCfg['validatePrct']
trainPrct = exeCfg['trainPrct']
randomSeed = exeCfg['randomSeed']
learning_rate = exeCfg['learning_rate']
log_reg_positive_result = exeCfg['log_reg_positive_result']
lin_reg_positive_result = exeCfg['lin_reg_positive_result']
model_type = exeCfg['model_type']
loader = exeCfg['loader']
cleanData = exeCfg['cleanData']
trainStepsMultiplier = exeCfg['trainStepsMultiplier']
dataMap = Data2DataRow.mapping[data_2_datarow_type]
files = exeCfg['files']
featureType = exeCfg['feature_type']
data = None
fData = None
tData = None
tfModel = None
print("Found loader: " + loader)
if loader == 'load_csv_data':
data = LoadCsvData.LoadCsvData()
data.limitLoad = limitLoad
data.rowLimit = rowLimit
data.verbose = verbose
for file in files:
csvFileName = files[file]['name']
appendCols = files[file]['appendCols']
data.loadData(csvFileName, type, version, reset, dataMap, appendCols)
# efl
# eif
if featureType != '':
print("Found feature type: " + featureType)
fData = LoadFeatureData.LoadFeatureData(limitLoad, rowLimit, cleanData, verbose, data)
fData.generateData(featureType)
else:
print("Found no feature type.")
fData = LoadFeatureData.LoadFeatureData(limitLoad, rowLimit, cleanData, verbose, data)
fData.generateData('')
# eif
tData = LoadTensorData.LoadTensorData(verbose, fData)
tData.generateData(DataRow2Tensor.columns[datarow_2_tensor_type])
If we take a look at our execution code we can see there are some new pieces to it. At the bottom there we have a new block for running
the feature generation code. I put a special hook in to print a different message when no feature type is specified, i.e. passthrough mode.
Take a look at the next two lines, see how similar our tensor generation calls are? Very clean and elegant at this level of the code. All our
complex methods are encapsulated by our classes. Such a thing of beauty. That wraps up this tutorial. I know this was a short one but in reality
we did most of the work in tutorial 0, and by creating a pipline to manipulate our data and using similar class structures we really didn't have a lot
of new code to look at although the small amount of new code we have does a whole lot. The next tutorial covers the creation of training and validation sets
as well as running the model.