Middlemind Games

Tutorial 1: Tensorflow Linear Regression: Generating Features and Tensors

Author: Brian A. Ree

Files/Resources for this Tutorial
Project Files v0.4.0.7_python2.7 (PyCharm)
TensorFlow For Machine Intelligence: A hands-on introduction to learning algorithms

0: Tensorflow Linear Regression Generating Features

Welcome to the second part of our tensorflow tutorials on linear regression models. In our first tutorial we learned how to import data from CSV files in a general and abstracted way. We learned about our DataRow, and LoadCsvData classes and also about some of our supporting classes and files. Next up we're going to start generating features and statistics on our loaded data. This is the one part of the process that is difficult to abstract because the features and statistics we generate are based on the column names and data in our CSV file. Because of this we allow the LoadFeatureData class to be a customizable class where you can create your own code to run statistics on a certain set of data. Let's take a look at our LoadFeatureData class.

class LoadFeatureData:
    """ A class for adding calculated features to the data rows loaded from csv. """

    rows = []
    rowCount = 0

    limitLoad = True
    rowLimit = 25

    cleanData = False
    cleanCount = 0

    verbose = False
    loadCsvData = None

    def __init__(self, lLimitLoad=False, lRowLimit=-1, lCleanData=False, lVerbose=False, lLoadCsvData=None):
        self.limitLoad = lLimitLoad
        self.rowLimit = lRowLimit
        self.cleanData = lCleanData
        self.verbose = lVerbose
        self.loadCsvData = lLoadCsvData
    #edef

    def generateData(self, type='goog_stock_sma100'):
        print ("")
        print ("")
        print("Generating Feature Data: Type: " + type)

        if type == '':
            self.rowCount = self.loadCsvData.rowCount
            rownum = self.rowCount
            lrows = []
            lrows.extend(self.loadCsvData.rows)

            print ("Loaded %i rows from this data file." % (rownum))
            lrows = self.sortRows(lrows)
            self.cleanRows(lrows)
            self.rows.extend(lrows)
            print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))

        elif type == 'goog_stock_sma100':
            self.resetRows()
            lrows = []
            lrows.extend(self.loadCsvData.rows)

            rownum = 0
            avg = 0
            target = 100
            cnt = 0
            startingVal = lrows[0].getMemberByName('Close')
            for i in xrange(len(lrows)):
                try:
                    float(lrows[i].getMemberByName('Close'))
                    float(lrows[i].getMemberByName('Open'))
                    float(lrows[i].getMemberByName('High'))
                    float(lrows[i].getMemberByName('Low'))
                except:
                    lrows[i].error = True
            # efl
            self.cleanRows(lrows)

            for i in xrange(len(lrows)):
                # The answer in linear regression models will always be stored in the 'Answer' column
                lrows[i].setMember('Answer', 8, float(lrows[i].getMemberByName('Close')))

                if i < target:
                    lrows[i].setMember('sma_100', 7, startingVal)
                else:
                    avg = 0
                    for j in xrange(len(lrows)):
                        if j < (i - target):
                            ''' do nothing '''
                        else:
                            if j >= (i - target) and j <= (i - 0):
                                avg += float(lrows[j].getMemberByName('Close'))
                            elif j > (i - 0):
                                break
                            # eif
                        # eif
                    # efl
                    if self.verbose == True:
                        print('sma_100: %d' % (float(avg) / float(target)))
                    # eif
                    lrows[i].setMember('sma_100', 7, (float(avg) / float(target)))
                # eif

                if self.limitLoad == True and cnt >= self.rowLimit and self.rowLimit > 0:
                    break;
                # eif

                rownum += 1
                cnt += 1
            # efl

            print ("Loaded %i rows from this data file." % (rownum))
            lrows = self.sortRows(lrows)
            self.cleanRows(lrows)
            self.rows.extend(lrows)
            self.rowCount += len(lrows)
            print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
        # eif
    # edef

    def resetRows(self):
        self.rows = []
        self.rowCount = 0
        self.cleanCount = 0
    # edef

    def sortRows(self, lrows):
        return sorted(lrows, key=id)
    # edef

    def cleanRows(self, lrows):
        if self.cleanData == True:
            print ("Cleaning row data...")
            should_restart = True
            while should_restart:
                should_restart = False
                for row in lrows:
                    if row.error == True:
                        lrows.remove(row)
                        self.rowCount -= 1
                        self.cleanCount += 1
                        should_restart = True
                    # eif
                # efl
            # fwl
        # eif
    # edef

# eclass

Let's review our class variables first.

rows: A list of row data this class maintains.
rowCount: The number of rows loaded into the rows list.
limitLoad: A flag that limits the loaded data by the value in rowLimit.
rowLimit: The maximum number of rows to load.
cleanData: A flag that indicates if we should clean data or not.
cleanCount: The number of rows cleaned from the rows list.
verbose: A boolean flag that indicates whether or not verbose logging is turned on.
loadCsvData: An instance of our LoadCsvData class where we will be accessing the loaded rows.

Next let's go over our class methods. If you need to review the data loading process that gets our data from CSV files into our LoadCsvData class please review this tutorial.

__init__: The default constructor for the class. Takes arguments providing data and configuration settings.
generateData: Generates the feature data on the passed in LoadCsvData class data.
resetRows: Resets local class data variables.
sortRows: Sorts the loaded rows by unique id.
cleanRows: Cleans the loaded data rows by removing any that have an internal error flag set to true.

Now let's look at some of the more important methods in this class. Many of the methods you see here are borrowed from the LoadCsvData class so we'll skip over thos and get right to the nitty gritty. I mentioned it earlier but I'll go over it again here. The process we've created in code is supposed to be as generic and general as possible so that we can support different types of data easily without having to write a lot of new custom code. To this end we push the proprietary code, the code this tied to unique features of our data and that cannot be generalized, to our LoadFeatureData class. We do this by allowing users to pass in a string that defines which set of statistics and feature generation code to run. This string is data driven in our execution configuration dictionary but the actual statitics and columns we use to generate them are proprietary so we assume our end user has created a special section to generate stats for their data. It'll all make sense in a little bit. Let's review some code.

def generateData(self, type='goog_stock_sma100'):
    print ("")
    print ("")
    print("Generating Feature Data: Type: " + type)

    if type == '':
        self.rowCount = self.loadCsvData.rowCount
        rownum = self.rowCount
        lrows = []
        lrows.extend(self.loadCsvData.rows)
        print ("Loaded %i rows from this data file." % (rownum))

        lrows = self.sortRows(lrows)
        self.cleanRows(lrows)
        self.rows.extend(lrows)
        print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))

    elif type == 'goog_stock_sma100':
        self.resetRows()
        lrows = []
        lrows.extend(self.loadCsvData.rows)

        rownum = 0
        avg = 0
        target = 100
        cnt = 0
        startingVal = lrows[0].getMemberByName('Close')
        for i in xrange(len(lrows)):
            try:
                float(lrows[i].getMemberByName('Close'))
                float(lrows[i].getMemberByName('Open'))
                float(lrows[i].getMemberByName('High'))
                float(lrows[i].getMemberByName('Low'))
            except:
                lrows[i].error = True
        # efl
        self.cleanRows(lrows)

        for i in xrange(len(lrows)):
            # The answer in linear regression models will always be stored in the 'Answer' column
            lrows[i].setMember('Answer', 8, float(lrows[i].getMemberByName('Close')))

            if i < target:
                lrows[i].setMember('sma_100', 7, startingVal)
            else:
                avg = 0
                for j in xrange(len(lrows)):
                    if j < (i - target):
                        ''' do nothing '''
                    else:
                        if j >= (i - target) and j <= (i - 0):
                            avg += float(lrows[j].getMemberByName('Close'))
                        elif j > (i - 0):
                            break
                        # eif
                    # eif
                # efl

                if self.verbose == True:
                    print('sma_100: %d' % (float(avg) / float(target)))
                # eif
                lrows[i].setMember('sma_100', 7, (float(avg) / float(target)))
            # eif

            if self.limitLoad == True and cnt >= self.rowLimit and self.rowLimit > 0:
                break;
            # eif

            rownum += 1
            cnt += 1
        # efl

        print ("Loaded %i rows from this data file." % (rownum))
        lrows = self.sortRows(lrows)
        self.cleanRows(lrows)
        self.rows.extend(lrows)
        self.rowCount += len(lrows)
        print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
    # eif
# edef

The first thing you should notice is that we've build in a passthrough feature where nothing is done to the passed in DataRow objects expect to push them into the local data list. This is so that we can bypass this code feature if we don't need it but without altering any of our execution code. So we're trying to make this proprietary section of code still as flexible as possible.

if type == '':
    self.rowCount = self.loadCsvData.rowCount
    rownum = self.rowCount
    lrows = []
    lrows.extend(self.loadCsvData.rows)
    print ("Loaded %i rows from this data file." % (rownum))
    lrows = self.sortRows(lrows)
    self.cleanRows(lrows)
    self.rows.extend(lrows)
    print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))

The bypass code take an empty string to trigger, is simply loads the rows from the LoadCsvData into a local list, runs a sort and clean step on the data. Then pushes the local list into the class data list. And bam, we're done. Nothing too crazy being done here but very useful code indeed. Next we'll look at an actual feature and statistics generation example.

elif type == 'goog_stock_sma100':
    self.resetRows()
    lrows = []
    lrows.extend(self.loadCsvData.rows)
    
    rownum = 0
    avg = 0
    target = 100
    cnt = 0
    startingVal = lrows[0].getMemberByName('Close')

    for i in xrange(len(lrows)):
        try:
            float(lrows[i].getMemberByName('Close'))
            float(lrows[i].getMemberByName('Open'))
            float(lrows[i].getMemberByName('High'))
            float(lrows[i].getMemberByName('Low'))
        except:
            lrows[i].error = True
    # efl
    self.cleanRows(lrows)

    for i in xrange(len(lrows)):
        # The answer in linear regression models will always be stored in the 'Answer' column
        lrows[i].setMember('Answer', 8, float(lrows[i].getMemberByName('Close')))

        if i < target:
            lrows[i].setMember('sma_100', 7, startingVal)
        else:
            avg = 0
            for j in xrange(len(lrows)):
                if j < (i - target):
                    ''' do nothing '''
                else:
                    if j >= (i - target) and j <= (i - 0):
                        avg += float(lrows[j].getMemberByName('Close'))
                    elif j > (i - 0):
                        break
                    # eif
                # eif
            # efl

            if self.verbose == True:
                print('sma_100: %d' % (float(avg) / float(target)))
            # eif
            lrows[i].setMember('sma_100', 7, (float(avg) / float(target)))
        # eif

        if self.limitLoad == True and cnt >= self.rowLimit and self.rowLimit > 0:
            break;
        # eif

        rownum += 1
        cnt += 1
    # efl

    print ("Loaded %i rows from this data file." % (rownum))
    lrows = self.sortRows(lrows)
    self.cleanRows(lrows)
    self.rows.extend(lrows)
    self.rowCount += len(lrows)
    print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))

This is an example of a custom extension of this class to support proprietary feature and statistics generation. We've named our new section goog_stock_sma100 because we're using google based stock data and we're generating a 100 day simple moving average on our stock data. We reset our local data storage variables and load the data stored in our LoadCsvData class into a local list. We're doing some checks on the data loaded in certain columns. Basically we want to make sure that we have numbers in each column and if there is an exception during this process we set the local DataRow error flag to true. This will ensure the row gets cleaned out of our data set during the cleanRows method call.

The feature we are generating here is a simple moving average and we're going to use linear regression and tensorflow to come up with a model that predicts the next value of the 100 day moving average based on the data we have. The looping structure is something you can review on your own, it should be a double loop that calculates the 100 day moving average based on the previous hundred rows from the current row. You can see that a lot of this code can't really be data driven or at least it's beyond the scope of this tutorial. We'll look at our DataRow2Tensor class, it's similar to our Data2DataRow class in that it provides a data driven way to specify what columns of data end up in our tensor for processing.

columns = {
    "goog_lin_reg_avg100day": ['Close', 'High', 'Low', 'sma_100'],
    "weight_age_lin_reg_blood_fat": ['Weight', 'Age']
    }

As you can see above the class is very simple and hold a dictionary that provides a data driven way to define our tensor shape. Because the column names from our CSV file and the column names of our features and statistics can be differentt depending on what we're trying to do, we've added an abstraction layer that allows end users to specify what data they want to load into their tensor by listing the column names here. Next up we're going to be taking a look at our LoadTensorData class that takes a LoadFeatureData class as an argument. You can see we're adjusting our data one step at a time and carrying forward the data from the previous step. This next step actually builds the tensors we're going to run through our tensorflow linear regression model.

import tensorflow as tf
import numpy as np
import LoadFeatureData

class LoadTensorData:
    """ A simple class for converting feature data into tensor data. """

    rows = []
    rowCount = 0

    answers = []

    verbose = False
    loadFeatureData = None

    columnMap = []
    dataModelColCount = 0

    def __init__(self, lVerbose=False, lLoadFeatureData=None):
        self.verbose = lVerbose
        self.loadFeatureData = lLoadFeatureData
    #edef

    def generateData(self, lColumnMap=[]):
        print ("")
        print ("")
        print("Generating Tensor Data:")

        self.resetRows()
        self.columnMap = lColumnMap
        self.dataModelColCount = len(self.columnMap)

        # Convert base data
        val2 = []
        val3 = []
        rowcnt = 0
        for row in self.loadFeatureData.rows:
            val = []
            for col in self.columnMap:
                val.append(float(row.getMemberByName(col)))
            # efl
        
            val2.append(val)
            rowcnt += 1
            val3.append(float(row.getMemberByName('Answer')))
        # efl

        self.rows = tf.to_float(val2)
        self.answers = tf.to_float(val3)
        self.rowCount = rowcnt
        print("TensorRow Data Shape: %s" % self.answers.get_shape())
        print("TensorRow Answer Shape: %s" % self.rows.get_shape())
        print('TensorRow Count: %i' % (self.rowCount))
    # edef

    def resetRows(self):
        self.rows = []
        self.rowCount = 0
    # edef
# eclass

We're going to skip over a detailed class variable and method listing, the class is somewhat simple and most of the variables and class methods should be familiar from previous classes we've looked at. They have a similar functionality and so they have a familiar structure. Let's take a look at the generateData method next.

def generateData(self, lColumnMap=[]):
    print ("")
    print ("")
    print("Generating Tensor Data:")
    
    self.resetRows()
    self.columnMap = lColumnMap
    self.dataModelColCount = len(self.columnMap)

    # Convert base data
    val2 = []
    val3 = []
    rowcnt = 0
    for row in self.loadFeatureData.rows:
        val = []
        for col in self.columnMap:
            val.append(float(row.getMemberByName(col)))
        # efl
        
        val2.append(val)
        rowcnt += 1
        val3.append(float(row.getMemberByName('Answer')))
    # efl

    self.rows = tf.to_float(val2)
    self.answers = tf.to_float(val3)
    self.rowCount = rowcnt
    print("TensorRow Data Shape: %s" % self.answers.get_shape())
    print("TensorRow Answer Shape: %s" % self.rows.get_shape())
    print('TensorRow Count: %i' % (self.rowCount))
# edef

How can so much cool work be done by this little method you ask? I don't know, it just does. Let's see how. First up the internal data storage variables are rest. Next we loop over the feature rich data and use our colunm mapping to pull the target columns into a list. We load the column data as a list into a list of rows. So we have a list of lists to work with. Notice that the data driven column listing automatically sets the dimensions of our tensor. At the very end of our method we call the tensorflow method to_float, this will convert our list of lists into a tensor obect of the same values.

You can check the debugging output as we print the row count and the tensor dimension at the end of the generateData call. Congrats we now have a tensor object of our data to begin running through our linear model. Let's take a look at our execution code so we can see how the process has evolved.

def run(exeCfg):
    type = exeCfg['type']
    data_2_datarow_type = exeCfg['data_2_datarow_type']
    datarow_2_tensor_type = exeCfg['datarow_2_tensor_type']
    version = exeCfg['version']
    reset = exeCfg['reset']
    checkpoint = exeCfg['checkpoint']
    verbose = exeCfg['verbose']
    limitLoad = exeCfg['limitLoad']
    rowLimit = exeCfg['rowLimit']
    validatePrct = exeCfg['validatePrct']
    trainPrct = exeCfg['trainPrct']
    randomSeed = exeCfg['randomSeed']
    learning_rate = exeCfg['learning_rate']
    log_reg_positive_result = exeCfg['log_reg_positive_result']
    lin_reg_positive_result = exeCfg['lin_reg_positive_result']
    model_type = exeCfg['model_type']
    loader = exeCfg['loader']
    cleanData = exeCfg['cleanData']
    trainStepsMultiplier = exeCfg['trainStepsMultiplier']
    dataMap = Data2DataRow.mapping[data_2_datarow_type]
    files = exeCfg['files']
    featureType = exeCfg['feature_type']
    data = None
    fData = None
    tData = None
    tfModel = None

    print("Found loader: " + loader)
    if loader == 'load_csv_data':
        data = LoadCsvData.LoadCsvData()
        data.limitLoad = limitLoad
        data.rowLimit = rowLimit
        data.verbose = verbose

        for file in files:
            csvFileName = files[file]['name']
            appendCols = files[file]['appendCols']
            data.loadData(csvFileName, type, version, reset, dataMap, appendCols)
        # efl
    # eif

    if featureType != '':
        print("Found feature type: " + featureType)
        fData = LoadFeatureData.LoadFeatureData(limitLoad, rowLimit, cleanData, verbose, data)
        fData.generateData(featureType)
    else:
        print("Found no feature type.")
        fData = LoadFeatureData.LoadFeatureData(limitLoad, rowLimit, cleanData, verbose, data)
        fData.generateData('')
    # eif

    tData = LoadTensorData.LoadTensorData(verbose, fData)
    tData.generateData(DataRow2Tensor.columns[datarow_2_tensor_type])

If we take a look at our execution code we can see there are some new pieces to it. At the bottom there we have a new block for running the feature generation code. I put a special hook in to print a different message when no feature type is specified, i.e. passthrough mode. Take a look at the next two lines, see how similar our tensor generation calls are? Very clean and elegant at this level of the code. All our complex methods are encapsulated by our classes. Such a thing of beauty. That wraps up this tutorial. I know this was a short one but in reality we did most of the work in tutorial 0, and by creating a pipline to manipulate our data and using similar class structures we really didn't have a lot of new code to look at although the small amount of new code we have does a whole lot. The next tutorial covers the creation of training and validation sets as well as running the model.

Tutorial 1: Tensorflow Linear Regression: Generating Features and Tensors

Author: Brian A. Ree

Files/Resources for this Tutorial Project Files v0.4.0.7_python2.7 (PyCharm) TensorFlow For Machine Intelligence: A hands-on introduction to learning algorithms

0: Tensorflow Linear Regression Generating Features

Let's review our class variables first.

Next let's go over our class methods. If you need to review the data loading process that gets our data from CSV files into our LoadCsvData class please review this tutorial.

You can check the debugging output as we print the row count and the tensor dimension at the end of the generateData call. Congrats we now have a tensor object of our data to begin running through our linear model. Let's take a look at our execution code so we can see how the process has evolved.

Files/Resources for this Tutorial
Project Files v0.4.0.7_python2.7 (PyCharm)
TensorFlow For Machine Intelligence: A hands-on introduction to learning algorithms