Middlemind Games

Tutorial 3: Tensorflow Logistic Regression: Generating Features and Tensors

Author: Brian A. Ree

Files/Resources for this Tutorial
Project Files v0.4.0.7_python2.7 (PyCharm)
TensorFlow For Machine Intelligence: A hands-on introduction to learning algorithms

0: Tensorflow Logistic Regression: Intro

Welcome back to our tutorial series on tensorflow. In this segment we'll be augmenting our existing code to include a logistic regression example using tensorflow. If you haven't done so already you should follow the previous tutorials, starting here, to get your knowledge and code setup, or possibly to just review what we've covered before this. With a linear regression model our neural network is being trained to guess a certain value in a linear relationship given some value or information. With logistic regression we are using a neural network to answer the question yes or no. In this case we'll be using our model to answer the question will tomorrow's closing stock price be higher than today's closing price.

1: Tensorflow Logistic Regression: Setup

Most of the code we've written is very flexible. Our CSV parsing and loading class can remain, we'll just use it to load some new stock data to use in our logistic regression sample. So we won't need to do anything new for our logistic regression sample until we get to the feature generation, statistics step. So let's add a new entry into our Data2DataRow class to handle the new data we want to load. Below is the new entry we'll be using.
```
# use -1 to ignore loading a column
mapping = {
        "google_price":
            {
                "Date": "0",
                "Open": "1",
                "Close": "2",
                "High": "3",
                "Low": "4",
                "Volume": "5",
                "Symbol": "-1",     # 6
                "sma_100": "-1",    # 7
            },

        "weight_age":
            {
                "Weight": "0",
                "Age": "1",
                "BloodFat": "2"
            },

        "google_price2":
            {
                "Date": "0",
                "Open": "1",
                "Close": "2",
                "High": "3",
                "Low": "4",
                "Volume": "5",
                "Symbol": "-1",  # 6
            }
        }
```
You can see that we've added a new entry called google_price2. It's the same underlying data structure from the same source, Google Finance, so we don't need to add anything new to it. We won't be generating a 100 day running average in this case though, so we can drop that column. But everything else is just fine. We're going to be using daily stock data, and logistic regression to see if we can predict if tomorrow's closing price will be higher or lower than today's closing price. Now of course there is no correlation that would cause this in a stock unless that stock is trending. Trending is a state attributed to stocks that are behaving in a more predictable manner due to the current market forces. No worries, we're having fun here and who knows if you build a more advanced prediction system and can predict the next days closing price change with 60% accuracy, you're 10% above the expected probability, cha ching! Just kidding. Back to the code. So we'll be using the google_price2 data map to bring in our data but we'll need to tell our little engine how to handle the feature generation so we'll look at that code next.
```
columns = {
    "goog_lin_reg_avg100day": ['Close', 'Open', 'sma_100'],
    "weight_age_lin_reg_blood_fat": ['Weight', 'Age'],
    "goog_log_reg": ['Close', 'Open', 'High', 'Low', 'Volume'],
}
```
We're going to be pulling in a little more data than with our linear regression example. The google_log_reg entry above defines the columns of data we'll be using for our logistic regression model. I want to bring in all the columns available in the stock daily price CSV because I want the neural network to have as much data as possible when trying to find a pattern. Wow was that fast. With almost no effort we're primed to load new data and generate some features, statistics with almost no work. This is the part of coding I enjoy the most, seeing how good planning and code structure can help you solve more problems faster. Instead of recoding and reworking things we can jump right in and solve a new problem. But wait! How do we utilize our new data and features you ask? There are two places in our code that we crudely left open for customization. One such place is in our LoadFeatureData class. This allows us to generate custom features based on the data we're loading. The second such customization point is in the evaluation method of our RegModelLinear and RegModelLogistic. Let's take a look at our custom feature generation code next.
```
elif type == 'goog_log_reg':
    self.resetRows()

    lrows = []
    lrows.extend(self.loadCsvData.rows)
    rownum = 0
    length = len(lrows)

    for i in xrange(length):
        try:
            float(lrows[i].getMemberByName('Close'))
            float(lrows[i].getMemberByName('Open'))
            float(lrows[i].getMemberByName('High'))
            float(lrows[i].getMemberByName('Low'))

            if i + 1 < length:
                if float(lrows[i + 1].getMemberByName('Close')) > float(lrows[i].getMemberByName('Close')):
                    lrows[i].setMember('Answer', 7, 1.0)
                else:
                    lrows[i].setMember('Answer', 7, 0.0)
                # eif
            else:
                lrows[i].setMember('Answer', 7, 0.0)
            # eif

            if float(lrows[i].getMemberByName('Answer')) == 1.0:
                lrows[i].setMember('AnswerCatYes', 8, 2.00)
                lrows[i].setMember('AnswerCatNo', 9, 1.00)
            else:
                lrows[i].setMember('AnswerCatYes', 8, 1.00)
                lrows[i].setMember('AnswerCatNo', 9, 2.00)
            # eif

        except:
            lrows[i].error = True

        # etry

        rownum = rownum + 1
    # efl

    print ("Loaded %i rows from this data file." % (rownum))
    lrows = self.sortRows(lrows)
    self.cleanRows(lrows)
    self.rows.extend(lrows)
    self.rowCount = len(lrows)

    print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))

    # eif

# eif
```
In our custom feature generation code we load our CSV data into a local data structure. As we loop through our data we're going to set our answer column value. The answer column is the column that the RegModelLinear and RegModelLogistic classes use to evaluate their performance and they expect it to be there for them so we'll gladly setup that column now. The conversion tests on the stock performance data is done to detect any errors in our imported data. If any exceptions occur at this point the except clause executes and we mark our data row as having an error. Very cool. Now back to that answer column. We're going to look forward to the next day's closing price. If that price is greater than today's closing price we're going to set our answer value to 1. If not we're going to set our answer value to 0. Once this is done we'll sort the rows on their import id ascending. This just maintains their original import order. We then clean out any data rows we marked with an error and bam we're good to go. Man we've got a lot done in like no time at all. Next up we're going to define our execution dictionary in Main.
```
"goog_log_reg":
{
    'type': 'csv',
    'data_2_datarow_type': 'google_price2',
    'datarow_2_tensor_type': 'goog_log_reg',
    'version': '1.0',
    'reset': False,
    'checkpoint': False,
    'limitLoad': False,
    'cleanData': True,
    'verbose': False,
    'rowLimit': 125,
    'validatePrct': 0.30,
    'trainPrct': 0.70,
    'randomSeed': False,
    'trainStepsMultiplier': 1,
    'learning_rate': 0.000001,
    'log_reg_positive_result': 0.50,
    'lin_reg_positive_result': 0.00,
    'model_type': 'logistic_regression',
    'loader': 'load_csv_data',
    'feature_type': 'goog_log_reg',
    'logPrint': 50,
    'eval_type': '',

    'files': {
        'file1': {'name': dataDir + "/ivv.csv.xls", 'appendCols': [{'Symbol': 'ivv', 'idx': '6'}]},
    }
},
```
You can see we've specified our new data loading entry, 'google_price2', our new feature generation entry, 'goog_log_reg', and our feature type, 'goog_log_reg'. The entries here tell our engine how to load and process our CSV file in a data driven way. We're now just about up to being able to run some logistic regression sample code. This tutorial is kind of short. We'll stop here and pick up in the next tutorial with an imlementation of the RegModelLogistic class.

Tutorial 3: Tensorflow Logistic Regression: Generating Features and Tensors

Author: Brian A. Ree

Files/Resources for this Tutorial Project Files v0.4.0.7_python2.7 (PyCharm) TensorFlow For Machine Intelligence: A hands-on introduction to learning algorithms

0: Tensorflow Logistic Regression: Intro

1: Tensorflow Logistic Regression: Setup

Files/Resources for this Tutorial
Project Files v0.4.0.7_python2.7 (PyCharm)
TensorFlow For Machine Intelligence: A hands-on introduction to learning algorithms