Author: Brian A. Ree
0: Introduction
Have you ever stood up an outward facing web server only to find that a few weeks later your server is constantly being attacked by
malicious attempts to exploit common admin portals and other general attacks against your web server trying to find a security flaw to take
advantage of? Ever wonder why it seems to be ok for people to constantly attack and attempt to compromise your server? Ever wonder what you can do
about it? Well I've thought about all of these questions and a few more. I've come to the conclusion that not only is it not ok, it's downright uncool.
So let's do something about it, shall we? Let's setup an active shield around our server that will block the IP address of anyone
running an attack against us.
Let's outline some of the main goals of our project.
To detect web based attacks to find and exploit web resources.
To use a list of IP addresses as a blocked IP list.
To utilize tensor flow in the detection of attacks.
To recognize simple bad url use by a legitemate user.
1: Apache Logs
We can detect we based attacks by using the apache web server's access logs. What we'll do is load the log data into our DataRow
object and convert it into something we can feed into a tensorflow logistic regression nueral networl. For a lesson on neural networks and our
neural network code design please review the tutorials starting here. We will assume an apache log file format
like the following. Below is the apache2 config file entry describing the log structure. We will append a '1 - - ', or a '0 - - ' when marking logs with
training and validation data.
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %O" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent
A sample of the log entry is as follows.
173.208.197.203 - - [21/Sep/2016:15:55:50 +0000] "GET / HTTP/1.1" 200 11576 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)"
2: Parsing Apache Log Entries
The first thing we have to do in our project approach is to parse and load our apache log data so we can use it in our neural network. Remember we will be using logistic regression which can answer a yes or no type of question. In this scenario the question
being asked is, is this log entry from a real user? The logistic regression model will learn the difference between an attack and a real use case. We can load up our apache logs and use them to train our network. In order to be able to use our dataset we need to process the log files add a '1 - - '
or a '0 - - ' as mentioned ealier. This may be tedius but is should be obvious from the URL, the response code, the content length, etc which entries are from attacks. I recommend using gtext replace to add a seed entry to the beginning of each line, a '1 - - ', then modify the rows where the value
should be different accordingly.
Once you have processed a fair number of log entries, a few thousand would be ideal, then we can move on and start thinking about a log row. Let's list a few things we're looking for and why. Then we will see what we can do with the apache log entry.
ip address: We want to block the IP addresses from hack attempts.
url: We want to know what resource the client browser tried to access. This is not in a format tensorflow can use.
http response code: What was the server's http response code.
length: The content length of the http request.
log entry category: A yes or no answer, '1 - - ' if the log entry is a safe use case, and '0 - - ' if the log entry is a hack attempt, or attack.
We can grab the ip address from the logs but it is not in a tensorflow format. We cannot use Strings in our matrices directly they must be converted to a numeric representation, in this case the solution is simple we split the IP into four integers.
We can then use those integers as four separate columns in our tensor if we end up using the data in our neural network. Next we have the URL to contend with. For the same reason as the IP address string we can't use the URL directly in our tensorflow
matrices so we need to convert the information into a numeric representation. We will choose a simple conversion technique. If the connection attempt to our server contains a valid URL root or a valid URL then we let the URL be represented by a 2, otherwise we
let it be represented by a 1. This approach works because when a valid use case has the wrong URL it is usually because of a slight typo or path mixup but there are usually valid path elements in the URL. These indicate a valid use case that had an error of some kind.
We know the files on our web server so if we take a list of all the files and their paths we can generate a list of URL roots or full URLS that we expect connections to our server to use.
So let's create an array of valid URL roots and full URLs for the web site we are protecting. This will be custom to your server just as the apache access logs are also custom your server and environment.
The following is what our safe URL list looks like, notice we changed the path characters '/', '\' to '_'.
'extras': [
'_websvc.php',
'_websvcclient.php',
'_scripts_utils_Log.js',
'_scripts_utils_md5.js',
'_scripts_utils_BufferedReader.js',
'_scripts_webutils_MmgMd5.js',
'_scripts_webutils_MsNewsSrchWebSvcClient.js',
'_scripts_webutils_SaWebSvcClient.js',
'_scripts_webutils_YhStockQuoteWebSvcClient.js',
'_scripts_webutils_MsTxtSntWebSvcClient.js',
'_scripts_stockanalytics_StockAnalytics.js',
'_scripts_models_ModelsConfig.js',
'_scripts_models_ModelArticlesCount.js',
'_scripts_models_ModelArticlesGetAllActive.js',
'_scripts_models_ModelDataSourcesCount.js',
'_scripts_models_ModelStockQuotesCount.js',
'_scripts_models_ModelDataSourcesUsageCount.js',
'_scripts_models_ModelTargetSearchTextCount.js',
'_scripts_models_ModelTargetSymbolsCount.js',
'_scripts_models_ModelTextRefinementCount.js',
'_scripts_models_ModelDataSourcesUsageSummary.js',
'_scripts_models_ModelDataSourcesGetAllActive.js',
'_scripts_models_ModelTargetSearchTextGetAllActive.js',
'_scripts_models_ModelStockQuotesGetAllActive.js',
'_scripts_models_ModelTargetSymbolsGetAllActive.js',
'_scripts_models_ModelTextRefinementGetAllActive.js',
'_scripts_views_ViewArticlesCount.js',
'_scripts_views_ViewArticlesGetAllActive.js',
'_scripts_views_ViewDataSourcesCount.js',
'_scripts_views_ViewStockQuotesCount.js',
'_scripts_views_ViewDataSourcesUsageCount.js',
'_scripts_views_ViewTargetSearchTextCount.js',
'_scripts_views_ViewTargetSymbolsCount.js',
'_scripts_views_ViewTextRefinementCount.js',
'_scripts_views_ViewDataSourcesUsageSummary.js',
'_scripts_views_ViewDataSourcesGetAllActive.js',
'_scripts_views_ViewStockQuotesGetAllActive.js',
'_scripts_views_ViewTargetSearchTextGetAllActive.js',
'_scripts_views_ViewTargetSymbolsGetAllActive.js',
'_scripts_views_ViewTextRefinementGetAllActive.js',
'_scripts_controllers_ControllerArticlesCount.js',
'_scripts_controllers_ControllerDataSourcesCount.js',
'_scripts_controllers_ControllerArticlesGetAllActive.js',
'_scripts_controllers_ControllerStockQuotesCount.js',
'_scripts_controllers_ControllerDataSourcesUsageCount.js',
'_scripts_controllers_ControllerTargetSearchTextCount.js',
'_scripts_controllers_ControllerTargetSymbolsCount.js',
'_scripts_controllers_ControllerDataSourcesUsageSummary.js',
'_scripts_controllers_ControllerTextRefinementCount.js',
'_scripts_controllers_ControllerTargetSearchTextGetAllActive.js',
'_scripts_controllers_ControllerDataSourcesGetAllActive.js',
'_scripts_controllers_ControllerStockQuotesGetAllActive.js',
'_scripts_controllers_ControllerTargetSymbolsGetAllActive.js',
'_scripts_controllers_ControllerTextRefinementGetAllActive.js'
]
Now we have converted a peice of complex log data into a number that has meaning we can leverage in our neural network. If we ponder the implementation further you'll notice we have a strong correlation between certain data points.
The response code, the entry category or answer, and the URL encoding all have a strong connection to the attempted file access. This is important to notice because it means we can expect our neural network to pick up on these relationships,
there is a strong real world basis for it to do just that. All the other data points we want to use come straight from our log entry and are numbers we can use in a tensor.
There is one thing, however, that comes to mind. The length value of the http request. Ths information may vary greatly and is not that useful in our nueral network. Can we convert it to something useful? Possibly, in this case I have noticed some log entries
with a content length of zero and are always attack entries. So I'm going to inspect the length value and generate a 2 if the length is zero and a 1 if the length is greater than zero. This is similar to the way we converted the URL to a useful numeric representation.
You'll notice that we chose to use numbers higher than 0, 1, we chose 1, 2 instead because they move our numeric tests outside the range of zero, things tend to stick to zero in certain mathematical operations and using numbers in a higher range seems like a good idea.
You may have though that we should encode the URL response code to a numeric value as well. We could be there are only a few response codes and they are already in number form so we can just leave them as is if need be. As a side task try adding your own response code
encoding and add it to the data tracked by the neural network. Don't forget that you'll have to delete the checkpoint files and retrain the network when you change something like this.
3: Load Apache Log Data
Most of our main bullet points are addressed now, we still need to figure out how to write up an IP address block list and use it to configure a firewall but we can conquer those tasks once our neural netowkr is up and running. So let's begin our implementation. First we'll
look at a new class LoadApacheLogData, it is very similar to our LoadCsvData class. This is the first time we're loading a new data type that is not in CSV format. We'll notice that we're implicitly forcing a structure on the data loading class, perhaps in a future
tutorial we can define an abstract, or interface like base class but for now we'll leave as an aside. Please review the full LoadApacheLogData source code below.
import csv
import DataRow
import codecs
import sys
class LoadApacheLogData:
""" A class for loading apache log data into a data row. """
rows = []
rowCount = 0
limitLoad = True
rowLimit = 25
cleanData = False
cleanCount = 0
verbose = False
ignoreHeader = True
header = ['Ip1', 'Ip2', 'Ip3', 'Ip4', 'HttpVersion', 'Timestamp', 'Url', 'ResponseCode', 'Length', 'Client', 'Answer']
def __init__(self, lRows=[], lLimitLoad=False, lRowLimit=-1, lCleanData=False, lVerbose=False):
self.rows = lRows
self.limitLoad = lLimitLoad
self.rowLimit = lRowLimit
self.cleanData = lCleanData
self.verbose = lVerbose
#edef
'''
1 - - 127.0.0.1 - - [timeStamp] "POST url HTTP/1.1" 200 957 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_91)"
answer: 1
ip: 127.0.0.1
opt1: -
opt2: -
timeStamp: []
url: "POST url HTTP/1.1"
responseCode: 200
length: 957
opt3: "-"
client: "Apache-HttpClient/4.5.2 (Java/1.8.0_91)"
'''
#alf = apache log file
def loadData(self, logFile='', type='alf', version='1.0', reset=False, dataMap={}, appendCols={}):
print ("")
print ("")
print("Loading Data: " + logFile + " Type: " + type + " Version: " + version + " Reset: " + str(reset))
if self.verbose:
print "Found data mapping:"
for i in dataMap:
print(i, dataMap[i])
#efl
#eif
if self.verbose:
print "Found append cols mapping:"
for i in appendCols:
print(i)
#efl
#eif
if reset == True:
print "Resetting rows:"
self.resetRows()
#eif
if type == 'alf' and version == '1.0' and logFile != '':
ifile = codecs.open(logFile, 'rb', encoding="utf-8-sig")
reader = ifile.readlines()
lrows = []
rownum = 0
for row in reader:
if rownum == 0 and self.ignoreHeader == False:
header = row
rownum += 1
else:
colnum = 0
dRow = DataRow.DataRow()
dRow.verbose = self.verbose
# Append static values outside of the csv like stock symbol etc
for entry in appendCols:
for key in entry:
dRow.setMember(key, str(entry['idx']), entry[key])
# efl
# efl
if self.verbose:
print ('')
# eif
rowCols = self.processRow(row)
if len(rowCols) > 0:
for col in rowCols:
if self.verbose:
print (' %-8s: %s' % (self.header[colnum], col))
#eif
colName = self.header[colnum]
if len(dataMap) > 0:
try:
memberIdx = dataMap[colName]
if int(memberIdx) != -1:
dRow.setMember(colName, int(memberIdx), col)
# eif
except:
if self.verbose:
print ("Error setting member with index: ", colnum, " with value: ", col)
print ("Unexpected error:", sys.exc_info()[0])
# eif
# etry
else:
dRow.setMember(colName, colnum, col)
# eif
colnum += 1
# efl
if self.verbose:
if self.limitLoad:
dRow.printDataRow()
# eif
# eif
dRow.stampId()
lrows.append(dRow.copy())
self.rowCount += 1
# eif
rownum += 1
# eif
if self.limitLoad == True and self.rowCount >= self.rowLimit and self.rowLimit > 0:
break;
# eif
# efl
ifile.close()
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
# eif
# edef
def processRow(self, row):
# 127.0.0.1 - - [21/Sep/2016:13:58:06 +0000] "POST url HTTP/1.1" 200 957 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_91)"
# print("Row: %s" % (row))
if row == None or row == "":
return []
# eif
nr1 = row.split(" - - ")
if len(nr1) == 2:
answer = -1
ip = nr1[0]
nr2 = nr1[1]
else:
answer = nr1[0]
ip = nr1[1]
nr2 = nr1[2]
# eif
nr3 = nr2.split(" \"-\" ")
if len(nr3) != 2:
return []
# eif
client = nr3[1].replace("\"", "")
client = client.replace("/", "_")
nr4 = nr3[0]
nr4 = nr4.replace("[", "")
nr4 = nr4.replace("]", "")
nr4 = nr4.replace("\"", "")
# 21/Sep/2016:13:58:06 +0000 POST url HTTP/1.1 200 957
nr5 = nr4.split(" ")
if len(nr5) != 7:
return []
# eif
date = nr5[0]
offset = nr5[1]
httpCmd = nr5[2]
url = nr5[3].replace("/", "_")
httpVer = nr5[4].replace("/", "_")
httpResCode = nr5[5]
httpLen = nr5[6]
'''
print ('Client: %s' % (client))
print ('Ip: %s' % (ip))
print ('Date: %s' % (date))
print ('Offset: %s' % (offset))
print ('HttpCmd: %s' % (httpCmd))
print ('Url: %s' % (url))
print ('HttpVer: %s' % (httpVer))
print ('HttpResCode: %s' % (httpResCode))
print ('HttpLen: %s' % (httpLen))
'''
ips = ip.split(".")
ip1 = ips[0]
ip2 = ips[1]
ip3 = ips[2]
ip4 = ips[3]
return [ip1, ip2, ip3, ip4, httpVer, (date + " " + offset), url, httpResCode, httpLen, client, answer]
# edef
def resetRows(self):
self.rows = []
self.rowCount = 0
self.cleanCount = 0
# edef
def sortRows(self, lrows):
return sorted(lrows, key=id)
# edef
def cleanRows(self, lrows):
if self.cleanData == True:
print ("Cleaning row data...")
should_restart = True
while should_restart:
should_restart = False
for row in lrows:
if row.error == True:
lrows.remove(row)
self.rowCount -= 1
self.cleanCount += 1
should_restart = True
# eif
# efl
# fwl
# eif
# edef
# eclass
First up we'll take a look at our class members. There are only two new class members in the apache log version of our data loader, ignoreHeader and header. For a review of the LoadCsvData class please peruse this tutorial. A common
feature of CSV data is the header row, this is not the case for apache log data so we've introduced some new class members to help handle the header columns. Now we've reviewed the format of our apache logs earlier on in the tutorial so we can expect a certain set of columns in our log data.
We're going to store column names in our LoadApacheLogData class so that we have a way to reference the individual pieces of data we pull from the apache logs. I want to keep the functionality of the two data loading classes similar and who knows I may add a new header row feature but for now
we'll use the ignoreHeader boolean to escape code that expects a header row. The expected header column list is as follows.
header = ['Ip1', 'Ip2', 'Ip3', 'Ip4', 'HttpVersion', 'Timestamp', 'Url', 'ResponseCode', 'Length', 'Client', 'Answer']
We'll start our deep dive into the code by reviewing the processRow method. This method is responsible for splitting up the apache log entry and returning an array of data. This step is done automatically for us in LoadCsvData by the CSV reader class. The log data goes through a number of splits
and transformations. You can add support for your own apache log entry format here and return an array of data that is aligned with our header array. The last line of processRow is shown below.
return [ip1, ip2, ip3, ip4, httpVer, (date + " " + offset), url, httpResCode, httpLen, client, answer]
But wait where is our URL encoding we discussed, we're also missing our content length encoding what gives? Good catch, we're going to group those two missing columns into our feature generation and statistics generation step. For now we'll review some of the slight differences in our LoadApacheLogData
class when compared to our LoadCsvData class. The first thing you should notice in out loadData method is that the type argument defaults to 'alf' not 'csv'. Since we are loading an Apache log file we'll use 'alf' as the type value expected. There isn't a more specific file extension used by apache other than .log, sometimes, and
that is too general for my taste. The next difference we'll look at is the initialization of the reader variable in LoadApacheLogData it is set to reader = ifile.readLines() as opposed to reader = csv.reader(ifile). In both cases we're bringing in row data but in the case of our CSV data the rows are processed as the file is read
, in LoadApacheLogData the row is processed by a call to the local processRow method and is just a string before that call.
The first row of the apache log file is currently protected from being interpreted as a header row by the ignoreHeader boolean. The next big difference occurs in the row column processing. The class LoadCsvData uses a CSV reader so the rows are processed as array of column data. In LoadApacheLogData we need to do this ourselves so you'll notice the following code where
row columns begin to be processed.
rowCols = self.processRow(row)
if len(rowCols) > 0:
for col in rowCols:
The main difference is that we're parsing log rows with our owner internal class method and that method is designed to return an empty array if there is an issue parsing the log entry. To that end we apply a quick check to make sure there are columns to process before you proceed.
That pretty much takes care of our data loading, notice how powerful our base design is, we were able to load a completely new type of data set with very few adjustments to our original CSV loading code.
So now we have our data loading class in place let's take a look at the data mapping entry for blue ice. The following is the latest entry in our Data2DataRow file, notice that the mapping matches the return array from our processRow method. This is similar to how the Data2DataRow mapping for CSV data maps to the column layout of the CSV file.
"blue_ice":
{
"Ip1": "0",
"Ip2": "1",
"Ip3": "2",
"Ip4": "3",
"HttpVersion": "4",
"Timestamp": "5",
"Url": "6",
"ResponseCode": "7",
"Length": "8",
"Client": "9",
"Answer": "10"
}
4: Blue Ice Config
Now that we have an idea how our data is loaded and how easily it fits into our tensorflow workflow let's take a look at the blue ice execution dictionary entry before we move on and see how
features are generated for blue ice. Below is the execution dictionary entry for blue ice it looks a lot like previous entries we've worked with but also has a few new entries.
"blue_ice_log_reg":
{
# Apache Log File => alf
'type': 'alf',
'data_2_datarow_type': 'blue_ice',
'datarow_2_tensor_type': 'blue_ice_log_reg_hack_attack',
'version': '1.0',
'reset': False,
'checkpoint': True,
'limitLoad': False,
'cleanData': True,
'verbose': False,
'rowLimit': 25,
'validatePrct': 0.20,
'trainPrct': 0.80,
'randomSeed': False,
'trainStepsMultiplier': 20,
'learning_rate': 0.000001,
'log_reg_positive_result': 0.50,
'lin_reg_positive_result': 0.50,
'model_type': 'logistic_regression',
'loader': 'load_apache_log_data',
'feature_type': 'blue_ice_log_reg',
'logPrint': 500,
'checkpointSave': 500,
'eval_type': 'blue_ice',
'files': {
'file1': {'name': dataDir + "/access_logs/access.log.1", 'appendCols': []},
'file2': {'name': dataDir + "/access_logs/access.log.2", 'appendCols': []},
'file3': {'name': dataDir + "/access_logs/access.log.3", 'appendCols': []},
'file4': {'name': dataDir + "/access_logs/access.log.4", 'appendCols': []},
'file5': {'name': dataDir + "/access_logs/access.log.5", 'appendCols': []},
'file6': {'name': dataDir + "/access_logs/access.log.6", 'appendCols': []},
'file7': {'name': dataDir + "/access_logs/access.log.7", 'appendCols': []},
'file8': {'name': dataDir + "/access_logs/access.log.8", 'appendCols': []},
'file9': {'name': dataDir + "/access_logs/access.log.9", 'appendCols': []},
'file10': {'name': dataDir + "/access_logs/access.log.10", 'appendCols': []},
'file11': {'name': dataDir + "/access_logs/access.log.11", 'appendCols': []},
'file12': {'name': dataDir + "/access_logs/access.log.12", 'appendCols': []},
'file13': {'name': dataDir + "/other_vhosts_access/other_vhosts_access.log", 'appendCols': []},
'file14': {'name': dataDir + "/other_vhosts_access/other_vhosts_access.log.1", 'appendCols': []},
'file15': {'name': dataDir + "/other_vhosts_access/other_vhosts_access.log.2", 'appendCols': []},
'file16': {'name': dataDir + "/other_vhosts_access/other_vhosts_access.log.3", 'appendCols': []},
'file17': {'name': dataDir + "/other_vhosts_access/other_vhosts_access.log.4", 'appendCols': []},
'file18': {'name': dataDir + "/other_vhosts_access/other_vhosts_access.log.5", 'appendCols': []},
},
'extras': [
'_websvc.php',
'_websvcclient.php',
'_scripts_utils_Log.js',
'_scripts_utils_md5.js',
'_scripts_utils_BufferedReader.js',
'_scripts_webutils_MmgMd5.js',
'_scripts_webutils_MsNewsSrchWebSvcClient.js',
'_scripts_webutils_SaWebSvcClient.js',
'_scripts_webutils_YhStockQuoteWebSvcClient.js',
'_scripts_webutils_MsTxtSntWebSvcClient.js',
'_scripts_stockanalytics_StockAnalytics.js',
'_scripts_models_ModelsConfig.js',
'_scripts_models_ModelArticlesCount.js',
'_scripts_models_ModelArticlesGetAllActive.js',
'_scripts_models_ModelDataSourcesCount.js',
'_scripts_models_ModelStockQuotesCount.js',
'_scripts_models_ModelDataSourcesUsageCount.js',
'_scripts_models_ModelTargetSearchTextCount.js',
'_scripts_models_ModelTargetSymbolsCount.js',
'_scripts_models_ModelTextRefinementCount.js',
'_scripts_models_ModelDataSourcesUsageSummary.js',
'_scripts_models_ModelDataSourcesGetAllActive.js',
'_scripts_models_ModelTargetSearchTextGetAllActive.js',
'_scripts_models_ModelStockQuotesGetAllActive.js',
'_scripts_models_ModelTargetSymbolsGetAllActive.js',
'_scripts_models_ModelTextRefinementGetAllActive.js',
'_scripts_views_ViewArticlesCount.js',
'_scripts_views_ViewArticlesGetAllActive.js',
'_scripts_views_ViewDataSourcesCount.js',
'_scripts_views_ViewStockQuotesCount.js',
'_scripts_views_ViewDataSourcesUsageCount.js',
'_scripts_views_ViewTargetSearchTextCount.js',
'_scripts_views_ViewTargetSymbolsCount.js',
'_scripts_views_ViewTextRefinementCount.js',
'_scripts_views_ViewDataSourcesUsageSummary.js',
'_scripts_views_ViewDataSourcesGetAllActive.js',
'_scripts_views_ViewStockQuotesGetAllActive.js',
'_scripts_views_ViewTargetSearchTextGetAllActive.js',
'_scripts_views_ViewTargetSymbolsGetAllActive.js',
'_scripts_views_ViewTextRefinementGetAllActive.js',
'_scripts_controllers_ControllerArticlesCount.js',
'_scripts_controllers_ControllerDataSourcesCount.js',
'_scripts_controllers_ControllerArticlesGetAllActive.js',
'_scripts_controllers_ControllerStockQuotesCount.js',
'_scripts_controllers_ControllerDataSourcesUsageCount.js',
'_scripts_controllers_ControllerTargetSearchTextCount.js',
'_scripts_controllers_ControllerTargetSymbolsCount.js',
'_scripts_controllers_ControllerDataSourcesUsageSummary.js',
'_scripts_controllers_ControllerTextRefinementCount.js',
'_scripts_controllers_ControllerTargetSearchTextGetAllActive.js',
'_scripts_controllers_ControllerDataSourcesGetAllActive.js',
'_scripts_controllers_ControllerStockQuotesGetAllActive.js',
'_scripts_controllers_ControllerTargetSymbolsGetAllActive.js',
'_scripts_controllers_ControllerTextRefinementGetAllActive.js'
]
}
Most everything looks like previous entries we've reviewed but there are a few small additions we should review. First up we have a new entry, checkpointSave, I'll cover this in depth in just a little bit
but the checkpoint system has been updated and is fully functional. Because we want to use our trained neural network in a server environment as a scheduled task, this is to automate the scanning of web access logs and
subsequently updating the firewall's block list, we don't want to train the network every time we use it. Turning on checkpoints makes the code load up the last saved state of the network. This was we can pick up right where we left
off and use the trained network to generate inferences. We've adjusted the checkpoint code to use a local directory to store our checkpoint files. The checkpointSave entry defines how often we save the network state to a checkpoint file, every X number of rows.
By default the checkpoint code saves the last five checkpoint files.
The only other entry that should catch your eye is the entras entry. Take a quick look at it, does it remind you of anything? This is the list of URLs and URL roots that we use to encode our apache log URL to a two or a one. This is a new feature in our neural network code,
the ability to pass along extra information to be used in the feature generation process. Let's take a moment to look into our code execution method, run, we'll list the new feature generation code below.
print("Found feature type: " + featureType)
fData = LoadFeatureData.LoadFeatureData(limitLoad, rowLimit, cleanData, verbose, data)
if 'extras' in exeCfg:
fData.setExtras(exeCfg['extras'])
# eif
if featureType != '':
fData.generateData(featureType)
else:
fData.generateData('')
# eif
5: Generating Features
The latest feature to our setup is the addition of 'extras' to our feature generation step. Looking at the execution config run method in Main.py we can see that if the extras key
is defined then we set the extras class variable the passed in extras data, we've purposely left the structure loose and undefined so it can be flexible.
We should also note the handling of the checkpointSave information. If the entry exists we pass it along to our linear, logistic regression classes via the
setCheckPointSave, setCheckPointOn methods. I'll go over these small changes to the regression classes a little later on. Next up let's look at our feature generation code.
elif type == 'blue_ice_log_reg':
self.resetRows()
lrows = []
lrows.extend(self.loadCsvData.rows)
rownum = 0
length = len(lrows)
for i in xrange(length):
try:
float(lrows[i].getMemberByName('Ip1'))
float(lrows[i].getMemberByName('Ip2'))
float(lrows[i].getMemberByName('Ip3'))
float(lrows[i].getMemberByName('Ip4'))
float(lrows[i].getMemberByName('ResponseCode'))
float(lrows[i].getMemberByName('Answer'))
llen = lrows[i].getMemberByName('Length')
float(llen)
if llen == 0:
lrows[i].setMember('IsZeroLen', 12, 2.00)
else:
lrows[i].setMember('IsZeroLen', 12, 1.00)
# eif
url = lrows[i].getMemberByName('Url')
lfound = False
for r in self.extras:
if url == "_" or r in url:
lrows[i].setMember('IsUrlKnown', 11, 2.00)
lfound = True
break
# eif
# efl
if lfound == False:
lrows[i].setMember('IsUrlKnown', 11, 1.00)
# eif
if float(lrows[i].getMemberByName('Answer')) == 1.0:
lrows[i].setMember('AnswerCatYes', 13, 2.00)
lrows[i].setMember('AnswerCatNo', 14, 1.00)
else:
lrows[i].setMember('AnswerCatYes', 13, 1.00)
lrows[i].setMember('AnswerCatNo', 14, 2.00)
# eif
if self.verbose:
print("ResponseCode: %s" % (lrows[i].getMemberByName('ResponseCode')))
print("Answer: %s" % (lrows[i].getMemberByName('Answer')))
print("Url: %s" % (lrows[i].getMemberByName('Url')))
print("IsUrlKnown: %s" % (lrows[i].getMemberByName('IsUrlKnown')))
print("AnswerCatYes: %s" % (lrows[i].getMemberByName('AnswerCatYes')))
print("AnswerCatNo: %s" % (lrows[i].getMemberByName('AnswerCatNo')))
# eif
except:
print "Unexpected error:", sys.exc_info()[0]
lrows[i].error = True
raise
# etry
rownum = rownum + 1
# efl
print ("Loaded %i rows from this data file." % (rownum))
lrows = self.sortRows(lrows)
self.cleanRows(lrows)
self.rows.extend(lrows)
self.rowCount = len(lrows)
print ('CleanCount: %i RowCount: %i RowsFound: %i' % (self.cleanCount, self.rowCount, len(self.rows)))
# eif
# eif
Remember the LoadFeatureData class is where we write our custom feature and statistic generation code. It makes sense that we'll use our extras data here in this custom code spot.
The code is not too different from other feature generation done in this class. We cast all expected numeric values to float, if there is a data type error then this DataRow object gets marked
as an error. The first feature we generate is the IsZeroLen feature based on a simple test of the length value. The next little piece of code tests the apache log URL to see if it contains as a substring
any of the valud URLs or URL roots listed in our extras array. If a match is found we encode our URL with a IsUrlKnown entry of 2, if no substring match if found we encode our URL with a IsUrlKnown entry of 1.
The additional feature columns must have unique indices by design so we use indexes 11 and 12 here. The pretty much wraps up the feature generation, our DataRow class has everything we need. Next we'll take
a look at our DataRow2Tensor file, specifically our blue ice entry.
"blue_ice_log_reg_hack_attack": ['Ip1', 'Ip2', 'Ip3', 'Ip4', 'IsUrlKnown', 'ResponseCode', 'IsZeroLen'],
6: Updated Logistic Regression Model
Currently we're only interested in the following columns for our regression model. We can run our neural network against our data now since our initial design was modular and data driven. Before we go we're going to
look at some code in out RegModelLogistic class specifically the checkpoint features and some newer evaluation specific code.
def setCheckPointSave(self, cps):
self.checkpointSave = cps;
# edef
def setCheckPointOn(self, cpo):
self.checkpoint = cpo;
# edef
if self.checkpoint == True:
# Verify we don't have a checkpoint saved already
ckpt = tf.train.get_checkpoint_state(os.path.dirname(__file__) + "/checkpoints/")
if ckpt and ckpt.model_checkpoint_path:
# Restores from checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
initial_step = int(ckpt.model_checkpoint_path.rsplit('-', 1)[1])
# eif
# eif
# Training loop
if self.checkpoint == True:
for step in range(initial_step, training_steps):
sess.run([train_op])
if step % self.logPrint == 0:
print ("Loss: ", sess.run([total_loss]))
# eif
if step % self.checkpointSave == 0:
saver.save(sess, './checkpoints/eod-model', global_step=step)
# eif
# efl
First off we have two new class methods for passing in checkpoint and checkpoint save options from our execution config dictionary entry. They are just simple setter methods. The other new checkpoint code is in the
introduction of the checkpoints for storage and a row interval in which to save the checkpoint data. By default the last five checkpoints are saved during a training session. If left behind and unless there are more rows
to train tensorflow will restore the last saved checkpoint of the neural network. We can then use python and tensorflow to process a new set of inferences. We will tackle this in our next tutorial.
Last but not least we have a new eval type setup, the special, custom eval_type specific coed has been moved out of the eval method and into the main tensorflow execution method. The output of our program is as follows.
My training data resulted in a 98% success rate in matching good and bad apache access logs!
Application Version: 0.4.0.6
Found loader: load_apache_log_data
Loading Data: ./data/access_logs/access.log.3 Type: alf Version: 1.0 Reset: False
Loaded 8 rows from this data file.
CleanCount: 0 RowCount: 7 RowsFound: 7
Loading Data: ./data/access_logs/access.log.2 Type: alf Version: 1.0 Reset: False
Loaded 6 rows from this data file.
CleanCount: 0 RowCount: 13 RowsFound: 13
Loading Data: ./data/access_logs/access.log.1 Type: alf Version: 1.0 Reset: False
Loaded 647 rows from this data file.
CleanCount: 0 RowCount: 656 RowsFound: 656
Loading Data: ./data/access_logs/access.log.10 Type: alf Version: 1.0 Reset: False
Loaded 10 rows from this data file.
CleanCount: 0 RowCount: 662 RowsFound: 662
Loading Data: ./data/access_logs/access.log.7 Type: alf Version: 1.0 Reset: False
Loaded 8 rows from this data file.
CleanCount: 0 RowCount: 669 RowsFound: 669
Loading Data: ./data/access_logs/access.log.6 Type: alf Version: 1.0 Reset: False
Loaded 13 rows from this data file.
CleanCount: 0 RowCount: 682 RowsFound: 682
Loading Data: ./data/access_logs/access.log.5 Type: alf Version: 1.0 Reset: False
Loaded 3 rows from this data file.
CleanCount: 0 RowCount: 685 RowsFound: 685
Loading Data: ./data/access_logs/access.log.4 Type: alf Version: 1.0 Reset: False
Loaded 13 rows from this data file.
CleanCount: 0 RowCount: 698 RowsFound: 698
Loading Data: ./data/access_logs/access.log.9 Type: alf Version: 1.0 Reset: False
Loaded 9 rows from this data file.
CleanCount: 0 RowCount: 707 RowsFound: 707
Loading Data: ./data/access_logs/access.log.8 Type: alf Version: 1.0 Reset: False
Loaded 5 rows from this data file.
CleanCount: 0 RowCount: 712 RowsFound: 712
Loading Data: ./data/access_logs/access.log.11 Type: alf Version: 1.0 Reset: False
Loaded 2 rows from this data file.
CleanCount: 0 RowCount: 714 RowsFound: 714
Loading Data: ./data/other_vhosts_access/other_vhosts_access.log.5 Type: alf Version: 1.0 Reset: False
Loaded 988 rows from this data file.
CleanCount: 0 RowCount: 1551 RowsFound: 1551
Loading Data: ./data/other_vhosts_access/other_vhosts_access.log.3 Type: alf Version: 1.0 Reset: False
Loaded 863 rows from this data file.
CleanCount: 0 RowCount: 2414 RowsFound: 2414
Loading Data: ./data/access_logs/access.log.12 Type: alf Version: 1.0 Reset: False
Loaded 32 rows from this data file.
CleanCount: 0 RowCount: 2442 RowsFound: 2442
Loading Data: ./data/other_vhosts_access/other_vhosts_access.log.4 Type: alf Version: 1.0 Reset: False
Loaded 635 rows from this data file.
CleanCount: 0 RowCount: 3076 RowsFound: 3076
Loading Data: ./data/other_vhosts_access/other_vhosts_access.log.1 Type: alf Version: 1.0 Reset: False
Loaded 228 rows from this data file.
CleanCount: 0 RowCount: 3294 RowsFound: 3294
Loading Data: ./data/other_vhosts_access/other_vhosts_access.log.2 Type: alf Version: 1.0 Reset: False
Loaded 428 rows from this data file.
CleanCount: 0 RowCount: 3718 RowsFound: 3718
Loading Data: ./data/other_vhosts_access/other_vhosts_access.log Type: alf Version: 1.0 Reset: False
Loaded 4 rows from this data file.
CleanCount: 0 RowCount: 3722 RowsFound: 3722
Found feature type: blue_ice_log_reg
Generating Feature Data: Type: blue_ice_log_reg
Loaded 3722 rows from this data file.
Cleaning row data...
CleanCount: 0 RowCount: 3722 RowsFound: 3722
Generating Tensor Data:
TensorRow Answer Shape: (3722, 2)
TensorRow Data Shape: (3722, 7)
TensorRow Count: 3722
TensorTrain Answer Shape: (743, 2)
TensorTrain Data Shape: (743, 7)
TensorTrain Count: 743
TensorValidate Answer Shape: (745, 2)
TensorValidate Data Shape: (745, 7)
TensorValidate Count: 745
Found training steps: 14860
Found tensor dimension: 7
2017-08-14 08:51:49.058836: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-14 08:51:49.058878: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-14 08:51:49.058886: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Weight: (7, 2)
Bias: (2,)
('Loss: ', [1.9991543])
('Loss: ', [1.9160835])
('Loss: ', [1.9151578])
('Loss: ', [1.9145861])
('Loss: ', [1.9142262])
('Loss: ', [1.914006])
('Loss: ', [1.9138721])
('Loss: ', [1.9137927])
('Loss: ', [1.9137374])
('Loss: ', [1.9137044])
('Loss: ', [1.913682])
('Loss: ', [1.9136708])
('Loss: ', [1.9136642])
('Loss: ', [1.9136577])
('Loss: ', [1.9136544])
('Loss: ', [1.9136547])
('Loss: ', [1.9136541])
('Loss: ', [1.9136521])
('Loss: ', [1.9136524])
('Loss: ', [1.9136516])
('Loss: ', [1.9136503])
('Loss: ', [1.9136521])
('Loss: ', [1.9136512])
('Loss: ', [1.9136506])
('Loss: ', [1.9136508])
('Loss: ', [1.9136504])
('Loss: ', [1.9136502])
('Loss: ', [1.9136499])
('Loss: ', [1.9136497])
('Loss: ', [1.9136497])
Accuracy: 0.9826
Custom Evaluation: blue_ice
Test1: [Bad] False
Test2: [Good] True
Test3: [Bad] False
Test4: [Good] True