Misconfiguration detection framework
The misconfiguration detection framework adopted in this project is inspired by the research paper 'Synthesizing Configuration File Specifications with Association Rule Learning'. Association rule learning is a method to discover frequently occurring patterns or associations between variables in a dataset. In association rule learning, support and confidence are two metrics widely used to filter the proposed rules. Support is the percentage of times that the keywords in the proposed rule have been seen in the training configuration files. Confidence is the percentage of times the proposed rule has held true over the training configuration files.
from IPython.display import Image
Image('images/framework.png')
As you can see in the above image, misconfiguration detection framework has two important modules: translator and learner.
- Translator: Translator works as a parser, translator converts raw configuration files into an intermediate representation which generally has a format of key, value, data type, frequency (k, v, Ï„, f)
- Learner:: Learner discovers frequently occurring patterns or associations between keywords in configuration files to derive rules.
In this notebook, our objective is to find data type errors.
Many system failure issues are caused by assigning incorrect data type values to a keyword. In this method, for each key, we match the data type of target with the data type information inferred from the training set. An error is reported if the matching fails.
Data type error framework
Following is the example of a data type error detection framework
Image('images/type_error_framework.png')
Steps for data type error detection 1) Translator parses the configuration files, determines the data type of values, and aggregate results for each key. 2) Learner proposes rules by assigning every possible datatype to each keyword in configuration files. .Then filter proposed rules using support and confidence.
In this notebook, we are finding data type errors in mysql.conf files. Dataset used in this notebook is publically available on the Github. You can download the data by running the script data_downloader.py in the same directory where this notebook exists.
Data type errors in mysql.cnf files
import os
import json
import pandas as pd
import numpy as np
from itertools import combinations
import nltk
import re
from collections import defaultdict
import math
import ntpath
from pathvalidate import is_valid_filename, is_valid_filepath
import ipaddress
import urllib
import configparser
# Below function parses all the configuration files based on ini type configuration files.
def get_typos(key_freq, confidence_t=0.9, typo_limit=2):
"""This function finds typos based on keyword frequencies.
We find similar words and filter the results based on the confidence threshold.
Parameters
----------
key_freq : dictionary
Frequency of keyword in configuration files.
confidence_t : float, default 0.9
confidence threshold for spell check
typo_limit : int, default 2
The number of typos allowed. We calculate typos using edit distance
For every ten characrters, we increase this argument by 1
Returns
----------
spell_errors: dictionary
key in the dictionary is a misspelled keyword, and value is the correct version of a keyword
"""
spell_errors = {}
for i, j in combinations(key_freq.keys(), r=2):
# Calculated similarity between keyword using Levenshtein distance.
if nltk.edit_distance(i, j) >= min(len(i), len(j))//10+typo_limit:
continue
if key_freq[i] >= key_freq[j]:
if key_freq[i]/(key_freq[i] + key_freq[j]) > confidence_t:
spell_errors[j] = i
else:
if key_freq[j]/(key_freq[i] + key_freq[j]) > confidence_t:
spell_errors[i] = j
return spell_errors
# Create_KeyValue_df function takes input as a folder path and creates a list of dictionaries as output
def create_keyvalue_df(data_dir, fixed_typos=True):
"""This function creates a list of dictionaries as output for a set of configuration files.
We parse and convert each configuration file into a dictionary format.
and append all this dictionary into a list
Parameters
----------
folder_path : string
folder path of the configuration files.
fixed_typos : bool, default True
If True, fixed typos in configuration files.
Returns
----------
dict_list: list of dictionaries
list where each element is a dictionary of a configuration file.
"""
dict_list = []
parse_errors = 0
errorconfigs_filename = []
for filename in os.listdir(data_dir):
try:
## Adding dummy section to avoid no header error
with open(os.path.join(data_dir, filename), 'r') as f:
config_string = '[dummy_section]\n' + f.read()
config = configparser.ConfigParser(allow_no_value=True, strict=False, inline_comment_prefixes= '#')
cfgs = {}
config.read_string(config_string)
for sect in config.sections():
cfgs.update(dict(config.items(sect)))
dict_list.append(cfgs)
except:
errorconfigs_filename.append(filename)
parse_errors += 1
if fixed_typos:
## Fixed typos
## Create dataframe
config_df = pd.DataFrame(dict_list)
## Find count of each key
key_dict = config_df.count().to_dict()
## get typos based on frequncy of keys
key_changes = get_typos(key_dict)
## Fix typos
for dict_i in dict_list:
for key in dict_i.copy():
if key in key_changes:
dict_i[key_changes[key]] = dict_i.pop(key)
return dict_list
Finding data type from the given value
We used a set of heuristics to infer the predefined data types from the given value. Our framework detects the bool, size, int, hostname/IP address, file path, filename, URI, string data types.
The following function takes the value of a keyword as input and infers the data type.
## array for boolean
bool_arr = ['on', 'off', 'true', 'false', 'yes', 'no']
def valid_ip(val):
val, separator, port = val.partition(':')
for ip in ipaddress.ip_network, ipaddress.ip_address:
try:
if ip==ipaddress.ip_network:
ip(val,False)
else:
ip(val)
return True
except ValueError:
pass
return False
def is_ip(val):
"""This function check if data type of value is ip address.
Parameters
----------
value :
value for which we want to check data type
Returns
----------
datatype:
a binary value based on data type of the given input argument is ipaddress or not
"""
val_list = val.split(',')
list_comp = [valid_ip(i) for i in val_list]
return all(list_comp)
def is_filepath(val):
"""This function check if data type of value or not.
Parameters
----------
value :
value for which we want to check data type
Returns
----------
datatype:
a binary value based on data type of the given input argument is filepath or not
"""
val = val.lstrip('/')
return is_valid_filepath(val) and (val.find('/')!=-1 or val.find('\\')!=-1)
def is_number(string):
"""This function check if data type of number.
Parameters
----------
value :
value for which we want to check data type
Returns
----------
datatype:
a binary value based on data type is number
"""
try:
float(string)
return True
except ValueError:
return False
def is_url(url):
"""This function check if data type of value or not.
Parameters
----------
value :
value for which we want to check data type
Returns
----------
datatype:
a binary value based on data type of the given input argument is url or not
"""
url = url.strip()
result = urllib.parse.urlparse(url)
scheme = result.scheme
domain = result.netloc
if not scheme or not domain:
return False
return True
def find_type(value):
"""This function infers the data type of the input argument.
Parameters
----------
value :
value for which we want to find data type
Returns
----------
datatype:
data type of the given input argumnet
"""
value = value.replace("'", "")
value = value.replace('"', "")
value = value.replace(" ", "")
if (value.lower() == 'nan') or (value.lower() == 'none') or len(value)==0:
datatype = None
elif value.lower() in bool_arr:
datatype = 'bool'
elif value=='0' or value=='1':
datatype = '0||1'
elif is_ip(value):
datatype = 'ip_address'
elif is_url(value):
datatype = 'uri'
elif is_number(value):
datatype = 'int'
elif is_valid_filename(ntpath.basename(value)) and ntpath.basename(value).find('.')!=-1:
datatype = 'filename||filepath+filename'
elif is_filepath(value):
datatype = 'filepath'
elif re.search(r'^[\d ]*(k|m|g|kb|mb|gb)$', value.lower()):
datatype = 'size'
else:
datatype = 'string'
return datatype
The following function takes the configuration files as input and infers the probabilistic data type for each key.
def assign_same_prob(df, curr_datatype, cols):
new_val = df[cols].sum(axis=1)
for col in cols:
df[col] = np.where(df['datatype'] == curr_datatype, new_val, df[col])
return df
def create_intermediate(config_df):
"""This function takes the configuration files as input and infers the probabilistic data type for each key.
Parameters
----------
config_df : pandas Dataframe
pandas Dataframe where each row corresponding to a configuration file.
Returns
----------
datatype: pandas Dataframe
pandas Dataframe where each row is a probabilistic data type for a key.
"""
# Below we calculated frequencies of different data types for each key.
dict_list = []
for column in config_df:
my_dict = defaultdict(int)
my_dict['key_name'] = column
for i in config_df[column].tolist():
i = str(i)
datatype = find_type(i)
my_dict[datatype] += 1
dict_list.append(my_dict)
# Finding frequency of each key
datatype = pd.DataFrame(dict_list)
datatype['frequency'] = datatype[datatype.columns[~datatype.columns.isin(['key_name', None])]].sum(axis=1)
# We converted frequencies of data types into probabilities.
# We divided frequency by total nonnull entries to calculate the probability.
datatype[datatype.columns[~datatype.columns.isin(['key_name', 'frequency', None])]] = datatype[datatype.columns[~datatype.columns.isin(['key_name', 'frequency', None])]].div(datatype[datatype.columns[~datatype.columns.isin(['key_name', 'frequency', None])]].sum(axis=1), axis=0)
datatype = datatype.drop([None], axis=1, errors='ignore')
# The data type for each keyword is inferred based on maximum probability
datatype['datatype'] = datatype.drop(['key_name', 'frequency'], axis=1, errors='ignore').idxmax(axis=1)
# size, 0||1 are also integers. Hence, whenever data type is int we are assigning the same probability to size
datatype = assign_same_prob(datatype, 'int',['int','size','0||1'])
# int can be also be size. Hence, whenever data type is size we are assigning the same probability to int
datatype = assign_same_prob(datatype, 'size',['int','size'])
# 0||1 are also an boolean. Hence, whenever data type is bool we are assigning the same probability to 0||1
datatype = assign_same_prob(datatype, 'bool',['bool','0||1'])
# bool can be also be 0||1. Hence, whenever data type is 0||1 we are assigning the same probability to bool
datatype = assign_same_prob(datatype, '0||1',['0||1','bool'])
return datatype
The following cell work as a translator; translator parses the configuration files, determines the data type of values, and aggregate results for each key.
data_dir = '../data/raw/mysql'
dict_list = create_keyvalue_df(data_dir, fixed_typos=False)
config_df = pd.DataFrame(dict_list)
config_df = config_df.dropna(axis=1, how='all')
intermediate_representation = create_intermediate(config_df)
total_files = len(dict_list)
The following cell works as a learner and filters the proposed rules using support and confidence.
def create_type_rule(datatype, total_files, confidence_t=0.9, support_t=0.1):
"""This function works as a learner and filters the proposed data type rules using support and confidence.
Parameters
----------
datatype : pandas Dataframe
pandas Dataframe where each row corresponding to a configuration file.
total_files : int
total number of configuration files.
confidence_t : float, default 0.9
confidence threshold for data type errors.
support_t : float, default 0.1
support threshold for data type errors.
Returns
----------
list_datatype_dict: list
list of dictionaries which contains two keys,
The first keys are for key_name in the configuration file; Second keys are the data type of the given key.
"""
# We filter out all the rules with support below the threshold
datatype = datatype[datatype['frequency']/total_files>support_t]
datatype = datatype.drop(['frequency'], axis=1, errors='ignore')
# We filter out all the rules with confidence below the threshold
datatype = datatype[datatype.drop(['key_name','datatype'], axis=1, errors='ignore').max(axis=1)>confidence_t]
# We converted the data type column into list because there are multiple data fields possible, such as 0||1 and int, 0||1 and bool, etc.
datatype_numeric = datatype.drop(['key_name','datatype'], axis=1, errors='ignore')
datatype_bool = datatype_numeric.eq(datatype_numeric.max(1), axis=0)
datatype['datatype'] = datatype_bool.apply(lambda x: datatype_bool.columns[x.iloc[:] == True].tolist(), axis=1)
list_datatype_dict = datatype[['key_name','datatype']].to_dict('records')
return list_datatype_dict
## Learner
datatype_dict = create_type_rule(intermediate_representation, total_files, confidence_t = 0.9, support_t = 0.1)
datatype_dict
[{'key_name': 'character-set-server', 'datatype': ['string']},
{'key_name': 'connect_timeout', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'default-character-set', 'datatype': ['string']},
{'key_name': 'default-storage-engine', 'datatype': ['string']},
{'key_name': 'expire_logs_days', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'innodb_additional_mem_pool_size', 'datatype': ['int', 'size']},
{'key_name': 'innodb_buffer_pool_size', 'datatype': ['int', 'size']},
{'key_name': 'innodb_data_file_path', 'datatype': ['string']},
{'key_name': 'innodb_file_per_table', 'datatype': ['0||1', 'bool']},
{'key_name': 'innodb_flush_method', 'datatype': ['string']},
{'key_name': 'innodb_lock_wait_timeout', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'innodb_log_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'innodb_log_file_size', 'datatype': ['int', 'size']},
{'key_name': 'innodb_log_files_in_group',
'datatype': ['int', '0||1', 'size']},
{'key_name': 'innodb_thread_concurrency',
'datatype': ['int', '0||1', 'size']},
{'key_name': 'interactive_timeout', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'join_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'key_buffer', 'datatype': ['int', 'size']},
{'key_name': 'key_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'language', 'datatype': ['filepath']},
{'key_name': 'log-error', 'datatype': ['filename||filepath+filename']},
{'key_name': 'log_error', 'datatype': ['filename||filepath+filename']},
{'key_name': 'long_query_time', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'max_allowed_packet', 'datatype': ['int', 'size']},
{'key_name': 'max_binlog_size', 'datatype': ['int', 'size']},
{'key_name': 'max_connect_errors', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'max_connections', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'max_heap_table_size', 'datatype': ['int', 'size']},
{'key_name': 'myisam-recover', 'datatype': ['string']},
{'key_name': 'myisam_max_sort_file_size', 'datatype': ['int', 'size']},
{'key_name': 'myisam_sort_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'net_buffer_length', 'datatype': ['int', 'size']},
{'key_name': 'nice', 'datatype': ['0||1', 'bool']},
{'key_name': 'old_passwords', 'datatype': ['0||1', 'bool']},
{'key_name': 'open_files_limit', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'pid-file', 'datatype': ['filename||filepath+filename']},
{'key_name': 'port', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'query_cache_limit', 'datatype': ['int', 'size']},
{'key_name': 'query_cache_type', 'datatype': ['0||1', 'bool']},
{'key_name': 'read_buffer', 'datatype': ['int', 'size']},
{'key_name': 'read_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'read_rnd_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'slow_query_log_file',
'datatype': ['filename||filepath+filename']},
{'key_name': 'socket', 'datatype': ['filename||filepath+filename']},
{'key_name': 'sort_buffer_size', 'datatype': ['int', 'size']},
{'key_name': 'sql-mode', 'datatype': ['string']},
{'key_name': 'symbolic-links', 'datatype': ['0||1', 'bool']},
{'key_name': 'table_cache', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'table_definition_cache', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'table_open_cache', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'thread_cache_size', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'thread_concurrency', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'thread_stack', 'datatype': ['int', 'size']},
{'key_name': 'tmp_table_size', 'datatype': ['int', 'size']},
{'key_name': 'user', 'datatype': ['string']},
{'key_name': 'wait_timeout', 'datatype': ['int', '0||1', 'size']},
{'key_name': 'write_buffer', 'datatype': ['int', 'size']}]
Following are some of the generated rules for mysql.conf file
- Our first rule is {'key_name': 'character-set-server', 'datatype': ['string']}. This rule states that key 'character-set-server' should have data type as 'string'.
- Our second rule is {'key_name': 'default-character-set', 'datatype': ['string']}. This rule states that key 'default-character-set' should have data type as 'string'. and so on
Percentage of data type errors in mysql.cnf files
In the following cell, we calculated the percentage of type errors in mysql.cnf files. We compared the data type of each key with the created rules. If it didn't match, we considered it as an error.
error = 0
for p in range(len(dict_list)):
for q in dict_list[p]:
val = next((item for item in datatype_dict if item["key_name"] == q), None)
if not val:
continue
datatype = find_type(str(dict_list[p][q]))
val_type = val['datatype']
if (datatype not in val_type) and (datatype != None):
error += 1
break
print('Percentage of data type errors in mysql.cnf files are', round(100*error/len(dict_list), 2))
Percentage of data type errors in mysql.cnf files are 6.98
Data type error detection in a particular mysql.cnf file
In the following example, we found data type errors in a mysql.conf file selected from our dataset. keyword 'socket' takes the input value as file_directory + file_name, i.e., location for socket file that the server uses for communication with local clients (Reference). Hence, data type for the keyword should be 'filename||filepath+filename' such as '/var/lib/mysql/mysql.sock'. But in our sample file, the value of keyword 'logfile' is 'MySQL' which has data type as string.
socket file is a pseudo-file used for data transmission by reading and writing.
config_df['socket'].value_counts()
/var/lib/mysql/mysql.sock 82
/var/run/mysqld/mysqld.sock 72
/tmp/mysql.sock 20
/var/run/mysql/mysql.sock 3
/mysqls/tmp/mysql.sock.gleb.i 1
MySQL 1
/mysql/mysqld.sock 1
/var/run/mysqld/mysqld-5-5-20.sock 1
/var/lib/mysql/mysql.sock user=mysql 1
/backup/mysql-cbt/mysql.sock 1
/tmp/mysql_live.sock 1
/usr/local/mysql/tmp/mysql.sock 1
C:\WebSerer\MySQL\data\mysql.sock 1
/tmp/mysql5.sock 1
/home/iepm/mysql/mysql.sock 1
/mysql/mydatabase/mysql.sock 1
/var/lib/mysql/mysql.sock1 1
/data/mysql/mysql.sock 1
"/var/lib/mysql/mysql.sock" 1
"TCP/IP" 1
/tmp/mysqld.sock 1
/var/lib/mysql/data/mysql.sock 1
/tmp/mysql.galera1.sock 1
${path}/tmp/mysql.sock 1
/opt/local/lib/mysql5/bin/mysql 1
"C:/xampp/mysql/mysql.sock" 1
/var/www/data/mysql.sock 1
"MySQL" 1
/app/mysql/var/mysql.sock 1
/db/log/manager.sock 1
/opt/bitnami/mysql/tmp/mysql.sock 1
/export/data1/apps/local/mysql/mysql.sock 1
mysql 1
Name: socket, dtype: int64
with open('../data/raw/mysql/11b9fa75593f39219532d16270c0d5d5.cnf') as f:
config_string = '[dummy_section]\n' + f.read()
config = configparser.ConfigParser(allow_no_value=True, strict=False, inline_comment_prefixes= '#')
cfgs = {}
config.read_string(config_string)
for sect in config.sections():
cfgs.update(dict(config.items(sect)))
cfgs['socket']
'MySQL'
As we can see in our file, value of keyword 'socket' is 'MySQL'. 'MySQL' has data type as a string.
def mis_type_detection(rules, input_file):
"""This function takes the configuration files as input and infers the probabilistic data type for each key.
Parameters
----------
rules : list
list of dictionaries which contains two keys,
The first keys are for key_name in the configuration file; Second keys are the data type of the given key.
input_file : file_path, string
Path of the configuration file for which we want to find data type error
This function print the error message
----------
error_message: string
string with error message correponding to incorrect data type
"""
with open(input_file) as f:
config_string = '[dummy_section]\n' + f.read()
config = configparser.ConfigParser(allow_no_value=True, strict=False, inline_comment_prefixes= '#')
cfgs = {}
config.read_string(config_string)
for sect in config.sections():
cfgs.update(dict(config.items(sect)))
for q in cfgs:
val = next((item for item in rules if item["key_name"] == q), None)
if not val:
continue
datatype = find_type(str(cfgs[q]))
val_type = val['datatype']
if (datatype not in val_type) and (datatype != None):
print('Type_Error: Data type of key \'' + q + '\' should be',val_type, 'but its given as \''+ datatype+'\'')
mis_type_detection(datatype_dict, '../data/raw/mysql/11b9fa75593f39219532d16270c0d5d5.cnf')
Type_Error: Data type of key 'socket' should be ['filename||filepath+filename'] but its given as 'string'
Conclusion
In this notebook, we demonstrated a framework for detecting data type errors in configuration files using the frequency-based approach. Our framework is able to detect sophisticated data type errors because of a wider variety of predefined data types. We can automate the data type error detection task using data-driven methods for all types of key-values based configuration files. Our implemented framework also suggests correct data types for incorrectly assigned data type values. For example, we detected misconfigurations in mysql.cnf files. The analysis can be easily replicated for other files.