ODH Logo

Misconfiguration detection framework

The misconfiguration detection framework adopted in this project is inspired by the research paper 'Synthesizing Configuration File Specifications with Association Rule Learning'. Association rule learning is a method to discover frequently occurring patterns or associations between variables in a dataset. In association rule learning, support and confidence are two metrics widely used to filter the proposed rules. Support is the percentage of times that the keywords in the proposed rule have been seen in the training configuration files. Confidence is the percentage of times the proposed rule has held true over the training configuration files.

[1]
from IPython.display import Image
Image('images/framework.png')

As you can see in the above image, misconfiguration detection framework has two important modules: translator and learner.

  • Translator: Translator works as a parser, translator converts raw configuration files into an intermediate representation which generally has a format of key, value, data type, frequency (k, v, Ï„, f)
  • Learner:: Learner discovers frequently occurring patterns or associations between keywords in configuration files to derive rules.

In this notebook, our objective is to find data type errors.

Many system failure issues are caused by assigning incorrect data type values to a keyword. In this method, for each key, we match the data type of target with the data type information inferred from the training set. An error is reported if the matching fails.

Data type error framework

Following is the example of a data type error detection framework

[2]
Image('images/type_error_framework.png')

Steps for data type error detection 1) Translator parses the configuration files, determines the data type of values, and aggregate results for each key. 2) Learner proposes rules by assigning every possible datatype to each keyword in configuration files. .Then filter proposed rules using support and confidence.

In this notebook, we are finding data type errors in mysql.conf files. Dataset used in this notebook is publically available on the Github. You can download the data by running the script data_downloader.py in the same directory where this notebook exists.

Data type errors in mysql.cnf files

[3]
import os
import json
import pandas as pd
import numpy as np
from itertools import combinations
import nltk
import re
from collections import defaultdict
import math
import ntpath
from pathvalidate import is_valid_filename, is_valid_filepath
import ipaddress
import urllib
import configparser
[4]
# Below function parses all the configuration files based on ini type configuration files.        
        
def get_typos(key_freq, confidence_t=0.9, typo_limit=2):
    """This function finds typos based on keyword frequencies.

    We find similar words and filter the results based on the confidence threshold.

    Parameters
    ----------
    key_freq : dictionary 
            Frequency of keyword in configuration files.
    confidence_t : float, default 0.9
            confidence threshold for spell check
    typo_limit : int, default 2
            The number of typos allowed. We calculate typos using edit distance
            For every ten characrters, we increase this argument by 1

    Returns
    ----------
    spell_errors: dictionary
            key in the dictionary is a misspelled keyword, and value is the correct version of a keyword
    """
    spell_errors = {}
    for i, j in combinations(key_freq.keys(), r=2):
        # Calculated similarity between keyword using Levenshtein distance.
        if nltk.edit_distance(i, j) >= min(len(i), len(j))//10+typo_limit:
            continue
        if key_freq[i] >= key_freq[j]:
            if key_freq[i]/(key_freq[i] + key_freq[j]) > confidence_t:
                spell_errors[j] = i
        else:
            if key_freq[j]/(key_freq[i] + key_freq[j]) > confidence_t:           
                spell_errors[i] = j
    return spell_errors

# Create_KeyValue_df function takes input as a folder path and creates a list of dictionaries as output 
def create_keyvalue_df(data_dir, fixed_typos=True):
    """This function creates a list of dictionaries as output for a set of configuration files.

    We parse and convert each configuration file into a dictionary format. 
    and append all this dictionary into a list

    Parameters
    ----------
    folder_path : string
            folder path of the configuration files.
    fixed_typos : bool, default True
            If True, fixed typos in configuration files.

    Returns
    ----------
    dict_list: list of dictionaries
            list where each element is a dictionary of a configuration file.
    """
    dict_list = []
    parse_errors = 0
    errorconfigs_filename = []
    
    for filename in os.listdir(data_dir):
        try:
            ## Adding dummy section to avoid no header error
            with open(os.path.join(data_dir, filename), 'r') as f:
                config_string = '[dummy_section]\n' + f.read()
            config = configparser.ConfigParser(allow_no_value=True, strict=False, inline_comment_prefixes= '#')
            cfgs = {}
            config.read_string(config_string)
            for sect in config.sections():
                cfgs.update(dict(config.items(sect)))
            dict_list.append(cfgs)
        except:
            errorconfigs_filename.append(filename)
            parse_errors += 1
    
    if fixed_typos:
        ## Fixed typos
        ## Create dataframe
        config_df = pd.DataFrame(dict_list)
        ## Find count of each key
        key_dict = config_df.count().to_dict()
        ## get typos based on frequncy of keys
        key_changes = get_typos(key_dict)
        ## Fix typos
        for dict_i in dict_list:
            for key in dict_i.copy():
                if key in key_changes:
                    dict_i[key_changes[key]] = dict_i.pop(key)

    
    return dict_list

Finding data type from the given value

We used a set of heuristics to infer the predefined data types from the given value. Our framework detects the bool, size, int, hostname/IP address, file path, filename, URI, string data types.

The following function takes the value of a keyword as input and infers the data type.

[5]
## array for boolean
bool_arr = ['on', 'off', 'true', 'false', 'yes', 'no']

def valid_ip(val):
    val, separator, port = val.partition(':')
    for ip in ipaddress.ip_network, ipaddress.ip_address:
        try:                                                           
            if ip==ipaddress.ip_network:
                ip(val,False)
            else:
                ip(val)
            return True
        except ValueError:
            pass
    return False

def is_ip(val):
    """This function check if data type of value is ip address.
    
    Parameters
    ----------
    value : 
            value for which we want to check data type

    Returns
    ----------
    datatype: 
            a binary value based on data type of the given input argument is ipaddress or not
    """
    val_list = val.split(',')
    list_comp = [valid_ip(i) for i in val_list]
    return all(list_comp)

def is_filepath(val):
    """This function check if data type of value or not.
    
    Parameters
    ----------
    value : 
            value for which we want to check data type

    Returns
    ----------
    datatype: 
            a binary value based on data type of the given input argument is filepath or not
    """
    val = val.lstrip('/')
    return is_valid_filepath(val) and (val.find('/')!=-1 or val.find('\\')!=-1)

def is_number(string):
    """This function check if data type of number.
    
    Parameters
    ----------
    value : 
            value for which we want to check data type

    Returns
    ----------
    datatype: 
            a binary value based on data type is number
    """
    try:
        float(string)
        return True
    except ValueError:
        return False

def is_url(url):
    """This function check if data type of value or not.
    
    Parameters
    ----------
    value : 
            value for which we want to check data type

    Returns
    ----------
    datatype: 
            a binary value based on data type of the given input argument is url or not
    """
    url = url.strip()

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc
    if not scheme or not domain:
        return False
    return True


def find_type(value):
    """This function infers the data type of the input argument.
    
    Parameters
    ----------
    value : 
            value for which we want to find data type

    Returns
    ----------
    datatype: 
            data type of the given input argumnet
    """
    value = value.replace("'", "")
    value = value.replace('"', "")
    value = value.replace(" ", "")
    if (value.lower() == 'nan') or (value.lower() == 'none') or len(value)==0:
        datatype = None
    elif value.lower() in bool_arr:
        datatype = 'bool'
    elif value=='0' or value=='1':
        datatype = '0||1'
    elif is_ip(value):
        datatype = 'ip_address'
    elif is_url(value):
        datatype = 'uri'
    elif is_number(value):
        datatype = 'int'
    elif is_valid_filename(ntpath.basename(value)) and ntpath.basename(value).find('.')!=-1:
        datatype = 'filename||filepath+filename'
    elif is_filepath(value):
        datatype = 'filepath'
    elif re.search(r'^[\d ]*(k|m|g|kb|mb|gb)$', value.lower()):
        datatype = 'size'
    else:
        datatype = 'string'
    return datatype

The following function takes the configuration files as input and infers the probabilistic data type for each key.

[6]
def assign_same_prob(df, curr_datatype, cols):
    new_val = df[cols].sum(axis=1)
    for col in cols:
        df[col] = np.where(df['datatype'] == curr_datatype, new_val, df[col])
    return df
[7]
def create_intermediate(config_df):
    """This function takes the configuration files as input and infers the probabilistic data type for each key.
    
    Parameters
    ----------
    config_df : pandas Dataframe
            pandas Dataframe where each row corresponding to a configuration file.

    Returns
    ----------
    datatype: pandas Dataframe
            pandas Dataframe where each row is a probabilistic data type for a key.
    """

    # Below we calculated frequencies of different data types for each key.
    dict_list = []
    for column in config_df:
        my_dict = defaultdict(int)
        my_dict['key_name'] = column
        for i in config_df[column].tolist():
            i = str(i)
            datatype = find_type(i)
            my_dict[datatype] += 1
        dict_list.append(my_dict)
    
    # Finding frequency of each key
    datatype = pd.DataFrame(dict_list)
    datatype['frequency'] = datatype[datatype.columns[~datatype.columns.isin(['key_name', None])]].sum(axis=1)

    # We converted frequencies of data types into probabilities. 
    # We divided frequency by total nonnull entries to calculate the probability.
    datatype[datatype.columns[~datatype.columns.isin(['key_name', 'frequency', None])]] = datatype[datatype.columns[~datatype.columns.isin(['key_name', 'frequency', None])]].div(datatype[datatype.columns[~datatype.columns.isin(['key_name', 'frequency', None])]].sum(axis=1), axis=0)
    datatype = datatype.drop([None], axis=1, errors='ignore')

    # The data type for each keyword is inferred based on maximum probability
    datatype['datatype'] = datatype.drop(['key_name', 'frequency'], axis=1, errors='ignore').idxmax(axis=1)

    # size, 0||1 are also integers. Hence, whenever data type is int we are assigning the same probability to size
    datatype = assign_same_prob(datatype, 'int',['int','size','0||1'])
        
    # int can be also be size. Hence, whenever data type is size we are assigning the same probability to int
    datatype = assign_same_prob(datatype, 'size',['int','size'])

    # 0||1 are also an boolean. Hence, whenever data type is bool we are assigning the same probability to 0||1
    datatype = assign_same_prob(datatype, 'bool',['bool','0||1'])
    
    # bool can be also be 0||1. Hence, whenever data type is 0||1 we are assigning the same probability to bool
    datatype = assign_same_prob(datatype, '0||1',['0||1','bool'])
    
    return datatype

The following cell work as a translator; translator parses the configuration files, determines the data type of values, and aggregate results for each key.

[8]
data_dir = '../data/raw/mysql'
dict_list = create_keyvalue_df(data_dir, fixed_typos=False)
config_df = pd.DataFrame(dict_list)
config_df = config_df.dropna(axis=1, how='all')
intermediate_representation = create_intermediate(config_df)
total_files = len(dict_list)

The following cell works as a learner and filters the proposed rules using support and confidence.

[9]
def create_type_rule(datatype, total_files, confidence_t=0.9, support_t=0.1):
    """This function works as a learner and filters the proposed data type rules using support and confidence.
    
    Parameters
    ----------
    datatype : pandas Dataframe
            pandas Dataframe where each row corresponding to a configuration file.
    total_files : int
            total number of configuration files.
    confidence_t : float, default 0.9
            confidence threshold for data type errors.
    support_t : float, default 0.1
            support threshold for data type errors.

    Returns
    ----------
    list_datatype_dict: list
            list of dictionaries which contains two keys,
            The first keys are for key_name in the configuration file; Second keys are the data type of the given key.
    """

    # We filter out all the rules with support below the threshold
    datatype = datatype[datatype['frequency']/total_files>support_t]
    datatype = datatype.drop(['frequency'], axis=1, errors='ignore')
    
    # We filter out all the rules with confidence below the threshold
    datatype = datatype[datatype.drop(['key_name','datatype'], axis=1, errors='ignore').max(axis=1)>confidence_t]

    # We converted the data type column into list because there are multiple data fields possible, such as 0||1 and int, 0||1 and bool, etc.
    datatype_numeric = datatype.drop(['key_name','datatype'], axis=1, errors='ignore')
    datatype_bool = datatype_numeric.eq(datatype_numeric.max(1), axis=0)
    datatype['datatype'] = datatype_bool.apply(lambda x: datatype_bool.columns[x.iloc[:] == True].tolist(), axis=1)
    list_datatype_dict = datatype[['key_name','datatype']].to_dict('records')

    return list_datatype_dict

## Learner
datatype_dict = create_type_rule(intermediate_representation, total_files, confidence_t = 0.9, support_t = 0.1)
datatype_dict
[{'key_name': 'character-set-server', 'datatype': ['string']},
 {'key_name': 'connect_timeout', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'default-character-set', 'datatype': ['string']},
 {'key_name': 'default-storage-engine', 'datatype': ['string']},
 {'key_name': 'expire_logs_days', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'innodb_additional_mem_pool_size', 'datatype': ['int', 'size']},
 {'key_name': 'innodb_buffer_pool_size', 'datatype': ['int', 'size']},
 {'key_name': 'innodb_data_file_path', 'datatype': ['string']},
 {'key_name': 'innodb_file_per_table', 'datatype': ['0||1', 'bool']},
 {'key_name': 'innodb_flush_method', 'datatype': ['string']},
 {'key_name': 'innodb_lock_wait_timeout', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'innodb_log_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'innodb_log_file_size', 'datatype': ['int', 'size']},
 {'key_name': 'innodb_log_files_in_group',
  'datatype': ['int', '0||1', 'size']},
 {'key_name': 'innodb_thread_concurrency',
  'datatype': ['int', '0||1', 'size']},
 {'key_name': 'interactive_timeout', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'join_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'key_buffer', 'datatype': ['int', 'size']},
 {'key_name': 'key_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'language', 'datatype': ['filepath']},
 {'key_name': 'log-error', 'datatype': ['filename||filepath+filename']},
 {'key_name': 'log_error', 'datatype': ['filename||filepath+filename']},
 {'key_name': 'long_query_time', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'max_allowed_packet', 'datatype': ['int', 'size']},
 {'key_name': 'max_binlog_size', 'datatype': ['int', 'size']},
 {'key_name': 'max_connect_errors', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'max_connections', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'max_heap_table_size', 'datatype': ['int', 'size']},
 {'key_name': 'myisam-recover', 'datatype': ['string']},
 {'key_name': 'myisam_max_sort_file_size', 'datatype': ['int', 'size']},
 {'key_name': 'myisam_sort_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'net_buffer_length', 'datatype': ['int', 'size']},
 {'key_name': 'nice', 'datatype': ['0||1', 'bool']},
 {'key_name': 'old_passwords', 'datatype': ['0||1', 'bool']},
 {'key_name': 'open_files_limit', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'pid-file', 'datatype': ['filename||filepath+filename']},
 {'key_name': 'port', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'query_cache_limit', 'datatype': ['int', 'size']},
 {'key_name': 'query_cache_type', 'datatype': ['0||1', 'bool']},
 {'key_name': 'read_buffer', 'datatype': ['int', 'size']},
 {'key_name': 'read_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'read_rnd_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'slow_query_log_file',
  'datatype': ['filename||filepath+filename']},
 {'key_name': 'socket', 'datatype': ['filename||filepath+filename']},
 {'key_name': 'sort_buffer_size', 'datatype': ['int', 'size']},
 {'key_name': 'sql-mode', 'datatype': ['string']},
 {'key_name': 'symbolic-links', 'datatype': ['0||1', 'bool']},
 {'key_name': 'table_cache', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'table_definition_cache', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'table_open_cache', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'thread_cache_size', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'thread_concurrency', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'thread_stack', 'datatype': ['int', 'size']},
 {'key_name': 'tmp_table_size', 'datatype': ['int', 'size']},
 {'key_name': 'user', 'datatype': ['string']},
 {'key_name': 'wait_timeout', 'datatype': ['int', '0||1', 'size']},
 {'key_name': 'write_buffer', 'datatype': ['int', 'size']}]

Following are some of the generated rules for mysql.conf file

  • Our first rule is {'key_name': 'character-set-server', 'datatype': ['string']}. This rule states that key 'character-set-server' should have data type as 'string'.
  • Our second rule is {'key_name': 'default-character-set', 'datatype': ['string']}. This rule states that key 'default-character-set' should have data type as 'string'. and so on

Percentage of data type errors in mysql.cnf files

In the following cell, we calculated the percentage of type errors in mysql.cnf files. We compared the data type of each key with the created rules. If it didn't match, we considered it as an error.

[10]
error = 0
for p in range(len(dict_list)):
    for q in dict_list[p]:
        val = next((item for item in datatype_dict if item["key_name"] == q), None)
        if not val:
            continue
        datatype = find_type(str(dict_list[p][q]))
        val_type = val['datatype']
        if (datatype not in val_type) and (datatype != None):
            error += 1
            break
print('Percentage of data type errors in mysql.cnf files are', round(100*error/len(dict_list), 2))
Percentage of data type errors in mysql.cnf files are 6.98

Data type error detection in a particular mysql.cnf file

In the following example, we found data type errors in a mysql.conf file selected from our dataset. keyword 'socket' takes the input value as file_directory + file_name, i.e., location for socket file that the server uses for communication with local clients (Reference). Hence, data type for the keyword should be 'filename||filepath+filename' such as '/var/lib/mysql/mysql.sock'. But in our sample file, the value of keyword 'logfile' is 'MySQL' which has data type as string.

socket file is a pseudo-file used for data transmission by reading and writing.

[11]
config_df['socket'].value_counts()
/var/lib/mysql/mysql.sock                    82
/var/run/mysqld/mysqld.sock                  72
/tmp/mysql.sock                              20
/var/run/mysql/mysql.sock                     3
/mysqls/tmp/mysql.sock.gleb.i                 1
MySQL                                         1
/mysql/mysqld.sock                            1
/var/run/mysqld/mysqld-5-5-20.sock            1
/var/lib/mysql/mysql.sock user=mysql          1
/backup/mysql-cbt/mysql.sock                  1
/tmp/mysql_live.sock                          1
/usr/local/mysql/tmp/mysql.sock               1
C:\WebSerer\MySQL\data\mysql.sock             1
/tmp/mysql5.sock                              1
/home/iepm/mysql/mysql.sock                   1
/mysql/mydatabase/mysql.sock                  1
/var/lib/mysql/mysql.sock1                    1
/data/mysql/mysql.sock                        1
"/var/lib/mysql/mysql.sock"                   1
"TCP/IP"                                      1
/tmp/mysqld.sock                              1
/var/lib/mysql/data/mysql.sock                1
/tmp/mysql.galera1.sock                       1
${path}/tmp/mysql.sock                        1
/opt/local/lib/mysql5/bin/mysql               1
"C:/xampp/mysql/mysql.sock"                   1
/var/www/data/mysql.sock                      1
"MySQL"                                       1
/app/mysql/var/mysql.sock                     1
/db/log/manager.sock                          1
/opt/bitnami/mysql/tmp/mysql.sock             1
/export/data1/apps/local/mysql/mysql.sock     1
mysql                                         1
Name: socket, dtype: int64
[12]
with open('../data/raw/mysql/11b9fa75593f39219532d16270c0d5d5.cnf') as f:
    config_string = '[dummy_section]\n' + f.read()
config = configparser.ConfigParser(allow_no_value=True, strict=False, inline_comment_prefixes= '#')
cfgs = {}
config.read_string(config_string)
for sect in config.sections():
    cfgs.update(dict(config.items(sect)))
cfgs['socket']
'MySQL'

As we can see in our file, value of keyword 'socket' is 'MySQL'. 'MySQL' has data type as a string.

[13]
def mis_type_detection(rules, input_file):
    """This function takes the configuration files as input and infers the probabilistic data type for each key.
    
    Parameters
    ----------
    rules : list
             list of dictionaries which contains two keys,
            The first keys are for key_name in the configuration file; Second keys are the data type of the given key.
    input_file : file_path, string
            Path of the configuration file for which we want to find data type error

    This function print the error message
    ----------
    error_message: string
            string with error message correponding to incorrect data type
    """
    with open(input_file) as f:
        config_string = '[dummy_section]\n' + f.read()
    config = configparser.ConfigParser(allow_no_value=True, strict=False, inline_comment_prefixes= '#')
    cfgs = {}
    config.read_string(config_string)
    for sect in config.sections():
        cfgs.update(dict(config.items(sect)))
    
    for q in cfgs:
        val = next((item for item in rules if item["key_name"] == q), None)
        if not val:
            continue
        datatype = find_type(str(cfgs[q]))
        val_type = val['datatype']
        if (datatype not in val_type) and (datatype != None):
            print('Type_Error: Data type of key \'' + q + '\' should be',val_type, 'but its given as \''+ datatype+'\'')

mis_type_detection(datatype_dict, '../data/raw/mysql/11b9fa75593f39219532d16270c0d5d5.cnf')
Type_Error: Data type of key 'socket' should be ['filename||filepath+filename'] but its given as 'string'

Conclusion

In this notebook, we demonstrated a framework for detecting data type errors in configuration files using the frequency-based approach. Our framework is able to detect sophisticated data type errors because of a wider variety of predefined data types. We can automate the data type error detection task using data-driven methods for all types of key-values based configuration files. Our implemented framework also suggests correct data types for incorrectly assigned data type values. For example, we detected misconfigurations in mysql.cnf files. The analysis can be easily replicated for other files.