A Survey on Data Mining for Fraud Detection

Abstract.[here]

Data Mining is a powerful tool that has been used by many companies in order to take advantage over their competitors, by analyzing the data patterns and studying it for future opportunities, beside this also companies take care of security and fraud. Data Mining has been widely used for fraud detection, applying different techniques to detect outliers, anomalies, irregular transactions.

This survey studies multiples techniques used for fraud detection, it collect data from multiple papers and compare against each other.

Research Papers:
References
Conferences:

Applications:

Complete Report:

1. Introduction

Data mining mechanism allow finding insights which are reliable from data. The Data must be available, reliable and clean. How data mining is couple with accounting fraud detection is that data mining provides analytical may to make decisions when finding an irregular patter, which may result on fraudulent transaction. It is impossible to be absolutely sure about the legality of a transaction in real time without further verification control, but the best cost effective approach is to check out all possible evidence of fraud in a transaction, based on mathematical algorithms and data analysis.

Actual software implementation to detect fraudulent transaction is based on: artificial immune systems, auditing, econometrics, fuzzy logic, machine learning, neuronal networks, patter recognition, parallel computing and others.

2. Background

Fraud can be classify in different ways, but most likely are divided in four categories:

Management Fraud: This implies that the party committing the fraud can be an employee, a customer, vendor, or investor, all of them some kind related to management. This is most known for actions like: falsification or manipulation of expenses, invoices, sales figures. Data mining here can help to Auditor to detect such anomalies and perform the required actions against the fraudulent management.
Customer Fraud: This kind of fraud is directly implied by customer, by not paying what they agree or claiming that they don't receive what they request. Data Mining uses classification mechanism to detect this kind of actions, by analyzing all previous and related data; it can classify an action into fraudulent class. The problem with this is that the data has to be representative.
Network Fraud: Tele communication and other Networks have an active role in preventing or detecting fraud. As an example of network fraud we can mention: Identity thefts, telemarketing frauds, investment scams, and others related.
Computer Based Fraud: Computer Security holes can be the most popular target to commit fraud. Data Mining used a variety of techniques to detect all fraud that is do it using computers. The inductive methods are the popular ones used for this purpose, like: decision trees, support vector machines.

For a company, the naïve method to detect a fraud will be: analyze manually every user, transaction and determine the legitimacy of this. Nowadays that is impossible due to the amount of user, transaction in general all the data that even a small company handle. Using automatic methods to analyze the data can take a lot time, so the ideal approach is a mechanism that can detect an outlier pattern in real time, so the parties affected can take the required actions to avoid damage.

3. Data and Measurements

3.1 Structured Data

The quality of the data is the more critical factor when determining if a transaction is fraudulent or not, the amount of the attribute that the data provides also play an important role in the main task. Then different kind of data with different attributes is used for specific studies. But in general the attributes used for detecting each fraud type are generally the same, by example: Credit card companies hold data reference to transactions, date, time, account, and history; Insurance company keep information about incomes, history, police reports; and other companies also have relevant data in order to detect outliers.

3.2 Performance Measures

In order to take measures referent to fraud detection, most of the companies give a monetary value to predictions to maximize cost savings/profit according to local policies.

Cahill et al suggest giving a score to every transaction based on the similarity to know fraud examples divided by dissimilarity of it to know legal transactions.

4. Methods and Techniques

4.1 Overview

Actual fraud detection system operate by including a fraudulent transaction into a “black-list”, looking for future matches in new transactions, so it’s easy to determine if a transaction can be potentially fraudulent or not. Some application hard-code some rules to match with specific numbers, amount of money, and also set some upper/lower bound limits in specific transactions. To understand the nature of a fraudulent transaction Fawcett et al describe the following:

The volume of both fraud and legal data can fluctuate independently of each other.
Different kind of fraud can happen at same time.
Legal transaction can change over the time.

4.2 Supervised Approaches on labeled Data

Sherman et al (2002) proposed predictive supervised algorithms, which examines all transactions and mathematically determine how a fraudulent transaction looks like. Ghosh and Reilly (1994) proposed a three-layer, fee0forward Radial Basis functions neural network with only two training passes required to classify a fraud transaction by given it a score. Base el at (2003) proposed a multi-layer neural network with memory trace to handle dependencies in synthetic log data.

Decision trees also has been used, Fan (2004) proposed a systematic data selection to get concept-drifting, data streams. They propose a framework to select the optimal model out of 4 models. The data that has been selected previously is the example for optimal model to predict data correctly. A cross-validated decision tree ensemble is consistently better than all other decision tree classifiers.

Statistical models are also proposed, Foster and Stine (2004) propose to use least squares regression and stepwise selection of predictors to show that standard methods are competitive.Their version has three modifications: First, organizes calculations to accommodate interactions, second exploits modern decision-theoretic criteria to choose predictors, and finally, conservatively estimate p-values to handle sparse data.

There exist also techniques which includes expert systems, association rules and genetic programming. Major and Reideinger (2002) proposed an actual five-layer expert system in which expert knowledge is integrated with statistical information assessment to identify insurance fraud.

4.3 Hybrid Approaches with labeled Data

Supervised algorithms like neural networks, Bayesian networks and decision trees have been combined in a way to improve results.

Chan et al (1999) uses naïve Bayes, C4.5 CART and RIPPER as base classifiers and combine them. Omerod et al (2003) recommends a rule generator to refine the weights of the Bayesian network. He et al (1999) propose genetic algorithms to determine optimal weight of attributes, followed by K-NN algorithm.

Cortes and Pregibo (2001) propose the use of signatures which are updated daily. Fraudulent signatures are added to special class then later analyzed as training data with algorithm like atree, slipper and model-averaged regression.

4.4 Semi Supervised Approaches with Only legal data

Kim et al (2003) propose a new fraud detection method in five steps: First, generate rules randomly using algorithm Apriori and increase diversity by calendar schema; second, apply rules on legitimate transaction, discard any rule which match with this data; third, use remaining rules to monitor actual system, discard any rule which detect no anomalies; fourth, replicate any rule which detects anomalies by adding tiny random mutations and finally, retain the successful rules.

Murad and Pinkas (1999) propose use profiling on telecommunications. The common daily profiles are extracted using a clustering algorithm with cumulative distribution distance. An alert is raised if the daily profile's call duration, destination, and quantity exceed the threshold and standard deviation of profile.

4.5 Unsupervised Approaches with unlabeled Data

Cortes et al (2001) evaluate temporal evolution of large dynamic graphs for telecommunications fraud detection. Each graph is made up of sub graphs called communities of interest.

Donrronsoro et al (1997) creates a non-linear discriminant analysis algorithm which do not need labels. Minimizes the ratio of the determinants of the within and between class variances of weight projections. There is no history on each credit card account's past transactions, so all transactions have to be segregated into different geographical locations. The authors explained that the installed detection system has low false positive rates, high cost savings, and high computational efficiency.

5. References

Clifton Phua, Vincent Lee, Kate Smith, Ross Gailer: A Comprehensive Survey of Data Mining-based Fraud Detection Research.
Dean W. Abbott, I Philip Matkovsky, John F. Elder: An Evaluation of High-edn Data mining Tools for Fraud Detection.
Aleskerov, E., Freisleben, B. & Rao, B. (1997). CARDWATCH: A Neural Network-Based Database Mining System for Credit Card Fraud Detection. Proc. of theIEEE/IAFE on Computational Intelligence for Financial Engineering, 220-226.
Chan, P., Fan, W., Prodromidis, A. & Stolfo, S. (1999). Distributed Data Mining in Credit Card Fraud Detection. IEEE Intelligent Systems 14: 67-74.
Cortes, C., Pregibon, D. & Volinsky, C. (2003). Computational Methods for Dynamic Graphs. Journal of Computational and Graphical Statistics 12: 950-970.
Fan, W., Miller, M., Stolfo, S., Lee, W. & Chan, P. (2001). Using Artificial Anomalies to Detect Unknown and Known Network Intrusions. Proc. of ICDM01, 123-248.
Fawcett, T. (2003). "In Vivo" Spam Filtering: A Challenge Problem for KDD. SIGKDD Explorations 5(2): 140-148.
Foster, D. & Stine, R. (2004). Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy. Journal of American Statistical Association 99: 303-313.
Kim, H., Pang, S., Je, H., Kim, D. & Bang, S. (2003). Constructing Support Vector Machine Ensemble. Pattern Recognition 36: 2757-2767.
Kim, J., Ong, A. & Overill, R. (2003). Design of an Artificial Immune System as a Novel Anomaly Detector for Combating Financial Fraud in Retail Sector. Congress on Evolutionary Computation.
Major, J. & Riedinger, D. (2002). EFD: A Hybrid Knowledge/Statistical-based system for the Detection of Fraud. Journal of Risk and Insurance 69(3): 309-324.
Murad, U. & Pinkas, G. (1999). Unsupervised Profiling for Identifying Superimposed Fraud. Proc. of PKDD99.

jueves, 8 de diciembre de 2011

A Survey on Data Mining for Fraud Detection