Metadata
- Author: S. Finlay
- Full Title:: Predictive Analytics, Data Mining and Big Data
- Category:: 📚Books
- Finished date:: 2023-02-02
Highlights
improvements are relatively small compared to the benefits of having more data, better quality data and analyzing this data more effectively. (Location 296)
would expect a well-implemented decision-making system, based on predictive analytics, to make decisions that are about 20–30% more accurate than their human counterparts. (Location 310)
Producing lists of people who would enjoy going on a date with you. (Location 350)
my experience, it’s still typical for model building to account for no more than 10–20% of the time, effort and cost involved in a modeling project. The rest of the effort is involved in doing all the other things that are needed to get the processes in place to be able to use the model operationally. (Location 412)
A huge proportion of the Big Data out there is absolutely useless when it comes to forecasting consumer behavior. You have to work pretty hard at finding the useful bits that will improve the accuracy of your predictive models (Location 493)
In banking, for example, the potential for new Big Data sources to improve the predictive ability of credit scoring models is fairly small, over and above the data already available. This is because the key driver of credit risk is past behavior, and the banks have ready access to people’s credit reports, plus a wealth of other data, supplied in a nice neat format by Credit Reference Agencies such as Equifax, Experian and TransUnion. (Location 508)
predictive models use no more than 20–30 data items (variables) to generate their predictions, and some considerably less than that. (Location 1497)
The most predictive data items are often derived from two or more other pieces of data. (Location 1514)
data about primary behavior is the most important by a considerable margin. It often accounts for 80–90% of the predictive ability of a model. (Location 1561)
It’s true that every so often someone will put forward a really interesting and surprising behavioral association that they have found (such as beer purchases are associated with diaper (nappy) sales2 or that insurance claims can be predicted using information about credit usage3), but these associations are relatively rare, (Location 1573)
Sentiment data is one of those things that has really come to the fore in the Big Data world, but in practice is hard (Location 1583)
the new data is highly correlated with existing data then it won’t add much to the power of your predictions. (Location 1605)
However, in many mass markets, where we are talking about millions of potential customers, a 1–2% improvement can translate into a very large financial benefit that justifies big IT spending. (Location 1632)
Big Data has proved so interesting to marketing people, (Location 1636)
Consequently, predictive models have a finite lifespan. (Location 1763)
Typically, one needs at least 30–50 examples of an attribute to be able to say whether it’s an important predictor of behavior or not. (Location 1792)
A large bank may have five million mortgages on its books, but only 10,000 cases where foreclosure occurred. We say that foreclosure is the “minority class” and non-foreclosure is the “majority class,” and it’s the minority class that is the limiting factor. One approach would be to take all 10,000 foreclosure cases and then randomly sample 10,000 non-foreclosure cases from the rest of the population. (Location 1817)
The over-fitting problem is less likely to occur if large samples are used.17 What you sometimes see is an over-fitted model constructed using a small sample being compared to a model constructed using a large sample where over-fitting has not occurred. Consequently, the difference in accuracy between the two models is larger than it should be. (Location 1871)
when it comes to liability, if a cloud provider leaks your data or someone hacks it, it’s still you that has ultimate responsibility – not the outsourcer. (Location 2129)
if I remove identifying features, such as the account number, name and address then it’s no longer possible to tell whose credit card details are whose. This is now anonymized data, and as such, data protection and privacy laws no longer apply. Consequently, I can sell or share this data without needing to seek permissions from the individuals in question. (Location 2135)
By cross referencing what you believe is anonymized data against these data sources it is often possible to identify specific individuals – (Location 2153)
it was shown that it was possible to identify some people on the Netflix dataset by cross-referencing it with publicly available information on the Internet Movie Database (IMDb).24 What was really interesting about this case was that all it took to identify someone on the Netflix dataset was a handful of their movie ratings and the approximate dates when those ratings had been made.25 (Location 2158)
Once strategy that can be adopted is to slightly perturb the data before it is released. (Location 2163)
As noted by the Information Commissioner’s Office (ICO) in the UK, the risk of anonymized data becoming linked back to specific individuals is essentially unpredictable. This is because one can never fully ascertain what data is already available or what data may be released in the future. It is also infeasible to guarantee the recall or deletion of data (i.e. removing it from a website) once it has been placed in the public domain. You can never be sure it has not been copied to some other database.26 (Location 2172)
laws exist in many countries preventing immutable data such as age, race and gender from being used in many types of decision-making process. (Location 2208)
When faced with non-linear data the way to deal with it is to transform it into something else that is linear. (Location 2357)
the most practical way of dealing with non-linear data is to discretize (bin) the data. Instead of including raw Income, Income is divided into a number of ranges (bins). A 1/0 indicator variable is then used to represent each range. (Location 2361)
With the binning approach one simply creates another 0/1 indicator variable to represent missing data. (Location 2371)
Binning all the predictor variables and replacing them with a set of indicator variables has several other advantages. One is that it standardizes the data. This makes it much easier to compare the contribution that each variable makes to the model score. In Figure 6.1 you can’t say whether having a net monthly income of 8,000 it contributes +172 points to the score, compared to +23 points if you are a home owner. (Location 2377)
Another issue that often comes up when building predictive models is outliers. Most families have annual incomes in the range 250,000 but a (very) few have incomes of $10m or more. It doesn’t really matter if these high incomes are real or a mistake – what tends to happen is that they distort the results. In traditional statistics a popular solution is to simply exclude these types of cases and build your model without them.6 However, in predictive modeling it’s often the outliers that are of most interest, particularly when models are being constructed to predict “rare” events such as fraud. Binning automatically takes account of outliers, and treats them appropriately within the model. (Location 2387)
In my experience, the use of binning always results in better linear models than when the raw data is used or when simple transforms such as log or power functions are applied. (Location 2398)
In practice there are often between two and twice the number of input variables (so in this example somewhere between two and ten). (Location 2478)
An easy way to think about the operation of the network in Figure 6.6 is as a function of three separate linear models. (Location 2492)
The big strength of neural networks is their ability to take into account non-linear features in data, and if you build a network correctly it has the potential to outperform linear models and decision trees in some situations. (Location 2495)
Their main drawback, as you may have gathered, is their complexity. Figure 6.6 contains just five input variables, two hidden neurons and one output neuron, and that’s complex enough. Imagine what a network with 200 input variables and a few dozen hidden layer neurons (Location 2497)
The hard part with neural networks is determining how many neurons to have in the hidden layer and what the weights should be, and there is usually a degree of trial and error involved. You can’t determine the weights in a network using a simple formula. Instead, algorithms are applied which run through the data many times, each time adjusting the weights based on how well the model predicts. (Location 2499)
support vector machines are a non-linear method (Location 2524)
The support vector machine works to find the equation of the line that maximizes the margin, and the equation of this line is your model. (Location 2526)
the structure of a support vector machine model is not interpretable in any meaningful way: (Location 2529)
With a support vector machine only the cases that are closest to the line are used. These are the “support vectors.” If you think about it this makes sense. Cases far from the margin (near the edge of the table where the events were scattered from) are not going to tell you much about where the line should be drawn. Only cases that are near to the middle of the table will have a significant impact on the classification process and are therefore included in the modeling process (these are the circles and crosses in bold in Figure 6.7). A key part of the SVM algorithm is determining which cases in the sample are the support vectors, before applying a suitable transformation to those cases and then using them to build the model. (Location 2533)
For predictive modeling a slightly different clustering process, called K-nearest neighbor, is widely applied, and for this type of clustering you do need to know something about behavior for cases in the development sample. (Location 2557)
The score for the new case is calculated to be the proportion of the K nearest cases in the development sample that displayed the behavior. (Location 2562)
A knowledge base. This holds the decision making logic, often in the form of IF/THEN type rules. For example: IF “Chest Pain” AND “Difficulty Breathing” THEN Probability of Heart Attack = 0.17. An inference engine. This interrogates the knowledge base, using data it has collected from the user to identify the most probable outcome, given the data that has been provided. An Interface. This provides a mechanism for the system to interact with users to gather the data it needs. For example, to ask the user questions such as: “Does the patient have chest pain?” (Location 2573)
Expert systems may only have a few applications, but where there is very little hard data available to build a model using a statistical/mathematical approach, the idea of capturing human expertise still has merit. (Location 2601)
On balance, support vector machines and neural networks are slightly better than other types of model for predicting consumer behavior, but often there is not much in it. Most algorithms perform similarly well in many situations23 (the flat maximum effect again). No single type of model is always best. Sometimes a simple linear model or decision tree will outperform more advanced/complex methods.24 (Location 2620)
Popular algorithms for deriving decision trees are not very efficient at utilizing data. (Location 2643)
This is particularly true when small and medium-sized samples are used to construct the model.29 (Location 2645)
If you have lots more examples of behavior or non-behavior in the development sample then model performance will be poor (e.g. (Location 2647)
There are however, ways of getting around this problem.31 (Location 2650)
weakness of clustering, neural networks and support vector machines is their complexity and “black box” nature. (Location 2659)
There are methods that can be used to infer what variables are important in a neural network, but that arguably just adds another layer of complexity. (Location 2661)
but what little evidence there is tends to support the case that simple linear models and decision trees are more stable than other types of model.36 (Location 2680)
Always use linear models as your benchmark, developed using stepwise linear or logistic regression, and replace the raw predictor variables with indicator variables.37 With the right software, these are quick to develop and will provide a baseline against which to assess other types of model. (Location 2695)
ensembles can give significant uplifts in terms of predictive accuracy, but in many real-world applications they are subject to the same operational requirements for simplicity and explicability as single model solutions, and this is their Achilles heel. Where ensembles are used, it’s most commonly segmentation ensembles that are employed. This is because the model structure remains relatively simple. (Location 2771)
Recently nature inspired “swarm intelligence” approaches, such as Particle Swarm Optimization, Artificial Ant Colony Optimization and Artificial Bee Colonies have been hot topics, in particular, because these methods adopt a divide and conquer approach. (Location 2786)
In some circumstances a new method does provide marginal improvement over established techniques for building predictive models, or there may a very specific type of problem that a particular algorithm is well suited for. However, in practical situations the benefits are usually very small, or come with a price in terms of being very complex, or completely unintelligible to anyone without a PhD in statistics or computer science. This is not to say there aren’t some problems out there that have benefited from new, cutting edge predictive analytical techniques, but in my opinion they are few and far between when it comes to predicting consumer behavior in real-world business environments. (Location 2790)
there are far better gains to be had by (Location 2796)
new methods of transforming data prior to model construction and improved ways for selecting the best subset of predictor variables to present to the chosen modeling algorithm. (Location 2797)
Therefore best practice is to maintain a third sample – a holdout sample – that is used to perform the final analysis of model performance. (Location 3569)
For an operational model with a long forecast horizon (such as credit scoring models), it’s also good practice to take additional holdout samples from different time points. These “out of time” samples are used to test the long-term stability of the model’s performance. (Location 3572)
What Hadoop is less good at is real-time data processing on lots of small sub-sets of the data. (Location 4167)
better suited to large-scale batch processing. (Location 4178)
It’s only when your analytical databases get into the terabyte range and you have a lot of unstructured data that solutions like Hadoop come into their own. (Location 4195)
One thing you can do is use Hadoop tools to do the preliminary leg work, to identify any important data amongst all the chaff and then shift the small proportion that is actually useful to your existing relational databases and operational systems to be used in real-time decision making. (Location 4239)
Model implementation. After constructing a model, how will it be implemented? Some packages produce code in Java, SQL, C++ and so on, allowing the models to be inserted directly into the relevant operational system. A fast-growing trend is the use of Predictive Modeling Markup Language (PMML). This allows a model built using one package to be automatically implemented by another. (Location 4287)
Hosmer, D. and Lemeshow, S. (2013). Applied logistic regression (Wiley series in probability and statistics). 3rd Edition. Wiley. This book provides a detailed look at the theory and application of logistic regression, which remains the most widely applied method for generating classification models. Bishop, C. M. (1995). Neural networks for pattern recognition. Clarendon Press. This is one of the few definitive guides to the theory and application of neural networks. Although it was originally published back in the 1990s, most of the material remains as relevant as it was when it was first published. Hastie, T., Tibshirani, R. and Friedman, J. (2011) The elements of statistical learning: Data mining, inference, and prediction, 2nd Edition. Springer. This is a heavy weight guide to many of the data mining tools used in predictive analytics, written by three world-leading academics. Bishop, C. M. (2007). Pattern recognition and machine learning (Information science and statistics). Springer. This book covers a lot of the theoretical material underpinning many of the tools commonly used for data mining and predictive analytics. Crawley, M. (2012). The R book. 2nd Edition. Wiley. This book is comprehensive, but also suitable for relative beginners (with some rudimentary experience of programming languages – maybe Visual Basic, C++, SAS, Java or Python), as well as more experienced statistical programmers who wish to learn how to use the R programming language. (Location 4590)
Another application is predicting stock market movements, based on sentiments being expressed in blogs, tweets and (Location 4676)