news / tech talk

Data Mining

by Lee LeClair
08/10/2007
As seen in Inside Tucson Business

Recently, it made the geek news that police in Richmond had used data mining to predict where and when crimes were most likely to occur in their jurisdiction. After taking preventative steps based on this information, they were able to reduce crime by 25% the first year and a further 19% the following year. The interesting thing about this was that they used their existing data and simply set about organizing it to determine if there were patterns that might lead to an increased risk of crime.

In essence, they made use of the ocean of data that they had collected to do something useful. Data mining is not a new concept but it is most commonly associated with major corporations or the NSA and lots of expensive hardware. However, the increasing power of hardware and the decreasing costs of business intelligence software have brought data mining down to a level where medium sized businesses can take advantage of it to attain real results if they apply it correctly.

Data mining involves examining your data in ways that were often not intended when the data was originally collected. It requires thinking about what you want to learn and then figuring out how to coax that data out of your system. This is where the business intelligence software comes in. BI programs help determine ways to ask the right questions. Powerful hardware is usually necessary because the mining effort involves scouring a very large amount of data for the pieces of the puzzle that you are trying to put together. The more mundane DB systems that collected the data were usually designed to collect data slowly and put it in large storage and query only for small parts of the data.

On the other hand, when mining data, it helps to have a separate and dedicated DB server with the data spread around to multiple smaller hard disks. This allows the query processes to speed up access to the data since its spread across multiple disks. The advent of dual and even quad core processors also helps the process; parallel processing is generally better than serial processing. Finally, it helps to organize your data in a different way than when it was collected.

Most data designs are geared for Online Transaction Processing (OLTP). These types of data designs are meant to maximize data integrity during the write process and reduce data duplication (i.e., a given set of data is stored only once). For data mining, Online XXXX (OLAP) data designs are a better fit. These designs optimize query speeds and are not as concerned about data integrity during the entry process. They are not at all good for OLTP but they are great for business intelligence queries. Naturally, it is also a good idea to have this data mining database copied from the original source so that it can be abused with lots of queries that would take the source database to its knees and disrupt operations.

The key to getting what you need out of your data is understanding what you have and what is possible to get from it. If you have crime statistics, crime location information, etc. then you should not expect to be able to determine tomorrow’s weather. You should be able to figure out what times of month there are spikes in crimes and in what areas. Then you can look at external information, like when it is payday for local employers and how far frequent crime locations are from cash machines. If you have a small or medium sized business, think about the data you have likely amassed and what you might be able to obtain from examining it. Make no mistake, it will take careful consideration and planning, but data mining could help you take your business to the next level.

Lee Le Clair is the CTO at Ephibian. His Tech Talk column appears the third week of each month in Inside Tucson Business