Friday, April 11, 2014

Questioning Information Security - It's all about the data

In Questioning Information Security Part 1, I argued that your security is only as good as the questions you ask.  If you never ask the question - is my network exposed to compromise through third party connectivity or how can my password reset function be defeated or who is using unauthorized systems on my network - you will never know the answer.

But great questions are only the seed of the answer. Just as a scientific hypothesis necessitates a quest of rigorous experimentation, an Information Security question requires data and analytics.

Data Collection

Fortunately, your environment has all the data needed to answer your security questions. Let me say it again - your environment has all the data you need to answer your security questions. The diagram below shows some of the data that is commonly available in enterprise network environments.

Mixed together properly, tour DNS activity, web logs, firewall logs, endpoint events, malware events, AD data, and net flow holds rich treasures to answer numbers of questions - ones that you are asking and ones that you haven't yet asked.

But how do you collect it? How do you process it? If your last attempt at wrangling and analyzing large amounts of data was back in 2005, you probably felt a bit like this guy.  The relational databases just didn't scale. It was expensive. It was a lot of work. And queries took forever.

The world of data processing began to change in 2005 when the Apache Software Foundation released Hadoop, an implementation of Google's MapReduce technology for parallel processing of huge data sets. As Hadoop because more accessible and associated technologies such as Hive and Pig wrangling this data wasn't so difficult any more. Now data analytics can occur on an industrial scale.

Massive data sets can be loaded in to commodity hardware running open source software and analyzed effectively, processing complex queries in reasonable time that was previously unthinkable. The shackles and overhead of relational data structures are off with the NoSQL paradigm and the structured and unstructured data can processed from all kinds of sources.

Of course, you don't have to collect all of the data in your environment to answer really meaningful security questions. Just start with the data necessary to answer well the core questions.

Data Correlation

Some cool questions can be answered by simply analyzing a single dimension of data that spans a long period of time, such as firewall or DNS logs. Things get exponentially more interesting when you correlate the different data sources together. How do you do this though with data from a variety of sources that was really never 'designed' to be together?

The common denominators of the data is an ip address or a hostname. All your data that interacts with the network can be tied pretty easily to one of these two attributes. With data collected over time and correlated, really interesting questions can start to be answered.  Take Active Directory - useful to correlate stuff to that because that is where you can tie in to the users behind the events. AD doesn't have the IP address of each user, but the domain authentication logs do. Tie that to your DNS logs and you can get down to a hostname. Similar process for just about any other data source. I have found that doing this data enrichment is useful to do during the data load processing.

Data Contextualization

Data contextualization is simply the process of taking that security data and putting it within the context of its place in the business, the environment, the overall risk of the enterprise. Then, something as lonely as a vulnerability report can be really useful. Then you can know how important that vuln is because of the risk profile of the system it resides in, and you can report vulnerabilities by business unit or, better, by business process. You can tie vulns to specific people within specific business units. Lots of places you can go with other data.


Ultimately, these answers to the questions you've asked are valuable in so much as they result in action, which requires good communication. One of the goals is to make it valuable to the business and create a meaningful business-specific dashboard that isn't color-coded based on swags.

Rather, the dashboard is based on real data such that you can dive in to its dimensions and layers, such as the data behind outsider fraudulent Transaction in the retail division. Or employee accidental data loss in the R&D group. Being driven and derived from data, you might see rates of non-compliance with use of encryption on laptops and host-based external media controls and perhaps promiscuous policies related to use of cloud storage.

In the last post in this series, I'll expand on ideas for questions to answer using data collection, correlation, and analytics.