In this blog post I will try to explain big data implementation projects and the why and how of big data, what drives big data project and most importantly, how business intelligence has changed over the years. One comment I hear often is that it is business intelligence which I have been doing and my data warehouse is handling huge amounts data and that today “big data is old wine in new bottle”. There has always been a hype around big data and its importance. We need to understand why we need it, what value it provides and how we can adapt best to the changes.
Having worked in the business intelligence space for a considerable number of years, we need to understand that there have been many changes along the way. To understand business intelligence and data mining projects better, I would like to divide it into 3 components: firstly, understanding data, secondly, the questions we need answered and thirdly how we use technology to answer these questions adequately and insightfully.
Business intelligence, its purpose and structure per say has not changed, but the above mentioned components most definitely have. Here’s how:
Data has changed dramatically over the years. Let’s take a look at the 4 V’s of data: volume, variety, velocity, and veracity.
Volume: Big data implies enormous volumes of data. Previously, data used to be created by employees. Today, data is generated by machines, networks and human interactions across mediums such social media, thereby making the volume of data to be analyzed massive. Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.
Variety: Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
In 2015, over 1.4 billion smart phones were be shipped – all packed with sensors capable of collecting all kinds of data, not to mention the data the users create themselves.
Velocity: Velocity refers to the speed at which data flows in from various sources such as business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if we are able to handle the sheer velocity of data.
Facebook users send on average 31.25 million messages and view 2.77 million videos every minute. There is massive growth in video and photo data, where every minute up to 300 hours of video are uploaded to YouTube alone.
Veracity: Veracity refers to the biases, accuracy and abnormality in data. It is very important to know whether the data that is being stored, and mined meaningful to the problem being analyzed. In scoping out big data strategies, it is imperative to keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems.
THE QUESTIONS WE NEED ANSWERED:
Now that we have an abundance of changing data, we are able to give business decision makers more insightful answers. With the combination of structured and unstructured data, we are able to answer these questions far more meaningfully than we have in the past. Previously, we would spend a lot of time and money in finding these answers, today BI is far less time consuming and can be delivered at lower costs driving ROI up significantly. The combination of new data sources, data mining, predictive analytics and today’s complex machine learning algorithms are more effective. Predictive business intelligence has become more mainstream.
Machine learning brings value to all the data that enterprises have been saving for years, by churning high volumes of data and helping gain deeper insights and improve decision-making. The beauty of it all is that these algorithms keep getting better with time by itself.
HOW WE USE TECHNOLOGY:
In early years’ companies like Google architected solutions to use huge amounts of unstructured data to get actionable meaningful insights for users. Here’s two research papers on how it was done:http://research.google.com/archive/mapreduce.html and http://research.google.com/archive/gfs.html .
Inspired by this open source community, technologists have invented an ecosystem of Hadoop technologies. These technologies have reached a state of maturity that companies of all sizes can make use of to build their data/BI solutions. These technologies can handle the type of data that is available to organizations today and can scale economically and are well suited to building next generation data warehouse for predictive and prescriptive analytics. The scale of storage and processing for petabytes is economical so that companies of all size can create data solutions.
Traditional Business Intelligence and data warehousing was not fully prepared and were not designed to handle or make use of type of data we have now. On the other hand, newer technologies are designed ground up to handle the new scenarios. Due to this, it is important to assess impact on data warehouses and ETL process infrastructure. It may require integration of Hadoop distribution and creating enterprise data lake/data hub for variety of reasons. There will be architecture change, infrastructure change etc.
IMPACT AND SUMMARIZING:
The value that organizations can get with the above changes in data and technology should drive new big data projects. There are multiple ways organizations can derive value from big data projects – for example, improving the cost of data warehouse, getting new insights from new type of data sources to improve operations, better customer targeting, customer experience and creating new lines of business using data, etc.
However, it is advisable to create a vision for the big data project before starting and the vision should include the value big data project will drive for the organization.
Many companies have invested in data mining but the focus was always on structured data. Now we can leverage the same on valuable unstructured data. More descriptive analytics and prescriptive analytics can be done. This is possible because of data as well as technology advancement in machine learning / modelling algorithms. This can be further integrated with likes of Agent based technology like Cortana, Siri, etc.
Data Sources: http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#3ec9f3f76c1d