Bill Gates once famously said – “640K is more memory than anyone will ever need on a computer“.
Lot of different versions of this quote have come up since. Whether Bill really stated this or not is questionable but what is not is that we need lot of space because we have lot of data. Also we need reliable software around the business processes which work with this data and also software to handle the data itself. Though both Software and Hardware have seen a sea of changes since the first computer was created, we will never see the day when a more scalable solution is not required.
Maintaining the infrastructure (software and hardware) for these burgeoning needs is a big challenge. With mounting cost of IT, Enterprises across businesses prefer to outsource their IT requirements. This helps the company concentrate on their core business. Outsourcing was a term mostly used for software and maintenance of systems in the IT world. Yes, data centers provided by external vendors have been used for hardware infrastructure needs but the IT team still had to look after the maintenance of those systems in some way or another. Also Enterprises now have to deal with data bigger than ever. This is where a big change has come after ‘Cloud’ and ‘Hadoop’ became the buzz words.
Cloud as a picture has always been used since the time I remember, to represent internet on many books and presentations. It was shown like that to hide the complexity of the internet so that a school kid or a non software/hardware guy could understand the bigger picture. Because all that mattered to them is that they can connect to internet from where ever and get what they want. As per Wikipedia ‘Cloud computing’ is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility over a network (The Cloud). Now think of an Enterprise running its Mail services or a Software Engineering team developing the next generation application producing petabytes of data. Without the Cloud, each of these tasks would need significant expertise and resources (read cost and time). This is where Cloud is making big providing IAAS, PAAS and SAAS. Moving these services or tasks to cloud hides lot of complexity, saves cost & time and helps the team focus on their core business.
So now we know why everyone is speaking about Cloud these days, we will see how Cloud and Hadoop helped a bunch of engineers to finish their product in time. I would take an example of Software Engineering team working on a next generation Supply Chain Management web application for Small/Medium Businesses using Microsoft Technologies.
This particular application deals with lot of data both explicit as well as implicit. Explicit data is all the data being inputted by the users or the administrators. By implicit data I mean the data which is produced on analyzing explicit data based on realities as well as some assumptions. It also means data which is gathered based on user activities. The team plans to provide cutting edge Business Intelligence and Analytic tools for the users of the web application so that they can analyze the data and maximize their revenue based on the data collected. All this data being generated cab be both structured and unstructured. Structured data can be analyzed using traditional BI tools however the team realizes that lot of data they have with them is not structured and huge. One might ask what this unstructured data is. Think of how you organized data all these days, storing them away in Relational SQL based databases. Now think about something like user comments or content you see on forums. The comments field of a record (record is a structured data) in a database table in itself is just a chunk of text without any structure and completely raw and there is lot of unstructured data that’s there in it. One can do data mining on it and get useful information for the business. In this particular application most of the unstructured data is being produced through the notes and comments users are leaving during the supply chain cycle. Also some of the implicit data generated based on user activities is unstructured.
While the team is trying to solve this problem the Higher Management decides that the product should be shipped earlier than original delivery date. They have also informed the engineering team that the initial infrastructure cost should be kept at a minimum and should be scalable. The team came up with a cloud based solution. The deployment which was originally planned to be done on dedicated servers sitting in a Datacenter was moved to Azure. This helped bringing down the hardware cost and provided scalability options. Using open source Hadoop (which is based on the MapReduce technology originally created by Google to solve its Big Data problem), the team could process and analyze any kind of data, both structured and unstructured. This saved the cost required to buy the costly BI tools. Hadoop provides a highly scalable distributed system where increasing the capacity means just adding one more server to the existing deployment. Hadoop also works with NoSQL database management systems seamlessly. The team used a combination of SQL server and NoSQL on Azure platform to store the data. They also plan to move the Hadoop setup to the Azure cloud once the support is available. Lucene was used to index the data and provide search capability. Since this is a distributed system Solr was used.
The Cloud along with Hadoop, NoSQL is changing how IT systems work. We will be covering in coming articles on Lucene/SOLR, how SQLServer works with Hadoop and why NoSQL is becoming more important day by day based on our own experiences in building similar systems @ DreamOrbit. Stay tuned.