Yesterday’s blog walked through the ins and outs of a typical communications platform, and what we built our foundation on here at Fuze. When client demand had our data running hot, we knew we needed to explore new database options to grow our business. What came next?
Enter Business Intelligence
Some of us had experience with data warehouses and 3rd party business intelligence platforms from previous jobs. Even though the use cases we were familiar with mostly involved analysis of sales and order data, we thought that we could use this technology to report on our communications data. We had seen examples of very large data warehouses that could process complex queries quickly. Further, we were attracted to the idea of being able to jettison the in-house developed data load and other logic. Why build and maintain custom software when you can buy something that will get the job done?
So we implemented a business intelligence suite that included an ETL platform that took over the job of collecting, parsing, transforming, and loading the logs into the reporting DB. For the database, this changed from a standard relational database to a column-oriented database specifically built for data warehousing applications. The reporting schema was completely redesigned from a normalized data model to a de-normalized star schema.
These changes resulted in several immediate gains for us. Replacing our custom code with an ETL platform standardized our load process and really improved stability and reliability. The enterprise class column-oriented database was a huge improvement in reliability and scalability. Unlike the relational database we had in place previously, the clustered architecture of the database we implemented could survive hardware failures and could be scaled by adding more nodes. When it came to complex queries, particularly ones that summarized data, we saw in some cases two or more orders of magnitude performance improvement. Our custom reporting application was replaced by the reporting services that were part of the BI platform. The reporting platform had a huge amount of functionality including report designers, ad hoc reporting tools, OLAP, dashboards and more, which greatly expanded the scope of the end user-facing reporting we could offer.
One of the problems that still remained, though, was data timeliness. The batch-loaded nature of the log collection and ETL process meant that it could take several hours between an event taking place and the data for that event showing up in the reporting system. This was one of the most visible user complaints: they wanted data to show up right away in reports. Another problem was that while our data load process was standardized in a single ETL platform, a lot of the transformations we had to do were fairly complex. The implementation language for ETL was all SQL. Some of the transforms we needed required very complicated SQL to get the job done. A last issue we had was that the system was fairly labor intensive to change. The single data warehouse offered a system of record for events, but sometimes the single data warehouse wasn’t well optimized for particular types of queries or use cases.
Business Intelligence Refined
The BI approach previously described evolved over time to minimize some of the problems we were hitting. For example, to combat data timeliness, we optimized the load part of the ETL process as much as we could. By running the batch load process as aggressively as possible, we got the data delay down to less than one hour in many cases. The other big architectural change common to BI implementations is to create an initial storage point for raw data, a so-called data vault. Raw data from the data vault is further transformed into specific data marts where the data is optimized for particular types of queries. This approach adds a degree of flexibility and allows for the creation of highly optimized data marts specific to particular applications based on a common data set in the data vault.
Some of the problems inherent in the batch load approach can be somewhat minimized, but they can’t be eliminated. For example, getting data to a less than one-hour delay still doesn’t satisfy the end user need for near real-time visibility. The optimized data mart approach gives you performance, but it does come at the cost of complexity. The ETL jobs which were already complicated only got more complicated over time.
How could we make sure that real-time reporting didn’t get lost in the complexity of ETL jobs? Read on in tomorrow’s blog, the last in a three-part series leading up to our session at the Internet of Things Expo in New York City.