In preparing a presentation that I will be giving at the Internet of Things Expo this week in New York, I had an opportunity to reflect on how our backend data platform has evolved over the last 10 years. While there has been a large amount of change in the technology landscape in general in that time, I would argue that there has been even more change and innovation in data science and in thinking about how to collect, organize, and get insights from data in the last 10 years. There are a bewildering number of technology vendors, strategies, and use cases in the big data space. Sometimes people will ask me: what are the right technologies – hadoop, data warehouses, stream data processing – that I should be using to get the most out of my data? To answer this question, I thought it would be useful to describe how both our platform and our thinking about data has evolved over time in the hopes that others can learn something from our story, and potentially avoid some of our mistakes.
First some context
At Fuze, we provide enterprise communications, including voice, video, messaging, and web collaboration on a cloud platform. End users on the left hand side of the diagram above make calls, send messages, and have video sessions with colleagues. These services are handled by our cloud-based feature platform. Let’s take calls as an example. Each call made on the platform has a set of data associated with it, such as what the source number is, what the destination number is, what the duration of the call is, etc. As end users make calls, data about those calls is generated on the feature platform.
When we first started the company, the needs that drove the collection of this call data were highly practical. For example, the on-premise PBXes we were replacing had reporting features in them that allowed users to look at a history of the calls that had been made. Reports needed to be able to answer questions like, “show me all the calls that Derek made and received today.” So we needed to at least match this baseline reporting functionality. In addition to this customer-facing reporting functionality, we also had billing and invoicing requirements that we had to meet. When you invoice customers for phone calls, there are always some calls that may be billed on a per minute usage basis. Calculating the charge for individual phone calls is a process called rating, and adds to the complexity of invoicing. But we had to do this from the day we launched our service in order to get paid.
In the beginning…RDBMS
Our first solution to the problem of reporting and invoicing was to use an RDBMS-based solution with custom data load routines. Focusing the discussion on calls, as calls are made on the feature platform, the servers handling those calls write out call records into log files. We developed custom scripts which ran on a scheduled basis to collect these log files onto centralized servers. On those servers we developed software to parse the incoming logs and load the call records into a SQL-based reporting database. Part of the load routine would look up external data to decorate call records with data from our master relational database and other data sources: things like customer, user, and other attributes needed to fully populate the reporting database.
This was not the most sophisticated reporting setup, but it had the benefit of being based on tools and technologies we were very familiar with (i.e. java, mysql), and it was relatively simple. We used this reporting database to drive simple end user reporting as well as generate customer invoices.
Scalability became our main problem. As our business grew, the amount of data being generated on the platform grew, as well. Some of the tables in this reporting database got quite large. Queries on these tables became slower and slower. To make matters worse, our users were pushing us to provide more sophisticated reports, including reports that summarized large data sets over longer periods of time. The queries needed to produce these summarized reports were very slow when run against our relational reporting database. There were some things we could do to optimize things such as creating additional read-only database slaves and archiving older data, but fundamentally the structure of the data did not lend itself to the queries we were trying to run, and the scale of the data was becoming a problem.
To respond to these issues, our business needed to evolve. Find out how we added third party business intelligence platforms to process complex queries more quickly: tomorrow’s blog has part two in our series on the evolution of a backend data platform.
Attending Internet of Things Expo this week? Stop by my session and explore this topic in more depth.