For an organization to be data-driven, it's not enough to just dump mountains of data. That data needs to be accurate and meaningful. Julianna Göbölös-Szabó, data engineer at Prezi, shared at devopsdays Amsterdam 2015 how her team worked with the rest of the company to improve the quality of its data and thus deliver better business insights. Their solution involved moving from unstructured to structured data with a lightweight, contract-based approach to nudge all teams in the right direction.
Prezi faced several challenges when it decided to tackle the data quality problem. Prezi was logging hundreds of gigabytes per day, but they were difficult to process and turn into useful information. The focus was mostly on monitoring systems and infrastructure health, not on getting information relevant to the business, so the tools were unfit for the job. The engineering teams didn't see data quality as having high priority. Data quality responsibility was spread out across multiple teams.
Göbölös-Szabó's team found out that the key principles to tackle those challenges were clear ownership, decreasing the effort to embed data quality into the products and a lot of effort on communication and collaboration.
The most common way to store clickstream data was through unstructured data. This approach, although unwieldy, worked when Prezi was smaller and its workforce was composed of mostly engineers. They could code their way out of the problem, with the help of lots of regular expressions and some scripting. But Prezi reached a point where it had 150 people without an engineering background and logged around 400GB of data per day. The "unstructured logs with lots of scripting" approach did not scale. Prezi had a lot of data but it struggled to derive business insights from it as the barrier to effective data usage was very high.
Göbölös-Szabó shared a a few stories to illustrate the challenges her team faced on bringing data quality to the product team's attention. On one occasion, there was a need to find out how often videos embedded in presentations - Prezi sells presentation software - were played, only to be found that among all the logged data this bit wasn't there. On one other occasion, a log call was accidentally removed from the code, leading to three days of data loss. Fortunately for Prezi, it was on a lower-traffic weekend. On data ownership, Göbölös-Szabó gave the example of location data. Who should own it? The web team, who gathers that information, the marketing team who spends money based on its accuracy or the data team who processes it?
Prezi's data team turned to technical and non-technical solutions. On the technical side, they moved away form unstructured to structured, JSON-based, logs. Prezi chose JSON because it is both human and machine-readable, making it self-documenting. Having a known, structured, schema also makes it easier to build tooling around it. Göbölös-Szabó's team also built a living documentation process to enable collaboration. Documentation acts as the single source of truth. Each team that needs to log some data documents the log schema, which then acts as a contract. Production logs are then validated against what is documented to spot inconsistencies, ensuring that the documentation is kept up to date. This single source of truth also enabled the creation of dashboards with a lot less effort.
InfoQ reached out to Göbölös-Szabó to learn more about the process.
InfoQ: During your talk you mentioned that you made it easy to create Logster parsing scripts. How did you do it?
Göbölös-Szabó: When we introduced structured logging, it was not only about using JSON format instead of plain text, but we also introduced some mandatory structure. Each logline must contain three mandatory fields: action, object and trigger. These must uniquely define one event on a platform. Of course you can have any other field in your log, but this is the minimal required. Our tool focuses on these three fields and we do two things with them:
- We count how many loglines arrived with each possible triplet. It shows if your event actually happened or not. This helps you to see if you've broken a feature with your last release - but doesn't help in ensuring data quality. It's still useful for the developers, though.
- Based on the loglines action, object, trigger triplet we identify the logline's definition and we validate it against its schema. This is the data quality check. It often turns out that after a release we find inconsistencies. For instance, a team "promised" an integer but instead is logging a string (eg. "NULL").
To add a logster rule, we have to create a python module, which is very generic in our case. We know the log category, so we query all the log schemas of the category and for each line, we try to find the matching schema. Basically, we provide a template and only the log category must be replaced all the time. Of course, if someone wants to have their own rules and metrics, they can append it to the respective python file. We just give a default logster that is enough for general use cases, and the teams are free to do anything with it.
InfoQ: How do you find the inconsistencies between the documentation/contract and what the code actually does? When do you perform that validation?
Göbölös-Szabó: We only find inconsistencies when the feature is deployed somewhere (e.g.: pre-production) and sends loglines. As we count action-object-trigger triplets with and without validation, it's easy to see if there is a problem with the code that sends the logline.
At the beginning of the project we were considering integrating checks to jenkins, or not accepting invalid loglines (e.g.: not shipping to s3 where we permanently store them). However we wanted to build a tool that helps the teams and that they are happy to use. If we build something that just puts more frustration on their shoulders, they will just stop using it, ignore it and then our efforts get wasted.
So we decided that we shouldn't build a bulletproof data quality checking system, but rather a tool that empowers and encourages teams to be more conscious about log quality.
A default dashboard. At the center, an example of a metric that may have an inconsistency between code and schema.
InfoQ: Are your dashboards built on top of a custom-built tool or are you using publicly available tools?
Göbölös-Szabó: Logster natively feeds metrics to graphite, this is what prezi has been using for years. We relied on this system, because most of the developers are already familiar with it.
For dashboards we are using Grafana. It gets the data from Graphite but it has a much nicer UI, better UX and you can describe your dashboard with a json file - and that's easy to generate from all that information in our central log schema documentation.
We use AWS's Redshift as a data warehouse. Once the logs are processed and cleaned up, then they go to Redshift and users can create charts (or tables, funnels, anything) with Chart.io. Right now we need a few preprocessing steps before data gets loaded to Redshift - with high quality structured logs we can eliminate these steps, that's the ultimate goal.
InfoQ: What kind of information do you display in those dashboards?
Göbölös-Szabó: We can work from those metrics that are generated by Logster. So we can show:
- action-object-trigger counts
- number of valid loglines per schema
- a very general number about how many loglines we processed and how many were thrown away (i.e., the line was not json, this should never happen).
- distribution for "enum" fields (it's possible to define enums in the schemas).
By default we provide a dashboard that shows one type of charts from the 4 above for all of your log definitions in a certain log category. For example, we have a huge dashboard for android logs with circa 50 charts showing valid lines for each log definition.
This information is mainly useful for engineers. As we count very basic things, - e.g., how many times action A happened, how many loglines match against schema S - it is not very useful for business in this raw form. For instance, this data often contains some test users that must be filtered out. As another example, we don't create funnels on the fly, which would give great insights for business people. However, we have a dedicated team that delivers metrics and numbers for stakeholders and having high quality logs is crucial for them.