How to Metadata Your Data for Faster Analysis

Building a better product is important for any startup, and building one that scales well after you get your initial traction is often the difference between success and failure. The only thing that makes scaling difficult is if you are bottlenecked on your ability to do an analysis of the data that flows into your system.

Bringing more engineers onto a team often means more features, but it also means longer timelines for writing code because there are simply more people involved. If this slowdown happens at the wrong time, this could mean missing out on opportunities or losing market share to competitors.
The best way I have found so far to combat this issue is by spending time upfront sharing knowledge about what data is flowing in through our APIs and how it can be used, so new users on the dev team can quickly pull down particular subsets of data without depending others for guidance or context about what is available in our system.
Additionally, with this approach, new engineers on the team can hit the ground running when it comes to productionizing their code by knowing exactly which subsets of data they need to pull down in order to do an analysis, without having to waste time figuring out what is available or how it is formatted.
For example; if you are using Redshift or Postgres, then use ef export and pg_dump to export a CSV file for each table that your application writes records to. Then have one person who has knowledge about each table structure within your DB write a small program that parses through all of these CSV files and generates a metadata index JSON blob per table sampled. This index will use the metadata from the CSV files and output a set of key value pairs for each column in your sampled table.
Once this is complete, you can now ship these JSON blobs as part of an ETL process or even push them up to S3 or DynamoDb for other engineers on the team to import directly into their own Redshift/Postgres databases.
The goal here is not to copy large volumes of data around but instead to enable new users on the team with faster access to subsets of the data that they need, without having to bother others about how it might be formatted or structured. This has worked well for us so far and we have been able to achieve 10X speed increases when it comes time to do analysis against our data compared to when we first started the company.

Hopefully, this article series helps other teams achieve similar results, which are often hard to come by especially in challenging startup environments. Now it is time for us to share what more we have learned about how you can more efficiently build scalable systems at scale. Stay tuned!

FAQs:

What formats are supported?

Currently, only CSV is supported. Can the tool import data directly from DynamoDb or S3? Yes, with some caveats.

What happens when you reference a column in your JSON blob that does not exist in the table?

The importer will throw an error and stop importing any further.

Do you validate that all of our CSV columns match up with one of our defined JSON blob keys?

No, this is not currently supported. Can you import data with fewer than two columns in the CSV file?

Yes, but the JSON key names will be assigned by us with specific semantics so it might be confusing to understand what each column represents later on if you do not have all of the original metadata that was included when our importer wrote out your initial blob.

How does this compare to Cascalog/Scalding/Impala etc.?

This differs from most ETL solutions because it operates at a lower level and builds up a schema based on the contents of your dataset as opposed to expecting you to provide them upfront. It also supports exporting metadata for more than just Redshift and Postgres. So with that said, it is different, but not necessarily better.

What are some of the current limitations?

Currently, this is very much a work in progress so there are several limitations including: – Only works on CSV files for now – Writing out large datasets (> 500MB) results in very long run times due to lack of file caching/buffering – No support for nested data types (e.g. JSON inside of JSON) – No column name validation – Here are some additional limitations based on our use cases at Buffer. Redshift only exports up to date columns via COPY if they match an existing column name.

Conclusion:

We have been using this tool to ingest our data at scale across a growing number of AWS instances and into a single Redshift cluster. We have found that by having a re-usable standard for how we name columns within our JSON files, it has allowed engineers on the team with little experience with our current infrastructure to be able to jump in and start sending us data without much further instruction. This post only touches the surface as there is still room for improvement as we continue to use this internally at Buffer and more feedback/requests from the community. If you find any bugs or would like to share additional thoughts then feel free to reach out or submit pull requests via Github.