Key Takeaways
- Data is the backbone of many SaaS-based services today.
- With the dynamic nature of data and cloud services, data transformation is a common need due to changing engineering requirements.
- Data transformation remains a continuous challenge in engineering and built upon manual toil.
- There is a current lack of tools to perform data transformations programmatically, in an automated way and safely.
- The open source utility Dynamo Data Transform was built to simplify and build safety and guardrails into data transformation for DynamoDB based systems - built upon a robust manual framework that was then automated and open sourced.
When designing a product to be a self-serve developer tool, there are often constraints - but likely one of the most common ones is scale. Ensuring our product, Jit - a security-as-code SaaS platform, was built for scale was not something we could embed as an afterthought, it needed to be designed and handled from the very first line of code.
We wanted to focus on developing our application and its user experience, without having challenges with issues and scale be a constant struggle for our engineers. After researching the infrastructure that would enable this for our team - we decided to use AWS with a serverless-based architecture.
AWS Lambda is becoming an ever-popular choice for fast-growing SaaS systems, as it provides a lot of benefits for scale and performance out of the box through its suite of tools, and namely the database that supports these systems, AWS’s DynamoDB.
One of its key benefits is that it is already part of the AWS ecosystem, and therefore this abstracts many of the operational tasks of management and maintenance, such as maintaining connections with the database, and it requires minimal setup to get started in AWS environments.
As a fast-growing SaaS operation, we need to evolve quickly based on user and customer feedback and embed this within our product. Many of these changes in application design have a direct impact on data structures and schemas.
With rapid and oftentimes significant changes in the application design and architecture, we found ourselves needing to make data transformations in DynamoDB very often, and of course, with existing users, it was a priority that this be achieved with zero downtime. (In the context of this article Data Transformation will refer to modifying data from state A to state B).
Challenges with Data Transformation
In the spirit of Brendon Moreno from the UFC:
Maybe not today, maybe not tomorrow, and maybe not next month, but only one thing is true, you will need to make data transformations one day, I promise.
Yet, while data transformation is a known constant in engineering and data engineering, it remains a pain point and challenge to do seamlessly. Currently, in DynamoDB, there is no easy way to do it programmatically in a managed way, surprisingly enough.
While there are many forms of data transformation, from replacing an existing item’s primary key to adding/removing attributes, updating existing indexes - and the list goes on (these types are just a few examples), there remains no simple way to perform any of these in a managed and reproducible manner, without just using breakable or one-off scripting.
User Table Data Transform Example
Below, we are going to dive into a real-world example of a data transformation process with production data.
Let’s take the example of splitting a “full name” field into its components “first name” and “last name”. As you can see in the example below, the data aggregation currently writes names in the table with a “full name” attribute. But let’s say we want to transform from a full name, and split this field into first and last name fields.
Before
|
|
|
|
After
|
|
|
|
|
|
Looks easy, right? Not so, to achieve just this simple change these are the steps that will need to be performed on the business logic side, in order to successfully transform this data.
- Scanning the user records
- Extracting the FullName attribute from each record
- Splitting the FullName attribute into new FirstName and LastName attributes
- Saving the new records
- Cleaning up the FullName attribute
But let's discuss some of the issues you would need to take into account before you even get started, such as - how do you run and manage these transformations in different application environments? Particularly when it’s not really considered a security best practice to have access to each environment. In addition, you need to think about service dependencies. For example, what should you do when you have another service dependent on this specific data format? Your service needs to be backward compatible and still provide the same interface to external services relying on it.
When you have production clients, possibly one of the most critical questions you need to ask yourself before you modify one line of code is how do you ensure that zero downtime will be maintained?
Some of the things you’d need to plan for to avoid any downtime is around testing and verification. How do you even test your data transformation script? What are some good practices for running a reliable dry run of a data transformation on production data?
There are so many things to consider before transforming data.
Now think that this is usually, for the most part, done manually. What an error-prone, tedious process! It looks like we need a fine-grained process that will prevent mistakes and help us to manage all of these steps.
To avoid this, we understood we’d need to define a process that would help us tackle the challenges above.
The Rewrite Process
Figure 1: Rewrite Process Flow Chart
First, we started by adjusting the backend code to write the new data format to the database while still keeping the old format, by first writing the FullName, FirstName and LastName to provide us some reassurance of backward compatibility. This would enable us to have the ability to revert to the previous format if something goes terribly wrong.
async function createUser(item) {
// FullName = 'Guy Br'
// 'Guy Br'.split(' ') === ['Guy', 'Br']
// Just for the example assume that the FullName has one space between first and last name
const [FirstName, LastName] = item.FullName.split(' ');
const newItemFormat = { ...item, FirstName, LastName };
return dynamodbClient.put({
TableName: 'Users',
Item: newItemFormat,
}).promise();
};
Next, we wrote a data transformation script that scans the old records and appends the FirstName and LastName attributes to each of them, see the example below:
async function appendFirstAndLastNameTransformation() {
let lastEvalKey;
let scannedAllItems = false;
while (!scannedAllItems) {
const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
lastEvalKey = LastEvaluatedKey;
const updatedItems = Items.map((item) => {
const [FirstName, LastName] = splitFullNameIntoFirstAndLast(item.FullName);
const newItemFormat = { ...item, FirstName, LastName };
return newItemFormat;
});
await Promise.all(updatedItems.map(async (item) => {
return dynamodbClient.put({
TableName: 'Users',
Item: item,
}).promise();
}));
scannedAllItems = !lastEvalKey;
};
}
After writing the actual script (which is the easy part), we now needed to verify that it actually does what it’s supposed to. To do so, the next step was to run this script on a test environment and make sure it works as expected. Only after the scripts usability is confirmed, it could be run on the application environments.
The last phase is the cleanup, this includes taking the plunge and ultimately deleting the FullName column entirely from our database attributes. This is done in order to purge the old data format which is not used anymore, and reduce clutter and any future misuse of the data format.
async function cleanup() {
let lastEvalKey;
let scannedAllItems = false;
while (!scannedAllItems) {
const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
lastEvalKey = LastEvaluatedKey;
const updatedItems = Items.map((item) => {
delete item.FullName;
return item;
});
await Promise.all(updatedItems.map(async (item) => {
return dynamodbClient.put({
TableName: 'Users',
Item: item,
}).promise();
}));
scannedAllItems = !lastEvalKey;
};
};
Lets quickly recap what we have done in the process:
- Adjusted the backend code to write in the new data format
- Created a data transformation script that updates each record
- Validated that script against a testing environment
- Ran the script on the application environments
- Cleaned up the old data
This well-defined process helped us to build much-needed safety and guardrails into our data transformation process. As we mentioned before, with this process we were able to avoid downtime by keeping the old format of the records until we don’t need them anymore. This provided us with a good basis and framework for more complex data transformations.
Transforming Existing Global Secondary Index (GSI) using an External Resource
Now that we have a process––let’s be honest, real-world data transformations are hardly so simple. Let’s assume, a more likely scenario, that the data is actually ingested from an external resource, such as the GitHub API, and that our more advanced data transformation scenario actually requires us to ingest data from multiple sources.
Let’s take a look at the example below for how this could work.
In the following table, the GSI partition key is by GithubUserId.
For the sake of this data transformation example, we want to add a “GithubUsername” column to our existing table.
Before
|
|
|
|
After
|
|
|
|
|
|
This data transformation looks seemingly as straightforward as the example with the full name, but there is a little twist.
How can we get the Github username if we don’t have this information? We have to use an external resource, in this case, it's the Github API.
GitHub has a simple API for extracting this data (you can read the documentation here). We will pass the GithubUserId and get information about the user which contains the Username field that we want.
https://api.github.com/user/:id
The naive flow is similar to the full name example above:
- Adjust our code to write in the new data format.
- Assume that we have the Github username when creating a user.
- Scan the user records (get `GithubUsername` by `GithubUserId` for each record using Github API), and update the record.
- Run that script on the testing environment
- Run it on the application environments
However, in contrast to our previous flow, there is an issue with this naive flow. The flow above is not safe enough. What happens if you have issues while running the data transformation when calling the external resource? Perhaps the external resource will crash / be blocked by your IP or is simply unavailable for any other reason? In this case, you might end up with production errors or a partial transformation, or other issues with your production data.
What can we do on our end to make this process safer?
While you can always resume the script if an error occurs or try to handle errors in the script itself, however, it is important to have the ability to perform a dry run with the prepared data from the external resource before running the script on production. A good way to provide greater safety measures is by preparing the data in advance.
Below is the design of the safer flow:
- Adjust our code to write in the new data format (create a user with GithubUsername field)
- Create the preparation data for the transformation
Only after we do this, we scan the user records, get GithubUsername for each of them using Github API, append it to a JSON Object `{ [GithubUserId]: GithubUsername }` and then write that JSON to a file.
This is what such a flow would look like:
async function prepareGithubUsernamesData() {
let lastEvalKey;
let scannedAllItems = false;
while (!scannedAllItems) {
const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
lastEvalKey = LastEvaluatedKey;
const currentIdNameMappings = await Promise.all(Items.map(async (item) => {
const githubUserId = item.GithubUserId;
const response = await fetch(`https://api.github.com/user/${githubUserId}`, { method: 'GET' });
const githubUserResponseBody = await response.json();
const GithubUsername = githubUserResponseBody.login;
return { [item.GithubUserId]: GithubUsername };
}));
currentIdNameMappings.forEach((mapping) => {
// append the current mapping to the preparationData object
preparationData = { ...preparationData, ...mapping };
});
scannedAllItems = !lastEvalKey;
};
await fs.writeFile('preparation-data.json', JSON.stringify(preparationData));
};
Next we scan the user records (get GithubUsername by GithubUserId for each record using Preparation Data), and move ahead to updating the record.
async function appendGithubUsername() {
let lastEvalKey;
let scannedAllItems = false;
while (!scannedAllItems) {
const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
lastEvalKey = LastEvaluatedKey;
const updatedItems = Items.map((item) => {
const GithubUsername = preparationData[item.GithubUserId];
const updatedItem = currentGithubLoginItem ? { ...item, GithubUsername } : item;
return updatedItem;
});
await Promise.all(updatedItems.map(async (item) => {
return dynamodbClient.put({
TableName: 'Users',
Item: item,
}).promise();
}));
scannedAllItems = !lastEvalKey;
};
};
And finally, like the previous process, we wrap up by running the script on the testing environment, and then the application environments.
Dynamo Data Transform
Once we built a robust process that we could trust for data transformation, we understood that to do away with human toil and ultimately error, the best bet would be to automate it.
We realized that even if this works for us today at our smaller scale, manual processes will not grow with us. This isn’t a practical long-term solution and would eventually break as our organization scales. That is why we decided to build a tool that would help us automate and simplify this process so that data transformation would no longer be a scary and painful process in the growth and evolution of our product.
Applying automation with open source tooling
Every data transformation is just a piece of code that helps us to perform a specific change in our database, but these scripts, eventually, must be found in your codebase.
This enables us to do a few important operations:
- Track the changes in the database and know the history at every moment. Which helps to investigate bugs and issues.
- No need to reinvent the wheel - reusing existing data transformation scripts already written your organization streamlines processes.
By enabling automation for data transformation processes, you essentially make it possible for every developer to be a data transformer. While you likely should not give production access to every developer in your organization, applying changes is the last mile. When only a handful of people have access to production, this leaves them with validating the scripts and running them on production, and not having to do all of the heavy lifting of writing the scripts too. We understand it consumes more time than needed for those operations and it is not safe.
When the scripts in your codebase and their execution are automated via CI/CD pipelines
other developers can review them, and basically, anyone can perform data transformations on all environments, alleviating bottlenecks.
Now that we understand the importance of having the scripts managed in our codebase, we want to create the best experience for every data-transforming developer.
Making every developer a data transformer
Every developer prefers to focus on their business logic - with very few context disruptions and changes. This tool can assist in keeping them focused on their business logic, and not have to start from scratch every time they need to perform data transformations to support their current tasks.
For example - dynamo-data-transform provides the benefits of:
- Export utility functions that are useful for most of the data transformations
- Managing the versioning of the data transformation scripts
- Supporting dry runs to easily test the data transformation scripts
- Rollback in the event the transformation goes wrong - it’s not possible to easily revert to the previous state
- Usage via CLI––for dev friendliness and to remain within developer workflows. You can run the scripts with simple commands like `dynamodt up`, `dynamodt down` for rollback, `dynamodt history` to show which commands were executed.
Dynamo Data Transform:
Quick Installation for serverless:
The package can be used as a standalone npm package see here.
To get started with DynamoDT, first run:
npm install dynamo-data-transform --save-dev
To install the package through NPM (you can also install it via…)
Next, add the tool to your serverless.yml Run:
npx sls plugin install -n dynamo-data-transform
You also have the option of adding it manually to your serverless.yml:
plugins:
- dynamo-data-transform
You can also run the command:
sls dynamodt --help
To see all of the capabilities that DynamoDT supports.
Let’s get started with running an example with DynamoDT. We’ll start by selecting an example from the code samples in the repo, for the sake of this example, we’re going to use the example `v3_insert_users.js`, however, you are welcome to test it out using the examples you’ll find here.
We’ll initialize the data transformation folder with the relevant tables by running the command:
npx sls dynamodt init --stage local
For serverless (it generates the folders using the resources section in the serverless.yml):
resources:
Resources:
UsersExampleTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: UsersExample
The section above should be in serverless.yml
The data-transformations folder generated with a template script that can be found here.
We will start by replacing the code in the template file v1_script-name.js with:
const { utils } = require('dynamo-data-transform');
const TABLE_NAME = 'UsersExample';
/**
* The tool supply following parameters:
* @param {DynamoDBDocumentClient} ddb - dynamo db document client https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-dynamodb
* @param {boolean} isDryRun - true if this is a dry run
*/
const transformUp = async ({ ddb, isDryRun }) => {
const addFirstAndLastName = (item) => {
// Just for the example:
// Assume the FullName has one space between first and last name
const [firstName, ...lastName] = item.name.split(' ');
return {
...item,
firstName,
lastName: lastName.join(' '),
};
};
return utils.transformItems(ddb, TABLE_NAME, addFirstAndLastName, isDryRun);
};
module.exports = {
transformUp,
transformationNumber: 1,
};
For most of the regular data transformations, you can use the util functions from the dynamo-data-transform package. This means you don’t need to manage the versions of the data transformation scripts, the package will do this work for you. Once you’ve customized the data you’ll want to transform, you can test the script using the dry run option by running:
npx sls dynamodt up --stage local --dry
The dry run option prints the records in your console so you can immediately see the results of the script, and ensure there is no data breakage or any other issues.
Once you’re happy with the test results, you can remove the --dry flag and run it again, this time it will run the script on your production data, so make sure to validate the results and outcome.
Once you have created your data transformation files, the next logical thing you’d likely want to do is add this to your CI/CD. To do so add the command to your workflow/ci file for production environments.
The command will run immediately after the `sls deploy` command, which is useful for serverless applications.
Finally, all of this is saved, as noted above so if you want to see the history of the data transformations, you can run:
`npx sls dynamodt history --table UserExample --stage local`
The tool also provides an interactive CLI for those who prefer to do it this way.
And all of the commands above are supported via CLI as well.
With Dynamo Data Transform, you get the added benefits of being able to version and order your data transformation operations and manage them in a single place. You also have the history of your data transformation operations if you would like to roll back an operation. And last but not least, you can reuse and review your previous data transformations.
We have open-sourced the Dynamo Data Transform tool that we built for internal use to perform data transformations on DynamoDB and serverless-based environments and manage these formerly manual processes in a safe way.
The tool can be used as a Serverless Plugin and as a standalone NPM package.
Feel free to provide feedback and contribute to the project if you find it useful.
Figure 2: Data Transformation Flow Chart