Key Takeaways
- Federated learning can be used to pursue advanced machine learning models while still keeping data in the hands of data owners.
- Through federated learning, the data can be kept in the hands of financial institutions and the intellectual property of data scientists can also be preserved.
- PySyft, an open-source library created by OpenMined, enables private AI by combining federated learning with two other key concepts: Secured Multi-Party Computation (SMPC) and Differential Privacy.
- Learn how to use federated learning in a financial services use cases like loan risk prediction.
- Given the guarantee of 100% privacy, Federated Learning achieves very similar performance to traditional deep learning techniques.
A model is only as strong as the data it’s provided, but what happens when data isn’t readily accessible or contains personally identifying information? In this case, can data owners and data scientists work together to create models on privatized data? Federated learning shows that it is indeed possible to pursue advanced models while still keeping data in the hands of data owners.
This new technology is readily applicable to financial services, as banks have extremely sensitive information ranging from transaction history to demographic information for customers. In general, it’s very risky to give data to a third party to perform analytical tasks. However, through federated learning, the data can be kept in the hands of financial institutions and the intellectual property of data scientists can also be preserved. In this article, we will demystify the technology of federated learning and touch upon one of the many use cases in finance: loan risk prediction.
What is Federated Learning?
Federated Learning, in short, is a method to train machine learning (ML) models securely via decentralization. That is, instead of aggregating all the data necessary to train a model, the model is instead sent to each individual data owner. Then, after models are trained on each subset of the data, the updated weights are sent back to the coordinator and averaged together for a final model. Through this approach, the data never leaves the hands of its original owner, ensuring a higher level of security and trust between data owner and data scientist without a compromise in model performance.
Currently, there is a slight additional computational cost for developing federated learning models as well as a limitation to neural networks as the main supported algorithm by the most common federated learning frameworks. Despite this, there is a great potential for federated learning to transform the way that models are trained due to the vast improvements in data privacy and security.
Transitioning from Federated Learning to Privatized AI
Federated learning as a methodology is effective, yet still has some flaws by itself. How can the intellectual property of data scientists’s AI models be kept private? What techniques can a data scientist use to explore a private dataset? PySyft, an open-source library created by OpenMined, enables fully private AI by combining federated learning with two other key concepts: Secured Multi-Party Computation (SMPC) and Differential Privacy. Google’s Tensorflow Federated also provides federated learning capabilities, as it integrates with Tensorflow and Keras for a deep learning backend.
Secure Multi-Party Computation
Let us say we have some data which we want to perform an operation on. In this case, let’s say our data is the number 6 and the operation is multiplication by 2. How can we get a third party to complete this operation without knowledge of the data? We use Secure Multi-Party Computation, of course! Instead of dealing with the data as a whole, we can split it into multiple parts, perform the operation, and combine each part’s result back together.
Figure 1: Using Secure Multi-Party Computation to perform a calculation.
Notice, each individual only has a part of the data (colloquially referred to as “shared encryption”), but we are still able to obtain the correct result of the operation. In the case of performing multiplication, SMPC is relatively simple; more advanced algorithms like backpropagation or linear algebra are much more difficult to compute. Luckily, PySyft integrates with PyTorch in a process known as “hooking” and brings SMPC into deep learning, allowing models to be trained by data owners without knowledge of the weights or updates.
Differential Privacy
As a data scientist, it’s imperative to look at several statistics to gain an elementary understanding of a dataset. However, any individual record of the dataset will affect these statistics in some manner. Differential Privacy is a rigorous, mathematical definition of privacy that measures how much a statistic changes when individual rows are included or excluded from the dataset.
The common method of measuring privacy now is known as ε-differential privacy. The concept behind ε-differential privacy is that an individual whose record is not included in some data has perfect privacy. Thus, we want to limit the dependence of any statistical functions on any individual.
For example, the mean equally reflects all entries in a group, so by observing the change in mean when removing or adding an entry it’s possible to calculate the exact value of an entry.
Figure 2: Calculating an individual value via including/excluding from the mean statistic.
As shown in the figure above, when removing one individual from the statistical calculation it’s possible to understand exactly what the removed individual’s value is — therefore the mean is not a differentially private statistic.
Randomized response is one example of a differentially private algorithm. Imagine we are trying to find out if individuals have an iPhone or an Android phone. The procedure of collecting data is as follows:
- Toss a coin
- If the coin lands heads, toss the coin again, but regardless of the second coin toss, answer the question honestly.
- If the coin lands tails, toss the coin again. If the second coin lands heads, answer iPhone. If tails, answer Android.
By this process (and the second coin toss in step 2), each individual has plausible deniability to how they answered the question. However, the majority of people who have an iPhone will answer as such and same for those who have an Android. Therefore, it’s possible to estimate roughly what percentage of the population has an iPhone or Android.
PySyft aims to track how much privacy is being lost through all analysis-related operations; data scientists and data owners can now collaborate to establish “privacy budgets” and ensure not too much PII is leaked.
What does this look like in Financial Services?
Financial services industry, where applications manage very sensitive data, is naturally a great industry target for Federated Learning. Loan Risk Prediction is one specific example — below, we will see how to get a basic Federated Learning application up and running. Doing so, we’ll be able to see the benefits of using PySyft to create private AI.
Figure 3: Federated Learning Pipeline
Let’s recap the workflow enabled via Federated Learning and PySyft:
- Data Scientists develop deep learning architecture in PyTorch
- PySyft encrypts model and sends model to the data owner
- Model is trained on data owner’s premises, with data scientist specifying training parameters (like learning rate or number of iterations)
- Updated model(s) are sent from data owner back to data scientist
- In the case of multiple data owners, each model which was trained on a subset of the total iterations is averaged with the other models
- Rinse and repeat for desired number of training iterations!
Now that a model has been trained, the process to get predictions to the data owner is very simple. This process is again facilitated by PySyft.
- Send model to data owner via PySyft’s encryption.
- Model runs on client data, only feeding forward through the neural net, and returns predictions to the data owner
Now, let’s go in depth to see the essential steps of building a sample application using Federated Learning.
To initialize PySyft and ensure all deep learning operations are made secure, PySyft is “hooked” to the PyTorch library.
import syft as sy
hook = sy.TorchHook(torch)
On the data owner’s end, the data now needs to be deployed via a WebsocketServerWorker, which enables federated learning. Different arguments are specified, such as host IP, the PySyft hook, and the specific port that Finastra’s PySyft client will connect to.
# Load dataset from csv or database into syft's dataset class
dataset = sy.BaseDataset(features_tensor, target_tensor)
server = WebsocketServerWorker(**kwargs)
server.add_dataset(dataset, key='LoanRiskDataset')
Now that a dataset is deployed from the Data Owner, we data scientists can connect via a WebsocketClientWorker defined in the PySyft library.
#Specified args include the hook, port, host IP
client = WebsocketClientWorker(id="client", port=port, **kwargs_websocket)
Next, we can specify the model and its training configuration. These will be sent to each WebsocketServerWorker (in this case there’s only one). Since PySyft sits on top of PyTorch, these are defined via Torch syntax.
# Model and Loss Function
def loss_fn(pred, target):
return torch.nn.functional.nll_loss(input=pred, target=target)
class Net(nn.Module):
def __init__(self, D_in, D_out):
super(Net, self).__init__()
self.linear1 = torch.nn.Linear(D_in, D_out)
self.activation = torch.nn.Sigmoid()
def forward(self, x):
x = self.linear1(x)
x = self.activation(x)
return x
#Training Configuration
async def fit_model_on_worker(
worker: websocket_client.WebsocketClientWorker,
traced_model: torch.jit.ScriptModule,
batch_size: int,
curr_round: int,
max_nr_batches: int,
lr: float,
):
train_config = sy.TrainConfig(
model=traced_model,
loss_fn=loss_fn,
batch_size=batch_size,
shuffle=True,
max_nr_batches=max_nr_batches,
epochs=1,
optimizer="SGD",
optimizer_args={"lr": lr},
)
train_config.send(worker)
loss = await worker.async_fit(dataset_key="LoanRiskDataset", return_ids=[0])
model = train_config.model_ptr.get().obj
return worker.id, model, loss
Now, the function fit_model_on_worker handles several steps. Via Syft, the model and training configuration is encrypted, the model is trained securely on the data owner, and lastly the updated model and its training loss are retrieved. All that needs to happen now is to call the function on each of the Data Owners. Although the computation is happening on the Data Owner side, all of the training is initiated by the data scientist!
results = await asyncio.gather(
*[
fit_model_on_worker(
worker=worker,
traced_model=traced_model,
batch_size=args.batch_size,
curr_round=curr_round,
max_nr_batches=args.federate_after_n_batches,
lr=learning_rate,
)
for worker in worker_instances
]
)
Now, we have obtained potentially several models from each remote worker. All that remains is to combine all the models into a final model via averaging all weights and parameters together. Naturally, this is also handled by PySyft.
for worker_id, worker_model, worker_loss in results:
if worker_model is not None:
models[worker_id] = worker_model
loss_values[worker_id] = worker_loss
traced_model = utils.federated_avg(models)
After running the training and averaging loop for the desired number of iterations, training is complete. The final model can then be sent to banks to predict the loan risk of new customers. In this scenario, because there is only one data owner, the model performance would be identical to that of a model trained in a typical ML pipeline. However, what happens to the performance when a model is trained on multiple data owners?
These code snippets are based on the Asynchronous Federated Learning Tutorial from the OpenMined github repository which trains a model on MNIST handwritten digit recognition data. In this example, the data is split up by digit. Multiple models are trained on subsets of the data and later combined. Let’s compare the performance of Federated Learning to the best performance in research.
Model |
Accuracy (%) |
CNN |
98.32 |
ResNet |
99.16 |
DenseNet |
99.37 |
CapsNet |
99.75 |
CNN (Federated Learning) |
95.45 |
Given the guarantee of 100% privacy, Federated Learning achieves very similar performance to traditional deep learning techniques.
We see through this process two main benefits. The first is the main goal of federated learning : privacy. No bank is able to understand what the model architecture is, and we are unable to gain access to any data. The second bonus is enhanced intelligence. Now, banks are able to gain access to more robust models that have been trained on a vast variety of data sources because the model is a culmination of the information from several other financial institutions. The applications of Private AI and federated learning are immense — any use case where personal data is involved, from loans to credit risk, can reap the benefits of these new methodologies.
For more information, be sure to check out PySyft on GitHub to learn more about the framework.
About the Author
Brendon Machado is a Data Scientist at Finastra’s Innovation Lab keen on bridging the gap between new developments in AI/ML and the Financial Services industry. Brendon holds a Master’s degree in Computer Science with a specialization in Machine Learning from the Georgia Institute of Technology.