In the podcast, I spoke with Meenakshi Kaushik and Neelima Mukiri from the Cisco team on responsible AI and machine learning bias and how to address the biases when using ML in our applications.
Meenakshi Kaushik currently works in the product management team at Cisco. She leads Kubernetes and AI/ML product offerings in the organization. Meenakshi has an interest in AI/ML space and is excited about how the technology can enhance human wellbeing and productivity.
Neelima Mukiri is a principal engineer at Cisco. She is currently working in Cisco's on-premise and software service container platforms.
Key Takeaways
- How to evaluate fairness in the input data & the model in machine learning applications.
- How open-source tools like Kubeflow can help with mitigating machine learning bias.
- Current state of standards or guidelines to develop responsible AI
- Future of responsible AI and ethics and innovations are happening in this space.
Subscribe on:
Transcript
Introductions
Srini Penchikala: Hi everyone. My name is Srini Penchikala. I am the lead editor for AI/ML and data engineering community at InfoQ website. Thank you for tuning into this podcast. In today's podcast, I will be speaking with Meenakshi Kaushik, and Neelima Mukiri, both from Cisco team. We will be talking about machine learning algorithm bias, and how to make machine learning models fair and unbiased.
Let me first introduce our guests, Meenakshi Kaushik currently works in product management team at Cisco. She leads Kubernetes and AI/ML product offerings in the organization. Meenakshi has interest in AI/ML space and is excited about how the technology can enhance human wellbeing and productivity. Thanks, Meenakshi, for joining me today. And Neelima Mukiri is a principal engineer at Cisco. She is currently working in Cisco's on-premise and software service container platforms. Thank you both for joining me today in this podcast. Before we get started, do you have any additional comments about your research and projects you've been working on, that you would like to share with our readers?
Meenakshi Kaushik: Hi everyone. My name is Meenakshi. Thank you, Srini, for inviting Neelima and me to the podcast. And thank you for that great introduction. Other than what you have mentioned, I just want to say that it is exciting to see how machine learning is getting more and more mainstream. And we see that in our customers. So, this topic and this conversation is super important.
Neelima Mukiri: Thank you, Srini, for that introduction and thank you, Meenakshi. Definitely we are very excited about the evolution of AI/ML, and especially in the Kubernetes community. How Kubernetes is making it easier to handle MLOps. I'm excited to be here to talk about fairness and reducing bias in machine learning pipelines, as ML becomes more pervasive in the society. That's a very critical topic for us to focus on.
Srini Penchikala: Thank you. Definitely I'm excited to discuss this with you today. So, let's get started. I think the first question, Meenakshi, maybe you can start us off with this. So, how did you get interested in machine learning, and what would you like to accomplish by working on machine learning projects and initiatives?
Meenakshi Kaushik: Machine learning has been there for a while. What got me excited is when I started seeing real world use cases getting more and more deployed. So, for example, I remember a much change when I saw Amazon's recognition, and it could recognize facial expression, and tell you your mood. What I took back from that is, "Oh, isn't that helpful? You can change somebody's mood by making them aware that today you're not looking so happy." So, that was pretty groundbreaking. And then more and more applications came along, especially in image recognition, where you could tell about patients' health, and that became more and more real. And we, as Neelima pointed out, that got mainstream even with our customers, with the evolution of Kubernetes and Kubeflow. So, both these spaces together, where it became easier and easier to enable data scientists and even ordinary folks to apply machine learning in day to day, really got me excited. And this evolution is progressing, so I feel very happy about that.
Srini Penchikala: How about you, Neelima, what made you get interested in working on machine learning projects, and what are your goals in this space?
Neelima Mukiri: I've always been interested in AI/ML and the possibilities that it opens up. In the recent past years, advances in ML have been so marked compared to 10 years before. There's so much improvement in what you can do, and it's so much more accessible to every domain that we are involved in. The possibilities of self-driving cars, robotics, healthcare, all of these are real world implications that have a chance to affect our day to day lives. In addition to just how exciting the field is, being involved in Kubernetes and in Kubeflow as part of Cisco's container platforms, we've seen our customers be very interested in using Kubeflow to make ML more accessible. And as part of that, we've started working on AI/ML in the context of Kubernetes.
Define AI fairness and examples of fair AI solutions
Srini Penchikala: Yeah. Definitely, kubernetes brings that additional dimension to the machine learning projects, to make them more cloud native and elastic and performant, right? So, thank you. Before we jump into the machine learning bias and fairness, which is kind of main focus of your discussion here, can you define AI fairness? What does it mean by AI fairness? And talk about a couple of examples of fair AI solutions, and an example of ML solution where it hasn't been fair.
Meenakshi Kaushik: The fairness comes into the picture when you start subdividing the population. So for example, this is an example which we gave even in our KubeCon presentation, where let's say a bank is looking at giving loans to the entire population. And it decides that 80% of the time it is accurate. So, overall in the population, things behave normally. But when you start looking at subsection of population, you want to see whether the subsection of population are equally getting represented in the overall decision making.
So, let's say if 80% of the time, the loan application gets accepted, if you started slicing and dicing at a broad level between, let's say, male and female, 80% of the time equally, do they get accepted? Or within the population where people who had previous loans, but defaulted, whether they get equally represented or not? So, fairness is about looking at a broad solution and then slicing and dicing into a specific subgroup, whether it is based on gender or racial differences or age differences. For example, with COVID vaccine, if it was tested on adults, it doesn't work on children that well. So, it's not fair to just push your machine learning data to children until you have looked at that population, and it's fair to that population. So, fairness is about equity, and it's really in the context of the stakeholder. A stakeholder decides at what level they want to define fairness, and what groups it wants to figure out, whether it is fair across or not.
Three sources of unfair algorithm bias
Srini Penchikala: That's a good background on what is fair and what is not, right? So, maybe Neelima, you can probably respond to this. In your presentation at KubeCon conference last month, you both mentioned that the sources of unfair algorithm bias are data, user interactions, and AI/ML pipeline. Can you discuss more about these three sources and the importance of each of them in controlling the unfair bias and contributing to unfair bias?
Neelima Mukiri: So, data is the first step where you are bringing in real world information into a machine learning pipeline. So, let's say you take the same example of deciding whether a person is creditworthy for a bank to give a loan or not. You populate the machine learning pipeline with data from previous customers, and the decisions made for the previous customers, whether to give them a loan or not. So, the data comes in from the real world, and it is filled with bias that is present in our real world because the world is evolving and what we consider as fair a few years back, may not be fair today. And we are all prone to prejudices and biases that we put into our decision making process. So, the first step in the machine learning pipeline is that collection of the data and processing the data, where we have the chance to evaluate and see what are the biases that are present.
For example, are there large subsets of the population that are missing in the dataset that we have collected? Or is the data skewed towards being more positive for some subset of a population? So, that's the first step where we want to evaluate, understand and when possible, improve the bias that is present.
The next step is, in the machine learning pipeline as we build the model and as we are serving the model, every step of the pipeline we want to make sure that we are considering and evaluating the decisions that are made in the pipeline, the models that are built and the inference that is provided, to evaluate and see, is it being fair across the population set that you're covering? And when you bring that decision to a user and present it, that can in turn reinforce bias by the user acting on it.
So, let's say, you're looking at policing and you are giving a wrong prediction of somebody being prone to commit a crime. It's possible that the police will actually do more enforcement in that region, and that ends up assigning more people in that region as possible to create a crime, and then that feeds into your data and the cycle is reinforced. So, every step in the pipeline right from starting from where you collect your data, to building models, providing inference and then seeing how people act based on those inferences, is prone to bias. So, we need to evaluate and correct for fairness where possible.
AI/ML fairness toolkits
Srini Penchikala: Maybe, Meenakshi, you can respond to this one, right? So, can you talk about some of the AI/ML fairness toolkits that you talked about in the presentation, and why you both chose Kubeflow for your project?
Meenakshi Kaushik: As we were talking in the beginning of the presentation, we work in the Kubernetes domain. And although there are many machine learning lifecycle manager toolkits available on top of Kubernetes, Kubeflow has gained a lot of traction, and it's used by many of our customers. It's also pretty easy to use. So, we chose Kubeflow since it is one of the popular open source machine learning lifecycle manager toolkit. And really, what it allows you to do, it allows you to build all the way starting from your exploration phase into the production pipeline. It allows you to do everything. You can bring up a notebook and run your machine learning models, and then chain them together in a workflow and deploy it in production. So, that's why we used Kubeflow.
And then on the machine learning toolkits, when we started this journey, we started looking at open source toolkits. And fairness is an upcoming topic, so there is a lot of interest, and there are a lot of toolkits available. We picked the four popular ones because they had a wide portfolio of features for fairness available. And the good thing is that they had many commonalities, but they also had interesting variations, so that it gives you a large variety of toolkits. So, let me quickly talk about the four toolkits. We started by looking at Aequitas. Aequitas fairness toolkit, I would say is the simplest toolkit when you want to get into. You just give your predictions and it will tell you about fairness. It would give you your entire fairness report. So, your prediction and your data and which population you want to look at fairness, the protected group, and it'll just give you the data. So, it offers you an insight as a black box, which is pretty nice.
But what if you want to go next level deeper, or what if you wanted to do interactive analysis? In which case, what I found was that Google's What-If Tool, was pretty nice. In the sense that it is a graphical user interface, and you can do interactive changes to your data to see when it is fair and whether, "Can I get a counterfactual? Can I change the threshold to see if it changes the bias in this subpopulation?" And how it impacts other things. For example, it might impact your accuracy if you try to change your bias threshold. So, What-If Tool is pretty good from that perspective. It is interactive and it will help you with that. Obviously, because it's an interactive toolkit, if you have billions and billions of dataset, you won't be able to pull all of those into this graphical user interface. But there is some strength to having a graphical toolkit.
Then the other toolkits which we looked at are AI 360 degree from IBM, and Microsoft's Fairlearn. And these toolkits are awesome. They don't have the interactive capability or a white box capability of Aequitas, but they have very simple libraries that you can pick and put in any of your machine learning workflow, on I guess, any notebook. In the case of Kubeflow, it's Jupyter notebook, but you could definitely run it on Colab. And now as you are experimenting, you can see graphically using those libraries where their fairness criteria lies.
So, those are four toolkits, and all of these toolkits have strength in doing binary classification, because that's where the machine learning fairness toolkits have started. For other areas like natural language processing and computer vision, things are evolving. So, these toolkits are adding more and more functionality into it. So, that's an overview of the landscape that we looked at.
Srini Penchikala: Neelima, do you have any additional comments on that same topic? Any other criteria you considered for the different frameworks?
Neelima Mukiri: In terms of the toolkits, Meenakshi covered what we looked at primarily. And in terms of the criteria, one of the primary things that we were looking for was how easy is it to run these on your on-prem system versus having to put your data in a cloud. Given a lot of our customers are running their workloads On-prem and they have the data locality restrictions. That was one key thing that was important for us to understand. And all the toolkits we were able to run them on-prem in Kubeflow. Some of them are, especially What-If, is a lot easier to run directly. Go on to the website and run it in a browser, but you have to upload your data there. The other part that we looked at is the programmability or how easy is it to bundle this into a pipeline? And that's where, I think, both Fairlearn and IBM AI 360 are easier to plug into, as well as a bunch of the TensorFlow libraries that are available for bias reduction and detection as well.
Yeah. So, the two axes which we were coming from was, how easy is it to plug in your data to it. And then where can you run it. How easy is it to run it in your machine learning pipeline versus having to have a separate process for it?
Srini Penchikala: So, once you chose the technology, Kubeflow, and also you have the model defined, and you have the requirements finalized, so the most important thing next is the data pipeline development itself, right? So, can you discuss the details of the data pipelines you are proposing as part of the solution to detect the bias, and improve the fairness of these programs, right? So, there are a few different steps you talked about in the presentation such as pre-processing, in-processing and post processing. So, can you provide more details on these steps? And also more importantly, how do you ensure fairness in every step in the data pipeline?
Neelima Mukiri: Basically, we divided the machine learning pipeline into three phases. In-processing, pre-processing, and post processing. Pre-processing is everything that comes before you start building your model. In-processing is what happens while you're building your model, and post processing is, you've built your model and you're ready to serve, is there something that you can do at that point? So, the first part, which is pre-processing is where you look at your data, analyze your data, and try to remove any biases that are present in the data. The type of biases that are better served by handling at the stage, are cases where you have a large skew in the data available for different subgroups. The example that we gave in the presentation was, let's say, you're trying to build a dog classifier, and you train it on one breed of dogs. It's not going to perform very well when you try to give it a different dog breed, right?
So, that's the place where you're coming in with a large skew in the data available per subgroups, trying to remove it at the pre-processing phase itself. The type of biases that are more easy to remove, or better served by removing in the model building phase, are more of the quality of service improvements. So, let's say you're trying to train a medical algorithm to see what type of medicine or treatment regimen works best for a subset of population. You don't really want to give everyone equal medicine or equal type of medication, you want to give them what is best serving their use case, what works well for that subset. So, you actually want to better fit the data.
And that's where doing the bias reduction during the model training phase, which is the in-processing step, works better. And there are a bunch of techniques which are available to improve or to reduce bias in the model training stage, that we talk about in the presentation, like going through and generating an adversarial training step, where you're trying to optimize for both accuracy as well as reducing the bias parameter that you specify.
Now, when we have trained the model, and we've decided on the model that we are going to use, we can still evaluate for bias and improve fairness in the post processing step. And the type of data that is really well suited for that is where you have existing bias in your data. So, the example of where you have the loan processing, where let's say a subgroup of population is traditionally being denied loans even though they are very good at paying back the loan. So, there you can actually go and say, "Hey, maybe their income is less than this threshold, but this population has traditionally been better at paying back the loan than we've predicted, so let's increase the threshold for them." And you're not changing the model, you're not changing the data, you're just increasing the threshold because you know that your prediction has been traditionally wrong.
So, that's the post-processing step, where you can remove that kind of bias better. So, each step of the pipeline, I think it's important to first evaluate and see, and try to remove the bias. And also try different mechanisms, because each thing works better in different scenarios.
Srini Penchikala: Meenakshi, do you have any additional comments on the data pipeline that you both worked on?
Meenakshi Kaushik: Yeah. What happens even before we have the ability to do pre-processing, in-processing or post-processing is, what do we have at hand? So, for example, sometimes we didn't build the model, we just are consumers of the model. In which case, there isn't much you can do other than post-processing. Or can we massage the output of the model to become fair? So, in that case, post-processing is and actually, it works very well in many scenarios.
So, it's not that nothing is lost there, you can still change and make it more fair just by that. Now, sometimes you have access to data, you may not have access to the model. So, in addition to what Neelima is saying about going through the different phases of the pipeline, do not be afraid even if you have a limited view of your infrastructure, or how you are serving your customers. There is still an opportunity where you can massage the data, like at the pre-processing layer.
If you don't have access to the model, but you have the ability to feed to the model the data, that's good. You still have the ability where you can change at the pre-processing level to influence the decision. But it's important to look at what really works. Sometimes the way I look at it is that, ideally it's like security, you shift left, you try to change the earliest pipeline as possible. But sometimes influencing the earlier pipeline may not give you the best result. But ideally that's what you want to do.
First, you want to fix the word so that you get perfect data. But if you cannot get the perfect data, can you massage it so that it is perfect? If that's not possible, then you go lower in the pipeline and say, "Oh, okay, can I change my model?" At times, model changing may not be possible. Then even at the last stage, as we've seen in a variety of examples, it's very good enough where the model may not be fair, but you massage the actual result which you give out to the others by changing some simple thresholds, and make your pipeline fair.
Data pipelines to detect machine learning bias and improve the fairness of ML programs
Srini Penchikala: Very interesting. So, still I have a question on the fairness quality assurance, right? Neelima, going back to your example of loan threshold, probably increasing it, because it's been traditionally wrong with the previous criteria. So, how do you decide that, and how do you ensure that that decision is also fair?
Neelima Mukiri: In examples like the bank loan, typically the way to evaluate fairness is, you have one set of data, let's say from your bank, and the decisions that you have made. But let's say, you've denied a loan to a person, and that person's gone and taken a loan with another bank. You actually have real world data about how they performed with that loan. Did they pay back on time or not? So, you can actually come back and say, "Hey, that was a false negative that I said, that I don't want to give a loan to the person, who actually is paying back on time."
So, historic data, you can take it and see how it's performed versus your prediction. And you can actually evaluate what is the real fairness in terms of both the accuracy. And you can easily look at fairness in terms of subpopulations by looking at the positive rates per population. But as a business coming in, you want to optimize for value. So, it's critical to know that you've actually made mistakes, both in terms of accuracy, and there is bias there.
The bias is what has induced the errors in accuracy. First of all, getting that historic data, and then getting a summary of how it's performed across these different dimensions, is the way for you to see what bias exists today. And if you improve it, is it actually going to improve your accuracy as well, and your goal of maximizing profit or whatever your goal is, right? So, tools like Aequitas and What-If, actually give you a very nice summary of the different dimensions. How is accuracy changing as you're changing fairness? How is it changing when you're trying to fit better to the data or when you're trying to change thresholds?
So, I would say evaluate this, run through the system, see the data that it generates, and then decide what sort of fairness reduction that makes sense for you. Because really, it doesn't make sense to say, "Give it to everyone." Because you have a business to run end of the day, right? So, evaluate, see the data and then act on the data.
Standards and community guidelines to implement responsible AI
Srini Penchikala: In that example, financial organizations, they definitely want to predict from an accuracy standpoint to minimize a risk, but also they need to be more responsible and unbiased when it comes to the customers' rights. Okay, thank you. We can now switch gears a little bit. So, let's talk about the current state of standards. So, can you both talk about, what is the current situation in terms of standards or community guidelines? Because responsible AI is still an emerging topic. So, what are some standards in this space that the developer community can adapt in their own organizations to be consistent with fairness? So, we don't want the fairness to be kind of different for different organizations. How can we provide a consistent standard or guideline to all the developers in their organizations?
Meenakshi Kaushik: So, let me just start by saying that, as you mentioned, fairness is still in its infancy, so everybody's trying to figure out. And the good thing is that it's easier to evaluate fairness, because you can look at lines from the subpopulation and see whether it is still doing the same thing as it's doing for the overall population as a whole. Given that the easiest thing which you can do for now, which is commonly done in most of our software and even in hardware, is we have a specification. It tells you, "Oh, these are the performance, it will only accept this many packets per second. These are the number of connections it would take." Things like that. What is a bounding limit under which you would get an expected performance?
And the model now has something called model cards, where you can give a similar specification as to how was the model built? What are the assumptions it made? This was the data it took. This is what the assumption it is making for the bounding limit under which it works, right? This is the data set that it took. For example, if you were doing some kind of medical analysis, and it took a population which is, let's say from India, then it has a specific view of just a specific population. And if you're trying to generalize, or if somebody's trying to use in a generalized setting, a model card which tells you about that, then me as a consumer can be aware of that, and can say, "Aha, okay, I should expect some kind of discrepancy." Currently, those things are not readily available when you go to take some model from open source or from anywhere, for that matter. So, that's the first easy step that I think that can be done in the near term.
In the long term, there has to be perhaps more guidelines. Today there are different ways of mitigating fairness. There is no one step which fits all. However, adding those into the pipelines, what needs to be added is not standardized. What should be standardized is that these are the sets of things your model should run against, right?
So, if your model is doing some kind of an age group across all the age groups, then some of the protected groups should be predefined. "Oh, I should look at children versus pre-teens versus adults, and see if it is performing in the same way." If there is some other kind of disparity, there has to be a common standard that an organization should define, depending on the space they are in. For the bank, for example, it would be based on gender differences, based on the zones that they live in, area zip code they live in, some ethnicity for example. In the case of ofcourse medical, the history is larger. So, those are the near term standards. The broader term standards, I think will take a longer time. Even within machine learning, there are no standard ways to give predictions. You can bring your own algorithms, and you can bring your own things. So, I think we're a little far away on that.
Neelima Mukiri: Yeah. I would echo what Meenakshi said. We were surprised with the lack of any standards. The field is at it's very infancy, right? So, it's evolving very rapidly. There's a lot of research going on. We are still at the phase where we are trying to define what is required versus at a state where we are able to set standards. That said, there are a lot of existing legal and society requirements that are available, in different settings, what's the level of disparity that you can have across different populations? But again, that's very limited to certain domains, certain use cases, maybe in things like, when you are applying for jobs or housing or giving out loans. So, there are fields where there are legal guidelines already in place. In terms of, what is the acceptable bias across different subgroups, that's where we have some existing standards.
But then bring it on to machine learning and AI, there's really no standards there. When we looked at all these different frameworks that are available for reducing bias, one interesting thing is that even the definition of what is bias or what is parity, is different across each of these models. Broadly, they fall into either an allocation bias or QoS or a quality of service improvement. But again, each framework comes and says, "This is the bias that I'm trying to reduce, or these are the set of biases that I allow you to optimize." So again, it makes sense at this stage to actually look at it from multiple angles, and try out and see what works in a specific sub-domain. But as a society, we have a lot of progress to do and ways to go before we can define standards and say, "This is the allowed parity in these domains."
Future innovations in addressing machine learning biases
Srini Penchikala: Right. Yeah. Definitely, the bias is contextual and situational, and relative, right? So, we have to take the business context into consideration to see what exactly is bias, right? Can you guys talk about what's the future? You already mentioned a couple of gaps. So, what kind of innovations do you see happening in this space or you would like to see happen in this space?
Meenakshi Kaushik: As Neelima pointed out, we were happy to see that there was fairness. It's easy to not define fairness, but at least evaluate fairness because it's model generated rather than, there is some human in the loop involved, and you can't really evaluate. So, that's good thing. What I am excited to see is that we are continuing fairness across different domains of machine learning, so that it started with, as I said, classification problems, but it is now going more and more towards the problems which are getting increasingly deployed. Anything to do with image recognition, computer vision, for example, and it touches broad areas, from medical to, as Neelima was pointing out, autonomous driving field. So, that I'm really excited to see.
The second thing is that more and more, hopefully the model cards become the way of the future. Every model comes with what it was that was used to generate the model, and what should the expected output be, so that we all can figure out how it is done. Even for advertisements which are served to me. If I know exactly how the model was defined, it's useful information to have. So, I'm excited to see that.
And the toolkits which are developing are also very good. Because right now, these toolkits are one-off toolkits. And when Neelima and I started looking at not only Kubeflow, but researching as to what we want to demonstrate in KubeCon, we were looking at a way of automating in our machine learning pipeline. Similar to how we generate automated hyperparameter, we wanted to automatically modify our machine learning model to now have fairness criteria built-in.
So, currently those things are not totally automated, but I think we're very close. We could just modify some of our routines, similar to the hyperparameter tuning. Now there is a machine learning fairness tuning, so you can tune your model so that you can achieve fairness as well as achieve your business objectives. So, accuracy versus fairness is easily done. So, that's the other area I'm excited to see that we achieve, so that it becomes in-built like hyperparameter tuning. Also do the fairness tuning for this model.
Neelima Mukiri: Yeah. To echo what Meenakshi said, we really need to have more standards that are defined, that we can use across different types of problems. We also want to see standardization in terms of defining fairness, evaluating fairness. And there's a lot of improvement to be done in making it easy to integrate fairness into the pipeline itself. There's work ongoing in Kubeflow, for example, to integrate evaluation of fairness into the inference side of the pipeline, post-processing. And so, we need to be able to build explainable, interpretable models and make it easy for people to build in fairness into the pipelines, so that it's not an afterthought, it's not coming in as someone who's interested in making your model more fair, but it's part of the pipeline. Just as you do training, testing, cross validation, you also need to do fairness as part of the pipeline, as part of the standard development process.
Final thoughts and wrap-up
Srini Penchikala: Yeah. Definitely, I agree with you, both of you. So, if there is one area that we can introduce fairness as another dimension, and build the solutions out of the box right from the beginning, to be fair, that area would be machine learning, right? So, thanks, Neelima. Thanks, Meenakshi.
Neelima Mukiri: Thank you for this opportunity to talk to you and talk to your readers on this very exciting topic.
Meenakshi Kaushik: Thank you so much for the opportunity. It was fun chatting with you, Srini. Thank you.
Srini Penchikala: Thank you very much for joining this podcast. Again, it's been great to discuss this emerging topic, and a very important topic in the machine learning space, how to make the programs more fair and unbiased, right? So, as we use more and more machine learning programs in our applications, and as we depend on machines to make decisions in different situations, it's very important to make sure there is no unfairness as much as possible. To one demographic group or another. So, to our listeners, thank you for listening to this podcast. If you would like to learn more about machine learning and deep learning topics, check out the AI/ML and the data engineering community page on infoq.com website. I encourage you to listen to recent podcasts, and check out the articles and news items my team has posted on the website. Thank you.