Live from the venue of the QCon London Conference, we are talking with Mehrnoosh Sameki. In this podcast, Mehrnoosh discusses the importance of responsible AI, the principles behind it, and the challenges in ensuring fairness and transparency in AI systems. She also highlights various open-source tools and approaches for developers to incorporate responsible AI practices into their machine learning models and ensure better decision-making and ethical outcomes.
Key Takeaways
- Responsible AI refers to the right approach to developing and deploying artificial intelligence systems, encompassing principles such as fairness, inclusiveness, reliability, safety, privacy, and security.
- Fairness in AI means that systems should treat similarly situated people the same, regardless of their protected attributes, such as gender or race. It can be approached by considering potential harms, including allocation, quality of service, and representational concerns.
- Sensitive attributes can be used during the evaluation phase of an AI model to ensure fairness, even if they were not used during the model training process.To improve AI fairness, developers can augment data, collect better quality data, or adjust features and models based on the errors and blind spots identified by tools such as the Responsible AI Dashboard.
- Trade-offs between fairness and performance might be necessary, and the choice of the right balance depends on the specific context and goals of the AI model.
- Generative AI models like ChatGPT have the potential to democratize AI, but they also raise concerns about over-reliance and the need for proper assessment and monitoring.
Subscribe on:
Transcript
Introduction [00:44]
Welcome everyone to the InfoQ podcast. My name is Roland Meertens, your host for today, and I'll be interviewing Mehrnoosh Sameki. She is the Responsible AI Tools lead at Microsoft and Adjunct Assistant Professor at Boston University. We are talking to each other in person at the QCon London conference where she hosted the track Emerging AI and Machine Learning Trends. She also gave the presentation, Responsible AI from Principle to Practice. Make sure to watch her presentation as she both delivers tremendous insights into responsible AI and also gets you the tooling needed to get started with it yourself. During today's interview, we will dive deeper into the topic of responsible AI. I hope you enjoy it, and I hope you can learn from it.
Welcome Mehrnoosh to the InfoQ podcast. We are here at QCon London, you are giving a talk tomorrow. Do you want to maybe give a short summary of your talk so people who haven't watched it online yet can watch it.
Mehrnoosh Sameki: Absolutely. So first of all, thank you so much for having me. It's so wonderful to be in this venue and also in London in general. I feel like I'm here as a part of history. It's so beautiful here. My name is Mehrnoosh Sameki, I'm a principal PM lead at Microsoft, and I'm also an Adjunct Assistant Professor at Boston University. And my focus and passion is to help operationalize this buzzword of responsible AI in practice. So I lead a product team who is on a mission to essentially help you with the tools that you can incorporate in your machine learning lifecycle to operationalize concepts like fairness, transparency, and interpretability, reliability, and safety. And so it's so wonderful to be here among the practitioners and understand their concerns or challenges, and see how we can support them in their ML lifecycle.
Defining responsibility and fairness [02:30]
Roland Meertens: You mentioned concepts like fairness. What do you mean with fairness? What do you mean by responsible?
Mehrnoosh Sameki: Great question. So responsible AI in general is a term that we very generically use to talk about the right approach of developing and deploying artificial intelligence systems. And it consists of many different aspects. So one possible enumeration is Microsoft's six different responsible AI principles of fairness, inclusiveness, reliability and safety, privacy and security, underpinned by two more foundational principles of transparency and accountability. And so during this podcast, I could double-click on each one of them, but specifically this aspect of fairness is very interesting because in general fairness is a concept that has been studied by philosophers, or psychologist, many different sociologists, and also computer scientists for decades, maybe even more than that. And so we cannot suddenly come from the tech world and come up with one great definition for it. It's truly a hugely multifaceted concept. However, the way that we refer to it in the space of AI is, AI systems should treat similarly situated people the same regardless of their protected attributes or sensitive attributes like gender, sex, sexual orientation, religion, age, or whatever else that might be.
And my possible way to simplify fairness for teams is to help them think through the harms, because harms make it very real. So think through what type of fairness harm you are envisioning your AI could have. Is that a harm of allocation where AI is allocating opportunities or information differently across different groups of people? Is that harm to the quality of service when AI is providing different quality of service across different groups of people? Is that representational harm like your AI is stereotyping, or using demeaning language or derogatory language, or possibly even erases a population from its answers or from the population that it's serving? So thinking through the harms would help just make this concept of fairness a little bit more real, and help you with picking the right measurements and mitigation techniques to resolve it.
Roland Meertens: And how do you then determine that two users have the same background or are essentially in a similar situation?
Mehrnoosh Sameki: You need to have something in your data that is considered as your sensitive attribute. So you need to have that type of information. And usually a lot of our customers, a lot of users that I'm seeing, they do have some notion of even if they haven't used some sensitive attributes like a gender during their model training, they still have access to them. And those are the angles that we use in order to see how the model is treating different groups defined in terms of those angles. For instance, female versus male versus non-binary.
Sometimes customers come and say, "We don't have access to sensitive attributes at all because we're not allowed to even collect that." In such scenarios, what we really recommend to them is a tool that I'm presenting in details in my presentation called error analysis, which at least provides you with the blind spots of your AI. So it says, "Oh, marginally, a lot more people of this particular cohort are getting erroneous responses." So it automatically puts those blind spots in front of you and even though you don't have access to those sensitive attributes and these cohorts might not be defined in terms of sensitive attributes, still knowing those blind spots and improving the performance of the model for them would hopefully help raise the overall performance of the model for everyone. And you never know, maybe there is a correlation between those blind spots features, and some sensitive attributes that you haven't even collected.
Using sensitive attributes [06:13]
Roland Meertens: Would you then recommend people to explicitly gather these maybe sensitive attributes for model evaluation or maybe even model training, or would you rather recommend people to not collect at all and not use it at all?
Mehrnoosh Sameki: Great question. So you are now alluding to the question that is asked a lot that if we don't collect and the model never knows, then what is the risk for fairness? Essentially, it's like we are completely closing our eyes to these sensitive attributes with the best intent. The challenge here is often there are other factors that leak information about sensitive attributes. So even if you haven't used sensitive attributes in your machine learning model, there might be another factor that seems very non-sensitive, like neighborhood for instance, or I was working on a hiring scenario, and something that we saw was time outside the work market, and that might be an interesting feature for the model, but then we realized that that's leaking information about gender because there are a lot of women who might take more time off because of maternity leave, because often maternity leave periods are longer than paternity leave.
And so in such cases I really recommend for them, if they could legally collect it, they could omit it from the model training, they could just pretend they don't have it, but still bring it as metadata to the evaluation phase just to make sure that no other nonsensitive factor has leaked information about that sensitive factor.
Roland Meertens: And would you then, during your evaluation step, do something like demographic pairing where you're comparing all the different people from different domains with different backgrounds?
Mehrnoosh Sameki: That is a great question again. So there are many different ways that you could enhance and improve the fairness of an AI system. You could never guarantee fairness, but you could improve the fairness with information you have at hand. And the demographic parity is one of them. So there are other things like equalized odds, or equal opportunities, stuff like that, and so the way that I look at it is, when we sit down with a team who is interested in fixing, or mitigating, or improving fairness of their AI, we try to understand, what is the thing that it's very top of mind for them? In other words, what disaster would look like to them.
Some companies might say, "I want my distribution of favorable outcome to be exactly the same across different demographic groups." So then we go with that approach. We try to, exactly as you mentioned, look into different demographic groups, look at what is the percentage of the favorable outcome distributed by the AI across those. Some people say, for instance, in a HR scenario, "I want to make sure that I'm making the errors in a systematically similar way for women, men, non-binary people. So imagine these people apply for my job, I want to make sure that the qualified ones based on ground truths are getting the same acceptance rate to my job, and the unqualified ones getting the same rejection rate." In other words, you want to get deeper in each demographic and even divide them into more buckets and then compare them. So it really depends on what your goal is. And demographic parity is definitely one of the top ones we see with the customer.
Trade Offs between performance and fairness [09:24]
Roland Meertens: Do you then ever get difficult trade-offs where, for example, you manage to improve the overall performance of your system except for one group? What do you do in such a case?
Mehrnoosh Sameki: So let me just first acknowledge that the concept of responsibility is very tough, and often we have to approach that with a very, very open mind, growth mindset and empathy, and always acknowledge that with the best, best, best intentions, still mistakes will happen in the real world. And so that's where the great feedback loops, and monitoring loops comes into the scenario that at least be very fast to respond. So to your point, yes, we see trade-offs all the time. We see trade-offs in two different facets when it comes to fairness. Sometimes you choose a couple of fair criteria and perfecting one to the knowledge that you have is essentially like tackling the other one or taking it to the negative side, or sometimes it's the fairness versus performance. I should say doesn't happen all the time. So sometimes you see that, "Wow, I improved the fairness of my model, and overall it brought up the performance for everyone."
But sometimes we see that challenge, and then the way that we tackle it is we try to provide different versions of a model. One of our mitigation techniques that is in the opensource under the package called Fairlearn tries to provide you with different model candidates. Each one has a fairness metric attached to it and a performance, and then it puts it on a diagram. And essentially, you're looking at a performance versus fairness trade off. And you might say that I am okay losing performance by five percent at the expense of getting this much improvement in the fairness. So these type of decisions are very, very dynamically being taken with the stakeholders in the room.
Roland Meertens: It's like an ROC curve, but then fairness versus performance.
Mehrnoosh Sameki: Exactly. And so you could take a look, you could see, "Okay, for this particular use case, am I okay losing the performance a little bit?" So you are trying to look at your losses. If I lose the performance, how much loss? But then fairness, the loss is beyond words, right? So if you lose the trust of, say, a certain group in the society, then I would say gradually your business is going to go out of the way with all the competition that is happening. So people these days are very aware and educated about these concepts, and I'm so proud of every single customer communication that we have, they genuinely care about making sure that everyone is getting served by this AI system.
Tools to help [11:50]
Roland Meertens: And you mentioned you have a tool which can give you insights into your data. How does this work?
Mehrnoosh Sameki: That's a great question. So we have built several open source tools that are absolutely free to use. People could just install them with a pip install. The three of these tools that we first started with are individual tools, each one focusing on one concept. So one of them was called InterpretML. That was a tool that you could pass your machine learning model to and it is providing explainability, or feature importance values, both on the overall level, the explainer says, "What are the top key important factors impacting your model's prediction?" And also on the individuals, say for Mehrnoosh, what are the top key important factors impacting models' predictions? So that was one tool for transparency specifically.
Then we worked on Fairlearn with the focus on the concept of group fairness. So exactly as you said, how different demographics, different sensitive groups are getting treatment from the AI. And so that one has a collection of fairness measurements and unfairness mitigation techniques that you could incorporate in your ML lifecycle. The third tool we worked on is called Error Analysis, and that is with this goal of giving you the blind spots of your AI, because often people use aggregate metrics. For instance, they say, "My model is 95% accurate." That is a great proxy to build trust with that AI, but then if you dive deeper and you realize that, "Oh, but for, say, Middle Eastern people whose age is less than 40 and they are women, the model is making 95% error." But because they're a small portion of the data set, then in the aggregation you don't see it. So Error Analysis puts those blind spots in front of you, and once you see it, you can't miss it.
And very lately we brought them all together under one roof, what's called the Responsible AI Dashboard, again open source, and now it gives you interpretability, transparency, and fairness, and error analysis, and other concepts like even causal analysis, so how you would use the historic data to inform some of your real world decisions and do real world intervention. It also has another capability for counterfactual analysis, which is, show me the closest data points with opposing predictions, or in other words, what is the bare minimum change you could apply to the features so that the model would say the opposite outcome. And so all of these tools under this Responsible AI Dashboard are aiming to give you the ability to identify your model issues, and errors, and fairness problems, and diagnose by diving deeper. And then we have a set of mitigation techniques to help you either augment your data, or mitigate the fairness issue that you just identified.
Roland Meertens: And so for solving this, is it a matter of collecting more data? Is it a matter of balancing the data during training? How does one go at solving this?
Mehrnoosh Sameki: It could look like 20-plus things, so I just categorized a few of them for you. So sometimes it is, wow, you're really underrepresenting a certain group in your training data, that's why the model never learns how to even respond for that group. Sometimes it is, you do have a great representation about that demographic group, but in your training data it's always associated with negative results. So for instance, imagine you collected the loans data and in your dataset you only have young people getting rejection under their loans. And obviously the model then probably picks a correlation between age, the lower the age, the higher the rejection rate.
So sometimes it's about augmenting data, collecting more data on a completely erased population, or collecting better high quality diverse data on some particular population.
Sometimes you take a look at model explanation, and you're like, "For some reason I don't see, based on my common sense and my domain expertise, I don't see any top factor showing up among the top important factors. So maybe I don't have the right features, or I have a very simple model, I don't have the right model that is picking on the complexity of the features that I have." So in those cases, maybe you chose the overly simple model, you go with the more advanced model, or you go and augment your feature set to come up with a better feature set. So all of those are possible and depending on what you're diagnosing, you can take the target at mitigation.
Roland Meertens: Do you know if it's often better to have a more advanced model such that your model can maybe capture the complexity of the different uses you have, or is it sometimes better to have a more simple model such that the model is not able to discriminate maybe between people?
Mehrnoosh Sameki: That is a great question. I don't think I do have an answer of when to go with which model. What I would say is, always start with the simpler model though. Try to look at maybe the performance of the model, like the error blind spots, and then go with the complex. I feel like everyone is defaulting on super advanced technology in terms of, say, the most sophisticated architecture for deep learning or whatever from the very get-go. And sometimes the notion of the problem is very simple. So I always recommend people to start with a simple model and experiment, and it's okay if they end up obviously with a more complex model, it's okay because now we have these interpretability techniques that could sit on top of it and treat it as a opaque box model, and take out explanation using bunch of heuristics. So they won't lose transparency anymore by using more complex models. But always start from a simple model.
Roland Meertens: And do you have specific models for which this toolbox is particularly suited or does it work for everything?
Mehrnoosh Sameki: For this particular open source toolbox, we cover essentially any Python model or necessarily even if the model is not in Python, you can wrap the prediction function in a Python file, but it needs to have a .predict or .predict_proba conforming to SciKit convention. And if you don't have it, if you're using a model type that doesn't have predict or predict_proba conforming to SciKit, it's just a matter of wrapping it under a .foo.bar prediction function that is transforming the format of your prediction function to predict, predict_proba. So because of that we support any SciKit learn model PyTorch, Keras, TensorFlow, anything else, just because of that wrapper trick that you could use.
Large Language Models [17:59]
Roland Meertens: That sounds pretty good. So nowadays one of the more popular models are these ChatGPT models, these large language models. Do you think it'll democratize the field because now more people can use these tools or do you think it will rather maybe drive people further apart and introduce more bias?
Mehrnoosh Sameki: I absolutely think it is going to put AI in every single person's hands. I personally see literally everyone from the tech world or outside tech world talking about these models. I have friends outside tech world using it on a day-to-day basis for anything they're doing right now. So they're like, "Oh, how was life before these models?" And so now we see this huge move of companies wanting to use such models in their ML life cycle. So the answer is absolutely yes. One thing that I have to say though is the challenges that we have for use of such models is fundamentally very different from what now I can say more traditional ML, and with more traditional ML, I'm also putting deep learning on their that now. So previously we were seeing a lot of people wanting the tools that we have, fairness, interpretability, error analysis, causal analysis, stuff like that, the concerns are still the same, I can talk a little bit about what challenges happen in generative AI, but it needs fundamentally different tools because the nature of the problem is completely different.
Roland Meertens: So do you think that now that more people are going to use it, is that a danger you see?
Mehrnoosh Sameki: The only danger that I'm seeing is I think the mindset should be not having over-reliance on generative AI. So previously when we had machine learning models, there was a lot more focus on assessment of such models. So I'm going to evaluate it, I'm going to even collaborate with a QA team on the results of it, stuff like that. And because the generative AI models are super, super smart, and we are all super excited about them, but also surprised by their capability, I see the challenge of over-reliance sometimes. So that's the part that I'm hoping to work with my team on, on better tools or new set of tools essentially for this space of measuring generative AI issues, or errors, or harms, and then helping them mitigate that. And so I think if we can change our mindset from, "Wow, these are literally replacing all of us," to, "These are our pets, our toys, and we are not going to overly on them, we still need the right assessment, the right mitigation, the right monitoring, and the right human in the loop with them, then I think if there's so much potential and we can minimize the harm state could cause."
Roland Meertens: Yeah, especially if you are using them as a zero shot inference where exactly you can basically deploy any application in minutes if you want.
Mehrnoosh Sameki: Absolutely. In fact, originally when they came out I was like, "Oh, people are going to really fine-tune them on some data." And then it turned out because fine-tuning is quite expensive in such scenarios, all people are doing is zero shot or few shots. So it becomes even more important to test than how they're going to do in a real world scenario.
Roland Meertens: Do you then recommend people to maybe take the output of these models and then train the model on top of that again such that you still control the entire training data pipeline or the information pipeline?
Mehrnoosh Sameki: I don't think that might be the approach, but the approach might be, for instance, Microsoft has worked on content moderation capabilities that we are exposing on top of these models. So they automatically sit on top of your outcome of large language models and they filter sexist, racist, harmful, hateful language for you. But there is so much work that is needed to happen in terms of the evaluation life cycle, and necessarily it's not about training a new model on top of it, but there should be a mechanism to make sure that the answers are measured before it's even deployed, that the answers are grounded in the reality, whatever context it is, there is no leakage of PII, so things like email addresses, social security numbers, any other personally identifiable information, there needs to be also some filters put in place to measure things like stereotyping, or hate, or demeaning, the content moderation capability that I mentioned definitely going to help with that. There needs to be work happening around inaccurate text, which I mentioned as groundedness.
I think we as the tooling team need to step up and create a lot of different tools that people could incorporate in their ML lifecycle and then we could help them with the right red teaming. So obviously one challenge that we have here is with the traditional models, people usually had test data set aside, so that was coming with ground truths, which is the actual source labels. Then they would hide those ground truths labels and then they would pass the test data to their model, get the answers and they would compare. This is not true anymore for generative AI, so it fundamentally requires things like bringing the stakeholders in the room to generate that data set or automate that generation of the data set for them. So there's so much work that needs to happen to make that happen so that people don't have to train another model on top of generative AI, which defeats the purpose of that superfast deployment of it.
Roland Meertens: And so for example, you mentioned the red teaming, and the security boundaries for ChatGPT, we see that as soon as you try to do something unethical or which leads toward towards unethical, it starts with this whole rant about it being a large language model and not it being able to do it, but we also see that people immediately defeat it. What do you think about the current security which is in place? Do you think that's enough? Do you think it's too little? Do you think it's a perfect amount?
Mehrnoosh Sameki: Definitely a lot could happen more because we saw all these examples of... What you're referring to is known as jailbreak essentially, or manipulation. So when you are providing an instruction to a large language model, and then you immediately see that, not immediately, but through some trials of putting it in essentially the hands of users, you see that people went around that instruction. So I think there are ways that we could also detect things like jailbreak. The very first stage of it is to understand why it's happening, in what scenarios it's happening. So that's why I think the red teaming is extremely important because people could attack the same model that the company is working on, and try to figure out what are some of the patterns that are happening. And then with those patterns, they could go and generate a lot more data and try to understand whether they could either improve their prompt instructions to the model, or they could write post-processing techniques on top of the outcome of the model. But even in that case, I think red teaming and figuring out patterns and see whether those patterns could help with some more reliable and less manipulative prompt instructions, whether that could lead to those instructions.
Roland Meertens: That would be great. And so if I want to get started with responsible AI, what would you recommend someone who is maybe new to the field or if someone already has models they are deploying on a daily basis, what would you recommend them to do?
Mehrnoosh Sameki: There are so many amazing resources out there, mostly by the big companies like, for instance, Microsoft, I saw many other cloud providers and many other consulting companies also have their own handouts about what responsible AI in practice means. So one great place to start is looking at the responsible AI standard that Microsoft publicly released, I think it was June or July last year. It walks you through the six principles that I mentioned at the very beginning and what are the considerations that you need to have in place around each of these areas. So when it comes to fairness, what are the concerns, what checks and balances you need to put in place? Same for transparency, reliability and safety, privacy and security, inclusiveness, accountability. And then there are so many great papers out there if you search for fairness in AI, interpretability and transparency in AI, there are so many fundamental papers that explain the field really well.
And then one great place to start maybe is from the open source tools. As I said, something like Responsible AI Dashboard or Fairlearn, or Interpret ML or Error Analysis, they're all open source. I've seen many great tools by other either research groups or companies as well. So what I would say is start with reading one of these standards by the companies, and educate yourself about the field. Look into an impact assessment form. Again, Microsoft also has a PDF out there which you could access and see what are the questions you should ask yourself in order to identify what challenges might be happening around your AI, or what demographic groups you need to pay attention to. And then going through that impact assessment form, that will educate you or at least inform you, bring that awareness to you that, "Okay, maybe I should use a fairness tool. Maybe I should now use a reliability tool." And then there are so many open source, the ones that I named, and many more out there that could help you get started there.
Roland Meertens: Thank you very much for joining the InfoQ podcast.
Mehrnoosh Sameki: Thank you so much for having me.
Roland Meertens: And I really hope that people are going to watch your talk either tomorrow or online.
Mehrnoosh Sameki: Thank you. Appreciate it.