We’ve seen it before, we’re seeing it again now with the recent Apple and Goldman Sachs alleged credit card bias issue, and we’ll very likely continue seeing it well into 2020 and beyond. Bias in AI is there, it’s usually hidden, (until it comes out), and it needs a foundational fix.
This past weekend we saw just how quickly the Apple Card, managed by Goldman Sachs, issue spiralled out of control. What started as a tweet thread with multiple reports of alleged bias (including from Apple’s very own co-founder, Steve Wozniak and his spouse), eventually led to a regulator opening an investigation into Goldman Sachs and their algorithm-prediction practices.
If we dig into the allegations to find the source of the problem, two issues stand out:
The algorithm making credit decisions for the Apple card is biased
The customer support teams from Goldman Sachs and Apple had zero insight into how the algorithm worked when they were asked to explain certain decisions
For (1) above we saw multiple responses to the original tweet thread that corroborated the allegation. Multiple people had faced a similar outcome where all other input factors being the same (or in some cases higher: like a higher annual income or credit score), and gender being the only difference, they were given significantly lower credit limits than their male spouse. This definitely comes across as problematic. Following the allegations, a separate group of non-related women and men ran a test experiment to check for bias and noticed significant differences in credit limits. Men with bad credit scores and irregular income got better offers than women with high incomes and good credit scores.
Was the algorithm biased against women? We can’t say for sure because we don’t know. This is the key – we don’t know what’s going on inside the algorithm to analyze the root-cause. But based on external outcomes, we’re guesstimating that this is likely what happened.
Impact of this problem
The primary issue here is with the black-box algorithm that generated Apple’s credit lending decisions. As laid out in the tweet thread, Apple Card’s customer service reps were rendered powerless to the algorithm’s decision. Not only did they have no insight into why certain decisions were made, they were unable to override it.
Humans don’t want a future ruled by algorithms, especially biased ones. Algorithms permeate all aspects of our lives today from lending and housing decisions to decisions in criminal justice. If we continue to let algorithms operate the way they do today, in the black-box and without human oversight, it results in a dystopian view of the world where unfair decisions are made by unseen algorithms operating in the unknown.
How could this issue have been avoided or at least handled better? Let’s come back to the initial statement above on how bias in AI needs a foundational fix. It’s very likely that in this particular credit lending decision, the algorithm was trained on biased data to begin with. To better understand this, let’s look at the high-level lifecycle of an AI solution:
Identify a use case for AI (credit lending, criminal justice, cancer prediction, etc.)
Access historical data to build models
Import this data to train and test models
Test and validate models
Deploy models into production
Monitor models in production to ensure optimal performance
If the data is flawed to begin with, this flaw permeates into everything that an algorithm does going forward. What we need is a way to check for bias and other issues in both data and models through all stages of the AI lifecycle. What we also need is human oversight – AI is just simply not ready to function on its own. We need humans-in-the-loop who will ensure that AI is functioning as it should.
In the Apple credit card example above, it’s likely this issue could have been avoided if humans had visibility into every stage of the AI lifecycle. They could have seen examples in the test and validate stage of how the model was behaving when a certain input factor was isolated and compared with the global dataset. They could have also had the ability to override an algorithm’s prediction in the test/validate stage if they felt it was unfair or incorrect. This would have resulted in an algorithm that was getting trained in the right way to produce accurate results when in production.
How Fiddler solves for this
This is exactly what we’re addressing at Fiddler. We’re working to unlock the AI black-box and empower all relevant stakeholders with more visibility into their AI than exists today. We’re working on infusing visibility and insight with explainable AI into every stage of an AI solution’s lifecycle: right from the data and training of it to when a model is deployed and in production.
We dive into the details to explain each prediction a model makes – whether a training, test, or production model – so users can understand the why behind individual decisions.
Our goal is to empower users with easy to grasp explanations of AI decisions. This empowers different stakeholders in an organization: data scientists are empowered to build the best and most accurate models, risk officers are empowered to publicize models with minimal-risk, and customer support representatives are empowered to answer customer questions around the why behind decisions.
With Fiddler’s explainable AI built into the AI lifecycle, it helps teams ensure they are compliant with regulations and are protecting their algorithms from inherent and hidden bias.
We’re continuing to build capabilities into Fiddler to ensure explainability is infused throughout the AI lifecycle and are working with a variety of customers to build this functionality into their existing and new models. If you’re interested in working with us, please reach out.
It is a bipartisan sentiment that, left unchecked, AI can pose a risk to fairness in financial services. While the exact extent of this danger might be debated, governments in the US and abroad acknowledge the necessity and assert the right to regulate financial institutions for this purpose.
The June 26 hearing was the first wake-up call for financial services: they need to be prepared to respond and comply with future legislation requiring transparency and fairness.
In this post, we review the notable events of this hearing, and we explore how the US House is beginning to examine the risks and benefits of AI in financial services.
Two new House Task Forces to regulate fintech and AI
The fintech task force should have a nearer-term focus on applications (e.g. underwriting, payments, immediate regulation).
The AI task force should have a longer-term focus on risks (e.g. fraud, job automation, digital identification).
And explicitly, Chairwoman Waters explained her overall interest in regulation:
Make sure that responsible innovation is encouraged, and that regulators and the law are adapting to the changing landscape to best protect consumers, investors, and small businesses.
The appointed chairman of the Task Force on AI, Congressman Bill Foster (D-IL), extolled AI’s potential in a similar statement, but also cautioned,
It is crucial that the application of AI to financial services contributes to an economy that is fair for all Americans.
This first hearing did find ample AI applications in financial services. But it also concluded that these worried sentiments are neither misrepresentative of their constituents nor misplaced.
Risks of AI
In a humorous exchange later in the hearing, Congresswoman Sylvia Garcia (D-TX) asks a witness, Dr. Bonnie Buchanan of the University of Surrey, to address the average American and explain AI in 25 words or less. It does not go well.
DR. BUCHANAN I would say it’s a group of technologies and processes that can look at determining general pattern recognition, universal approximation of relationships, and trying to detect patterns from noisy data or sensory perception.
REP. GARCIA I think that probably confused them more.
DR. BUCHANAN Oh, sorry.
Beyond making jokes, Congresswoman Garcia has a point. AI is extraordinarily complex. Not only that, to many Americans it can be threatening. As Garcia later expresses, “I think there’s an idea that all these robots are going to take over all the jobs, and everybody’s going to get into our information.”
In his opening statement, task force ranking member Congressman French Hill (R-AR) tries to preempt at least the first concern. He cites a World Economic Forum study that the 75 million jobs lost because of AI will be more than offset by 130 million new jobs. But Americans are still anxious about AI development.
overwhelming support for careful management of robots and/or AI (82% support)
more trust in tech companies than in the US government to manage AI in the interest of the public
mixed support for developing high-level machine intelligence (defined as “when machines are able to perform almost all tasks that are economically relevant today better than the median human today”)
This public apprehension about AI development is mirrored by concerns from the task force and experts. Personal privacy is mentioned nine times throughout the hearing, notably in Congressman Anthony Gonzalez’s (R-OH) broad question on “balancing innovation with empowering consumers with their data,” which the panel does not quite adequately address.
But more often, the witnesses discuss fairnessand how AI models could discriminate unnoticed. Most notably, Dr. Nicol Turner-Lee, a fellow at the the Brookings Institution, suggests implementing guardrails to prevent biased training data from “replicat[ing] and amplify[ing] stereotypes historically prescribed to people of color and other vulnerable populations.”
And she’s not alone. A separate April 2019 Brookings report seconds this concern of an unfairness “whereby algorithms deny credit or increase interest rates using a host of variables that are fundamentally driven by historical discriminatory factors that remain embedded in society.”
So if we’re so worried, why bother introducing the Pandora’s box of AI to financial services at all?
Benefits of AI
AI’s potential benefits, according to Congressman Hill, are to “gather enormous amounts of data, detect abnormalities, and solve complex problems.” In financial services, this means actually fairer and more accurate models for fraud, insurance, and underwriting. This can simultaneously improve bank profitability and extend services to the previously underbanked.
Both Hill and Foster cite a National Bureau of Economic Research working paper finding where in one case, algorithmic lending models discriminate 40% less than face-to-face lenders. Furthermore, Dr. Douglas Merrill, CEO of ZestFinance and expert witness, claims that customers using his company’s AI tools experience higher approval rates for credit cards, auto loans, and personal loans, each with no increase in defaults.
Moreover, Hill frames his statement with an important point about how AI could reshape the industry: this advancement will work “for both disruptive innovators and for our incumbent financial players.” At first this might seem counterintuitive.
“Disruptive innovators,” more agile and hindered less by legacy processes, can have an advantage in implementing new technology. But without the immense budgets and customer bases of “incumbent financial players,” how can these disruptors succeed? And will incumbents, stuck in old ways, ever adopt AI?
Mr. Jesse McWaters, financial innovation lead at the World Economic Forum and the final expert witness, addresses this apparent paradox, discussing what will “redraw the map of what we consider the financial sector.” Third-party AI service providers — from traditional banks to small fintech companies — can “help smaller community banks remain digitally relevant to their customers” and “enable financial institutions to leapfrog forward.”
Enabling competitive markets, especially in concentrated industries like financial services, is an unadulterated benefit according to free market enthusiasts in Congress. However, “redrawing the map” in this manner makes the financial sector larger and more complex. Congress will have to develop policy responding to not only more complex models, but also a more complex financial system.
This system poses risks both to corporations, acting in the interest of shareholders, and to the government, acting in the interest of consumers.
Business and government look at risks
Businesses are already acting to avert potential losses from AI model failure and system complexity. A June 2019 Gartner report predicts that 75% of large organizations will hire AI behavioral forensic experts to reduce brand and reputation risk by 2023.
However, governments recognize that business-led initiatives, if motivated to protect company brand and profits, may only go so far. For a government to protect consumers, investors, and small businesses (the relevant parties according to Chairwoman Waters), a gap may still remain.
As governments explore how to fill this gap, they are establishing principles that will underpin future guidance and regulation. The themes are consistent across governing bodies:
AI systems need to be trustworthy.
They therefore require some government guidance or regulation from government representing the people.
This guidance should encourage fairness, privacy, and transparency.
In the US, President Donald Trump signed an executive order in February 2019 “to Maintain American Leadership in Artificial Intelligence,” directing federal agencies to, among other goals, “foster public trust in AI systems by establishing guidance for AI development and use.” The Republican White House and Democratic House of Representatives seem to clash at every turn, but they align here.
The EU is also establishing a regulatory framework for ensuring trustworthy AI. Likewise included among the seven requirements in their latest communication from April 2019: privacy, transparency, and fairness.
And June’s G20 summit drew upon similar ideas to create their own set of principles, including fairness and transparency, but also adding explainability.
These governing bodies are in a fact-finding stage, establishing principles and learning what they are up against before guiding policy. In the words of Chairman Foster, the task force must understand “how this technology will shape the questions that policymakers will have to grapple with in the coming years.”
Conclusion: Explain your models
An hour before Congresswoman Garcia’s amusing challenge, Dr. Buchanan reflected upon a couple common themes of concern.
Policymakers need to be concerned about the explainability of artificial intelligence models. And we should avoid black-box modeling where humans cannot determine the underlying process or outcomes of the machine learning or deep learning algorithms.
But through this statement, she suggests a solution: make these AI models explainable. If humans can indeed understand the inputs, process, and outputs of a model, we can trust our AI. Then throughout AI applications in financial services, we can promote fairness for all Americans.
Zhang, Baobao and Allan Dafoe. “Artificial Intelligence: American Attitudes and Trends.” Oxford, UK: Center for the Governance of AI, Future of Humanity Institute, University of Oxford, 2019. https://ssrn.com/abstract=3312874
‘In 1935, the Federal Home Loan Bank Board asked the Home Owners’ Loan Corporation to look at 239 cities and create “residential security maps” to indicate the level of security for real-estate investments in each surveyed city. On the maps, “Type D” neighborhoods were outlined in red and were considered the most risky for mortgage support..
‘In the 1960s, sociologist John McKnight coined the term “redlining” to describe the discriminatory practice of fencing off areas where banks would avoid investments based on community demographics. During the heyday of redlining, the areas most frequently discriminated against were black inner city neighborhoods…’
Redlining is clearly unfair, since the decision to invest was not based on an individual homeowner’s ability to repay the loan, but rather on location; and that basis systematically denied loans to one racial group, black people. In fact, part 1 of a Pulitzer Prize-winning series in the Atlanta Journal-Constitution in 1988 suggests that location was more important than income: “Among stable neighborhoods of the same income [in metro Atlanta], white neighborhoods always received the most bank loans per 1,000 single-family homes. Integrated neighborhoods always received fewer. Black neighborhoods — including the mayor’s neighborhood — always received the fewest.”
More recently, in 2018, WUNC reported that blacks and latinos in some cities in North Carolina were denied mortgages at higher rates than whites:
“Lenders and their trade organizations do not dispute the fact that they turn away people of color at rates far greater than whites. But they maintain that the disparity can be explained by two factors the industry has fought to keep hidden: the prospective borrowers’ credit history and overall debt-to-income ratio. They singled out the three-digit credit score — which banks use to determine whether a borrower is likely to repay a loan — as especially important in lending decisions.”
The WUNC example raises an interesting point: it is possible to look unfair via one measure (loan rates by demographic), but not by another (ability to pay as judged by credit history and debt-to-income ratio). Measuring fairness is complicated. In this case, we can’t tell if the lending practices are fair because the data on credit history and debt-to-income ratio for these particular groups are not available to us to evaluate lenders’ explanations of the disparity.
In 2007, the federal reserve board (FRB) reported on credit scoring and its effects on the availability and affordability of credit. They concluded that the credit characteristics included in credit history scoring models are not a proxy for race, although different demographic groups have substantially different credit scores on average, and “for given credit scores, credit outcomes — including measures of loan performance, availability, and affordability — differ for different demographic groups.” This FRB study supports the lenders’ claims that credit score might explain disparity in mortgage denial rates (since demographic groups have different credit scores), while also pointing out that credit outcomes are different for different groups.
Is this fair or not?
As machine learning (ML) becomes widespread, there is growing interest in fairness, accountability, and transparency in ML (e.g., the fat* conference and fatml workshops).
Some researchers say that fairness is not a statistical concept, and no statistic will fully capture it. There are many statistical definitions that people try to relate to (if not define) fairness.
First, here are two legal concepts that come up in many discussions on fairness:
Disparate treatment: “unequal behavior toward someone because of a protected characteristic (e.g., race or gender) under Title VII of the United States Civil Rights Act.” Redlining is disparate treatment if the intent is to deny black people loans.
Disparate impact: “practices .. that adversely affect one group of people of a protected characteristic more than another, even though rules applied .. are formally neutral.” (“The disparate impact doctrine was formalized in the landmark U.S. Supreme Court case Griggs v. Duke Power Co. (1971). In 1955, the Duke Power Company instituted a policy that mandated employees have a high school diploma to be considered for promotion, which had the effect of drastically limiting the eligibility of black employees. The Court found that this requirement had little relation to job performance, and thus deemed it to have an unjustified — and illegal — disparate impact.” [Corb2018])
[Lipt2017] points out that these are legal concepts of disparity, and creates corresponding terms for technical concepts of parity applied to machine learning classifiers:
Treatment parity: a classifier should be blind to a given protected characteristic. Also called anti-classification in [Corb2018], or “fairness through unawareness.”
Impact parity: the fraction of people given a positive decision should be equal across different groups. This is also called demographic parity, statistical parity, or independence of the protected class and the score [Fair2018].
There is a large body of literature on algorithmic fairness. From [Corb2018], two more definitions:
Classification parity: some given measure of classification error is equal across groups defined by the protected attributes. [Hard2016] called this equal opportunity if the measure is true positive rates, and equalized odds if there were two equalized measures, true positive rates and false positive rates.
Calibration: outcomes are independent of protected attributes conditional on risk score. That is, reality conforms to risk score. For example, about 20% of all loans predicted to have a 20% chance of default actually do.
There is lack of consensus in the research community on an ideal statistical definition of fairness. In fact, there are impossibility results on achieving multiple fairness notions simultaneously ([Klei2016] [Chou2017]). As we noted previously, some researchers say that fairness is not a statistical concept.
No definition is perfect
Each statistical definition described above has counterarguments.
Treatment parity unfairly ignores real differences. [Corb2018] describes the case of the COMPAS score used to predict recidivism (whether someone will commit a crime if released from jail). After controlling for COMPAS score and other factors, women are less likely to recidivate. Thus, ignoring sex in this prediction might unfairly punish women. Note that the Equal Credit Opportunity Act legally mandates treatment parity: “Creditors may ask you for [protected class information like race] in certain situations, but they may not use it when deciding whether to give you credit or when setting the terms of your credit.” Thus, [Corb2018] implies that this sort of unfairness is enshrined in law.
Impact parity doesn’t ensure fairness (people argue against quotas), and can cripple a model’s accuracy, harming the model’s utility to society. [Hard2016] discusses this issue (using the term “demographic parity”) in its introduction.
Corbett et al. [Corb2018] argue at length that classification parity is naturally violated: “when base rates of violent recidivism differ across groups, the true risk distributions will necessarily differ as well — and this difference persists regardless of which features are used in the prediction.”
They also argue that calibration is not sufficient to prevent unfairness. Their hypothetical example is a bank that gives loans based solely on the default rate within a zip code, ignoring other attributes like income. Suppose that (1) within zip code, white and black applicants have similar default rates; and (2) black applicants live in zip codes with relatively high default rates. Then the bank’s plan would unfairly punish creditworthy black applicants, but still be calibrated.
In summary, likely fairness has no single measure. We took a whirlwind tour of four statistical definitions, two motivated by history and two more recently motivated by machine learning, and summarized the counterarguments to each.
This also means it is challenging to automatically decide if an algorithm is fair. Open-source fairness-measuring packages reflect this by offering many different measures.
However, this doesn’t mean we should ignore statistical measures. They can give us an idea of whether we should look more carefully. Food for thought. We should feed our brain well, it being the most likely to make the final call.
(A note: this subject is rightfully contentious. Our intention is to add to the conversation in a productive, respectful way. We welcome feedback of any kind.)
[Chou2017] Chouldechova, Alexandra. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5, no. 2 (June 1, 2017): 153–63. https://doi.org/10.1089/big.2016.0047.
[Corb2018] Corbett-Davies, Sam, and Sharad Goel. “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning.” ArXiv:1808.00023 [Cs], July 31, 2018. http://arxiv.org/abs/1808.00023.
[Hard2016] Hardt, Moritz, Eric Price, and Nathan Srebro. “Equality of Opportunity in Supervised Learning.” ArXiv:1610.02413 [Cs], October 7, 2016. http://arxiv.org/abs/1610.02413.
[Klei2016] Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” ArXiv:1609.05807 [Cs, Stat], September 19, 2016. http://arxiv.org/abs/1609.05807.
[Lipt2017] Lipton, Zachary C., Alexandra Chouldechova, and Julian McAuley. “Does Mitigating ML’s Impact Disparity Require Treatment Disparity?” ArXiv:1711.07076 [Cs, Stat], November 19, 2017. http://arxiv.org/abs/1711.07076.