Innovative Thinking Surfacing Hidden User Data: Multi-Step Data-Science Approaches Using NLP Methods Missing data is a critical problem for data scientists that can lead to invalid analysis and predictions, as well as a degraded user experience. I’m a data scientist at Intuit working on Mint, and we came up with a solution for missing financial data. Mint enables users to keep track of their finances in one Written by Noa Haas Published Jan 24, 2019 - [Updated Nov 10, 2022] 3 min read Missing data is a critical problem for data scientists that can lead to invalid analysis and predictions, as well as a degraded user experience. I’m a data scientist at Intuit working on Mint, and we came up with a solution for missing financial data. Mint enables users to keep track of their finances in one place. They enter information about their financial accounts, and the software automatically downloads transactions so users can create budgets, schedule bill payments, and keep track of their balances. Sometimes users don’t add all their accounts, which prevents them from having an optimal experience of the product. Our challenge was to detect which users had missing financial accounts, and then to make it easy for them to add those accounts to their financial profile. In the classic case of missing data, in a given data set, some data is missing for some users within the structured fields. For example, you might know the value “age” for some of your users. For those missing a value in the “age” field, you might be able to use supervised learning methods to impute the missing value. In our case, the existing structured data gave no signal as to which data was missing for which users. In other words, nobody had flagged their own profiles as missing financial accounts. This put us in an unsupervised framework. Our first objective was to reveal the users who had provided incomplete financial data, i.e., users who had financial accounts not listed in our app. We did that through learning and characterizing the textual aspects of personal account transactions using our existing data. We found that some users were extremely likely to have missing financial accounts due to the presence of matched pairs of transactions. Using fuzzy matching and domain knowledge, we leveraged inexplicit clues and traces of unlisted accounts within the listed accounts’ transactions. Secondly, we used machine learning and deep learning NLP methods to infer the missing data source. Our ultimate goal was to detect the name of the financial institution associated with the missing financial account. We first detected that the data pairing from our first stage could assist us in solving this problem in a supervised manner. With some creativity, we realized we could use the matched pairs from the first stage to train our model to predict the names of the financial institution used for personal transfers into known financial accounts. Then we used a multi-classification model where the target was to identify which financial institution was associated with the transaction. Our methodology identified fine linguistic characteristics of the textual descriptions provided by the financial institutions. This representation recognized strong patterns and properties in the text associated with each financial institution, formulating its unique fingerprint. Usually, NLP methods are used in situations where expressive language is unpredictable and varied. In our case, the transaction information was deterministic and computer-generated. Still, in this case, the application of probabilistic methods on our data provided extremely highly accurate predictions. What was fun about this project was that we used different strategies to tackle different aspects of this challenge, all in service to a clear user benefit. Applying diverse learning algorithms in combination with logic-based heuristics in a multi-step process reflects data science at its best. For more details, please see the poster below. Previous Post Tech Talk: Don’t Let the Wrong Application Make AI Ineffective Next Post Tech Talk: Accelerate Your AI Efforts with Speed and Scale Written by Noa Haas Noa Haas works as a data scientist at Intuit’s Tel Aviv location. She is passionate about non-standard data-related problems, and enjoys the process of exploring and solving these riddles. She is currently pursuing her MSc in Applied Statistics. Browse Related Articles Social Responsibility Making a global impact Social Responsibility 40 years of powering prosperity: Highlights from Intuit’s 2023 Corporate Responsibility and Diversity, Equity, and Inclusion Report Social Responsibility Intuit’s New Food Truck Program Empowers Underserved Youth with Vital Financial, Technical, and Entrepreneurial Skills Social Responsibility Intuit for Education Innovative Thinking Responsible AI helps small businesses grow and do more. Diversity, Equity and Inclusion Building Inclusion: setting a path to success for Latinos in tech Social Responsibility Women in Tech: Why We Need More Innovative Thinking Introducing Intuit Assist News Intuit Responds to U.S. Federal Trade Commission’s Decision And Reaffirms its Commitment to Free Tax Preparation Intuit Experts Best Jobs for Seniors: Part Time Jobs in Tax Preparation and Bookkeeping