Missing data is a critical problem for data scientists that can lead to invalid analysis and predictions, as well as a degraded user experience.
I’m a data scientist at Intuit working on Mint, and we came up with a solution for missing financial data. Mint enables users to keep track of their finances in one place. They enter information about their financial accounts, and the software automatically downloads transactions so users can create budgets, schedule bill payments, and keep track of their balances. Sometimes users don’t add all their accounts, which prevents them from having an optimal experience of the product. Our challenge was to detect which users had missing financial accounts, and then to make it easy for them to add those accounts to their financial profile.
In the classic case of missing data, in a given data set, some data is missing for some users within the structured fields. For example, you might know the value “age” for some of your users. For those missing a value in the “age” field, you might be able to use supervised learning methods to impute the missing value. In our case, the existing structured data gave no signal as to which data was missing for which users. In other words, nobody had flagged their own profiles as missing financial accounts. This put us in an unsupervised framework.
Our first objective was to reveal the users who had provided incomplete financial data, i.e., users who had financial accounts not listed in our app. We did that through learning and characterizing the textual aspects of personal account transactions using our existing data. We found that some users were extremely likely to have missing financial accounts due to the presence of matched pairs of transactions. Using fuzzy matching and domain knowledge, we leveraged inexplicit clues and traces of unlisted accounts within the listed accounts’ transactions.
Secondly, we used machine learning and deep learning NLP methods to infer the missing data source.
Our ultimate goal was to detect the name of the financial institution associated with the missing financial account. We first detected that the data pairing from our first stage could assist us in solving this problem in a supervised manner. With some creativity, we realized we could use the matched pairs from the first stage to train our model to predict the names of the financial institution used for personal transfers into known financial accounts.
Then we used a multi-classification model where the target was to identify which financial institution was associated with the transaction.
Our methodology identified fine linguistic characteristics of the textual descriptions provided by the financial institutions. This representation recognized strong patterns and properties in the text associated with each financial institution, formulating its unique fingerprint.
Usually, NLP methods are used in situations where expressive language is unpredictable and varied. In our case, the transaction information was deterministic and computer-generated. Still, in this case, the application of probabilistic methods on our data provided extremely highly accurate predictions.
What was fun about this project was that we used different strategies to tackle different aspects of this challenge, all in service to a clear user benefit. Applying diverse learning algorithms in combination with logic-based heuristics in a multi-step process reflects data science at its best.
For more details, please see the poster below.