Surfacing Hidden User Data: Multi-Step Data-Science Approaches Using NLP Methods

Technology screen shot 2019-01-16 at 5.14.56 pm

Missing data is a critical problem for data scientists that can lead to invalid analysis and predictions, as well as a degraded user experience.

I’m a data scientist at Intuit working on Mint, and we came up with a solution for missing financial data. Mint enables users to keep track of their finances in one place. They enter information about their financial accounts, and the software automatically downloads transactions so users can create budgets, schedule bill payments, and keep track of their balances. Sometimes users don’t add all their accounts, which prevents them from having an optimal experience of the product. Our challenge was to detect which users had missing financial accounts, and then to make it easy for them to add those accounts to their financial profile.

In the classic case of missing data, in a given data set, some data is missing for some users within the structured fields. For example, you might know the value “age” for some of your users. For those missing a value in the “age” field, you might be able to use supervised learning methods to impute the missing value. In our case, the existing structured data gave no signal as to which data was missing for which users. In other words, nobody had flagged their own profiles as missing financial accounts. This put us in an unsupervised framework.

Our first objective was to reveal the users who had provided incomplete financial data, i.e., users who had financial accounts not listed in our app. We did that through learning and characterizing the textual aspects of personal account transactions using our existing data. We found that some users were extremely likely to have missing financial accounts due to the presence of matched pairs of transactions. Using fuzzy matching and domain knowledge, we leveraged inexplicit clues and traces of unlisted accounts within the listed accounts’ transactions.

Secondly, we used machine learning and deep learning NLP methods to infer the missing data source.

Our ultimate goal was to detect the name of the financial institution associated with the missing financial account. We first detected that the data pairing from our first stage could assist us in solving this problem in a supervised manner. With some creativity, we realized we could use the matched pairs from the first stage to train our model to predict the names of the financial institution used for personal transfers into known financial accounts.

Then we used a multi-classification model where the target was to identify which financial institution was associated with the transaction.

Our methodology identified fine linguistic characteristics of the textual descriptions provided by the financial institutions. This representation recognized strong patterns and properties in the text associated with each financial institution, formulating its unique fingerprint.

Usually, NLP methods are used in situations where expressive language is unpredictable and varied. In our case, the transaction information was deterministic and computer-generated. Still, in this case, the application of probabilistic methods on our data provided extremely highly accurate predictions.

What was fun about this project was that we used different strategies to tackle different aspects of this challenge, all in service to a clear user benefit. Applying diverse learning algorithms in combination with logic-based heuristics in a multi-step process reflects data science at its best.

For more details, please see the poster below.

Comments (1) Leave your comment

  1. Hello there I am so delighted I found your weblog, I really found
    you by accident, while I was researching on Bing for something
    else, Regardless I am here now and would just like to say
    kudos for a incredible post and a all round enjoyable blog (I also
    love the theme/design), I don’t have time to browse it all at
    the minute but I have book-marked it and also included your RSS
    feeds, so when I have time I will be back to read a great deal more, Please do keep up the great job.

Leave a Reply

Your email address will not be published. Required fields are marked *