Tuesday, July 18, 2017

Can our payment institutions innovate?

Banks earning season is rolling in with Q2/2017 earnings. Most major banks including Bank of America, Wells Fargo, Goldman Sachs etc have lowered their guidance for the rest of the year citing lots of reasons - trading, market making, interest rates etc. Mortgage and lending business is also under pressure it seems. The same time Netflix reported record consumer growth in Q2/2017 while also suggesting strong negative free cash flows for the suggestible future. The market was happy at that and sent the stock surging. Its just part of doing business, right? That just sounds too bad for U.S banks which are considered fundamental pillars of the economy. Fundamental enough that taxpayers bailed them out few years back but still not allowed to innovate.

Bitter truth is that innovation in core parts of the economy can make the economy unstable. Innovation implies risk and risk can materialize once in a while. So we allow innovation to happen on the fringes of the economy. Fringe enough that if we loose it, not many people would notice.

Back to FinTech or Innovation in financial services. In the U.S/EU FinTech is helping evolve micro-lending and online banking industries. The pace is much more rapid in the developing parts of the world though, like Africa. However, we may be surprised to hear that U.S is a laggard when it comes to financial innovation. The reason is not that innovators in the U.S are not smart enough - but they simply focus on other important problems like talking robots, space travel and auto driving cars.

So what is going on with FinTech that should worry us? Here is a news flash - China is fast becoming the new FinTech capital of the world, taking that title away from the western world. The transactions volumes processed by Baidu, Tencent and Alibaba are growing so fast that they are set to pass the volumes Visa and Mastercard process in Europe and US in 2017. China's mobile payment system is does not use any high-technology though. It uses bar codes to process payments! Thats way inferior and insecure than the system used by Mastercard and Visa (EMVCO) or by Google/Applepay. One thing it gets right though - its fast, convenient and cheap and thats what matters. It has connected about a billion people in China who never had any bank accounts or credit history before. So why should we be worried? Chinese companies are slowly importing this system into U.S and EU through partnerships and deals with processors like Stripe and companies like AirBnB etc. They are so big and powerful that U.S companies have little choice accommodating them. That would slowly cut into the western financial system and would cause a huge disruption in the economies in the next ten years.

What can we learn from the success of the Chinese system? First, it processes transactions without communicating with the banks each time. If the amount of the transaction has a limit and the replication is fast enough, this should work. The result is that this makes it very fast for high volume processing.  It is cheap and convenient because there is no need to have expensive NFC hardware or scanning devices or chips/cards. Compare that with the our system - there are two banks involved. We use a dedicated transmission network for the payments and there are middlemen involved. More middlemen mean more fees and commissions including the need to keep reconciling data as it passes all the intermediate systems. Chip cards make things even slower. Evolution in our system is not going to be easy since the credit card companies have a tight grip on end users with rewards and points systems.

What can save us from certain failure is the need for innovation - the U.S still has around 50 million people without bank accounts and thus no credit cards. PayNearMe tapped into this market by offering an easy way to make cash based payments where credit cards were not an option. So who is going to get to it first!

How social networking combined with NLP analytics is helping expand the economy

Recently I have been researching the rise of NLP again. This was the topic of my bachelor thesis in 1995, almost 20 years ago and it has now become a hot area of research again in the last 5 years. The science and tools have evolved and a lot of new open source tools like NLTK are available for researchers.

Clearly the early users of social networking data were doing a lot of sentiment analysis on it to determine trends for companies, products, politics etc. Things have changed now - governments are interested in scouring billions of bytes of data generated daily in social networks for intelligence hints, Fin-Tech upstarts are starting to successfully use the same data to disrupt financial services - for example, the lending industry. 

The sale of Troo.ly to Airbnb spiked my interest into this. Research revealed the existence of a whole bunch of companies and existing patents in the science of using social network data to determine a person's trust or risk score. Appears that lot of tricks have evolved in the last five years. However, this is quite scary as well! In the next few years, I can see a lot of people trying to use this score instead of just pulling the credit scores in business transactions. This could be landlords, people selling goods on craigslist (e.g cars etc), small business engaging in seasonal hiring to keep costs low. It could certainly replace simple background checks. While this is all great, the scary part is that this presents a Kafkian situation for people at the other end of it. Unlike their FICO scores, there is a lot less visibility into how the machine learning algorithms work. Imagine a customer service rep trying to explain why your social risk was considered too high when renting an apartment for a few days on AirBnB.

This led me to findings how the big Chinese payment companies (Ant Financial, Baidu) are now planning to use all the data that they collect from repeated payment transactions and the social networks they own to determine credit scores of people. This idea is definitely not new but it is breaking new ground in the last couple of years. For example, companies like HelloSoda, Guardian Analytics have been doing risk scoring for quite some time now. There are also a lot of banking and lending upstarts in U.S and EU - Kabbage, Simple, Moven, Fidor etc. However, FinTech has taken social networking data to a new level now and this time its not just to send marketing and sales offers (like the famous American Express examples show) or to make your banking very cool indeed. 

There are multiple innovation areas - companies like Jumo have made a significant penetration in the underbanked markets of Africa to enable micro-lending based on social networking data that they collect. In China, Baidu already has more than 10% of its assets involved in some kind of lending, the biggest being in the education market. Now Alibaba and Tencent are also getting involved. Not only is the mobile payments market in China set to pass the transactions Visa and Mastercard process but it is also spinning out new uses of the data collected. In one way it looks like social networking companies may have an edge over managing risk and may have slowly started to disintermediate traditional lending companies. There are 800 million people using Tencent for payments that have no credit history with the central bank and the daily transaction patterns reflect a lot about their behavior and risk. Take a look at AutoGravity - users can not only select cars, schedule a test drive but once they connect it to their social networking account, the site also prefills the application, verifies identity and lines up four lending companies without having the user fill up scores of forms or go to an office. All on the mobile phone in minutes. Facebook, Google and Apple could be doing this next year. It all starts by focusing on an underserved market and there are enough in financial services - there are almost 64M Americans without sufficient credit history and almost 2B people around the world without a bank account.

Thursday, July 6, 2017

Learning TensorFlow

TensorFlow is Google's deep learning library, released in 2015. In many ways, it may be puzzling why we should pay attention to it given there are so many machine learning frameworks around that seem to be doing a pretty good job so far!

The following should provide a good motivation:

- TensorFlow supports GPUs.
- TensorFlow supports distributed computation
- Primarily TensorFlow is good for deep learning. Lets just say it seems to be much more focused on DL.

Note that TensorFlow is equivalent to the numpy module in python. There is a lot of development still going on and hopefully easy to use libraries like scikit-learn will be available soon. One may also ask that Apache Spark provides distributed computation, has an ML library and supports GPUs as well. So why not just use Spark by itself? The answer may be that Spark is not focused on DL as much as TF is. Moreover the distributed computation model of Spark is very different from TensorFlow. Spark has a resource manager hidden from the user that parallelizes an RDD computation over a cluster. TensorFlow distributed programming involves the user and the program has a lot more control on the computation. IMO, Spark may sit in the data pipeline ahead of TensorFlow to massage/clean and process data that is used to train a very large neural network. At this point, TensorFlow needs a considerable simplification of its cluster management and programming API before it can be used by data scientists used to working with tools like numpy/R or Spark.

Here are some good talks and links to understand TensorFlow better:

Monday, May 15, 2017

Why Logistic Regression is linear even though it uses a non linear function

Logistic Regression uses a non linear activation function - the logistic function:

\[z = \frac{1}{(1 + e^{-y})}\]

where $\textit{y}$ is linear in $\textit{x}$, the input variable.

Note that this is equivalent to:

\[z = \frac{e^{y}}{(1 + e^{y})}\]

So why is logistic regression considered linear and the result used for classification rather than predicting a continuous output? Having more of a computer science background, this was something that did not initially catch my eye. This related post made it quite easy for me to understand logistic regression. Here I provide some key points related to logistic regression and some references from a theoretical perspective to help develop a better understanding:
  • Logistic function is used to give the probability of the output being in a binary class. Its output is always between 0 and 1 given any value of inputs in any number of dimensions.
  • If you rearrange the logistic function, the natural log of the odds (the ratios of probabilities of an event being successful and unsuccessful) is the familiar linear regression equation. The reason why logistic regression is considered linear is that we are combining the outputs using a linear function.
  • The tanh function, which is a mathematical function of the logistic function is a better choice than logistic function since it has steeper gradient. The steeper gradient is better in backprop training. A steeper gradient passes back feedback from output back to input much faster and having a larger impact on weights closer to input nodes making convergence faster.
  • While logistic regression works very well in binary classification for any number of dimensions, the softmax function is a much better choice for in multi-class classification. The softmax sums up to one in a multi-class situation over all the classes. The logistic function does not have this property. 
  • In a binary classification, using a softmax function is equivalent to using the sigmoid function.
  • One may ask why use these complicated exponential functions in the output? If we want probability, we can simply use an average - divide each output by the sum of outputs. The problem with this approach is that individual values can become negative even if they add to one. Exponentiating makes everything positive. Also, exponentiation works well for back propagation since it amplifies errors making algorithms converge faster.
  • The use of logistic function as an activation function inside the network also has an issue - the vanishing gradient problem which makes deep neural nets very hard to train. ReLU, y = max(x, 0), has been a popular choice since it does not alter the gradients as they are propagated back to the input. This nice blog entry provides a great explanation.
  • ReLU is also only used for hidden layers. Outputs would still be softmax (classification) or linear (regression).

Wednesday, May 10, 2017

Better ways to do One Hot Encoding

While running an ML algorithm on any data, you may have to convert categorical data into numerical data - reason is that a mostly all scikit-learn code requires you to input data which is numeric. Though one may think that it is a scikit limitation, that may not be true. Since ML uses math and vectors behind the scenes, the data has to be numerical for most good algorithms.

One of the common ways to convert categorical data to numeric data is using One Hot Encoding. This kind of encoding uses indicator variables, where each value of the category is replaced by a column of its own. This can lead to column explosion so one must be careful. A lot of times, categories that have an order can be mapped to numerical values that may be helpful as well. 

Several methods of One Hot Encoding have been mentioned. Most prominent and simple of them uses the get_dummies function in pandas:
In this function, we loop through all the categorical variables in the pandas dataframe one by one and for each case, we use get_dummies to create indicator variable columns (which are numeric) and then we delete the original categorical column. This is pretty simple and mostly this is what is recommended the most in forums. However, there are a few catches with this method in practice:

  1. The categorical variable may have different sets of values in training, validation and test data. If you run get_dummies, it may assign the same numeric value to different categorical values. When fed into the ML algorithm this can cause unintended data alternation and results.
  2. The order in which the categorical values are encountered by get_dummies may further contribute to the above issue
  3. Some categorical values may only appear in validation and test data and may be absent in training data. This can cause more problems. Training a model on one kind of values and then testing predictions on another may not make sense. The get_dummies does not help identify this problem. 
The best course of action is to map the categorical values to a set of indicator variables and then use the same set during test and validation. This should not be changed. In addition, if certain categorical values that are going to be seen in practice in validation or test data, we must take that into account. We will leave this specific problem to another post. In this post, lets see how we can fix get_dummies to at least fix the first two problems and alert us of the third one.

Python provides few other alternatives that are a bit complex to use but I feel are totally necessary. One of these uses the LabelEncoder function. LabelEncoder looks at a categorical variable and creates a transformation which maps the values to integer labels. This does not create any indicator variables so one may think is inadequate for our needs. However, what LabelEncoder does is store the mapping as a model which can be used repeatedly later on. Combining LabelEncoder with get_dummies provides the ideal solution:
So what's up with the dictionary and the train variable in this function? Note our initial objective - we must use the same mapping for both the training and test data. The python dictionary holds important storage area for the mappings created by the LabelEncoder for each column in a dataframe. The call made to this function for training data looks as follows:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

test_data, _ = oneHotEncode2(test_data, le_dict)

The call to an already created transform for encoding also checks to see if it encounters any new values in the test data. If it does, it will warn us and we can go back and take appropriate action.