Lets start with an example of categorical value, every one would be familiar with Iris dataset the species value [“Iris Setosa”,”Iris Versicolour”,”Iris Virginica”] is a categorical variable. It’s difficult to work with categorical values directly, be it be complex machine learning algorithm or a simple statistical test done using SPSS we would require to encode the information in an integer encoding scheme.

Two approaches are listed below, I would take iris dataset species data value as example.

  • Simple integer encoding ( Iris setosa will be encoded as 0 , Iris Versicolour as 1 and Iris Virginica as 2)
  • One hot encoding (Iris setosa will be encoded as [1,0,0] , Iris Versicolour as[ 0,1,0] and Iris Virginica as [0,0,1])

One hot encoding is binary representation technique used with categorical data. Machine learning cannot work directly with categorical data this true is with respect to both input and output.  One hot encoding is also called as one-of-K scheme. To achieve one hot encoding we must first convert the categorical values to number(simple integer encoding), now question is then why do we need to convert the numbers again to binary vectors ? why not use the number directly to represent the data after all machine learning algorithm do work with number?

Answer is yes indeed, we could use a number or an integer encoding scheme directly, and rescale them where ever required. This would work with data which have a natural ordinal relationship between the categories and hence the number representing the category, example temperature{cold, warm, hot} or speed {low, medium, high} . Problem might arise when there is no ordinal relationship, in such case the representation might affect the machine learning performance. Here we would be interested is defining the relationship in terms of probability like values for each label value creating a vector with probability like value for each class for each data record. There by creating a better and  an implicitly representation of relationship. When a one hot encoding is used for the output variable, it may offer a more fine and distinct set of predictions than a single label.

Example for problem, for categorical data without an ordinal relationship while using simple integer encoding. Consider the iris dataset again, there is no relationship between the class of species. If we represent them with simple integer encoding the resultant representation will be like Iris setosa will be encoded as 0 , Iris Versicolour as 1 and Iris Virginica as 2.  Let say inference the class based on value Iris setosa < Iris versicolour  < Iris Virginica. Things would get complex if we take an average of two class  avg(0 ,2)  = 1 , which equivalent of saying the average of Iris setosa and Iris Virginica is Iris versicolour. I understand the example are primitive but conclusion is using such a kind of representation might lead us to misleading pattern of data which ideally did not even existed in the first place.

So how do we implement one hot encoding, l personally prefer using my own code but we do have scikit-learn and keras packages providing us the function to achieve the encoding scheme.  Now a days I use scikit-learn.preprocessing (LabelEncoder() and OneHotEncoder() there are many more function to explore ) for my day-to-day data preprocessing task, I have used keras (to_categorical())as well.