Feature Engineering
Feature engineering refers to the process of transforming raw data into features that better captures the research problem being analysed in a bid to improve model accuracy. Feature engineering involves finding the best representation of a sample data to learn the solution to a defined problem. A feature refers to a useful attribute for modeling a solution to a problem. A feature is an attribute that is relevant to a problem; an important part of an observation that best captures the problem that is being modeled.
Feature engineering includes the objective estimation of attribute usefulness, feature extraction, feature selection, and feature construction. Some popular examples of feature engineering use cases include decomposing categorical attributes, decomposing a date-time and re-framing numerical quantities.
The process of carrying out feature engineering for data mining and machine learning tasks involves brainstorming features, devising features using applicable technique, selecting features, and evaluating models.
Some of the most commonly used feature engineering techniques includes binning, log transforms, n-grams, binarization, and bags of word. Binning or quantisation technique is used in preparing numerical data for machine learning especially when an attribute has continuous numbers with many unique values. To effectively model a problem, the continuous numbers are replaced with categorical values that captures defined ranges. Feature hashing or hashing trick technique refers to the process of vectorizing features. It is used to scale-up machine learning algorithms. Commonly used for text mining, it involves the conversion of tokens into integers. Using a random sparse projection matrix and preserving the Euclidean norm, it reduces the dimension of the data thus making text analysis possible. Log transform technique comes in handy when dealing with highly skewed distributions. It is used to reduce the skewness of distributions making patterns in a dataset more interpretable. It is used to predict the next item in a sequence. Binarisation technique is used to transform features into vectors of binary numbers for better classification. Bag-of-Words (BoW) is technique for feature engineering is used to count the number of occurrences of a word in a specific document. This is popularly used for text mining compare documents and estimate similarities. It is used to interprete text data for machine. N-grams technique uses word sequences to generalise the bag-of-word technique.
In summary, feature engineering adopts varying techniques to address the challenge of having representative and relevant features for analysis and model accuracy improvement.
