{"id":182,"date":"2020-04-20T21:24:16","date_gmt":"2020-04-20T21:24:16","guid":{"rendered":"https:\/\/youpple.com\/dataclergy\/?p=182"},"modified":"2020-06-20T13:44:25","modified_gmt":"2020-06-20T13:44:25","slug":"dimensionality-reduction-techniques","status":"publish","type":"post","link":"https:\/\/youpple.com\/dataclergy\/2020\/04\/20\/dimensionality-reduction-techniques\/","title":{"rendered":"Dimensionality Reduction Techniques"},"content":{"rendered":"<p>In data mining and machine learning, dimensionality reduction refers to the process of reducing the number of variables in a given dataset for better analysis. Dimensionality reduction techniques are used to solve the problem of feature redundancy especially when most of these features are correlated. It is academically divided into feature selection; keeping the most relevant variables from the original dataset and feature extraction; finding a smaller set of new variables by combining multiple input variables with highly similar information.<\/p>\n<p>Simply put, dimensionality reduction techniques are used to address the curse of dimensionality. This refers to all the challenges posed as a result of working with data in the higher dimensions. Machine learning models trained using many features have a great tendency of overfitting. This leads to the poor performance of such models when applied to real data. Avoiding overfitting is a cogent reason for performing dimensionality reduction. Models are best kept simple using fewer features for training, hence having lesser assumptions.<\/p>\n<p>There are several dimensionality reduction techniques that are used to address varying levels of complexity in data mining and machine learning tasks. The Missing Value Ratio technique is used to reduce the number of variables when a dataset has too many missing values. The Low Variance filter technique is applied to identify and drop constant variables from a dataset. This is done when variables with low variance have little or no impact on the dependent variable. The High Correlation filter technique is used to find and drop highly correlated features to address multicollinearity in a dataset. The Random Forest technique is arguably one of the most commonly used techniques. It is used to estimate the importance of the features present in a dataset in a bid to select the topmost features. Forward Feature Selection and Backward Feature Elimination techniques are computationally demanding are best suited for small datasets. The Factor Analysis technique is used to divide highly correlated variables into different groups by assigning a factor to each group. The Principal Component Analysis technique is a widely used technique when analyzing a dataset with linear data. It explains as much variance as possible by dividing the data into components. Independent Component Analysis technique is a step further that uses independent components to describe the data with a lesser number of components. ISOMAP is a Manifold Projection Technique that is applied when the dataset is strongly non-linear. T-SNE technique is also used on strongly non-linear datasets but mainly preferred for its visualizations. Last but not the least, UMAP technique is an advanced project-based technique with shorter run-time compared to ISOMAP and t-SNE.<\/p>\n<p>In summary, dimensionality reduction techniques are used for data mining and machine learning to remove redundant features and noise from a dataset thus allowing usage of algorithms fit for small datasets, improving model accuracy, and reducing computation time. However, the arsenal of dimension reduction techniques, domain expertise, and heuristics must be well harnessed to avoid data loss, address multicollinearity, select components, and define the final dataset.<\/p>\n<p>&nbsp;<\/p>\n<p>Image Source: MathWorks<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In data mining and machine learning, dimensionality reduction refers to the process of reducing the number of variables in a given dataset for better analysis. Dimensionality reduction techniques are used to solve the problem of feature redundancy especially when most of these features are correlated. It is academically divided into feature selection; keeping the most [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":302,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[3],"tags":[],"class_list":["post-182","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science"],"_links":{"self":[{"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/posts\/182","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/comments?post=182"}],"version-history":[{"count":6,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/posts\/182\/revisions"}],"predecessor-version":[{"id":304,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/posts\/182\/revisions\/304"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/media\/302"}],"wp:attachment":[{"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/media?parent=182"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/categories?post=182"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youpple.com\/dataclergy\/wp-json\/wp\/v2\/tags?post=182"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}