Data Science – Emmanuel Adekola

Data Quality Seasonings

Think about your food without seasonings or salt, that’s what a digital data collection tool without conditions is like - tasteless. Having noted that, too much seasoning/salt can also make the food uneatable.

The use of conditions is a very vital piece of work when building or developing a digital data collection tool. By conditions, I’m referring to display condition (skip logic), validation condition (constraints), required condition (mandatory fields), and default values (autofill).

It is the duty of the tool developers, programme/field experts, and data analysts to ensure data collection tools meet a certain level of standard. Beyond coming up with questions that can be used to measure the performance of a programme or initiative, it’s important to think through and carefully so, how each question is collected, and validate or invalidate other questions. It’s in this thought process that questions are properly sequenced, structured, and to a great extent rid of possible wrong responses.

Conditions significantly contribute to the improvement of data quality for new and existing systems. I’ll try to highlight the relevance of each condition type in this article.

Display Condition, also known as Form Filtering or Question Display Logic is used to filter the content (the visibility and accessibility of folders, lists, or questions) of your application/form for a specific use case. It determines the conditions under which questions are displayed. The use of display conditions averts having conflicting realities within your record/dataset. For example, you don’t want a pregnant man but a pregnant woman in your analysis hence it’s important that the question ‘Are you pregnant?’ is only available to be answered when the preceding response to the question on gender is female.

Validation Conditions check responses to make sure they meet the constraints that you specify. It works such that if the entered information does not meet the logic you specify, a flagged message pops up for the user and blocks them from submitting the form. It ensures responses meet specified constraints. The use of validation conditions increases your chances of having correct responses entered. A good example that reflects the lack of a needed validation condition is having a future date of birth in your dataset. A good validation condition to avoid this is setting the field collecting date of birth to throw up an error message when a date of birth later than ‘today’ is entered.

Required Condition determines conditions under which an answer must be provided for the form to be submitted. Simply put, it determines the conditions under which questions are required. This is similar to the display condition in function but without the skip. The use of required conditions averts error of omission or omission of needed response. For example, if a preceding question had inquired if a woman is pregnant or not and the response is affirmative, then a follow-on question on pregnancy duration is required but not required for women whose responses were not affirmative.

Default Value displays a value upon accessing a form question and can be changed by the user. This is often used when we don’t want a field without value. A good example is having zero entered as opposed to having null values. Some analysis requires zeros for the right insight to be gotten. Another example and common in a case management setting is the collection of follow-up data on a question with registration data. Default values (registration data) are used as placeholders to give users the opportunity to update or leave as is during the follow-up entry.

The consequences of not taking time out to pay due diligence to the use of conditions are numerous. This includes but is not limited to dirty data, misleading data, biased data, missed analysis opportunities, and ultimately wrong insight or information.

There’s a need to make careful considerations when setting conditions hence the need to work with field/programme experts and analysts/statisticians to be certain our tool is relatable and fit for purpose.

Finally, based on our initial seasonings or salt analogy, one may argue that seasonings or salt can be added after the food has been cooked but you’ll also agree that it’s hardly as tasty or evened out as cooking it along with the meal. This can be likened to data cleaning, it’s much more an arduous task than setting conditions before data collection. Also, remember some seasonings require a level of heat to add real value (aroma and taste) to the meal. There are some data quality issues that conditions can avert but no amount of data cleaning can resolve them.

Data Science

Data Buzzwords – Part 2

Data Engineering, Data Science

Data Buzzwords – Part 1

Data Science

Steps to Assure Data Quality

Data Science

Data Quality Standards

Data Science

Feature Engineering

Feature engineering refers to the process of transforming raw data into features that better captures the research problem being analysed in a bid to improve model accuracy. Feature engineering involves finding the best representation of a sample data to learn the solution to a defined problem. A feature refers to a useful attribute for modeling a solution to a problem. A feature is an attribute that is relevant to a problem; an important part of an observation that best captures the problem that is being modeled. Feature engineering includes the objective estimation of attribute usefulness, feature extraction, feature selection, and feature construction. Some popular examples of feature engineering use cases include decomposing categorical attributes, decomposing a date-time and re-framing numerical quantities. The process of carrying out feature engineering for data mining and machine learning tasks involves brainstorming features, devising features using applicable technique, selecting features, and evaluating models. Some of the most commonly used feature engineering techniques includes binning, log transforms, n-grams, binarization, and bags of word. Binning or quantisation technique is used in preparing numerical data for machine learning especially when an attribute has continuous numbers with many unique values. To effectively model a problem, the continuous numbers are replaced with categorical values that captures defined ranges. Feature hashing or hashing trick technique refers to the process of vectorizing features. It is used to scale-up machine learning algorithms. Commonly used for text mining, it involves the conversion of tokens into integers. Using a random sparse projection matrix and preserving the Euclidean norm, it reduces the dimension of the data thus making text analysis possible. Log transform technique comes in handy when dealing with highly skewed distributions. It is used to reduce the skewness of distributions making patterns in a dataset more interpretable. It is used to predict the next item in a sequence. Binarisation technique is used to transform features into vectors of binary numbers for better classification. Bag-of-Words (BoW) is technique for feature engineering is used to count the number of occurrences of a word in a specific document. This is popularly used for text mining compare documents and estimate similarities. It is used to interprete text data for machine. N-grams technique uses word sequences to generalise the bag-of-word technique. In summary, feature engineering adopts varying techniques to address the challenge of having representative and relevant features for analysis and model accuracy improvement.

Data Science

Dimensionality Reduction Techniques

In data mining and machine learning, dimensionality reduction refers to the process of reducing the number of variables in a given dataset for better analysis. Dimensionality reduction techniques are used to solve the problem of feature redundancy especially when most of these features are correlated. It is academically divided into feature selection; keeping the most relevant variables from the original dataset and feature extraction; finding a smaller set of new variables by combining multiple input variables with highly similar information. Simply put, dimensionality reduction techniques are used to address the curse of dimensionality. This refers to all the challenges posed as a result of working with data in the higher dimensions. Machine learning models trained using many features have a great tendency of overfitting. This leads to the poor performance of such models when applied to real data. Avoiding overfitting is a cogent reason for performing dimensionality reduction. Models are best kept simple using fewer features for training, hence having lesser assumptions. There are several dimensionality reduction techniques that are used to address varying levels of complexity in data mining and machine learning tasks. The Missing Value Ratio technique is used to reduce the number of variables when a dataset has too many missing values. The Low Variance filter technique is applied to identify and drop constant variables from a dataset. This is done when variables with low variance have little or no impact on the dependent variable. The High Correlation filter technique is used to find and drop highly correlated features to address multicollinearity in a dataset. The Random Forest technique is arguably one of the most commonly used techniques. It is used to estimate the importance of the features present in a dataset in a bid to select the topmost features. Forward Feature Selection and Backward Feature Elimination techniques are computationally demanding are best suited for small datasets. The Factor Analysis technique is used to divide highly correlated variables into different groups by assigning a factor to each group. The Principal Component Analysis technique is a widely used technique when analyzing a dataset with linear data. It explains as much variance as possible by dividing the data into components. Independent Component Analysis technique is a step further that uses independent components to describe the data with a lesser number of components. ISOMAP is a Manifold Projection Technique that is applied when the dataset is strongly non-linear. T-SNE technique is also used on strongly non-linear datasets but mainly preferred for its visualizations. Last but not the least, UMAP technique is an advanced project-based technique with shorter run-time compared to ISOMAP and t-SNE. In summary, dimensionality reduction techniques are used for data mining and machine learning to remove redundant features and noise from a dataset thus allowing usage of algorithms fit for small datasets, improving model accuracy, and reducing computation time. However, the arsenal of dimension reduction techniques, domain expertise, and heuristics must be well harnessed to avoid data loss, address multicollinearity, select components, and define the final dataset. Image Source: MathWorks

Data Science

Machine Learning with Python

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development. Why Python?

Simple and consistent: Python offers concise and readable code. While complex algorithms and versatile workflows stand behind machine learning and deep learning, Python’s simplicity allows developers to write reliable systems. Python code is understandable by humans, which makes it easier to build models for machine learning and deep learning.
Extensive selection of libraries and frameworks: Implementing machine learning and deep learning algorithms can be tricky and requires a lot of time. Python provides a well-structured and well-tested environment that enables developers to come up with the best coding solutions.
Platform independence: Python is a platform independent language. Python code can be used to create standalone executable programs for most common operating systems. Python is supported by many platforms including Linux, Windows, and macOS.
Great community and popularity: Python is among the top 10 most popular programming languages.

Some popular Python libraries are Pandas, Numpy, SciPy, Seaborn, Scikit-learn,Tensorflow and Keras . A comprehensive list of python standard libraries can be found here.

Data Science

Machine Learning in R

It is common for today’s scientific and business industries to collect large amounts of data, and the ability to analyze the data and learn from it is critical to making informed decisions. Machine learning is a branch in computer science that studies the design of algorithms that can learn. R is one of the major languages for data science. It provides excellent visualization features, which is essential to explore the data before submitting it to any automated learning, as well as assessing the results of the learning algorithm. Why R?

R is free: Unlike other proprietary software packages that require expensive licenses, R is open source and you can always download it for free. No matter where you travel, you can have access to R on your computer.
R gives you access to cutting-edge technology: Top researchers develop statistical learning methods in R, and new algorithms are constantly added to the list of packages you can download.
R is a useful skill: Employers that value analytics recognize R as useful and important. If for no other reason, learning R is worthwhile to help boost your resume.

Some popular R packages/libraries are caret, ggplot2, mlbench, class, caTools, randomForest, impute, ranger, kernlab, class, glmnet, naivebayes, rpart, rpart.plot.

Email

Call

Category: Data Science

Data Quality Seasonings

Data Buzzwords – Part 2

Data Buzzwords – Part 1

Steps to Assure Data Quality

Data Quality Standards

Feature Engineering

Dimensionality Reduction Techniques

Machine Learning with Python

Machine Learning in R