Analyzing Your Data: Why Do Data Types Matter? — Part 1

Armel Djangone
The Startup
Published in
5 min readJan 28, 2021

--

Introduction

When analyzing your data, it is crucial to recognize and understand the importance of each data type. Depending on the type of your data, a specific analysis will be appropriate. Similarly, the data type will also drive the choice of data visualization techniques. In this series of 2 posts, I first presented the different type of statistical data and provided some examples of each type. I further explained the importance of data types using Python and R programing.

In this part 1, I covered data type with Python, converting data type with Python, and visualizing quantitative and qualitative.

Numerical data Versus categorical data

Various data are usually collected for analysis purpose. For example, demographic data (age, ethnicity, age,), employees salaries, job grade, height, student grades, curriculum. These observations can be classified as either categorical (also known as qualitative) or numeric (or quantitative) data.

Numerical data is any type of data where anything you are measuring can be characterized with a number. These types of data can be divided into 2 subgroups: discrete and continuous.

Discrete numerical data are typically whole numbers. Discrete numerical data cannot be measured, they are counted. For example, since you measure your weight on a scale, it’s not discrete data. Examples include the number of people in a class, test questions answered correctly. There are two questions you can ask when deciding if data is discrete:

o Can we count the data?

o Can it divided into smaller and smaller parts?

Continuous numerical variables are variables that may contain any value within some range. Continuous data is measured. Example of continuous data include a person’s height, weight, time in a race, temperature.

Categorical data describe a ‘characteristic’ of a data unit and are selected from a small group of categories. Just like numerical data, categorical data can be divided into 2 subgroups: ordinal and nominal. First difference here — with nominal data, the order does not matter, while ordinal data requires a specific order. Nominal (sometimes called labels) scales are used for labeling variables, without any quantitative value. Examples include gender (Male, Female), hair color (Brown, Black, Blonde, Gray), citizenship (American, Ivorian, French, Canadian). With ordinal scales the order of the values is the focus and most significant. Examples include assessing the temperatures: cold, cool, warm and hot.

Summary of data types

Data types in Python

Now that we have a better understanding of data types, let’s now see how we can use this information with Python programming. For this post, I will use the 2017 & 2018 SAT and ACT scores datasets. I will show how to check the data type, identify and fix incorrect data types.

Let’s begin by importing the necessary libraries: Numpy and Pandas

Next , let’s read in SAT & ACT Data Read in the sat_2017.csv and act_2017.csv files and assign them to appropriately named pandas dataframes

Let’s take a look at the first 10 rows of the ACT data

Similarly, let’s take a look at the first 10 rows of the SAT dataset

We can see that the 2017 SAT included five features as follow: State, Participation rate, Writing and Reading, Math and Total.

The 2017 ACT included 7 observations as follow: State, Participation, Reading, Science, English, Math and Composite.

Data look complete. But lets take a deeper look at the columns name, data types later on and what we have.

Let’s now display the data type for each feature by using “dtypes”

do any of the data type here seem odd? which ones are not correct?

There are some issues with data. The data type for participation is showing as a string, but really, the data is of float type. Similarly. Composite should be numeric and not considered as a string. There are some other issues such as misspelling of the columns names, but for the purpose of today’s topic, I will focus on correcting the data type.

Fixing incorrect data type

Based on what I discovered above, I use an appropriate methods to convert incorrectly typed data into the correct one. For this I will use a function along with “map” that will allow to convert participation rates to an appropriate numeric type.

Note: You can use various methods, I chose a function here because I will need to use that code again and again during the project. You could just write a simple code to resolve this.

Let’s now verify the data type was changed

As we can see, participation column now has changed from object to float64, which is the appropriate data type for this feature based on what we have seen above.

Conclusion

  • Whether you are using Python, R or SAS, you will usually deal with two main types of statistical data: quantitative and qualitative data.
  • Understanding the data types and their use is very important when analyzing your data.
  • The type of data will drive the way the data is managed, analyzed and represented (visually).
  • The full code for this blog can be found in my personal Github.

References

Personal Github: https://github.com/amaso13/Analysing-SAT-ACT-scores.git

--

--