Cleaning Messy Data: A Step-by-Step Guide Using ChatGPT

Data is rarely perfect. Inconsistent formatting, missing values, duplicates, outliers, and messy open-ended responses can all impact your analysis.

In this guide, we’ll walk you through a step-by-step process for cleaning messy survey data using ChatGPT.

Using a synthetic dataset with common data issues, we’ll show you how to:

Identify missing values 🧪
Detect and remove duplicates 🗑
Standardise categorical data 🏷
Correct formatting errors ✏️
Clean open-ended responses 💬
Handle outliers 🚨

We’ve included downloadable resources which include a messy dataset, and ChatGPT prompt document, so you can practise each technique hands-on. You can download the files below: 👇

The files are hosted securely on Internext.com and no account is required. If you’d prefer me to email them to you, send me a DM.

📊 Dataset
✏️ Prompts

If you do not have an account yet, simply go to ChatGPT. If you’re setting up a new account, or already have one, it is good practice to disable your content from being used to train the OpenAI model. You can do this by:

➡️ Head to the Settings menu and look for the section labelled Data Controls.

➡️ Toggle the setting for ‘Improve the model for everyone’ to disable it.

➡️ For a recap on how to use ChatGPT for Market Research safely, click here.

Let’s get started! 🚀

Step 1: Identifying Missing Data 🔍

Scenario

Before analysing data, we need to check for missing values in Q2 (Age). Missing data can skew results or create biased insights.

Starter Prompt

Scan Q2 (Age) for missing values. Identify how many missing entries exist and suggest strategies for handling them. Summarise your output in the chat.

ChatGPT Output

🤖 ChatGPT will highlight missing values and provide a summary of where data gaps exist. It may suggest solutions such as imputation (filling in missing values), removing affected rows, or flagging them for further review.

Follow-Up Prompt

In Q2 (Age), replace missing values with the flag 'Missing'.

Step 2: Detecting and Removing Duplicates 🗑

Scenario

Duplicate entries in Q0 (Respondent ID) can distort analysis, especially in survey data where respondents may submit multiple responses.

Starter Prompt

Check Q0 (Respondent ID) for duplicate rows. Identify any duplicates and summarise their frequency. Summarise your output in the chat.

ChatGPT Output

📋 ChatGPT will detect duplicate entries and suggest whether they should be removed or merged.

Follow-Up Prompt

Remove duplicate rows while keeping the first occurrence.

Step 3: Standardising Categorical Data 🏷

Scenario

Survey respondents may provide different variations of the same answer in Q3 (Country), such as “USA,” “U.S.A.,” or “United States.”

Starter Prompt

Standardise Q3 (Country) by identifying inconsistencies in spelling, capitalisation, or abbreviations. Suggest a consistent format for each category. Summarise your output in the chat.

ChatGPT Output

📝 ChatGPT will list inconsistent values and propose a standardised format for categorical variables such as country names.

Follow-Up Prompt

Replace inconsistent values with standardised categories in Q3 (Country).

Step 4: Correcting Formatting Errors ✏️

Scenario

Data formatting errors in Q1 (Name) include extra spaces, inconsistent capitalisation, and incorrect name formatting.

Starter Prompt

Identify formatting inconsistencies in Q1 (Name), such as extra spaces and inconsistent capitalisation. Suggest corrected versions. Summarise your output in the chat.

ChatGPT Output

🔠 ChatGPT will detect inconsistencies like extra spaces (” John Doe “) or mixed capitalisation (“alice brown” vs. “Alice Brown”).

Follow-Up Prompt

Apply the suggested formatting corrections across Q1 (Name).

Step 5: Cleaning Open-Ended Responses 💬

Scenario

Survey respondents provide open-ended responses in Q6 (Describe your experience using digital media). These responses contain typos, inconsistent punctuation, and extra spaces.

📌 Example Raw Responses

” great way to stay in touch with friends “ (Extra spaces, lowercase first letter)
“it’s addictive, but also very useful.” (Lowercase first letter, missing punctuation)
“I love how easy it is to access information!” (Correct)

Starter Prompt

Review responses in Q6 (Digital Media Experience). Identify inconsistencies in capitalisation, punctuation, and spacing. Suggest cleaned versions. Summarise your output in the chat.

ChatGPT Output

🛠 ChatGPT will detect inconsistencies such as missing punctuation, improper capitalisation, and extra spaces. It will generate cleaned versions of responses.

Follow-Up Prompt

Create a new column called Q6_Cleaned_Response with standardised and properly formatted responses.

Step 6: Handling Outliers 🚨

Scenario

Outliers in Q4 (Income) and Q5 (Satisfaction Score) can distort statistical analysis.

Starter Prompt

Identify outliers in Q4 (Income) and Q5 (Satisfaction Score) using statistical methods such as Z-score or IQR. Suggest whether they should be kept, adjusted, or removed. Summarise your output in the chat.

ChatGPT Output

📊 ChatGPT will highlight extreme values and suggest approaches such as truncation (removing extreme values) or winsorisation (replacing extreme values with nearest valid ones).

Follow-Up Prompt

Flag outliers in new columns Q4_Outlier_Flagand Q5_Outlier_Flagand recommend actions for handling them.

Step 7: Finalising and Exporting Clean Data 🏁

Scenario

Now that the data has been cleaned, we need to save and validate the final dataset.

Starter Prompt

Perform a final check on the dataset to ensure all missing values, duplicates, and formatting errors have been resolved. Generate a summary of the cleaning steps applied. Summarise your output in the chat.

ChatGPT Output

✅ ChatGPT will confirm that all key cleaning steps have been completed and provide a summary report.

Follow-Up Prompt

Save the cleaned dataset as a new file ensuring that all corrections are applied.

Next Steps: Where to Go From Here? ✨📊

🧪 Experimentation: Apply these cleaning techniques to different datasets such as customer data, social media insights, or research surveys.
🔄 Validation: Test the impact of data cleaning on analysis. Does removing outliers change trends? Does standardising categories improve segmentation?
🤖 Automation: Use ChatGPT to automate repetitive cleaning tasks or generate Python scripts for handling large datasets!

This guide is just the starting point. The more you experiment, the better you’ll get at cleaning and structuring messy data!

This training guide is yours to use however you like. Share it, apply it in your business, or adapt it to fit your brand. That is what DIY culture is all about! If you find it valuable, consider supporting my work with a coffee or a beer through my Buy Me a Coffee link below. Cheers! ☕🍻