top of page
Writer's pictureAbhinandan Borse

"dirty" data issues --general


  1. Missing Values:

    • Example: In a dataset of customer orders, some entries in the "Address" field are blank.

    • Cleaning: You can either remove rows with missing addresses or fill in the missing values (e.g., with "Unknown").


  1. Outliers:

    • Example: In a dataset of employee salaries, there's an entry with an unusually high value.

    • Cleaning: Outliers can be identified using statistical methods (e.g., Z-score) and either corrected or removed.


  1. Duplicate Records:

    • Example: In a database of sales transactions, there are two identical entries for the same sale.

    • Cleaning: Duplicates can be identified using unique identifiers (e.g., transaction IDs) and removed.


  1. Incorrect Data Types:

    • Example: A column that should contain dates is formatted as text.

    • Cleaning: Convert the data type to the correct format (e.g., using Excel's "Text to Columns" or pandas' pd.to_datetime).


  1. Inconsistent Formatting:

    • Example: Dates are inconsistently formatted as 'dd/mm/yyyy' and 'mm/dd/yyyy'.

    • Cleaning: Standardize the date format to a consistent one (e.g., using Excel's "Text to Columns" or Python's datetime functions).


  1. Inaccurate or Incomplete Data:

    • Example: A sensor recorded a temperature of -300 degrees Celsius, which is physically impossible.

    • Cleaning: Identify and correct data points that are outside of plausible ranges based on domain knowledge.


  1. Non-Standard Values:

    • Example: Gender is recorded as "M," "Male," and "male."

    • Cleaning: Standardize values to a consistent format (e.g., using Excel's Find and Replace or Python's string manipulation functions).


  1. Unusual Characters:

    • Example: Text fields contain special characters like emojis or non-printable characters.

    • Cleaning: Remove or replace unusual characters using text processing functions.


  1. Data Integrity Issues:

    • Example: In a database, a foreign key value doesn't match any corresponding primary key in the referenced table.

    • Cleaning: Identify and rectify discrepancies, possibly by updating or removing the incorrect foreign key references.


  1. Data Privacy Concerns:

    • Example: Social Security Numbers or credit card numbers are stored in a dataset without proper encryption or masking.

    • Cleaning: Apply data masking techniques or encryption to protect sensitive information.


0 views0 comments

Recent Posts

See All

FILES CREATED ON NEW SSIS PACKAGE CREATION

When you create a new Integration Services project in SQL Server Data Tools (SSDT), several files and folders are generated. Here's a...

SSIS ERRORS

https://learn.microsoft.com/en-us/sql/integration-services/integration-services-error-and-message-reference?view=sql-server-ver16

コメント


bottom of page