Missing Values:
Example: In a dataset of customer orders, some entries in the "Address" field are blank.
Cleaning: You can either remove rows with missing addresses or fill in the missing values (e.g., with "Unknown").
Outliers:
Example: In a dataset of employee salaries, there's an entry with an unusually high value.
Cleaning: Outliers can be identified using statistical methods (e.g., Z-score) and either corrected or removed.
Duplicate Records:
Example: In a database of sales transactions, there are two identical entries for the same sale.
Cleaning: Duplicates can be identified using unique identifiers (e.g., transaction IDs) and removed.
Incorrect Data Types:
Example: A column that should contain dates is formatted as text.
Cleaning: Convert the data type to the correct format (e.g., using Excel's "Text to Columns" or pandas' pd.to_datetime).
Inconsistent Formatting:
Example: Dates are inconsistently formatted as 'dd/mm/yyyy' and 'mm/dd/yyyy'.
Cleaning: Standardize the date format to a consistent one (e.g., using Excel's "Text to Columns" or Python's datetime functions).
Inaccurate or Incomplete Data:
Example: A sensor recorded a temperature of -300 degrees Celsius, which is physically impossible.
Cleaning: Identify and correct data points that are outside of plausible ranges based on domain knowledge.
Non-Standard Values:
Example: Gender is recorded as "M," "Male," and "male."
Cleaning: Standardize values to a consistent format (e.g., using Excel's Find and Replace or Python's string manipulation functions).
Unusual Characters:
Example: Text fields contain special characters like emojis or non-printable characters.
Cleaning: Remove or replace unusual characters using text processing functions.
Data Integrity Issues:
Example: In a database, a foreign key value doesn't match any corresponding primary key in the referenced table.
Cleaning: Identify and rectify discrepancies, possibly by updating or removing the incorrect foreign key references.
Data Privacy Concerns:
Example: Social Security Numbers or credit card numbers are stored in a dataset without proper encryption or masking.
Cleaning: Apply data masking techniques or encryption to protect sensitive information.
コメント