Data

Data Cleaning

removal of noisy or incomplete data from the collection. Many methods that generally clean data by itself are available but they are not robust.
This step carries out the routine cleaning work by:
(1a) Fill The Missing Data:
Missing data can be filled by methods such as:
Ignoring the tuple. ( Alltså välj att bortse från denna rad i datasetet)
Filling the missing value manually.
Use the measure of central tendency, median or
Filling in the most probable value.

(2b) Remove The Noisy Data: Random error is called noisy data.
Methods to remove noise are :
Binning: Binning methods are applied by sorting values into buckets or bins. Smoothening is performed by consulting the neighboring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin. Smoothing by a median, where each bin value is replaced by a bin median. Smoothing by bin boundaries i.e. The minimum and maximum values in the bin are bin boundaries and each bin value is replaced by the closest boundary value.
Identifying the Outliers (Data som skiljer sig mycket från resten av datan)
Resolving Inconsistencies

#2) Data Integration

When multiple heterogeneous data sources such as databases, data cubes or files are combined for analysis, this process is called data integration. This can help in improving the accuracy and speed of the data mining process.
Different databases have different naming conventions of variables, by causing redundancies in the databases. Additional Data Cleaning can be performed to remove the redundancies and inconsistencies from the data integration without affecting the reliability of data.
Data Integration can be performed using Data Migration Tools such as Oracle Data Service Integrator and Microsoft SQL etc.

#3) Data Reduction

This technique is applied to obtain relevant data for analysis from the collection of data. The size of the representation is much smaller in volume while maintaining integrity. Data Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural network, etc.
Some strategies of data reduction are:
Dimensionality Reduction: Reducing the number of attributes in the dataset.
Numerosity Reduction: Replacing the original data volume by smaller forms of data representation.
Data Compression: Compressed representation of the original data.
Summera, klassificera och gruppera

#4) Data Transformation

In this process, data is transformed into a form suitable for the data mining process. Data is consolidated so that the mining process is more efficient and the patterns are easier to understand. Data Transformation involves Data Mapping and code generation process.
Strategies for data transformation are:
Smoothing: Removing noise from data using clustering, regression techniques, etc.
Aggregation: Summary operations are applied to data.
Normalization: Scaling of data to fall within a smaller range.
Discretization: Raw values of numeric data are replaced by intervals. For Example, Age.

#5) Data Mining

Data Mining is a process to identify interesting patterns and knowledge from a large amount of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is represented in the form of patterns and models are structured using classification and clustering techniques.

#6) Pattern Evaluation

This step involves identifying interesting patterns representing the knowledge based on interestingness measures. Data summarization and visualization methods are used to make the data understandable by the user.

#7) Knowledge Representation

Knowledge representation is a step where data visualization and knowledge representation tools are used to represent the mined data. Data is visualized in the form of reports, tables, etc.

Extra:

1. Monitor errors

Keep a record of trends where most of your errors are coming from.This will make it a lot easier to identify and fix incorrect or corrupt data. Records are especially important if you are integrating other solutions with your fleet management software, so that your errors don’t clog up the work of other departments.

2. Standardize your process

Standardize the point of entry to help reduce the risk of duplication.

3. Validate data accuracy

Once you have cleaned your existing database, validate the accuracy of your data. Research and invest in data tools that allow you to clean your data in real-time. Some tools even use AI or to better test for accuracy.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.