What is it?

From Wikipedia:

“Data Integration involves combining data residing in different sources and providing users with a unified view of these data.”

from_to

Make an example:

In a dream world, companies could perfectely manage all pieces of information, even the smallest and hidden, each department is well connected with all the others, information is fluid and everyone could reach the needed data at the right time.

In the real world that’s situation is very difficult to achieve, it’s quite common to have different data source in a work enviroment; different departments usually have different ways to collect and store data about clients and products, so data suddenly becomes overwhelming and information is:

Data Integration is a solution to all this problems, and with a small impact on the already consolidated information system!

  • duplicated
  • fragmentary
  • unrelated
  • not up-to-date
  • unacessible
  • unreadable

Data Integration is a solution to all this problems, and with a small impact on the already consolidated information system!

ETL and Federation

Actually there are a lot of systems for Data Integration that solve the problems in different ways but the most common are ETL and Federation; these approaches are diametrically opposite.

ETL stands for Extract Transform and Load and identifies that tools that extract data from the sources, manipulate, check and integrate the data in some way and then load the already integrated data in a new database or datawarehouse.

This tool are usually easy to use but have also very big drawbacks, such as:

  • you have to duplicate the data
  • integration is a scheduled event, for example during night, so the data will be one day old

etl_eng

With Federation instead is created a new unified rappresentation, called global schema, that the user can use. This is also the approach used by MOMIS.

Once that the global schema is up and running the user can directly query it and have a complete and uniform view of all the existing data.

Look at the steps triggered by a query:

  • A user submits a query to the system, also called federator
  • the query is splitted in subquery submitted to each source
  • all answers are collected and merged by the federator
  • the federator presents to the user a well formed and complete answer

There’s no need to duplicate information and each query show the user the last available data.

federation_eng