Data Warehousing

Data Warehousing is the concept of collection and storage of data from various sources or databases after performing the ETL process (Extract, Transform, Load). Data warehouse is a computing system that is used for reporting and data analysis, and it has proved to be an important component of business intelligence. They store historical and current data in one single place and is beneficial for companies as it enables them to draw insights from the data effectively and arrive at decisions.

Image Source: corporatefinanceinstitute.com

Difference between Database and Data Warehouse

OLTP Database

A database is designed to capture and record data from a variety of enterprise systems i.e. OLTP - Online Transaction Processing. OLTP includes resource planning, customer relationships, transaction management etc. OLTP database is an operational database which contains highly detailed live, real-time data.

These databases should operate consistently for the success of a business and there should be no error or ambiguity in the data that is being stored. The transactions stored in the OLTP databases are simple and direct which can be queried with speed and ease. However, databases work slowly for querying large amounts of data and can slowdown the transactional process which negatively affects the business.

OLAP Data Warehouse

On the other hand, data warehouses are designed to store summarized data from various databases or sources after transforming the data according to the business necessity for analysis, reporting and decision making, i.e. OLAP - Online Analytical Processing. Since data from various sources is stored in a warehouse, the structure of a warehouse should be thoroughly pre-planned.

Quering huge amounts of data is easy with data warehouses, and it does not generally affect business as fresh data is generally not dynamically stored in the data warehouse, rather data is updated periodically. Data stored is secure, reliable and can be easily managed and retrieved.

Data Warehouse Architecture

Data warehouse architecture comes in three types:

Two-tier
Three-tier
Four-tier

Two-tier architecture:

Two-tier architecture is basically a client-server architecture. The end users are directly communicating with data sources. It is easy to maintain, and the communication is fast as there is no mediator in between.

In this type of architecture, the performance of the warehouse decreases when the number of end users increases, and it is also cost-ineffective.

Image source: dwgeek.com

Three-tier architecture:

Three-tier architecture has three layers. The top tier, middle tier and bottom tier. The top tier is the User layer, which allows users to retrieve data from the data warehouse and analyze, generate visualizations and share insights with the team. The middle tier is the OLAP layer, which acts as an interface between the User and Data layer, enhances the communication speed and processes large volumes of data. The bottom tier is the data layer that has the actual databases, and all RDBMS actions are performed in this layer.

The three-tier architecture improves performance as load is reduced on the User and Data layer. This architecture provides improved security as users do not have direct access to the database.

Image source: https://starship-knowledge.com/tag/data-warehousing-2

Four-tier architecture:

In four-tier architecture, a Presentation layer is included in addition to the layers present in the three-tier architecture. When users want to retrieve certain data from the warehouse, requests are sent from the user layer to the presentation layer which interacts with the business layer/OLAP layer which in turn accesses the data warehouse and the retrieved data is sent as a response to the user. The advantage of having a presentation layer is that it enhances the way the data is presented to the end user.

Image source: DATA WAREHOUSING AND DATA MINING: architecture model, 2-tier, 3-tier and 4-tier data warehouse (dwdmnotes.blogspot.com )

Why do businesses need Data Warehousing?

Let us consider a business that has several data sources:

1. Databases

2. Flat files like Excel spreadsheets

3. Web services

4. Online Surveying

5. Social media

Let us say these data are stored in various locations:

1. Local files on a computer

2. Data centers

3. Cloud storages

Ideally, an analyst would want all the business data in one place rather from these multiple sources and locations. If the data were to be sourced from various locations to query and extract business insights, the data will not be organized, there might be redundancy, disparity might arise from different data sources, data cleaning and processing might be required every time the data is queried.

The following are the advantages of having a centralized data storage like a data warehouse in an organization:

Data need not be fetched from various sources and locations every time we want to query the data
Data will not have disparities. Data is collected at different times, and it comes in from different sources so they can have inconsistent data types and timestamp, and varied levels of pre-processing is required. So, when data from different sources is pre-processed at regular intervals and loaded to a warehouse, the problem of disparity is handled
Data cleaning would not be required as it would have already been handled in ETL process before warehousing the data
Sharing data within teams will become easy. There will be no duplication of data, and all teams will work with the same format of data which results in better efficiency in terms of time and cost
As a business grows, so does the amount of data. A data warehouse scales effortlessly to accommodate more data and complex queries
Storage of historical data for trend analysis and forecasting is also an advantage of data warehousing

Steps involved in Designing a Data Warehouse

1. Define the business objectives

2. Identify all the data sources

3. Choose the right Data Warehouse architecture

4. Designing the data model

5. Choose the necessary technical tools

6. Implement ETL Processes

7. Maintain Data Quality

8. Testing and monitoring the warehouse

1. Define your business objectives

Ponder upon the purpose of implementing a data warehouse in the organization. The reasons for having a data warehouse in a business can be for data-driven strategic decision-making, to increase the operational efficiency with a centralized storage solution and for analysing the data to understand trends, which help in forecasting and better understanding of the customer’s needs.

2. Identify all the data sources

Identify all potential data sources, which can include CRM systems, ERP systems, operational databases, social media analytics, market research etc. accessing the volume of data each source generates. This will help in determining the scalability needs of the data warehouse. Also understand the frequency at which the data needs to be updated to the warehouse. Ensure that the data collected aligns with the standards and privacy policies.

3. Choose the right type of warehouse

Depending on the business objectives and requirements, choose the right type of warehouse architecture. The commonly used types are:

Centralized Data Warehouse: The one that we have been discussing in the article
Data Lake: This can hold unstructured or semi-structured data of huge volumes
Data Mart: Department specific data warehouse used for analytical needs of a specific department
Hybrid models: Structured data warehouses combined with data lakes are hybrid models
Cloud based: Amazon Redshift, Google Big Query, Snowflake etc. provide storage solutions with minimal physical infrastructure requirements

4. Designing the data model

Designing the model starts with understanding the data and identifying the key entities and how they are related to the data from other data sources as well. Choosing the right schema plays a pivotal role. The Star schema can be chosen for its simplicity and efficiency. The Snowflake schema can be chosen, which requires multiple tables related to each other.

5. Choose the necessary technical tools

In this step, choose required database management system, ETL tools, Analytics and BI tools. This step impacts how the data is processed, stored and analyzed, so choosing according to the organization’s needs is crucial.

6. Implement ETL processes

ETL process is one important step that decides how data is collected from different data sources, how unstructured data is processed and stored in the warehouse as structured data. Decide how frequently data needs to be updated to the warehouse. Whether data needs to be periodically updated in batches or if real-time processing is needed for dynamic data analysis. The ETL process can be automated as much as possible to reduce manual errors.

7. Maintain data quality

Once the architecture has been decided and the model has been set up, ensure that the data that flows into the warehouse is properly cleaned and processed. Quality standards should be setup to maintain the data quality. Ensure that the data collected has strict access controls and grant access to only necessary people in the organization. Regular audits are to be performed to check that the data storage process follows the organization’s policies and standards.

8. Testing and monitoring the warehouse

The warehouse has to be tested for its effective processing. Performance testing, security testing and User Acceptance Testing are done to ensure the smooth working of the warehouse, data storage and easy retrieval. The warehouse has to be monitored and checked upon from time to time to ensure optimal performance of the system.

To summarize, Data Warehousing increases the efficiency and performance of data storage and retrieval, while the data being retrieved will be of the highest quality information which is used in decision-making, which in turn increases the productivity of the organization. Implementing a data warehouse in an organization can also prove to be a cost and time effective solution.