CMU-Database-Notes

Topics:

  1. Database
  2. Purpose of Database Systems
  3. Flat File Strawman
  4. View of Data
  5. Database Languages
  6. Database Design Process
  7. Data Storage and Querying
  8. Database Architecture
  9. Database Users and Administrators
  10. CHAPTER 2

Database

A database is an organized collection of inter-related data that models some aspect of the real-world (e.g modeling the students in a class or a digital music store).


Purpose of Database Systems

A database system aims to address the limitations of the traditional file-processing systems by providing a more efficient, reliable, and flexible way to manage data. Some of the key purposes of a database system include:

A file-processing system is supported by a conventional operating system. The system stores permanent records in various files, and it requires different application programs to extract records from and add records to the appropriate files.

The file-processing system has several major disadvantages:


Flat File Strawman

A Flat File Strawman refers to a simplistic proposal or initial model for data storage using flat files, which are basic text files organized in rows and columns, such as CSV or TXT files. This approach is often utilized as a starting point in discussions or designs to explore data storage concepts without the complexities of relational databases. The term “strawman” signifies that this model is not fully developed and serves as a basis for further exploration, debate, or refinement before adopting a more sophisticated data management solution.


Issues with Flat Files

Issues with Flat Files refer to the various challenges and limitations encountered when using flat files for data storage in applications. Flat files, such as CSV files, are simple text files where data is organized in a tabular format, with records separated by new lines and attributes within records separated by commas. While flat files are easy to implement and manage for small datasets, they pose significant problems as the volume of data grows or as the complexity of data relationships increases.

1.Data Integrity

Ensuring consistency and validity of data can be challenging, especially when multiple entries relate to the same entity. For example, ensuring that artist details remain accurate across multiple albums.

2. Implementation

Locating specific records can be inefficient, and integrating with new applications or systems can be complicated, particularly if they operate on different platforms. Furthermore, concurrent write operations can lead to conflicts and data corruption.

3. Durability

Flat files are vulnerable to data loss in the event of system failures during updates, and maintaining high availability through replication across multiple systems can be complex and resource-intensive.


Database Management System

A Database Management System (DBMS) is software that allows applications to store and analyze information in a database. A general-purpose DBMS supports the definition, creation, querying, updating, and administration of databases in accordance with some data model.


View of Data

A major purpose of a database system is to provide users with an abstract view of the data. The system hides certain details about how the data are stored and maintained, allowing users to interact with data without needing to understand the complexities of its storage.

Data Abstraction

Data Abstraction

For the system to be usable, it must retrieve data efficiently. Designers use complex data structures to represent data in the database to ensure this efficiency. To simplify users’ interactions with the system, data abstraction is categorized into three levels:

Each level of abstraction ensures that users or programs only deal with the relevant data and ignore the complexities of other layers, improving efficiency and ease of use.


Data Model

A data model is a collection of concepts for describing the data in a database. A schema is a description of a particular collection of data, using a given data model.


Types of Data Models

  1. Hierarchical Data Model
    Organizes data in a tree-like structure with parent-child relationships. Each parent can have multiple children, but each child has only one parent.
    Example: File systems, organizational structures.

  2. Network Data Model
    Allows for more complex relationships with multiple parent-child links. Represents data as a graph, with nodes (entities) and edges (relationships).
    Example: Telecommunications networks.

  3. Relational Data Model
    Organizes data into tables (relations) with rows and columns. Each table represents an entity, and relationships are established through foreign keys.
    Example: SQL databases (e.g., MySQL, PostgreSQL).

  4. Object-oriented Data Model
    Combines data and behavior in objects, allowing for inheritance and encapsulation. Suitable for complex data structures and relationships.
    Example: Object-oriented databases (e.g., MongoDB).

  5. Entity-Relationship Model (ER Model)
    Visual representation of data entities and their relationships, primarily used for database design and conceptualization.
    Example: ER diagrams in database design.

  6. Document Data Model
    Stores data in documents, typically in formats like JSON or XML. Each document can have a varying structure, allowing flexibility.
    Example: NoSQL databases (e.g., MongoDB, CouchDB).

  7. Key-Value Data Model
    Stores data as a collection of key-value pairs, where each key is unique. Ideal for fast lookups and simple data structures.
    Example: Key-value stores (e.g., Redis, DynamoDB).

  8. Column-Family Data Model
    Organizes data into column families rather than rows, allowing for efficient storage and retrieval. Each column family can store different types of data.
    Example: Columnar databases (e.g., Apache Cassandra).

  9. Graph Data Model
    Represents data as nodes and edges, capturing relationships and interactions between entities. Optimized for connected data and complex queries.
    Example: Graph databases (e.g., Neo4j).


Database Languages

A database system provides two key languages:


Data Manipulation Languages (DMLs)

A Data Manipulation Language (DML) enables users to access or manipulate data organized by the appropriate data model. DMLs allow for the following types of access:

There are two types of DMLs:


Data-Definition Language (DDL)

A Data-Definition Language (DDL) is used to define the structure of the database. It specifies the database schema through a set of definitions. In addition to specifying the schema, DDL also allows users to define various properties of the data, such as:

DDL commands, such as CREATE, ALTER, and DROP, are used to define and modify the structure of the database.


Database Design Process

The database design process begins with a high-level data model that provides a conceptual framework for specifying user data requirements and structuring the database to meet those needs.

Phases of Database Design:

1. User Requirement Specification

2. Conceptual Schema Development

3. Defining Attributes and Tables

4. Functional Requirements Specification

5. Logical and Physical Design


The Entity-Relationship Model

The Entity-Relationship (E-R) data model is a framework for describing the logical structure of a database using basic objects known as entities and the relationships among them.

Key Concepts

Entities

Relationships

E-R Diagrams

The overall logical structure (schema) of a database can be visually represented using an E-R diagram. In E-R diagrams:

Constraints

The E-R model includes certain constraints that the database must adhere to. One important constraint is mapping cardinalities, which specify the number of entities that can be associated with another entity through a relationship set. For example, if each instructor can be associated with only one department, this constraint can be represented in the E-R model.


Normalization

Normalization is a method for designing a relational database aimed at generating a set of relation schemas that minimize redundancy while facilitating easy information retrieval. The process focuses on creating schemas that conform to appropriate normal forms.

Goals of Normalization

Key Concepts

To determine whether a relation schema is in a desirable normal form, additional information about the real-world enterprise being modeled is required. The most common approach to provide this information is through functional dependencies.

Issues with Poor Database Design

Bad database designs may exhibit undesirable properties, including:

Normalization helps address these issues, ensuring that the database structure effectively supports the necessary operations and relationships.


Data Storage and Querying

A database system is divided into modules that manage different responsibilities. The main components of a database system include the Storage Manager and the Query Processor.

Storage Manager

The Storage Manager is crucial as databases require significant storage space, ranging from hundreds of gigabytes to terabytes. Data is stored on disks, and movement between disk storage and main memory must be minimized due to the slower speed of disk operations compared to the CPU.

Responsibilities of the Storage Manager:

Components of the Storage Manager:

  1. Authorization and Integrity Manager: Ensures integrity constraints are met and verifies user access.
  2. Transaction Manager: Maintains database consistency during system failures and manages concurrent transactions.
  3. File Manager: Manages disk space allocation and data structures for stored information.
  4. Buffer Manager: Fetches data from disk to main memory and decides which data to cache, enabling handling of larger data sizes than available memory.

Data Structures Implemented:

Query Processor

The Query Processor simplifies and facilitates data access for users, enabling them to work at a logical level without dealing with physical implementation details.

Components of the Query Processor:

  1. DDL Interpreter: Interprets Data Definition Language (DDL) statements and updates the data dictionary.
  2. DML Compiler: Translates DML statements into evaluation plans of low-level instructions for the query evaluation engine, while optimizing the plan for efficiency.
  3. Query Evaluation Engine: Executes the low-level instructions generated by the DML compiler.

Database Architecture

Database architecture encompasses the various components of a database system and their interconnections. It is influenced by the underlying computer system and can be categorized into different types:

Databasearc

Types of Database Systems

  1. Centralized Systems: All components run on a single machine.
  2. Client-Server Systems: One server executes tasks on behalf of multiple clients.
  3. Distributed Databases: Data is spread across multiple geographically separated machines.
  4. Parallel Architectures: Designed to exploit parallel computing capabilities.

Architecture Overview

Application Partitioning

Database applications can be organized into two or three tiers:

Two-Tier Architecture

Two-Tier Architecture

Three-Tier Architecture

This architecture is more suitable for larger applications where business logic can be centralized in the application server.

Three-Tier Architecture


Database Users and Administrators

There are four different types of database-system users, differentiated by how they interact with the system:


Database Administrator (DBA)

One of the key reasons for using a DBMS is to centralize control over both the data and the programs that access it. The individual responsible for this central control is called the Database Administrator (DBA). The functions of a DBA include:


CHAPTER 2

Topics:

  1. Relational Model
  2. Relational Algebra
  3. Summary

Relational Model

The relational model defines a database abstraction based on relations to avoid maintenance overhead.

Key Ideas

Concepts of the Relational Data Model

  1. Structure:

    • Defines the relations and their contents, independent of their physical representation.
  2. Integrity:

    • Ensures that the contents of the database satisfy specific constraints.
  3. Manipulation:

    • Provides a programming interface for accessing and modifying the contents of the database.

Data Independence

Data Independence

Relational Model: Keys and Constraints

1. Primary Keys

2. Foreign Keys

3. Superkeys

4. Candidate Keys

5. Constraints


Relational Algebra

Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes in one or more relations as inputs, and outputs a new relation. To write queries we can ”chain” these operators together to create more complex operations.

relationalalgebra


Relational Algebra Operations

This document provides an overview of key operations in relational algebra, which is used for querying and manipulating relational databases. relational algebra

Selection (σ)

Selection is an operation that takes a relation and outputs a subset of tuples that satisfy a specified selection predicate.

Example:
If we have a relation Students and we want to select students older than 20:

σ_Age > 20(Students)


Projection (π)

Projection is an operation that takes a relation and outputs a new relation containing only specified attributes.

Example:
To project only the names and majors of students:

π_Name, Major(Students)


Union (∪)

Union combines the tuples from two relations and removes duplicates. Both relations must have the same attributes.

Example:
If UndergraduateStudents and GraduateStudents have the same structure, the operation would be:

UndergraduateStudents ∪ GraduateStudents


Intersection (∩)

Intersection returns the tuples that are present in both relations.

Example:
To find students enrolled in both the CS and Math courses:

Students_CS ∩ Students_Math


Difference (−)

Difference returns the tuples that are in the first relation but not in the second.

Example:
To find students who are not enrolled in any Math courses:

Students − Students_Math


Product (×)

Product (Cartesian product) combines all tuples from two relations, producing a new relation that contains every possible pair of tuples.

Example:
If we have relations Students and Courses, the operation would be:

Students × Courses


Join (⨝)

Join combines tuples from two relations based on a related attribute. There are several types of joins:

1. Inner Join

Returns only the tuples that have matching values in both relations.

Example:
To join Students and Enrollments on the StudentID:

Students ⨝ Enrollments

2. Left Outer Join

Returns all tuples from the left relation and matching tuples from the right relation. If there is no match, null values are returned.

3. Right Outer Join

Returns all tuples from the right relation and matching tuples from the left relation. If there is no match, null values are returned.

4. Full Outer Join

Returns all tuples when there is a match in either relation. Null values are filled in when there is no match.

Example:
To perform a left outer join between Students and Enrollments:

Students ⨝ Enrollments (Left Outer)


Summary