A database is an organized collection of inter-related data that models some aspect of the real-world (e.g modeling the students in a class or a digital music store).
A database system aims to address the limitations of the traditional file-processing systems by providing a more efficient, reliable, and flexible way to manage data. Some of the key purposes of a database system include:
A file-processing system is supported by a conventional operating system. The system stores permanent records in various files, and it requires different application programs to extract records from and add records to the appropriate files.
The file-processing system has several major disadvantages:
A Flat File Strawman refers to a simplistic proposal or initial model for data storage using flat files, which are basic text files organized in rows and columns, such as CSV or TXT files. This approach is often utilized as a starting point in discussions or designs to explore data storage concepts without the complexities of relational databases. The term “strawman” signifies that this model is not fully developed and serves as a basis for further exploration, debate, or refinement before adopting a more sophisticated data management solution.
Issues with Flat Files refer to the various challenges and limitations encountered when using flat files for data storage in applications. Flat files, such as CSV files, are simple text files where data is organized in a tabular format, with records separated by new lines and attributes within records separated by commas. While flat files are easy to implement and manage for small datasets, they pose significant problems as the volume of data grows or as the complexity of data relationships increases.
Ensuring consistency and validity of data can be challenging, especially when multiple entries relate to the same entity. For example, ensuring that artist details remain accurate across multiple albums.
Locating specific records can be inefficient, and integrating with new applications or systems can be complicated, particularly if they operate on different platforms. Furthermore, concurrent write operations can lead to conflicts and data corruption.
Flat files are vulnerable to data loss in the event of system failures during updates, and maintaining high availability through replication across multiple systems can be complex and resource-intensive.
A Database Management System (DBMS) is software that allows applications to store and analyze information in a database. A general-purpose DBMS supports the definition, creation, querying, updating, and administration of databases in accordance with some data model.
A major purpose of a database system is to provide users with an abstract view of the data. The system hides certain details about how the data are stored and maintained, allowing users to interact with data without needing to understand the complexities of its storage.
For the system to be usable, it must retrieve data efficiently. Designers use complex data structures to represent data in the database to ensure this efficiency. To simplify users’ interactions with the system, data abstraction is categorized into three levels:
Physical level: The lowest level of abstraction that describes how data is physically stored in the database. This level details the complex low-level data structures such as indexes, storage blocks, and file organization.
Logical level: The middle level of abstraction that describes what data is stored in the database and the relationships between the different data elements. This level defines tables, fields, and data types, but hides details of how data is physically stored.
View level: The highest level of abstraction that describes only a part of the entire database. It defines user-specific views of the data, often showing only relevant information to the user or application while hiding other details.
Each level of abstraction ensures that users or programs only deal with the relevant data and ignore the complexities of other layers, improving efficiency and ease of use.
A data model is a collection of concepts for describing the data in a database. A schema is a description of a particular collection of data, using a given data model.
Hierarchical Data Model
Organizes data in a tree-like structure with parent-child relationships. Each parent can have multiple children, but each child has only one parent.
Example: File systems, organizational structures.
Network Data Model
Allows for more complex relationships with multiple parent-child links. Represents data as a graph, with nodes (entities) and edges (relationships).
Example: Telecommunications networks.
Relational Data Model
Organizes data into tables (relations) with rows and columns. Each table represents an entity, and relationships are established through foreign keys.
Example: SQL databases (e.g., MySQL, PostgreSQL).
Object-oriented Data Model
Combines data and behavior in objects, allowing for inheritance and encapsulation. Suitable for complex data structures and relationships.
Example: Object-oriented databases (e.g., MongoDB).
Entity-Relationship Model (ER Model)
Visual representation of data entities and their relationships, primarily used for database design and conceptualization.
Example: ER diagrams in database design.
Document Data Model
Stores data in documents, typically in formats like JSON or XML. Each document can have a varying structure, allowing flexibility.
Example: NoSQL databases (e.g., MongoDB, CouchDB).
Key-Value Data Model
Stores data as a collection of key-value pairs, where each key is unique. Ideal for fast lookups and simple data structures.
Example: Key-value stores (e.g., Redis, DynamoDB).
Column-Family Data Model
Organizes data into column families rather than rows, allowing for efficient storage and retrieval. Each column family can store different types of data.
Example: Columnar databases (e.g., Apache Cassandra).
Graph Data Model
Represents data as nodes and edges, capturing relationships and interactions between entities. Optimized for connected data and complex queries.
Example: Graph databases (e.g., Neo4j).
A database system provides two key languages:
A Data Manipulation Language (DML) enables users to access or manipulate data organized by the appropriate data model. DMLs allow for the following types of access:
There are two types of DMLs:
SELECT COUNT(*) FROM artist
to count the number of records in a table.A Data-Definition Language (DDL) is used to define the structure of the database. It specifies the database schema through a set of definitions. In addition to specifying the schema, DDL also allows users to define various properties of the data, such as:
DDL commands, such as
CREATE
,ALTER
, andDROP
, are used to define and modify the structure of the database.
The database design process begins with a high-level data model that provides a conceptual framework for specifying user data requirements and structuring the database to meet those needs.
The Entity-Relationship (E-R) data model is a framework for describing the logical structure of a database using basic objects known as entities and the relationships among them.
dept_name
, building
, and budget
.ID
, name
, and salary
. The attribute ID
uniquely identifies each instructor.The overall logical structure (schema) of a database can be visually represented using an E-R diagram. In E-R diagrams:
The E-R model includes certain constraints that the database must adhere to. One important constraint is mapping cardinalities, which specify the number of entities that can be associated with another entity through a relationship set. For example, if each instructor can be associated with only one department, this constraint can be represented in the E-R model.
Normalization is a method for designing a relational database aimed at generating a set of relation schemas that minimize redundancy while facilitating easy information retrieval. The process focuses on creating schemas that conform to appropriate normal forms.
To determine whether a relation schema is in a desirable normal form, additional information about the real-world enterprise being modeled is required. The most common approach to provide this information is through functional dependencies.
Bad database designs may exhibit undesirable properties, including:
Normalization helps address these issues, ensuring that the database structure effectively supports the necessary operations and relationships.
A database system is divided into modules that manage different responsibilities. The main components of a database system include the Storage Manager and the Query Processor.
The Storage Manager is crucial as databases require significant storage space, ranging from hundreds of gigabytes to terabytes. Data is stored on disks, and movement between disk storage and main memory must be minimized due to the slower speed of disk operations compared to the CPU.
The Query Processor simplifies and facilitates data access for users, enabling them to work at a logical level without dealing with physical implementation details.
Database architecture encompasses the various components of a database system and their interconnections. It is influenced by the underlying computer system and can be categorized into different types:
Database applications can be organized into two or three tiers:
This architecture is more suitable for larger applications where business logic can be centralized in the application server.
There are four different types of database-system users, differentiated by how they interact with the system:
Naïve users: Users who interact with the system through predefined programs or interfaces.
Application programmers: Computer professionals who write application programs that interact with the database.
Sophisticated users: Users who interact with the system without writing programs. They use database query languages or tools like data analysis software to form their requests.
Specialized users: Sophisticated users who develop specialized database applications that fall outside the traditional data-processing framework.
One of the key reasons for using a DBMS is to centralize control over both the data and the programs that access it. The individual responsible for this central control is called the Database Administrator (DBA). The functions of a DBA include:
Schema definition: The DBA creates the original database schema by executing a set of data definition statements in the DDL (Data Definition Language).
Storage structure and access-method definition: The DBA defines how the data will be stored and how it can be accessed efficiently.
The relational model defines a database abstraction based on relations to avoid maintenance overhead.
Structure:
Integrity:
Manipulation:
Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes in one or more relations as inputs, and outputs a new relation. To write queries we can ”chain” these operators together to create more complex operations.
This document provides an overview of key operations in relational algebra, which is used for querying and manipulating relational databases.
Selection is an operation that takes a relation and outputs a subset of tuples that satisfy a specified selection predicate.
Example:
If we have a relation Students
and we want to select students older than 20:
σ_Age > 20(Students)
Projection is an operation that takes a relation and outputs a new relation containing only specified attributes.
Example:
To project only the names and majors of students:
π_Name, Major(Students)
Union combines the tuples from two relations and removes duplicates. Both relations must have the same attributes.
Example:
If UndergraduateStudents
and GraduateStudents
have the same structure, the operation would be:
UndergraduateStudents ∪ GraduateStudents
Intersection returns the tuples that are present in both relations.
Example:
To find students enrolled in both the CS and Math courses:
Students_CS ∩ Students_Math
Difference returns the tuples that are in the first relation but not in the second.
Example:
To find students who are not enrolled in any Math courses:
Students − Students_Math
Product (Cartesian product) combines all tuples from two relations, producing a new relation that contains every possible pair of tuples.
Example:
If we have relations Students
and Courses
, the operation would be:
Students × Courses
Join combines tuples from two relations based on a related attribute. There are several types of joins:
Returns only the tuples that have matching values in both relations.
Example:
To join Students
and Enrollments
on the StudentID
:
Students ⨝ Enrollments
Returns all tuples from the left relation and matching tuples from the right relation. If there is no match, null values are returned.
Returns all tuples from the right relation and matching tuples from the left relation. If there is no match, null values are returned.
Returns all tuples when there is a match in either relation. Null values are filled in when there is no match.
Example:
To perform a left outer join between Students
and Enrollments
:
Students ⨝ Enrollments (Left Outer)
Relational Data Model: The relational data model is based on a collection of tables. Users of the database system can query these tables, insert new tuples, delete tuples, and update (modify) tuples using various languages for expressing these operations.
Schema vs. Instance: The schema of a relation refers to its logical design, while an instance of the relation refers to its contents at a specific point in time. The schema includes attributes, types of attributes, and constraints such as primary and foreign key constraints.
Schema Diagram: A schema diagram is a pictorial representation of the database schema, showing the relations, their attributes, primary keys, and foreign keys.
Relational Query Languages: Relational query languages define a set of operations that operate on tables and produce tables as results. These operations can be combined to express desired queries.