Topics Discussed:
- What is Database
- Evolution of data in software industry
- Types of Database – RDBMS vs NoSQL
- Choosing the right type of database – ACID Compliance vs Eventuak consistency
- Data partitioning : Vertical partitioning vs Horizontal
- SQL : What is it?
- Select/Update/Delete
- Joins
- Advanced concepts : Procedure or function
- Indexing the data
- Designing an efficient and normalized scheme
- ORM or native libraries
Software’s Primary objective is to act on data
In Memory Data – The process will be processed in in memory. When you close its lost
Persistent Data – Here once all the process is completed the data can be saved. The data is persistent. Some naive ways of persisting data,
1. Serialization
2. File storage
Problems of the above 2:- restoring is slow, searching through is slow. the solution to the above problem is a database.
Evolution of data in softwares
○ Traditionally, banks/telecom vendors were the biggest number of customers.
○ Country specific deployments
○ Few millions of rows of data was considered big data.
○ What changed then?
■ Social media across the globe.
■ Machine learning.
○ What happened with the growth of data? Previously thought solutions became obsolete or not enough.
Evolution of databases
○ RDBMS- solved all the problems of early softwares.
■ Relational database
■ Normalized for optimized storage
■ A predefined schema.
■ Collection of columns is a table.
■ Collection of tables and other objects are a schema.
■ Examples:- Oracle, MySQL, Postgres and few more.
■ Vertical scaling.- . When data increases we can add ram, storage, processing etc. We can make it more powerful But it’s costly. And there will be a fixed limit.
■ Costly.
■ Functions, Procedures, triggers support.
○ NoSQL databases
■ Big amounts of data. – analysis, social media etc.
■ Unstructured data:- example a social media post, its comments, likes etc. Survey responses, Documents like pdf/text contents
■ Ambiguous, difficult to parse.
■ Various types:- Each of these solves one type of problem. Examples:-
● Key-value store – dynamo db, Redis, Memcache
● Document database – mongo db. Each document has its own structure.
● Graph database:- Neo4J. Social graph, vertices and edges.
● Wide column databases:- Cassandra, BigTable etc.
■ Horizontal scaling.
■ Cheap
Which database is right for you?
Here confusion arises. Which Database is best for your software? No perfect yes/no answer available for this question. You can keep in mind these points and can choose the right one.
○ Concepts:-
■ ACID compliance- Atomicity, Consistency, Isolation and Durability.
● Atomicity means a transaction either completes or fails in entirety. There is no state in between. Nobody sees a partial completion of a transaction.
● Consistency means the transaction leaves the database in the valid state.
● Isolation means no two transactions mingle or interfere with each other. The result of two transactions executed in parallel would be same as sequential execution.
● Durability means the changes of the transaction are saved. It remains there even if power is turned off.
■ Consistency (strict versus eventual).
● Strong – Instance Update
E.g – bank transitions
● Eventual – takes time to reflect
E.g – Social media posting.
Consistency – SQL Strong, NoSQL- Eventual/ Strong
○ Considerations:-
■ Scale
■ structure of your data
■ Type of business
■ Consistency requirement
■ ACIDicity of the application
○ Often, this means a big software system might use multiple data sources for different purposes.
● Data partitioning– Dividing data across databases and tables.
● Vertical partitioning:– ‘not vertical scaling’. Normalization. Splitting data into tables based on different features etc. Mostly application logic. To send different data to different databases. Like store all images in db1 store all videos in db2 etc.
● Horizontal partitioning:- splitting rows into different tables based on a sharing logic. Data sharing. Phone number based sharing, date based sharing etc. Hash based sharding etc. In many nosql databases, sharding is a first-class concept, and the database knows how to store and retrieve data within a cluster. Almost all modern databases are natively sharded. Cassandra, HBase, HDFS, and MongoDB are popular distributed databases
■ Cross partition queries are costly. So you need to be very careful here.
■ Uneven distribution of data is very much a possibility.
● How to talk to databases
○ SQL for RDBMS databases.
■ Types
1. DDL- data definition language- create, alter, drop table/schema objects
2. DML- data manipulation language- insert/update/delete/select
3. Transaction Control Statements – commit/rollback etc.
■ Joins select a.x, b.y from a inner join b on a.z=b.z1
1. Inner join (intersection)- both tables should have data
2. Left join – all of left+matching right
3. Right join
4. Full join – union
5. Self join
Good sql vs bad sql
1. Analyze function
2. Have enough indexes
3. Deadlocks from procedures/functions.
1. T1 update table X where I’d=1; T2 update table Y where I’d=1
2. T1 update table Y where I’d=1; T2 update table X
3. Database auto recovers by aborting one transaction.
1. Keep Short transaction
2. Access order make similar in both transactions
3. Try catch
ORM versus Native- How does a programming language talk to database?
1. JDBC- driver for every type of databases. Create connection and send native SQL. Maintain connection and other stuff on your own.
2. ORM- hibernate- maps tables to java objects and creates a layer. Programmer deals with objects and the framework converts to Query language and deals with it. Easy to use. Difficult to debug in complex cases. Complex sqls are difficult.
no sql databases
1. Each database has its own syntax.
2. Some are sql like
3. Some are just get(key). Put(key) format
4. Use a driver, create connection. Insert collection, get etc. MongoDb.
DevOps and databases
○ Monitoring. Logs, slowness, deadlock detection
○ Design decisions, which database to choose. how manageable are they
○ Sharding and scalability
○ DBA is getting obsolete- self driving database by Oracle
Micro services and databases
■ Multiple database/ One database multiple schema/ Never allow shared tables.
■ Zero down time upgrades.
■ Database code reviews.
■ Upgrade/ Roll back, or backward compatible approach.
■ Move away from the culture of procedures and functions.