Learning DB Performance/Behavior

So despite a SQL database being something I with day-to-day at my job, i've come to realize its a huge black box for me. I was wondering if folks on here have any resources for learning about the internals of such dbs in a similar way to understanding C code in terms of what instructions it breaks down to (and how costly those instructios are), how it access memory and penalties, etc. Are there even simpler parts you can break things down to with SQL dbs? My ignorance of this subject is pretty big, I mean I can guess that there's disk access going on in response to running a command and that it does something a little smarter than just scan the whole disk for every last place your data might be, but after that...?
If you want actual source code, then SQLite is best learning material for this.
It also has a good documentation available on architecture and internals: https://www.sqlite.org/arch.html
Take a look there, you'll see various technologies/algorithms that are pretty much the same also in other SQL databases. And check out other topics:
https://www.sqlite.org/atomiccommit.html
https://www.sqlite.org/malloc.html
https://www.sqlite.org/isolation.html
https://www.sqlite.org/optoverview.html
https://www.sqlite.org/queryplanner-ng.html
https://www.sqlite.org/opcode.html

Edited by Mārtiņš Možeiko on
At it's core DBs work with the constraint that not all data can fit in ram so it has to page in and out the data it needs from disk.

I'm not too familiar with DB code myself but My gut tells me that the primary optimization target is to minimize the amount of data that needs to be read into ram (at least until you can install a massive SSD RAID) or written out back to disk. They will also be working with fixed sized file blocks rather than simple flat files because then they don't have to rely on the OS file IO caching doing the right thing and not eating up too much ram.

Minimizing reads can be done by sorting the table or adding index tables so the DB can find where the data is stored based on the constraints in the WHERE clause in sub linear time (you don't want things to scale linearly for the big tables when they don't have to).

Those index tables will likely be K-ary trees with each node taking up an entire block in the hard drive.

Keeping the variable length data and the fixed length data separate lets the fixed length be easily traversable (and then you can use a offset to the variable length data in the fixed length struct).

Making changes to a sorted list without doing a O(n) update each time can be done by soft-deleting records and keeping a second smaller array for the insertions. Then during daily/weekly maintenance the full table can be traversed, compacting out the deleted records and merging in the smaller array.

@mmozeiko Thanks! These look pretty awesome

@ratchetfreak Yeah, moving/transforming a small amount of data is always better than more, I imagine that holds true at all levels of hardware. What caused me to realize that I didn't have the mental tools to understand querying a db was my coworker explaining how a query he had written got "optimized" by having a certain select reversed. I'd imagine this could be something similar to reading from a linear array backwards vs forwards, thanks to a little game dev experience I know about cache behavior at the cpu end but I have no clue as to what the reasoning is for a db.
Floresy


@ratchetfreak Yeah, moving/transforming a small amount of data is always better than more, I imagine that holds true at all levels of hardware. What caused me to realize that I didn't have the mental tools to understand querying a db was my coworker explaining how a query he had written got "optimized" by having a certain select reversed. I'd imagine this could be something similar to reading from a linear array backwards vs forwards, thanks to a little game dev experience I know about cache behavior at the cpu end but I have no clue as to what the reasoning is for a db.


If he's talking about "SELECT A B" vs. "SELECT B A" then that shouldn't matter. It may perhaps change some cpu cache behavior but I doubt it's significant when taking into account the surrounding processing.

The only way that could matter significantly is if the query interpreter uses the order of SELECT fields in some way for the execution and frankly I'd say that's a bug.

Not including some columns of a table can have significant impact (when a table is split up internally to separate variable length fields from fixed length fields and you don't SELECT any varchar column.)
I was thinking is it possible to use sqlite asynchronously (inside game loop).
Yes, it is possible. You can use sqlite asynchronously by putting query execution on separate thread. Just make game loop to issue queries on thread, and after they finish post back the result. Very similar how HandmadeHero did async asset loading.

Edited by Mārtiņš Možeiko on
Is it possible without multi threading. Because I haven't done multi threading before.
No, sqlite library doesn't provide async API.
The database course professor at Rice University highly recommends this book:
https://www.amazon.com/Database-S...s-Complete-Book-2nd/dp/0131873253

Also the PostgreSQL source code looks really clean and well organized, so it might be a good place to poke around.
Hi, I'am working on a app using sqlite and was wondering how would you protect the app from Sql Injection. Is there a specific way of coding sqlite api to prevent sql injection.

Edited by Shazan Shums on
msmshazan
Hi, I'am working on a app using sqlite and was wondering how would you protect the app from Sql Injection.Is there a specific way of coding sqlite api to prevent sql injection.


Don't build sql query strings at runtime. Instead use parametered queries. That way there is no way for the input to change what the query does.
What do you mean by parameterized queries can you provide an example.
1
2
3
4
5
6
7
8
sqlite3_stmt* statement;
sqlite3_prepare_v2(db, "select name from accounts where id = ?1", -1, &statement, null);
sqlite3_bind_text(statement, 1, id_from_input, free);

sqlite3_step(statement);
//grab data out of statement

sqlite3_finalize(statement);


the "?1" in the query string is the parameter with "bind" I set the value of it to "id_from_input" which is then used as if the query string was build as sprintf(buff, "select name from accounts where id = \"%s\"", id_from_input); but without any chance that a quote character inside id_from_input could change the meaning of the query.
I seem to have a problem making Parameterized query.

Here's the function I implemented.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
int CreateTable(sqlite3* Database,const char* Tablename){
    const char* Sql ="CREATE TABLE ?1 ( bar TEXT );";
    sqlite3_stmt *CreateStatement;
    const char *Errormessage;
    int Result= sqlite3_prepare(Database, Sql,-1,&CreateStatement,NULL);
    if(SQLITE_OK != Result) {
        // TODO: Correct error handling
        Errormessage= sqlite3_errstr(Result);
        fprintf(stderr, "Error creating table (%i): %s\n", Result,Errormessage);
        
    }else{
        Result= sqlite3_bind_text(CreateStatement, 1, Tablename, -1,SQLITE_TRANSIENT);
        if(Result != SQLITE_OK){
            // TODO: Correct error handling
            Errormessage= sqlite3_errstr(Result);
            fprintf(stderr, "Error Binding text (%i): %s\n", Result,Errormessage);
          
        }else{
            Result=  sqlite3_step(CreateStatement);
            if(SQLITE_OK != Result) {
                // TODO: Correct error handling
                Errormessage= sqlite3_errstr(Result);
                fprintf(stderr, "Error stepping sql (%i): %s\n", Result,Errormessage);
            }else{
            }
        }
        }
    sqlite3_finalize(CreateStatement);
    
    return Result;
}


But when i run it it say's sql logic error.