Exploring Kùzu Graph Database Management System code
Introduction
Kùzu is a Graph Database Manaagement System born
after extensive research conducted over several years at University of Waterloo.
Kùzu is highly optimized to handle complex join-heavy
analytical workloads on very large databases. It is similar to what
DuckDB is doing for SQL. It is extremely useful when you
need to model your data as a graph from different sources and store it in one
place for fast extraction in analytics. Kùzu has integration with Pytorch
Geometric, making it easy to extract graph data and feed it into your PyG models
to perform a GNN task.
This article contains my annotations from when I started exploring how Kùzu database
works. I took a ‘depth limited search’ approach exploring the code by first
going to the CLI and running a simple query. I used LLDB to debug and learn
more about the overall design of the database.
Starting from the embedded shell
Starting from the CLI tool, the purpose is to track what is happening
internally from the initialization to a match query.
Kùzu uses args library to parse the arguments.
#include "args.hxx". For instance, database path (-i parameter) can be
retrieved by:
Initialize default bufferPoolSize as -1u bit mask:
uint64_t bpSizeInBytes = -1u;.
systemMemSize: total memory in the system. This is accomplished by mutiplying
the number of pages of physical memory by the size of a page in bytes. Both
values are retrieved using sysconf from unistd.h library.
_SC_PHYS_PAGES : the number of pages of physical memory
_SC_PAGESIZE : size of a page in bytes
bufferPoolSize: defined by the system memory or UINTPTR_MAX x default
pages buffer ratio. UINTPTR_MAX is the larges value uintptr_t can hold. StorageConfig
is located at include/common/configs.h and contains the struct with many default values used by the application.
defaultPageBufferPoolSize and largePageBufferPoolSize: the bufferPoolSize
multiplied by the ratio defined for default pages and large pages.
maxNumThreads: the number of concurrent threads supported by the available
hardware. This number is only a hint and might not be accurate.
Embedded Shell
Initialize an instance of EmbddedShell (tools/shell/embedded_shell.cpp):
tools/shell/embedded_shell.cpp:
Initialize the embedded shell using the databasePath from the parameter and
also the systemConfig previously defined:
linenoise is a lightweight library for
editing line, providing useful functionalities such as single and multi line
editing mode, history handling, completion, hints as you type, among others.
It is used in Redis, MongoDB and Android. The library is embedded in the
codebase (tools/shell/linenoise.cpp). I won’t get into the details of
linenoise configuration.
database and conn are both defined in embedded_shell.h:
Line 205 and 206 define the database and get the current connection,
respectively. Before getting into connection in the next section, I’ll take a
look at the updateTableNames(), since now we are dealing with catalogue to read
the database schema.
updateTableNames()
There are two type of tables: node and relations. updateTableNames will store
the table names for both by fetching from database->catalog. In my database, I
have “person” and “animal” node tables and “hasOwner” and “knows” relations
tables:
lldb output:
Connection (src/main/connection.cpp)
Connection is used to interact with a Database instance, and each Connection is thread-safe.
Multiple connections can connect to the same Database instance in a multi-threaded environment.
The description of the API below was extracted from src/include/main/connection.h:
Creates a connection to the database.
Destructor
Manually starts a new read-only transaction in the current connection.
Manually starts a new write transaction in the current connection.
Manually commits the current transaction.
Manually rollbacks the current transaction.
Sets the maximum number of threads to use for execution in the current connection.
Returns the maximum number of threads to use for execution in the current connection.
Executes the given query and returns the result.
Prepares the given query and returns the prepared statement.
Executes the given prepared statement with args and returns the result.
Executes the given prepared statement with inputParams and returns the result.
Return all node table names in string format.
Return all rel table names in string format.
Return the node property names.
Return the relation property names.
If you wondering what is behind KUZU_API, the datatype is defined in src/include/common/types/types.h:
Starting from C++ API
I will now explore COPY command from the C++ API by using the existing
example from examples/cpp/main.cpp. To compile, you just have to add
add_subdirectory(examples/cpp) inside CMakeLists.txt and run make test
or make debug. The example will be compiled and available at
build/debug/examples/cpp or build/release/examples/cpp depending on the
make parameter used to compile.
This example created a node table named tableOfTypes (from copy-test schema)
and use the command COPY to import 50k rows from types_50k.csv file.
I will start debugging by adding a breakpoint before the COPY command:
Inside Connection::query, a mutex lock is set, a preparedStatement will be
created and executed through executeAndAutoCommitIfNecessaryNoLock.
A prepared statement is a parameterized query used to avoid repeated execution
of the same query. prepareNoLock will go through the following steps:
parsing, binding, planning and optmizing and then return a PreparedStatement
object to Connection::query.