Setting up the Database Layer

In the previous post, I set up the basic dependencies for Literature Vault. In this post, I want to do the same thing for the database.

The goal is not to build the full storage layer yet. I am not going to parse PDFs, extract metadata, create a full-text index, or store embeddings in this post. For now, I just want to make sure the app can create a local SQLite database, write a paper record to it, and read that record back through the same kind of clean application boundary I used for the CLI and MCP examples.

I already decided in the project overview that Literature Vault should use SQLite. Python has built-in support for SQLite; it has a standard library module called sqlite3. That is enough for a simple database, and it is the right choice for a local application that does not need to scale to multiple users or servers.

sqlite3 requires no extra installation, but it is also a very low-level interface. We would have to write SQL statements by hand, and we would have to manage the database connection and transactions ourselves. I'm fine with that!

Creating the Database Module

The first thing I want is a small database module. Not the whole storage layer, just enough structure so the rest of the app does not have to know where the database lives or how the tables are created.

Create a new file called src/literature_vault/db.py:

from pathlib import Path
import sqlite3


def connect(db_path: Path) -> sqlite3.Connection:
    connection = sqlite3.connect(db_path)
    connection.row_factory = sqlite3.Row
    return connection


def init_db(connection: sqlite3.Connection) -> None:
    connection.execute(
        """
        CREATE TABLE IF NOT EXISTS papers (
            id INTEGER PRIMARY KEY,
            fingerprint TEXT NOT NULL UNIQUE,
            path TEXT NOT NULL,
            title TEXT,
            year INTEGER,
            created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
        )
        """
    )
    connection.commit()

Let's walk through this code.

The import statements are straightforward. I need Path to work with file paths, and I need sqlite3 to interact with the database.
The connect() function takes a Path to the database file and returns a sqlite3.Connection object. This function is responsible for opening the connection to the database. It also sets the row_factory to sqlite3.Row, which allows us to access columns by name instead of by index (see the discussion below).
The init_db() function takes a sqlite3.Connection and creates the necessary tables if they do not already exist. In this case, it creates a papers table with the specified columns.

There are two small choices here that are worth pointing out.

First, connect() takes a Path instead of hard-coding a database location. This keeps the function easy to test, because tests can create a temporary database file and throw it away afterward. Later, the CLI can decide where the real database should live.

Second, I set connection.row_factory = sqlite3.Row. By default, SQLite rows come back as tuples, so you have to access values by position. With sqlite3.Row, the rows still behave like lightweight row objects, but I can access columns by name: so you can write row["title"] instead of row[3].

The string passed to execute() in init_db() is a SQL statement. It creates a table called papers with the following columns:

fingerprint, a unique identifier for the paper
path, the location of the PDF on disk
title, if I know it
year, if I know it
created_at, when the row was created

The fingerprint is the important part. Eventually, this will be a hash of the PDF contents. In this post, I am going to pass it in manually so I can focus on the database layer. The UNIQUE constraint means SQLite will not allow the same fingerprint to be inserted twice, which is exactly what I want for idempotent ingestion later.

Before adding more application code, I want to test this module. We already have a test setup from the previous post, so the next step is to add a database test file.

Create src/tests/test_db.py:

from pathlib import Path
from literature_vault.db import connect, init_db


def test_init_db_creates_papers_table(tmp_path: Path) -> None:
    db_path = tmp_path / "vault.db"
    connection = connect(db_path)

    try:
        init_db(connection)

        row = connection.execute(
            """
            SELECT name
            FROM sqlite_master
            WHERE type = 'table' AND name = 'papers'
            """
        ).fetchone()

        assert row is not None
        assert row["name"] == "papers"
    finally:
        connection.close()


def test_connect_returns_rows_with_named_columns(tmp_path: Path) -> None:
    db_path = tmp_path / "vault.db"
    connection = connect(db_path)

    try:
        row = connection.execute("SELECT 'hello' AS message").fetchone()

        assert row["message"] == "hello"
    finally:
        connection.close()

These tests do not insert papers yet. They only check the behavior of the database module as it exists right now:

connect() opens a SQLite database and returns rows that can be accessed by column name
init_db() creates the papers table

Run the tests:

uv run pytest

The existing tests should still pass, and now there should be two new database tests as well.

This is a small step, but it gives me a useful boundary: the database module can create the first table, and the tests can verify that without touching the CLI or MCP layer yet.

I will stop here for this post. In the next one, I will add the first actual database operation: inserting a paper record and reading it back.