Untangling Oracle Views with Python: A Dependency Crawler for Modern Migrations

The Problem We Had to Solve

I was leading a migration off a legacy Oracle reporting system—one that had quietly grown for over two decades. The infrastructure was beyond support: the hardware, the OS, and the reporting tools were all aging out. But the real risk wasn’t just technical—it was operational. This system was still powering production decisions. We couldn’t afford to break it.

We chose Snowflake as the modern target. But before we could move anything, we had to understand what we were moving. That meant dissecting hundreds of Oracle views—some built years ago by engineers long gone, many undocumented, and most more complex than expected.

The first proof-of-concept view looked simple on the surface. But under the hood, it had 521 columns, 117 joins, and a dependency chain five layers deep. It referenced other views, which called even more views, some of which relied on Oracle packages that injected logic at runtime. Much of that behavior—especially the parameterized configuration—simply wouldn’t translate to Snowflake.

And that was just one view. There were 1,127 tagged as production-critical, and over 5,000 more behind them.

We couldn’t migrate what we couldn’t explain. And we couldn’t explain what we couldn’t see.

It became obvious that manually tracing dependencies wasn’t just slow—it was impossible. We needed visibility. Not just for ourselves, but for QA, governance, and business stakeholders who needed to understand the impact of every change.

So I built a tool to surface the truth buried inside those views.

What Was Built—and Why It Works

I wrote a Python script that connects to Oracle, pulls the definition of a view, and traces every object it depends on. It parses the SQL, walks through view-on-view chains, and extracts every table, intermediate view, and user-defined function along the way. If something is referenced, it’s surfaced. And if it’s nested five layers deep, it still gets pulled.

The script builds a clean lineage map—one that’s readable, versionable, and testable. It saves each view’s cleaned-up SQL. It outputs an ASCII tree that shows how everything connects. And it generates a summary file we can drop into a Git repo or send to stakeholders for review.

This wasn’t just about speed—it was about certainty.

Instead of guessing which tables were involved or hoping nothing would break downstream, we had concrete answers. We knew what we were moving and what depended on it. When something failed QA, we didn’t have to start from scratch—we already had the map.

For engineers, it replaced hours of manual tracing with structured output. For analysts, it explained why their reports behaved the way they did. And for leadership, it reduced the risk of surprises during migration.

That’s why it worked.

How It Works

This is a command-line Python script designed to trace the full lineage of an Oracle view—no GUI, no manual inspection. Just run it with the right credentials and the fully qualified name of the view you want to analyze.

python oracrawl.py <host> <port> <service> <user> <password> <schema.view>

Once connected, the script does three things:

Pulls the view’s SQL from Oracle system tables.
Parses the SQL using sqlglot to identify all referenced tables, views, and user-defined functions.
Recursively walks through each dependency until it hits the base tables.

It writes everything to a dedicated output folder:

A summary.txt listing all views, tables, and UDFs involved.
A tree.txt file that shows the view lineage as an ASCII diagram.
One .sql file per view with the cleaned-up SQL definition.

Here’s an example of what the dependency tree looks like:

The diagram shows how deeply nested logic connects back to tables (INVENTORY_ITEMS, HZ_PARTIES), other views (*_V), user-defined functions (FN_GET_ORDER_STATUS), and Oracle packages (FND_GLOBAL). Each object is followed and mapped recursively.

The output is structured, human-readable, and Git-friendly—ready for validation, review, or version control.

If you’re migrating to a new platform or just trying to understand what a legacy view actually depends on, this gives you immediate visibility without opening a single database IDE.

Design Strategy: What Each Part Brings to the Table

This script wasn’t built as a perfect abstraction—it was built to solve a real migration problem under pressure. Each function emerged from a blocker, and each one plays a focused role in turning legacy SQL chaos into something structured and clear. Here’s a quick walk-through of what each component contributes.

`main`

This is the bootstrap. It takes command-line arguments, connects to Oracle, and kicks off the crawl. It includes error handling for common Oracle connection issues so the script fails clearly and early.

`connect_thin()`

Keeps the setup lightweight. By using Oracle’s thin client, we avoid the friction of environment setup and can connect from anywhere with minimal config.

`get_object_type_and_text()`

Determines whether the input is a view (which we can crawl) or a table (which we treat as a leaf). If it’s a view, it pulls the full SQL text so we can analyze it.

`clean_sql()`

Cleans up the SQL so the parser doesn’t choke. Oracle view definitions often contain legacy formatting, multi-line comments, and other noise. This function strips it down to just the logic.

`extract_objects_from_sql()`

This is the parser’s core. It walks the SQL syntax tree and identifies which tables and functions the view touches. This gives us a structured list of what to crawl next.

`classify_function()`

Filters out Oracle built-ins so we can focus only on UDFs—those are the ones that need attention during a migration.

`strip_alias()` and `fully_qualify()`

These utilities clean up names. Oracle SQL often references objects with aliases or omits schemas entirely. These functions normalize object names so recursion works reliably.

`crawl_object()` and `_crawl()`

This is the recursive engine. It tracks what we’ve already seen, prevents loops, and walks through every dependency layer until we reach base tables. It’s cautious, state-aware, and critical to the whole process.

`write_summary()`, `write_view_ddls()`, `render_tree()`, `write_tree()`

These output functions take everything we’ve collected and turn it into artifacts: clean SQL files, dependency trees, and a summary of all referenced objects. It’s the handoff layer—for engineers, analysts, and anyone who needs to understand what’s really going on.

This wasn’t designed for elegance. It was designed for clarity under fire. Every function exists because something failed or something needed to be explained. The result is a small, sharp tool that gets the job done.

TODO : START HERE, INCLUDE CODE FROM EACH FUNCTION AND EXPLAIN IT TO AN ENGINEER

Behind the Build: Priorities & Design

`main`

This is the script’s control center—the part that runs when the script is executed from the command line. It takes six required arguments (host, port, service, user, password, and schema.object), establishes a connection to Oracle, initiates the recursive dependency crawl, and then writes all outputs.

It’s intentionally procedural and easy to trace—built for clarity, not abstraction. Every line plays a role in making this tool usable by engineers who just need answers fast.

if __name__ == "__main__":
    if len(sys.argv) != 7:
        print("Usage:")
        print("  python oracrawl.py <host> <port> <service> <user> <password> <schema.object>")
        sys.exit(1)

    host, port, service, user, password, fq_object = sys.argv[1:]

    try:
        conn = connect_thin(host, port, service, user, password)
    except oracledb.Error as e:
        logging.error("Failed to connect to Oracle: %s", e)
        if "ORA-12520" in str(e):
            logging.error("Tip: ORA-12520 often means your service name is incorrect, misspelled, or missing a handler.")
            logging.error("Docs: https://docs.oracle.com/error-help/db/ora-12520/")
        sys.exit(2)

    crawl_object(conn, fq_object)

    output_dir = os.path.join(OUTPUT_BASE_DIR, fq_object)
    os.makedirs(output_dir, exist_ok=True)

    write_view_ddls(output_dir)
    write_summary(output_dir)
    write_tree(output_dir, fq_object)

    logging.info("Dependency crawl complete. Output in: %s", output_dir)
    sys.exit(0)

This block ensures:

Arguments are validated before anything risky runs
Oracle errors are surfaced early and clearly
All outputs go into a structured folder for version control or review

`connect_thin()`

This function creates a connection to an Oracle database using the thin driver. It avoids the need for a full Oracle client installation, which makes the script easier to run in CI/CD pipelines or on lightweight developer machines.

It’s small by design—minimal dependencies, minimal assumptions, and fast to fail if something goes wrong.

def connect_thin(host, port, service, user, password):
    dsn = f"{host}:{port}/{service}"
    return oracledb.connect(user=user, password=password, dsn=dsn)

Why this matters:

No client install required: The thin driver connects via TCP without needing sqlplus, TNS files, or heavy Oracle tooling.
DNS-style connection string: The host:port/service format aligns with how developers often access Oracle in containerized or remote setups.
Enables early failure: If credentials or connectivity are broken, the script exits early—before any crawling begins.

This function ensures every run starts from a clean, testable connection. If it doesn’t connect, it doesn’t guess.

`get_object_type_and_text()`

This function inspects an Oracle object to determine whether it’s a VIEW or TABLE, and if it’s a view, it retrieves the full SQL definition. This is the branching point in the crawl—tables are leaf nodes; views are where the recursion continues.

def get_object_type_and_text(conn, owner, object_name):
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT object_type
            FROM all_objects
            WHERE object_name = :obj AND owner = :owner
            """,
            { "obj": object_name.upper(), "owner": owner.upper() },
        )
        row = cur.fetchone()
        if not row:
            return None, None
        obj_type = row[0]
        if obj_type == "VIEW":
            cur.execute(
                """
                SELECT text
                FROM all_views
                WHERE view_name = :obj AND owner = :owner
                """,
                { "obj": object_name.upper(), "owner": owner.upper() },
            )
            rows = cur.fetchall()
            ddl = clean_sql("\n".join(r[0] for r in rows))
            view_ddl_map[f"{owner}.{object_name}"] = ddl
            return "VIEW", ddl
        return "TABLE", None

Why this matters:

Detects the crawl path: Tables terminate recursion. Views continue it.
Grabs the SQL we need: For views, this is where the recursive engine gets the raw material to parse.
Handles edge cases: If the object doesn’t exist or isn’t found, the function exits gracefully—avoiding false assumptions during a crawl.

It also updates a shared view_ddl_map so that later output functions can write the raw view SQL for review or versioning. It’s a critical link between Oracle’s metadata layer and the parser logic.

`clean_sql()`

This function prepares a raw SQL string for parsing by removing Oracle-style comments and compressing the formatting. Oracle views can contain multi-line comments, inline -- comments, and inconsistent whitespace that break parsers like sqlglot.

def clean_sql(sql):
    sql = re.sub(r"/\*.*?\*/", "", sql, flags=re.DOTALL)
    sql = re.sub(r"--.*?$", "", sql, flags=re.MULTILINE)
    return sql.strip()

Why this matters:

Multi-line comment removal: Strips /* ... */ blocks which often contain metadata or explanations irrelevant to parsing.
Single-line comment cleanup: Gets rid of -- comment lines which are common in generated or hand-written views.
Whitespace normalization: Ensures leading/trailing spaces don’t interfere with parsing or output formatting.

This step is essential for consistent and successful parsing. Without it, downstream functions like extract_objects_from_sql() would fail on even mildly messy input—which, in legacy Oracle environments, is the norm.

`extract_objects_from_sql()`

This function parses the cleaned SQL from a view and extracts all referenced tables and user-defined functions (UDFs). It uses the sqlglot library to walk the syntax tree instead of relying on brittle regex or string matching.

def extract_objects_from_sql(sql):
    sql = clean_sql(sql)
    try:
        parsed = sqlglot.parse_one(sql, dialect="oracle")
    except Exception:
        return [], {}

    tables, functions = set(), {}
    for node in parsed.walk():
        if isinstance(node, Table):
            tables.add(strip_alias(node.sql(dialect="oracle")))
        elif isinstance(node, Func):
            fname = node.sql_name().upper()
            if classify_function(fname) == "UDF":
                functions[fname] = "UDF"
    return sorted(tables), functions

Why this matters:

Reliable parsing: Uses AST parsing from sqlglot, which understands real SQL structure—not just patterns.
Captures both tables and UDFs: Ensures we know which objects influence the view logic.
Filters irrelevant functions: Works in tandem with classify_function() to exclude Oracle built-ins.

This function gives us the actionable list of dependencies needed to recurse into deeper layers of the view stack—safely and predictably.

`classify_function()`

Oracle SQL includes hundreds of built-in functions—mathematical, string, date, and analytic. When crawling views, we only care about user-defined functions (UDFs) that need to be migrated or verified. This function separates the two.

def classify_function(name):
    name = name.upper()
    if (
        not name.isidentifier()
        or name in SQL_KEYWORDS
        or name in {"ANONYMOUS", "BLOCK"}
    ):
        return None
    return "BUILTIN" if name in ORACLE_BUILTINS else "UDF"

Why this matters:

Reduces noise: We ignore TO_CHAR, NVL, SYSDATE, etc., which don’t affect migration.
Flags logic we own: UDFs represent logic we need to inspect, rewrite, or replace.
Prevents false positives: Only classifies valid identifiers, skipping junk like SQL blocks or inline strings.

Without this filter, every crawl would include hundreds of irrelevant functions. With it, we isolate just the logic that’s ours to manage.

`strip_alias()` and `fully_qualify()`

These two small utilities do crucial cleanup work on object names before recursion begins. Without them, ambiguity in how views reference tables or other views would cause misclassification or crawl failures.

def strip_alias(name):
    return name.split()[0]

Purpose: Removes aliases like EMPLOYEES E or ORDERS O so we only process the base object name.
Why it matters: Aliases aren’t valid identifiers when crawling dependencies; we need to recurse on actual object names.

def fully_qualify(name, default_schema):
    return f"{default_schema}.{name}" if "." not in name else name

Purpose: Ensures that every object name is schema-qualified.
Why it matters: Oracle lets users omit the schema if it’s implied, but for recursion to work reliably across schemas, names must be fully qualified.

Together, these functions normalize naming so that the crawler sees consistent, predictable identifiers—no matter how inconsistent the original SQL is.

`crawl_object()` and `_crawl()`

This is the recursive engine that powers the entire dependency analysis. Once a view is identified, these two functions work together to walk its dependency graph—view by view, function by function—until it hits base tables.

def crawl_object(conn, fq_object):
    owner, object_name = fq_object.split(".")
    _crawl(conn, owner.upper(), object_name.upper(), owner.upper())

Purpose: Parses the fully qualified view name and kicks off the recursive crawl.

def _crawl(conn, owner, object_name, default_schema):
    fq_name = f"{owner}.{object_name}"
    if fq_name in SEEN_OBJECTS:
        return
    SEEN_OBJECTS.add(fq_name)

    obj_type, ddl = get_object_type_and_text(conn, owner, object_name)
    if not obj_type:
        logging.error(f"Object not found: {fq_name}")
        sys.exit(3)

    if obj_type == "TABLE":
        found_tables.add(fq_name)
        return

    found_views.add(fq_name)
    tables, functions = extract_objects_from_sql(ddl)
    for t in tables:
        fq = fully_qualify(t, default_schema)
        dependency_graph[fq_name].append(fq)
        if fq.upper() != fq_name.upper():
            _crawl(conn, *fq.split("."), default_schema)
    found_functions.update(functions)

Avoids loops: Uses SEEN_OBJECTS to ensure no object is processed more than once.
Handles base cases: If it’s a table, stop. If it’s a view, parse and keep crawling.
Builds the graph: Adds every edge to dependency_graph, which powers the final tree output.
Captures lineage: Adds found views, tables, and UDFs to global sets for reporting.

This engine is what makes the tool work at scale. With five layers of view nesting or 117 joins in a single definition, this function ensures nothing is missed—and nothing is processed twice.

`write_summary()`, `write_view_ddls()`, `render_tree()`, `write_tree()`

Once the crawl is complete, these functions turn raw data into artifacts that are useful, shareable, and reviewable. They power the handoff from script to human.

def write_summary(output_dir):
    with open(os.path.join(output_dir, "summary.txt"), "w") as f:
        f.write("Views and UDFs:\n")
        for v in sorted(found_views):
            f.write(f" - {v}\n")
        for func in sorted(found_functions):
            f.write(f" - {func} (UDF)\n")
        f.write("\nTables:\n")
        for t in sorted(found_tables):
            f.write(f" - {t}\n")

Purpose: Human-readable list of every object involved.
Why it matters: Great for audits, status tracking, and surfacing surprises.

def write_view_ddls(output_dir):
    for fq_name, ddl in view_ddl_map.items():
        with open(
            os.path.join(output_dir, f"{sanitize_filename(fq_name)}.sql"),
            "w",
            encoding="utf-8",
        ) as f:
            f.write(ddl)

Purpose: Saves cleaned-up SQL for every crawled view.
Why it matters: Supports diffing, version control, and migration prep.

def render_tree(fq_view):
    lines = []
    def _walk(node, prefix="", is_last=True):
        lines.append(f"{prefix}{'└── ' if is_last else '├── '}{node}")
        children = dependency_graph.get(node, [])
        for i, child in enumerate(children):
            is_last_child = i == len(children) - 1
            new_prefix = prefix + ("    " if is_last else "│   ")
            _walk(child, new_prefix, is_last_child)
    _walk(fq_view)
    return "\n".join(lines)

Purpose: Builds a visual ASCII tree showing how the view depends on other views, UDFs, and tables.
Why it matters: Turns abstract lineage into something you can actually reason about.

def write_tree(output_dir, fq_view):
    with open(os.path.join(output_dir, "tree.txt"), "w") as f:
        f.write(render_tree(fq_view))

Purpose: Saves the rendered dependency tree to a file.
Why it matters: Becomes a trusted artifact for engineers, analysts, and reviewers alike.

Together, these functions create a complete snapshot of the view’s lineage—explained, versioned, and easy to share.

Who This Helps and How

This tool was built by engineers, for engineers—but its impact reaches further.
It helps:

Engineers understand tangled Oracle logic and migrate with confidence.
Analysts trace lineage, explain logic, and validate changes.
Leaders reduce migration risk and increase visibility into legacy systems.

What started as a one-off crawler became a cornerstone in a major modernization effort. We didn’t just save time—we restored trust in the system.

Try It, Fork It, Extend It

You can find the full script, usage instructions, and setup guide in the GitHub repo:

GitHub: andrusone/oracle-view-crawler
Install with pip install -r requirements.txt and go.
To run:

python oracrawl.py <host> <port> <service> <user> <password> <schema.view>

I’d love to hear how you used this—whether it saved your team hours, helped with Snowflake migration, or inspired a new version entirely.

Open an issue, send a pull request, or just shoot a message. The work is better when it’s shared.

And if you’re solving problems like this—especially in healthcare, manufacturing, or data platforms—reach out.
That’s the kind of work I care about.

Thanks for reading. Build responsibly.

The Problem We Had to Solve#

What Was Built—and Why It Works#

How It Works#

Design Strategy: What Each Part Brings to the Table#

__main__#

connect_thin()#

get_object_type_and_text()#

clean_sql()#

extract_objects_from_sql()#

classify_function()#

strip_alias() and fully_qualify()#

crawl_object() and _crawl()#

write_summary(), write_view_ddls(), render_tree(), write_tree()#

Behind the Build: Priorities & Design#

__main__#

connect_thin()#

get_object_type_and_text()#

clean_sql()#

extract_objects_from_sql()#

classify_function()#

strip_alias() and fully_qualify()#

crawl_object() and _crawl()#

write_summary(), write_view_ddls(), render_tree(), write_tree()#

Who This Helps and How#

Try It, Fork It, Extend It#

Share What You Build#

The Problem We Had to Solve

What Was Built—and Why It Works

How It Works

Design Strategy: What Each Part Brings to the Table

`main`

`connect_thin()`

`get_object_type_and_text()`

`clean_sql()`

`extract_objects_from_sql()`

`classify_function()`

`strip_alias()` and `fully_qualify()`

`crawl_object()` and `_crawl()`

`write_summary()`, `write_view_ddls()`, `render_tree()`, `write_tree()`

Behind the Build: Priorities & Design

`main`

`connect_thin()`

`get_object_type_and_text()`

`clean_sql()`

`extract_objects_from_sql()`

`classify_function()`

`strip_alias()` and `fully_qualify()`

`crawl_object()` and `_crawl()`

`write_summary()`, `write_view_ddls()`, `render_tree()`, `write_tree()`

Who This Helps and How

Try It, Fork It, Extend It

Share What You Build