Introduction

Bogleheads are a large group of investors who follow the general investment principles laid out by Jack Bogle, the founder of Vanguard. These include:

Develop a Workable Plan
Invest Early and Often
Maintain Proper Asset Allocation
Diversify Broadly
Use Low-Cost Index Funds
Minimize Costs
Minimize Taxes
Invest with Simplicity
Never Try to Time the Market
Stay the Course

In addition to a very active and informative forum (https://www.bogleheads.org/) and other resources, the non-profit John C. Bogle Center for Financial Literacy maintains a wiki to serve as a reference resource and knowledge base for both forum members and the general public.

The wiki provides in-depth, detailed, and well-sourced information; however it can also be hard to understand and the information relevant to a specific question might be spread across multiple wiki entries. To improve the wiki’s value and ease of use, one of the forum members put forth the idea of creating a chatbot based on the information in the wiki. I volunteered to help, and as a proof of concept and demonstration, implemented a RAG-based chatbot on Google’s NotebookLM platform. Because the chatbot uses the Bogleheads wiki, the notebook is called BHBot. This blog post describes the process of building the chatbot and the results.

Goal

The goal of the project was to develop a reliable, accurate RAG-based chatbot built using the Bogleheads wiki as the reference material. The bot should not hallucinate, should not provide personal financial advice (but potentially direct the user to other reference sources, possibly including the Bogleheads forum), and should cite the specific sections of the Wiki that were used to formulate its answers. If possible, it would include the actual URLs for the cited wiki pages.

About NotebookLM

While Google has not documented the internal workings of NotebookLM, it appears to be a RAG-based platform. For each notebook that you build, you provide a set of documents that NotebookLM will draw from to formulate its answers when you later ask questions. Your documents can be text based, such as PDFs, Google Docs, presentations, audio or video transcripts, etc. This information is broken down into smaller chunks and each chunk indexed semantically (by meaning) by associating each chunk into a large dimensional vector that captures the meaning. Then when you ask a question, NotebookLM can use both keyword and semantic search to find the relevant chunks of your reference material. The answer to your question is primarily drawn from these chunks.

NotebookLM is very easy to set up, is totally free, and specific notebooks can be made publicly available. This made it a good choice for a proof-of-concept demonstration (as well as for personal projects). It is significantly less work than building your own RAG platform.

It does have several downsides, some of which make it a non-starter for a production system. Users must have a gmail address. It also has very limited customization, and the free tier is limited to 50 queries per day total from all users of that notebook. You are also limited in how many resources can be in a Notebook. For this application, this meant that each wiki page could not be uploaded as a separate resource, they must be consolidated. So you can’t treat each wiki page as a resource, you have to consolidate them. At the same time, when I tried to upload the entire wiki as a single resource, NotebookLM would not index it, so I ended up breaking it into four sections.

File Preparation

First, I needed a current, or at least relatively recent, complete contents for the wiki. Obtain a relatively recent dump of the text content of the wiki. I tried to use wikiteam3dumpgenerator, which will pull the contents from a MediaWiki wiki. However, the Bogleheads wiki website seems to reject requests, so that was not an option. It might be possible if the owners of the wiki provided permission along with an ID and/or password.

There’s the option of using Special:Export API for MediaWiki to export each page one at a time, after getting a list of all page titles using the MediaWiki API. One could use https://bogleheads.org/w/api.php?action=query&list=allpages&aplimit=500&format=json to get up to 500 pages at a time. You can paginate through using the apcontinue token in the response to get all titles, then feed that list into Special:Export. Special:Export accepts a newline-separated list of page titles in its text box and can export them all at once. To do that one would POST the full title list to Special:Export with curonly=1 (for current revisions only) and wpDownload=1. However that would have been quite burdensome and cumbersome, even using a script to automate much of it.

A better alternative if available, which is what I did, was to dump from the Internet Archives, which had the complete copy of the wiki that was less than a year old. I downloaded all of the text (non-image) files. The dump was compressed. To uncompress it, I installed Meta.Zstandard to do this (winget install Meta.Zstandard). Then the command on the command line is:

zstd -d –long=30 www.bogleheads.org_w-20250503-history.xml.zst

Next, I needed to process only the current pages and convert the XML dump to Markdown using mwxml and pandoc. It’s possible that NotebookLM would have worked with the XML, but in general, both RAG tools and large language models work better with Markdown. I wrote a script that first converted the dump to one large markdown file, keeping only the main article pages and skipping Talk pages, User pages, Templates, etc. Then I used a second script to break the document into four, being sure to split along a wiki article (page) boundary. Once this was done, the four files were ready to upload to NotebookLM. Here are the two scripts:

make_markdown.py

"""
Convert a MediaWiki XML dump to a single merged markdown file.

Iterates over all pages in the dump, skips non-article namespaces and
redirects, converts each page's latest revision from MediaWiki markup to
markdown via pandoc, and writes everything to one output file with '---'
horizontal rules between pages.

Requires: mwxml, pandoc (on PATH)
"""

import os
import subprocess
import sys
from pathlib import Path

import mwxml

# Ensure UTF-8 output on Windows, where the default console encoding may differ
os.environ['PYTHONUTF8'] = '1'
sys.stdout.reconfigure(encoding='utf-8')
sys.stderr.reconfigure(encoding='utf-8')

BASE_DIR = Path(__file__).parent.parent # step up from code/ to project root
DATA_DIR = BASE_DIR / "data"
RAW_DATA_DIR = DATA_DIR / "raw data"

input_path = RAW_DATA_DIR / "www.bogleheads.org_w-20250503-history.xml"
output_path = DATA_DIR / "bogleheads.md"

output = []
seen = set()  # Tracks titles already added, since the dump may include multiple revisions per page

with open(input_path, 'r', encoding='utf-8') as f:
    dump = mwxml.Dump.from_file(f)

    for page in dump:
        # Namespace 0 is the main article space; skip talk pages, templates, etc.
        if page.namespace != 0:
            continue

        if page.title in seen:
            continue

        # Iterate through all revisions to land on the latest one
        latest = None
        for revision in page:
            latest = revision

        if latest is None or not latest.text:
            continue

        # Redirects contain no useful article content
        if latest.text.strip().upper().startswith('#REDIRECT'):
            print(f'Skipping redirect: {page.title}')
            continue

        seen.add(page.title)

        try:
            result = subprocess.run(
                ['pandoc', '-f', 'mediawiki', '-t', 'markdown'],
                input=latest.text.encode('utf-8'),
                capture_output=True,
                encoding=None,  # Raw bytes in/out; we decode stdout manually below
                env={**os.environ, 'PYTHONUTF8': '1'},
                timeout=30
            )
            markdown = result.stdout.decode('utf-8')
        except subprocess.TimeoutExpired:
            print(f'Timeout, skipping: {page.title}')
            continue
        except Exception as e:
            print(f'Error on {page.title}: {e}')
            continue

        output.append(f'# {page.title}\n\n{markdown}')
        print(f'Converted: {page.title}')

with open(output_path, 'w', encoding='utf-8') as f:
    f.write('\n\n---\n\n'.join(output))

print('Done!')

split.py

"""
Split a large markdown file into four roughly equal parts.

The input file is expected to contain pages separated by '---' 
horizontal rules (with surrounding blank lines). Each part is written 
as a separate numbered markdown file in the same directory.
"""

from pathlib import Path

DATA_DIR = Path(__file__).parent.parent / "data"

input_path = DATA_DIR / "bogleheads.md"
with open(input_path, 'r', encoding='utf-8') as f:
    content = f.read()

# Split on the page separator used by the markdown export
pages = content.split('\n\n---\n\n')
print(f'Total pages: {len(pages)}')

chunk_size = len(pages) // 4

for i in range(4):
    # Last chunk gets any remainder pages so nothing is dropped
    start = i * chunk_size
    end = start + chunk_size if i < 3 else len(pages)

    chunk = '\n\n---\n\n'.join(pages[start:end])

    output_path = DATA_DIR / f'bogleheads_part{i+1}.md'
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(chunk)
    print(f'Wrote {output_path.name}: {end - start} pages')

Prompt

One of the downsides of GoogleLM is that it does not have a lot of user-adjustable parameters. In fact, it has only two. First, you can choose between short, long, or default length responses. The other is that you can choose the conversational style or role between Default, Learning Guide, or Custom. If you choose Custom, you can provide a system prompt that is invoked along with every question. This is the system prompt that I used:

You are an expert providing information about financial and investment topics. Help the user learn from the wiki. You can simplify the language if it doesn’t change the meaning. At the end of each answer, include the wiki pages and sections that you used to prepare your answer. Include the links to the actual wiki pages. Please also remind the user to confirm recommendations by checking the actual wiki. If the question doesn’t relate to personal finance, investing, or taxes, please politely tell the user that the question is out of scope. If and only if the user asks for personal financial advice, provide the relevant general answer, but tell the user that you do not provide individual financial advice, and they should seek credible financial sources, including the Bogleheads forum.

This system prompt instructs the bot to:

Play the role of an expert on financial and investment topics.
Always provide the user with specific wiki pages and perhaps sections used to generate the reply and remind the user to check the answers against the actual wiki text.
Politely decline to answer questions that are not related to finances, investing, or taxes
Not provide personal financial advice

Example Queries

Here are two queries and the responses provided by BHBot:

Question: Which are better: actively managed mutual funds and ETFs or passive index funds?

Response: The Bogleheads philosophy generally holds that passive index funds are superior to actively managed funds for the vast majority of investors. While active managers seek to “beat the market” through individual stock selection or market timing, research consistently shows that this approach usually results in lower returns for the investor after accounting for fees and taxes.

Here is a breakdown of why passive index funds are considered the better choice according to the wiki:

The Cost Advantage

The most significant factor making index funds “better” is their lower cost. In 2012, the average expense ratio for U.S. stock mutual funds was 1.40%, compared to index funds which are often available for 0.20% or less.

The Power of Compounding: Even a 1% difference in annual fees can reduce your final portfolio value by nearly 20% over 30 years.
Hidden Costs: Active funds trade more frequently, leading to higher “transaction costs”—such as brokerage commissions and bid/ask spreads—which are not included in the expense ratio but still lower your returns. Index funds have much lower turnover, minimizing these costs.

Consistent Performance

Academic research indicates that while some active managers may have skill, it is rarely enough to overcome their higher costs.

Lack of Persistence: There is no evidence that managers who outperform one year will continue to do so in the next.
Underperformance: Over long periods, more than half of actively managed funds underperform index funds because they are effectively trying to beat the market average while burdened by high fees.

Superior Tax Efficiency

Because index funds buy and hold securities rather than trading frequently, they generate fewer “capital gains” that you have to pay taxes on. They are also better at providing “qualified dividends,” which are taxed at lower rates. Active funds frequently “churn” their holdings, creating a higher tax bill for investors in taxable accounts.

Style Purity and Reduced Risk

Style Drift: Active managers often suffer from “style drift,” where they buy stocks outside of the fund’s stated category (e.g., a large-cap manager buying small-cap stocks). This takes control of your asset allocation away from you. Index funds stay “pure” to their category, making them better building blocks for a planned portfolio.
Manager Risk: With an active fund, you face the risk that the manager will make poor decisions or leave the fund entirely, potentially forcing you to sell and pay taxes to find a new fund.

Summary of Comparison

Feature	Passive Index Funds	Actively Managed Funds
Costs	Very Low	Generally High
Returns	Market Average	Often below average after fees
Taxes	Very Efficient	Less Efficient
Simplicity	Easy to manage	Complex to monitor

Wiki pages and sections used for this answer:

Bogleheads® investment philosophy – Sections: Use index funds when possible; Keep costs low; Minimize taxes; Never try to time the market.
Index fund – Sections: Advantages; Low costs; Tax efficiency; Reduced manager risk.
Principles of tax-efficient fund placement – Section: Explanation for the estimated order.
Mutual funds and fees – Section: The impact of costs.

Please remind yourself to confirm recommendations by checking the actual wiki. I do not provide individual financial advice, and you should seek credible financial sources, including the Bogleheads forum, for your specific situation.

Note: The response is thorough and includes a summary table. It provides hyperlinks to the four wiki pages used to produce the answer. It also reminds the user that the answer does not constitute individual financial advice.

Question: How does the offside rule work in soccer?

Response: I am sorry, but your question about the offside rule in soccer does not relate to personal finance, investing, or taxes. Therefore, it is out of scope for this assistant.

For information regarding financial and investment topics, please feel free to ask about the Bogleheads® investment philosophy, retirement planning, or tax-efficient investing strategies as described in the Bogleheads Wiki.

Note: The bot politely declines to attempt to answer off-topic questions.

Try it Yourself!

Feel free to try BHBot out for yourself: https://notebooklm.google.com/notebook/664a244b-900a-4e50-b3c4-7887dc0b6849

Remember that BHBot will only answer a total of 50 questions a day, so if it hits the limit, you’ll get a message that you need to wait for the next day.

I’d welcome your feedback in the comments section.

The Aspiring Roboticist

Notes from a computer and robotics hobbyist

Author Archives: ViennaMike

Using NotebookLM as a RAG Chatbot for Financial Advice