ChatGPT As Web Search

One of the more fruitful ways that I’ve been using ChatGPT is as a different kind of web search. This works best when I want the original data summarized. ChatGPT’s responses tend to focus on the specific details of its prompt. In the same use case, Google tends to provide a lot of generic information and I have to extract the relevant answer.

The equivalence of ChatGPT to web search is evident in the equivalence in their operation. ChatGPT and other large language models (LLM) that are trained on the web as a corpus have effectively indexed the entire web. This index happens to be encoded as the numerical factors in a very large equation. In contrast to Google and traditional inverted table indexes, ChatGPT has memorized the contents of the web as the bits in those numerical factors.

In classic web search, a web crawler initially visits many websites. The pages from each site are fed into an intake pipe. The intake pipe identifies tokens from the text on the page and updates the inverted index under construction. The relationships among tokens and their pages are encoded within the index. There is some fuzziness and duplication in the tokens and the indexes, but the foundations of any web search service align with this broad outline.

The search of web search is a search of this pre-computed index. It is not a real-time scan of current web content. A user’s query is parsed into tokens which are passed to a query service. The query service feeds the tokens into the inverted index and gathers the referenced pages. The outcome involves ranking the pages based on data stored within the precomputed index.

The crawling, indexing, and querying processes of web search services are remarkably similar to the content acquisition, training and prompting processes of LLMs. The vocabulary is different, but the steps and the results show a strong equivalence.

Training an LLM requires a large corpus of text for it to read. Regardless of whether the content is found from a golden list or web crawlers, the pages are read from the source and fed into an intake pipe. The intake pipe identifies tokens from the text on the page and updates the LLM under construction with new token relationships. The LLM encodes these relationships as numerical factors in a large equation.

LLM prompting is similarly similar to web search query handling. The user’s prompt is parsed into tokens which are provided to the model. The model provides output based on the numerical values computed during training.

For ChatGPT, this corpus was largely web content. Its underlying LLM has effectively memorized the entire web.

For me, that’s a great resource for certain kinds of questions. It’s best when I want a summary of the original data, rather than the original data.

In some casual reading, the Italian “Years of Lead” was mentioned. ChatGPT provided a nice seven-paragraph summary. Wikipedia also offers a nice summary, but a bit longer. We’re planning a vacation to London. ChatGPT provided a nice itinerary tailored to the length of our stay and interests.

The other day I was fussing with some HTML layout. It’s been a few months, and I’ve forgotten both the syntax and the vocabulary. The prompt was simple:

remind me of the correct HTML to insert an image with the text flowing around on the right

The first ChatGPT response reminded me that images float (versus text flowing) and presentation is called style (versus layout). It also provided a partial answer and the recommendation that style sheets are modern best practices. A second prompt provided the complete answer I was seeking, with the less desirable but necessary style property.

style="float: left;"

Issuing the same query to Google provides the answer, but I have to extract it from the referenced pages by myself. Today [10-Jun-2023], the first page uses the archaic align property, not CSS. The next page provides the answer, but it is on the fourth screenful of a lengthy tutorial. None of the web search results provide the result “above the fold”.

The observation that ChatGPT has memorized the web guides how I use chat GPT to solve problems. I use it where memory is effective and the answers from a text search might require integration or review.

As any moderate user has experienced, you need to be on the lookout for “hallucinations”. Regardless, I often find it is a quick and easy source for a forgotten fact.

ChatGPT As Web Search

'Cuz ChatGPT memorized the web for you