Curated Resource ( ? )

Using Brave Search for higher quality training data and better AI | Brave

my notes ( ? )

The Brave browser project shows that it was ahead of the curve back in late 2023, pointing out:

  • to train an LLM you need training data which is "diverse... span[ning] a wide variety of genres, topics, viewpoints, languages, and more... [to] reduce the errors, biases, and misrepresentations that might be more pronounced in smaller data sets... [ensuring] models can become more knowledgeable, fair, and representative of the real world.
  • the internet was, of course, a large source of training data, but "much of that data can be poor quality". For example, "Common Crawl is a non-profit organization that crawls the Web ... October 2023 crawl... [is] often cited as a training source for large, well-funded LLMs ... [but it] isn’t perfect... overrepresentation of young people from developed countries". Other datasets thus exist

So "AI projects should look for more accessible, better curated, high-quality data sets that fit their needs", hence Brave's approach:

  • offer Brave Search: "the only independent, global-scale search engine outside of Google and Bing... index contains over 19 billion webpages, and indexes about 50–70 million" more per day
  • and an API that allows LLM developers to access high-quality training data while respecting user privacy.

A key factor (for me at least), is that "Brave’s index is much more representative of the Web people actually care about" because of their related "Web Discovery Project mechanism (which allows real users to contribute anonymous data about the pages they’re actually visiting)... [so our] index represent the 99% of the Web that people actually want to visit... a filtered search index ... curated by its millions of actual users".

The article then looks at Brave's (early) "AI Summarizer, which returns AI-powered, contextual answers at the top of the search results page...[and] cites its sources".

The article then explores how Brave Search index can be used both for "assembling a data set to train AI models ... [and] To help at the time of inference... via RAG (basically when a model retrieves new information that it wasn’t originally trained on)"

Read the Full Post

The above notes were curated from the full post brave.com/ai/using-brave-search-api/.

Related reading

More Stuff I Like

More Stuff tagged ai , llm , rag , ai4communities , brave

See also: Digital Transformation , Innovation Strategy , Science&Technology , Large language models

Cookies disclaimer

MyHub.ai saves very few cookies onto your device: we need some to monitor site traffic using Google Analytics, while another protects you from a cross-site request forgeries. Nevertheless, you can disable the usage of cookies by changing the settings of your browser. By browsing our website without changing the browser settings, you grant us permission to store that information on your device. More details in our Privacy Policy.