Ziff Davis's Study Reveals That LLMs Favor High DA Websites
  • Etiquetas - #seo
    • Última actualización 5 de feb.
    • 0 comentarios, 113 vistas, 0 likes
  • Kolkata, West Bengal, India - Obtener las direcciones

More in Politics

  • Norton antivirus account login
    32 comentarios, 153.873 vistas
  • Liquidity Locking Made Easy
    13 comentarios, 84.049 vistas
  • Boomerang Bet \u2013 Deutsches Casino mit Geringer Mindesteinzahlung
    0 comentarios, 48.290 vistas

Related Blogs

  • Kanken autumn leaf
    0 comentarios, 0 likes
  • Use Nootrofx To Make Someone Fall In Love With You
    0 comentarios, 1 me gusta
  • Solar PV Installers: Driving the Solar Energy Revolution
    0 comentarios, 0 likes

Archivo

compartir social

Ziff Davis's Study Reveals That LLMs Favor High DA Websites

Publicado por Manomita Mandal     5 de feb.    

Cuerpo

For years, SEOs have relied on Domain Authority (DA) as a benchmark for assessing a website’s authority. While Moz has consistently stated that DA is not a Google ranking factor, the metric has remained a key point of discussion in the industry.

New research from Ziff Davis sheds more light on how Domain Authority correlates with LLM content preferences, suggesting that the future might not be so different from the present.

 

Why did Ziff Davis conduct this study?

Ziff Davis, a major publisher with brands like PCMag, Mashable, IGN, and Moz, faces the same challenges as other media companies. They suspect that Large Language Models (LLMs) are training on their content without licensing agreements. Hence, it’s difficult to determine which content is being favored.

The study set out to address this issue. Researchers analyzed datasets like Common Crawl, C4, OpenWebText, and OpenWebText2 to understand how LLMs are trained, what types of content they prefer, and how these choices influence AI behavior and output.

You can read the full study report here.

 

Key takeaways from the Ziff Davis LLM Study

If you want to skip the rest of the article, I’ve summarized the key findings below:

  • LLMs place a high weighting on heavily-curated, high-quality datasets above other raw web data
  • Authoritative publishers dominate these curated datasets
  • OpenWebText and OpenWebText2 feature a much higher proportion of high-DA content compared to uncurated datasets
  • LLM developers prioritize commercial publisher content, reflecting a preference for quality and credibility
 

Which datasets were analyzed?

The Ziff Davis study examined four key datasets that are crucial in training large language models:

  • Common Crawl: An uncurated repository of web text scraped from the entire internet with minimal quality control.
  • C4: A cleaned version of Common Crawl that focuses on English pages and excludes duplicates and low-quality text. It offers a more refined dataset without strict curation.
  • OpenWebText: A proxy for OpenAI’s WebText, emphasizing high-quality content linked from Reddit with a minimum upvote threshold.
  • OpenWebText2: A follow-up to OpenWebText featuring an expanded and updated dataset while maintaining the same quality-focused approach.

It’s important to note that these datasets aren’t created equal. More curated datasets, like OpenWebText and OpenWebText2, contain a higher proportion of authoritative content, while unfiltered sources like Common Crawl pull from a much wider but lower-quality pool of web pages. The difference in dataset impacts how LLMs learn and generate responses.

How were publishers chosen for the study?

The study used Comscore’s web traffic to determine which publishers to analyze. Researchers focused on the top 15 portfolio publishers in the Media category as of August 2020, representing the most widely visited news and media organizations.

Comentarios

0 comentarios