Ziff Davis's Study Reveals That LLMs Favor High DA Websites

More in Politics

  • Norton antivirus account login
    ٣٢ comments, ١٥٤٬٠٧٧ views
  • Liquidity Locking Made Easy
    ١٣ comments, ٨٤٬٠٦٧ views
  • Boomerang Bet \u2013 Deutsches Casino mit Geringer Mindesteinzahlung
    تعليق ٠ , ٤٨٬٤٧٠ views

Related Blogs

  • Keep Employees Comfortable with an 8-Inch Clip-On Fan in None Industry Offices
    تعليق ٠ , ٠ مثل
  • Eyelash Extension Supplies
    تعليق ٠ , ٠ مثل
  • Best gastroenterologist in South delhi
    تعليق ٠ , ٠ مثل

أرشيف

حصة الاجتماعي

Ziff Davis's Study Reveals That LLMs Favor High DA Websites

منشور من طرف Manomita Mandal     ٥ فبراير    

الجسم

For years, SEOs have relied on Domain Authority (DA) as a benchmark for assessing a website’s authority. While Moz has consistently stated that DA is not a Google ranking factor, the metric has remained a key point of discussion in the industry.

New research from Ziff Davis sheds more light on how Domain Authority correlates with LLM content preferences, suggesting that the future might not be so different from the present.

 

Why did Ziff Davis conduct this study?

Ziff Davis, a major publisher with brands like PCMag, Mashable, IGN, and Moz, faces the same challenges as other media companies. They suspect that Large Language Models (LLMs) are training on their content without licensing agreements. Hence, it’s difficult to determine which content is being favored.

The study set out to address this issue. Researchers analyzed datasets like Common Crawl, C4, OpenWebText, and OpenWebText2 to understand how LLMs are trained, what types of content they prefer, and how these choices influence AI behavior and output.

You can read the full study report here.

 

Key takeaways from the Ziff Davis LLM Study

If you want to skip the rest of the article, I’ve summarized the key findings below:

  • LLMs place a high weighting on heavily-curated, high-quality datasets above other raw web data
  • Authoritative publishers dominate these curated datasets
  • OpenWebText and OpenWebText2 feature a much higher proportion of high-DA content compared to uncurated datasets
  • LLM developers prioritize commercial publisher content, reflecting a preference for quality and credibility
 

Which datasets were analyzed?

The Ziff Davis study examined four key datasets that are crucial in training large language models:

  • Common Crawl: An uncurated repository of web text scraped from the entire internet with minimal quality control.
  • C4: A cleaned version of Common Crawl that focuses on English pages and excludes duplicates and low-quality text. It offers a more refined dataset without strict curation.
  • OpenWebText: A proxy for OpenAI’s WebText, emphasizing high-quality content linked from Reddit with a minimum upvote threshold.
  • OpenWebText2: A follow-up to OpenWebText featuring an expanded and updated dataset while maintaining the same quality-focused approach.

It’s important to note that these datasets aren’t created equal. More curated datasets, like OpenWebText and OpenWebText2, contain a higher proportion of authoritative content, while unfiltered sources like Common Crawl pull from a much wider but lower-quality pool of web pages. The difference in dataset impacts how LLMs learn and generate responses.

How were publishers chosen for the study?

The study used Comscore’s web traffic to determine which publishers to analyze. Researchers focused on the top 15 portfolio publishers in the Media category as of August 2020, representing the most widely visited news and media organizations.

تعليقات

تعليق ٠