MyWorldGo Ziff Davis's Study Reveals That LLMs Favor High DA Websites

Blog Information

  • منشور من طرف : Manomita Mandal
  • نشر على : Feb 05, 2025
  • الآراء : 113
  • الفئة : جنرال لواء
  • وصف : New research from Ziff Davis sheds more light on how Domain Authority correlates with LLM content preferences, suggesting that the future might not be so different from the present.
  • موقعك : Kolkata, West Bengal, India

نظرة عامة

  • For years, SEOs have relied on Domain Authority (DA) as a benchmark for assessing a website’s authority. While Moz has consistently stated that DA is not a Google ranking factor, the metric has remained a key point of discussion in the industry.

    New research from Ziff Davis sheds more light on how Domain Authority correlates with LLM content preferences, suggesting that the future might not be so different from the present.

     

    Why did Ziff Davis conduct this study?

    Ziff Davis, a major publisher with brands like PCMag, Mashable, IGN, and Moz, faces the same challenges as other media companies. They suspect that Large Language Models (LLMs) are training on their content without licensing agreements. Hence, it’s difficult to determine which content is being favored.

    The study set out to address this issue. Researchers analyzed datasets like Common Crawl, C4, OpenWebText, and OpenWebText2 to understand how LLMs are trained, what types of content they prefer, and how these choices influence AI behavior and output.

    You can read the full study report here.

     

    Key takeaways from the Ziff Davis LLM Study

    If you want to skip the rest of the article, I’ve summarized the key findings below:

    • LLMs place a high weighting on heavily-curated, high-quality datasets above other raw web data
    • Authoritative publishers dominate these curated datasets
    • OpenWebText and OpenWebText2 feature a much higher proportion of high-DA content compared to uncurated datasets
    • LLM developers prioritize commercial publisher content, reflecting a preference for quality and credibility
     

    Which datasets were analyzed?

    The Ziff Davis study examined four key datasets that are crucial in training large language models:

    • Common Crawl: An uncurated repository of web text scraped from the entire internet with minimal quality control.
    • C4: A cleaned version of Common Crawl that focuses on English pages and excludes duplicates and low-quality text. It offers a more refined dataset without strict curation.
    • OpenWebText: A proxy for OpenAI’s WebText, emphasizing high-quality content linked from Reddit with a minimum upvote threshold.
    • OpenWebText2: A follow-up to OpenWebText featuring an expanded and updated dataset while maintaining the same quality-focused approach.

    It’s important to note that these datasets aren’t created equal. More curated datasets, like OpenWebText and OpenWebText2, contain a higher proportion of authoritative content, while unfiltered sources like Common Crawl pull from a much wider but lower-quality pool of web pages. The difference in dataset impacts how LLMs learn and generate responses.

    How were publishers chosen for the study?

    The study used Comscore’s web traffic to determine which publishers to analyze. Researchers focused on the top 15 portfolio publishers in the Media category as of August 2020, representing the most widely visited news and media organizations.