Access the lawsuit here.
35 local and regional newspaper publishers filed a copyright lawsuit on June 24, 2026, in the Southern District of New York against Microsoft and a web of OpenAI entities. Together, the publishers operate nearly 400 outlets across 33 states.
The complaint alleges that Microsoft and OpenAI used automated systems to crawl publishers’ websites, including paywalled content, copy articles to their own servers, remove copyright management information (CMI), and incorporate the content into training datasets for ChatGPT and Microsoft Copilot. The publishers claim that the companies neither sought permission nor paid compensation.
The plaintiffs range from large regional chains to small family-owned weeklies. They include the Arkansas Democrat-Gazette, The New York Amsterdam News (founded in 1909), The Santa Fe New Mexican (founded in 1849), Ogden Newspapers (founded in 1890 and operating in 17 states with roughly 1,400 employees), and dozens of smaller outlets, some with circulations of fewer than 2,000.
What the complaint says happened: According to the filing, OpenAI’s data collection pipeline worked as follows:
- Automated crawlers scraped article text from publishers’ websites, including paywalled content.
- OpenAI used content extraction tools called Dragnet and Newspaper to pull article body text. According to the complaint, both tools were designed to strip surrounding page elements, including copyright notices, author bylines, publication names, and terms of use.
- The stripped text was compiled into training datasets, including WebText, WebText2, and filtered versions of Common Crawl.
- Those datasets were then used to train successive GPT models, which the complaint alleges have “memorized” portions of the scraped material and reproduced them in response to user prompts.
- The complaint further alleges that OpenAI has repeated this process continuously as it updates its models with new material.
The token counts: The publishers present two tables that quantify the presence of their content in OpenAI’s training data, based on analyses of open-source dataset approximations.
In OpenWebText, an approximation of OpenAI’s WebText dataset, the plaintiffs identified millions of tokens sourced from their websites. AIM Media Indiana accounted for more than 891,000 tokens, while AmNews Corp. contributed over 706,000.
In C4, a filtered snapshot of Common Crawl used to train GPT-3, the figures are significantly higher. Ogden Newspapers accounted for more than 71 million tokens, WEHCO Newspapers for over 6.3 million, and Richner Communications for more than 2.9 million. Across all plaintiffs, the total number of tokens in C4 exceeded 115 million.
The CMI stripping claim: Beyond standard copyright infringement, the complaint adds a claim under the Digital Millennium Copyright Act (DMCA) based on the alleged deliberate removal of copyright management information.
The complaint states that OpenAI selected Dragnet and Newspaper because these tools were known to remove “navigation chrome, advertising blocks, copyright notices and the like.” As a result, the extracted article text allegedly entered the training pipeline without author credits, copyright notices, publication names, or terms-of-use information.
The publishers argue that this was intentional. They claim that retaining CMI in the training data would have linked the material to article outputs, alerting users that the content was copyrighted. According to the complaint, removing CMI concealed the material’s origin and made infringement more difficult to detect and prove.
The complaint also alleges that the C4 dataset contains full-text articles from the plaintiffs’ publications without bylines, titles, copyright notices, or terms-of-use links, consistent with the intended output of the named extraction tools.
Scale of defendants’ businesses: The complaint places the alleged infringement in the context of OpenAI’s financial growth:
- OpenAI generates $2 billion in monthly revenue.
- It was valued at $852 billion following a $122 billion funding round in March 2026.
- It confidentially filed for an IPO in June 2026, with some analysts projecting a valuation exceeding $1 trillion.
- More than 92% of Fortune 500 companies reportedly use ChatGPT.
- ChatGPT has over 900 million weekly active users.
- Microsoft reported $82.9 billion in quarterly revenue in early 2026, roughly 20% higher than a year earlier.
The complaint alleges that none of this revenue was shared with the publishers whose work contributed to the development of the underlying AI models.
Legal claims: The complaint asserts three counts
Count I — Direct copyright infringement: This claim is brought by the five publishers with registered copyrights on file: the Arkansas Democrat-Gazette, Concord Publishing House, H.S. Gere & Sons, The New Mexican, and Newspapers of New Hampshire. They allege that the defendants scraped, reproduced, stored, and distributed their registered works without authorization, both during model training and through model outputs.
Count II — Vicarious copyright infringement: This claim is brought by the same five publishers. It targets Microsoft and the OpenAI parent entities on the theory that they controlled and profited from infringement carried out by subsidiary and partner entities while having the ability to stop it.
Count III — DMCA violation (17 U.S.C. § 1202): This claim is brought by all 35 plaintiffs against the OpenAI entities. It alleges the knowing removal of CMI with the intent to conceal infringement. Unlike copyright claims, DMCA claims do not require registered copyrights, allowing all 35 plaintiffs, not just those with registered works, to pursue this count.
Prior litigation context: The complaint acknowledges that this is not the first lawsuit of its kind. It notes that similar suits have been filed by The New York Times, the New York Daily News, the Chicago Tribune, the Denver Post, The Intercept, Raw Story, and others. Those cases have been consolidated into multidistrict litigation in the Southern District of New York (MDL No. 1:25-md-03143).
The filing states that the cases have largely survived motions to dismiss and alleges that the defendants have continued the challenged conduct rather than changing it.
Also read: