Florent Daudens

fdaudens

AI & ML interests

AI & Journalism

Articles

Organizations

fdaudens's activity

posted an update about 9 hours ago
view post
Post
165
🚀 Your AI toolkit just got a major upgrade! I updated the Journalists on Hugging Face community's collection with tools for investigative work, content creation, and data analysis.

Sharing these new additions with the links in case it’s helpful:
- @wendys-llc 's excellent 6-part video series on AI for investigative journalism https://www.youtube.com/playlist?list=PLewNEVDy7gq1_GPUaL0OQ31QsiHP5ncAQ
- @jeremycaplan 's curated AI Spaces on HF https://wondertools.substack.com/p/huggingface
- @Xenova 's Whisper Timestamped (with diarization!) for private, on-device transcription Xenova/whisper-speaker-diarization & Xenova/whisper-word-level-timestamps
- Flux models for image gen & LoRAs autotrain-projects/train-flux-lora-ease
- FineGrain's object cutter finegrain/finegrain-object-cutter and object eraser (this one's cool) finegrain/finegrain-object-eraser
- FineVideo: massive open-source annotated dataset + explorer HuggingFaceFV/FineVideo-Explorer
- Qwen2 chat demos, including 2.5 & multimodal versions (crushing it on handwriting recognition) Qwen/Qwen2.5 & Qwen/Qwen2-VL
- GOT-OCR integration stepfun-ai/GOT_official_online_demo
- HTML to Markdown converter maxiw/HTML-to-Markdown
- Text-to-SQL query tool by @davidberenstein1957 for HF datasets davidberenstein1957/text-to-sql-hub-datasets

There's a lot of potential here for journalism and beyond. Give these a try and let me know what you build!

You can also add your favorite ones if you're part of the community!

Check it out: https://huggingface.co/JournalistsonHF

#AIforJournalism #HuggingFace #OpenSourceAI
posted an update 8 days ago
view post
Post
2297
🚨 Cool tool alert! 🚨

Finally tried Kotaemon, an open-source RAG tool for document chat!

With local models, it's free and private. Perfect for journalists and researchers.

I put Kotaemon to the test with EPA's Greenhouse Gas Inventory. Accurately answered questions on CO2 percentage in 2022 emissions and compared 2022 vs 2021 data

🛠️ Kotaemon's no-code interface makes it user-friendly.
- Use your own models or APIs from OpenAI or Cohere
- Great documentation & easy installation
- Multimodal capabilities + reranking
- View sources, navigate docs & create graphRAG

🌟 Kotaemon is gaining traction with 11.3k GitHub stars

Try the online demo: cin-model/kotaemon-demo
GitHub: https://github.com/Cinnamon/kotaemon
Docs: https://cinnamon.github.io/kotaemon/usage/
  • 1 reply
·
posted an update 9 days ago
view post
Post
1264
A lot of coverage of the Apple event! I’ve selected a few unique angles and distinctive takes.

**The NYT**
- "The iPhone’s limited feature set is emblematic of how Apple is taking a cautious approach to generative A.I."
- "Wall Street is enthusiastic about the artificially intelligent phones, with analysts predicting the features could help Apple sell a record 240 million iPhones next year."

**The Guardian**
- "Despite the bells and whistles, and being a tech-adopting lot, I bet many of you won’t be lining up to buy it."
- One reason is the simple cost of the iPhone 16, which starts at $799.
- The adoption of AI into the iPhone could be considered a step change in how the iPhone works. But there may not be a huge hankering to use ChatGPT on your phone."

**The WSJ**
- Apple didn’t say when the AI services would be available in China, its second-largest market after the U.S.
- The delay puts the iPhone maker at a disadvantage against rivals offering AI services
- Huawei held its own announcement in China to release the Mate XT, a three-way foldable smartphone with AI features.
- Apple said that the launch of Apple Intelligence was subject to regulatory approval. In China, any generative AI models that could influence public opinion need government approval.

**CNN**
- "For an event built around unveiling Apple’s first AI-powered iPhone, there was one striking absence over the two-hour presentation: the words 'artificial intelligence.'"
- "But Apple understands something that often gets lost in the bot-pilled bubble of Silicon Valley: Regular people don’t trust AI."

Links:
https://www.nytimes.com/2024/09/09/technology/apple-event-iphone-16-watch.html
https://www.theguardian.com/technology/article/2024/sep/10/techscape-iphone-16-cost-features
https://www.wsj.com/tech/apples-challenge-in-china-rises-with-new-rival-phones-and-ai-delay-8cf871fb?mod=rss_Technology
https://www.cnn.com/2024/09/10/business/apple-iphone-ai-nightcap/
replied to their post 10 days ago
posted an update 10 days ago
posted an update 15 days ago
posted an update 16 days ago
view post
Post
1528
Is AI’s impact on elections being overblown? Three researchers think so in this opinion piece published in the MIT Tech Review.

Highlights:

• “AI is being used to try to influence electoral processes, but these efforts have not been fruitful.”
• “Why were these initial speculations about AI-enabled electoral interference so off (…) ? The short answer: Because they ignored decades of research on the limited influence of mass persuasion campaigns, the complex determinants of voting behaviors, and the indirect and human-mediated causal role of technology.”
• “Yet we should remember that there’s a cost to overreaction based on ill-founded assumptions, especially when other critical issues go unaddressed.”

👉Read more here: https://technologyreview.com/2024/09/03/1103464/ai-impact-elections-overblown/
posted an update 20 days ago
view post
Post
442
AI in the News: Llama 10x growth, Apple & Nvidia in talks with OpenAI, universal basic income, AI & art

* Meta leads open-source AI boom, Llama downloads surge 10x year-over-year - VB
https://venturebeat.com/ai/meta-leads-open-source-ai-boom-llama-downloads-surge-10x-year-over-year/

* Apple, Nvidia Are in Talks to Invest in OpenAI - WSJ
https://www.wsj.com/tech/ai/openai-apple-funding-chatgpt-50754cd6?mod=rss_Technology

* The Report Card on Guaranteed Income Is Still Incomplete - NYT
https://www.nytimes.com/2024/08/30/business/economy/the-report-card-on-guaranteed-income-is-still-incomplete.html

* Ethically dubious or a creative gift? How artists are grappling with AI in their work - The Guardian
https://www.theguardian.com/artanddesign/article/2024/aug/30/xanthe-dobbie-futuer-sex-love-sounds-ai-video-celebrity-clones

Want more? Subscribe to my daily newsletter!
https://linkedin.com/build-relation/newsletter-follow?entityUrn=7233909926606053377
posted an update 21 days ago
view post
Post
862
📫 A packed AI in the News edition today!

📉 Nvidia Revenue Jumps 122% in Positive Sign for Tech's A.I. Boom - NYT
- $30.04 billion revenue, $16.95 billion net income (up from $6.19 billion a year ago)
- Shares in the company fell by as much as 7% in after-hours trading
- Nvidia faces production challenges with its new Blackwell chip and growing competition, including from its own customers
- Spending on data centers and energy costs to support A.I. is expected to be $1 trillion
👉 https://www.nytimes.com/2024/08/28/technology/nvidia-earnings-ai-stocks.html

🏛️ California Legislature Approves Bill Proposing Sweeping A.I. Restrictions - NYT
- Bill S.B. 1047 would require AI companies to test their systems for safety before public release and allow the state attorney general to sue for serious harms caused by AI.
- Supporters argue it’s necessary to mitigate AI risks, while critics worry it’s excessively focused on catastrophic harms and could jeopardize open-source AI development.
- Governor Gavin Newsom has until September 30 to decide on the bill, which could set a national standard for AI regulation if signed into law.
👉 https://www.nytimes.com/2024/08/28/technology/california-ai-safety-bill.html

🧑‍🏫 Generative AI Transformed English Homework. Math Is Next
- Gauth app, which can solve math problems from photos, has millions of downloads
- Got a low B in high-school level algebra and geometry in tests by Wired. "Likely good enough to satisfy bored students who'd rather spend their time after school doing literally anything else."
- The rise of such AI tools challenges educators to rethink their approach to math homework and teaching methods, possibly leading to a shift towards more in-class practice and personalized learning.
👉 https://www.wired.com/story/gauth-ai-math-homework-app/
posted an update 22 days ago
view post
Post
454
📫 AI in the News today:

X’s Grok bot now points to government website after election misinformation warnings - The Verge
https://www.theverge.com/2024/8/28/24230325/x-grok-chatbot-election-misinformation-warnings-vote

Klarna aims to halve workforce with AI-driven gains
https://www.ft.com/content/bfd9af3d-d607-4877-9571-078ab82a837e

Artificial intelligence: questioning the loss of employee autonomy - Le Monde (Google Translate)
https://www-lemonde-fr.translate.goog/emploi/article/2024/08/28/intelligence-artificielle-la-perte-d-autonomie-des-salaries-en-question_6297347_1698637.html?_x_tr_sl=fr&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp

Make AI tools to reduce teacher workloads, tech companies urged - The Guardian
https://www.theguardian.com/education/article/2024/aug/28/make-ai-tools-to-reduce-teacher-workloads-tech-companies-urged

Can Tech Executives Be Held Responsible for What Happens on Their Platforms?
https://www.nytimes.com/2024/08/28/technology/durov-telegram-liability-platforms.html

‘Being on camera is no longer sensible’: persecuted Venezuelan journalists turn to AI - The Guardian
https://www.theguardian.com/world/article/2024/aug/27/venezuela-journalists-nicolas-maduro-artificial-intelligence-media-election

Read my daily newsletter here: https://linkedin.com/pulse/ai-news-august-28th-2024-florent-daudens-o7mjc/
posted an update 23 days ago
view post
Post
1471
‘AI in the News’ of the day:

Anthropic publishes the ‘system prompts’ that make Claude tick
- "In its continued effort to paint itself as a more ethical, transparent AI vendor, Anthropic has published the system prompts for its latest models"
- They specify that “Claude cannot open URLs, links, or videos, perform facial recognition or identify or name any humans in photos.
- "Anthropic is exerting pressure on competitors to publish the same. We’ll have to see if the gambit works."
https://techcrunch.com/2024/08/26/anthropic-publishes-the-system-prompt-that-makes-claude-tick/

China’s tech giants splash out on AI despite US restrictions (paywall)
- "Alibaba, Tencent and Baidu had combined capital expenditure of Rmb50bn ($7bn) in the first half, compared with Rmb23bn a year earlier. TikTok parent ByteDance (which is private) has also increased AI-related spending"
- Nvidia's H100 and upcoming Blackwell series are under US restrictions, but China’s tech giants can buy H20
- Analysts expect Nvidia to ship more than 1mn of the processors to Chinese tech groups in the coming months.
https://www.ft.com/content/31bffc48-2ca7-472b-9d53-3deaad2d86ce

MZ "said it was improper for the Biden administration to have pressured Facebook to censor content in 2021 related to the coronavirus pandemic"
- "At the time, Facebook’s publicly stated goal was to push millions of people toward Covid-19 vaccines. In his letter, Zuckerberg didn’t indicate whether he had changed his mind about that goal"
https://www.wsj.com/tech/mark-zuckerberg-neutral-politics-letter-election-2024-02b86372

Food for thought:
- Why don’t women use artificial intelligence?
https://www.economist.com/finance-and-economics/2024/08/21/why-dont-women-use-artificial-intelligence
- Most AI avatars look female, young and attractive. Are they a passing trend or here to stay?
https://reutersinstitute.politics.ox.ac.uk/news/most-ai-avatars-look-female-young-and-attractive-are-they-passing-trend-or-here-stay
posted an update about 1 month ago
view post
Post
2890
🚀 How The Washington Post Uses AI to Empower Journalists 🔍📰

An exciting new example in the world of AI-assisted journalism! The Post has developed an internal tool called "Hayatacker" that's enhancing in-depth reporting. Here's why it matters:

🎥 What it does:
• Extracts stills from video files
• Processes on-screen text
• Labels objects in images

🗳️ First big project:
Analyzed 745 Republican campaign ads on immigration (Jan-Jun 2024)

🤝 Human-AI collaboration:
• AI extracts and organizes data
• Reporters verify and analyze findings

🔎 Thorough approach:
• Manual review of all 745 ads
• Reverse image searches when context is lacking
• Cross-referencing with AdImpact transcripts

💡 Key insight from WaPo's Senior Editor for AI strategy Phoebe Connelly:
"The more exciting choice is putting AI in the hands of reporters early on in the process."

This tool showcases how AI can augment journalistic capabilities without replacing human insight and verification. It's a powerful example of technology enhancing, not replacing, traditional reporting skills.

👉 Read the full article and the methodology: https://www.washingtonpost.com/elections/interactive/2024/republican-campaign-ads-immigration-border-security/
posted an update about 1 month ago
view post
Post
1970
📫 AI in the news today: Struggling AI startups, Figure 0 robot, chipmakers

- OpenAI Co-Founders Schulman and Brockman Step Back
https://finance.yahoo.com/news/openai-co-founders-schulman-brockman-010542796.html

- Struggling AI Startups Look for a Bailout from Big Tech
"More exits—either pseudo-acquisitions or real ones—are coming, investors say, as a bubble built by the excitement around generative AI is showing signs of peaking."
https://www.wsj.com/tech/ai/struggling-ai-startups-look-for-a-bailout-from-big-tech-3e635927?mod=rss_Technology

- Did Google Just Pay $2.5 Billion to Hire Character's CEO?
https://www.theinformation.com/articles/did-google-just-pay-2-5-billion-to-hire-characters-ceo

- Figure’s new humanoid robot leverages OpenAI for natural speech conversations
Figure has unveiled its latest humanoid robot, the Figure 02.
The most notable addition this time out arrives by way a longstanding partnership with OpenAI, which helped Figure raise a $675 million Series B back in February, valuing the South Bay firm at $2.6 billion.
https://techcrunch.com/2024/08/06/figures-new-humanoid-robot-leverages-openai-for-natural-speech-conversations/

- World’s Five Leading Chipmakers Have Now Promised U.S. Investment
The Biden administration award up to $450 million in grants to a South Korean chipmaker, SK Hynix, to help build its new chip facility in Indiana
The US now has commitments from all five of the world’s leading-edge semiconductor manufacturers to construct chip plants in theUS with financial assistance from the administration
https://www.nytimes.com/2024/08/06/business/economy/chipmakers-promise-investment.html
posted an update about 2 months ago
posted an update about 2 months ago
posted an update about 2 months ago
view post
Post
1955
Running Gemma 2 2B at 41.66 tokens/s on my MacBook 💻🚀

- MLX Community's swift conversion
- One-line download from the Hub
- Small yet powerful on-device model

Try it yourself: mlx-community/google-gemma2-667dca89bc9abbfa34080066

#GemmaAI #OnDeviceAI #MachineLearning
  • 2 replies
·
posted an update about 2 months ago
view post
Post
459
Journalists, this is a must-read for your career evolution. I just read @ndiakopoulos on "The Impact of Generative AI on Journalistic Labor". Here are my 5 takeaways:

🚀 LLMs could make 83% of reporter tasks and 76% of editor tasks way more efficient. "It’s important to emphasize that these figures are fundamentally about augmentation rather than automation".

🧩 Four emerging job clusters to consider: AI-doers, AI-users, AI-strategizers, and AI-reporters. Where do you fit?

🌊 US newsrooms "will need people (1) who have the skills to use the current generation of LLMs and (2) who can develop the bespoke software to unlock their full potential, particularly if building new in-house tools". Get ahead of the curve.

🔑 Key takeaway: upskill in AI. It's not just about using tools, but understanding how to integrate them into your workflow.

📚 News orgs "would be wise to accelerate hiring and invest in upskilling their existing workforce". Take advantage or seek out learning opportunities.

Journalists: What AI skills are you planning to develop? How might this reshape your role?

👉 Read the full blog post here: https://generative-ai-newsroom.com/the-impact-of-generative-ai-on-journalistic-labor-e87a6c333245 It's worth your time if you're thinking about the future of journalism & your career.

#AIinJournalism #CareerEvolution #FutureofNews
posted an update about 2 months ago
view post
Post
607
Barefoot developer experiment: AI-powered app creation with my poor coding skills 🧠💻

I recently discussed the "barefoot developer" concept - using AI to build apps for specific needs without coding expertise. Decided to put it to the test. 🔬

🍅⏲️ The challenge: Create a menu bar Pomodoro app for my computer to boost my focus. Previous attempt? Messy.
🤯 The twist: I've never coded in Swift.
⚡️ The result: 30 minutes. No joke. Elegant, functional, and shareable.

🔗 Want to try it yourself? Grab the open source code and app here:
- Code: https://github.com/fdaudens/pomodoro2
- App: https://github.com/fdaudens/pomodoro2/releases/tag/v1.0.0

Key takeaways:
🚀 AI-assisted development is evolving rapidly
🧩 Domain expertise + AI tools can yield impressive results
🌐 This approach democratizes app creation

🤔 What's your take on AI-powered development? Have you experimented with it?

#AIinDevelopment #BarefootDeveloper #OpenSource
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
591
Aspen Institute's wake-up call for journalism: Embrace AI or risk obsolescence 📰🤖

"Every new technology comes with risks—it's how the media industry responds that determines how (or whether) news providers can prevail." —
Vivian Schiller
 
Bonus: In need of ideas for AI projects? The report is a goldmine of real-world experiments. Here are some lesser-known innovations:

📊 Spotting patterns: Semafor uses chatbots to assess newsroom performance
📜 Extending reach: Politico summarizes state and federal legislation with AI
🎙️ Transformation: Washington Post uses AI-generated voices for newsletter narration
🏀 Comprehensive coverage: Richland Source covers 10,000 Ohio high school sports games yearly with AI
🔑 Summarization: Gannett adds AI-generated bullet points to stories
🌍 Translation: Finnish broadcaster Yle built an AI tool to reach Ukrainian immigrants
🎥 Internal info: AP's Merlin tool pinpoints key video moments
🗳️ Civic engagement: Spotlight PA's AI assistant answers election questions
🗣️ Personalization: Baltimore Times customizes health news with AI voice readers
🎯 Targeted ads: NYT's AI tool aligns content with advertisers' focus
💼 Conversion: WSJ uses ML to boost subscription renewals

A must-read for all in media: https://aspendigital.org/wp-content/uploads/2024/07/Aspen-Digital_Here-Come-the-Robots_July-2024.pdf

#AIinJournalism #MediaInnovation #FutureofNews
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
2283
🚀 Introducing the Model Drops Tracker! 🕵️‍♂️

Feeling overwhelmed by the AI model release frenzy? 🤯 You're not alone!

I built this simple tool to help us all keep up:
- Filter recent models from the 🤗 Hub
- Set minimum likes threshold
- Choose how recent you want to go

Try it out and let me know what you think: fdaudens/Model-Drops-Tracker

Any features you'd like to see added?
#AIModels
·
posted an update about 2 months ago
view post
Post
800
🤖💡Just tried out @m-ric 's new LLaMA-3.1 70B agent for data analysis. Impressive stuff.

🚢📊 Fed it the Titanic passenger dataset with minimal instructions. The agent autonomously dug in, tested hypotheses, and reached some intriguing conclusions:

"Lower class passengers less likely to survive, slight negative correlation with age, and positive correlation between fare price and survival."

📈It even generated charts to visualize the findings!

🧠💼 Great potential for business intelligence, research, and decision-making when we can just upload datasets and let AI agents loose on them.

👉 Check it out: m-ric/agent-data-analyst

🤔 Any particular use cases you're excited about?

#AIinDataAnalysis #MachineLearning #DataScience
  • 3 replies
·
posted an update about 2 months ago
view post
Post
1948
I just had a masterclass in open-source collaboration with the release of Llama 3.1 🦙🤗

Meta dropped Llama 3.1, and seeing firsthand the Hugging Face team working to integrate it is nothing short of impressive. Their swift integration, comprehensive documentation, and innovative tools showcase the power of open-source teamwork.

For the curious minds:

📊 Check out independent evaluations: open-llm-leaderboard/open_llm_leaderboard

🧠 Deep dive into the tech: https://huggingface.co/blog/llama31

👨‍🍳 Try different recipes (including running 8B on free Colab!): https://github.com/huggingface/huggingface-llama-recipes

📈 Visualize open vs. closed LLM progress: andrewrreed/closed-vs-open-arena-elo

🤖 Generate synthetic data with distilabel, thanks to the new license allowing the use of outputs to train other LLMs https://huggingface.co/blog/llama31#synthetic-data-generation-with-distilabel

💡 Pro tip: Experience the 405B version for free on HuggingChat, now with tool-calling capabilities! https://huggingface.co/chat/

#OpenSourceAI #AIInnovation
  • 1 reply
·
posted an update about 2 months ago
posted an update 2 months ago
view post
Post
620
Websites slam doors on AI data harvesting 🚪🔒

New study "Consent in Crisis: The Rapid Decline of the AI Data Commons" reveals a rapid decline in open web access.

Key findings from 14,000 web domains audit:
- +5% of three common data sets (C4, RefinedWeb and Dolma) now fully restricted, +25% of the highest-quality sources now fully restricted
- 45% of C4 restricted by Terms of Service

Noteworthy trends:
🚫🔄 OpenAI banned 2x more than any other company
📰🔐 News sites leading restrictions: 45% of tokens off-limits

Two quotes in the NYT piece to ponder:

“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods.” — @yjernite

“Major tech companies already have all of the data. Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers.” — @stellaathena

👉 Dive into the research: https://www.dataprovenance.org/consent-in-crisis-paper
👉 Read the NYT story: https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html

#AIEthics #DataPrivacy

posted an update 2 months ago
view post
Post
707
Thrilled to share some AI insights for journalism! 📊🤖

Just wrote a guest post on @ndiakopoulos 's blog about Hugging Face on Sheets tool.

Why it matters:
🔌 Brings AI power directly to spreadsheets
📈 Huge potential for data journalism
🚫💻 No coding required!

If you've read Nicholas' "Automating the News" (a must-read!), you'll appreciate how this tool fits into the evolving landscape of AI in journalism.

Read here: https://generative-ai-newsroom.com/bringing-open-source-models-to-spreadsheets-c440fc4818b4

#AIJournalism #DataJournalism
  • 1 reply
·
posted an update 2 months ago
view post
Post
3248
Small models, BIG impact: SmolLM is here! 🚀🔬

We're launching a series of small but mighty language models:
🏎️ Super fast - runs on laptops, phones, you name it!
📏 3 sizes: 130M, 350M, and 1.5B parameters
🥇 Outperforms same size models from Meta, Microsoft, and Qwen
🔓 Fully open-source: datasets, training code, models

𝐊𝐞𝐲 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬
- Trained on FineWeb-Edu and Cosmopedia v2 (largest synthetic pre-training dataset)
- No cloud needed - run locally for privacy and energy efficiency
- Everything is public, from data curation to training steps

𝐏𝐨𝐭𝐞𝐧𝐭𝐢𝐚𝐥 𝐮𝐬𝐞 𝐜𝐚𝐬𝐞𝐬
- On-device autocomplete
- Local request parsing
- Custom fine-tuning for specific needs without the need for expensive GPUs

𝐆𝐨 𝐝𝐞𝐞𝐩𝐞𝐫
👉 Check it out: https://huggingface.co/collections/HuggingFaceTB/smollm-models-6695016cad7167254ce15966
👉 Run the 360M model in your browser, 100 % private: HuggingFaceTB/SmolLM-360M-Instruct-WebGPU
👉 Read the blog explaining everything in detail: huggingface.co/blog/smollm

Kudos to the stellar team who worked on this project: @loubnabnl @anton-l @eliebak @lvwerra
replied to lucianosb's post 2 months ago
view reply

So cool to apply this tool to your own models!

posted an update 2 months ago
view post
Post
2273
Exciting news for audio AI enthusiasts! 🎙️🌍

The Emilia dataset dropped last week, and it's a cool one:
- 101k+ hours of high-quality audio
- 6 languages: 🇨🇳 🇺🇸 🇯🇵 🇰🇷 🇩🇪 🇫🇷
- Diverse content: talk shows, interviews, debates, sports commentary, audiobooks

This dataset could improve multilingual speech generation and recognition. Opens up many possibilities for global media, language learning, and accessibility!

Explore it: amphion/Emilia

#AIAudio
posted an update 2 months ago
view post
Post
553
How's 2024 been so far? 🎉 Well, we've got the data to find out!

New dataset just dropped on the Hub: 12.5 million Reddit comments from 2024! 📊🗣️

Features:
- Personal info anonymized 🕵️‍♀️
- Language detection 🌐
- Token count 🔢
- NSFW filtering 🚫

Most popular post? Classic AITA drama: "AITA for 'ruining Christmas' and being upset the only gifts I got from my family were 'joke gifts'" 🎄😅

Some things never change, huh?

Dive in: OpenCo7/UpVoteWeb
posted an update 2 months ago
posted an update 2 months ago
view post
Post
2037
Very cool dataset for journalists and historians just dropped: 2.7 million unique public domain U.S. news wire articles (1878-1977) 📰🕰️

This is a goldmine for tracking historical events & newspaper coverage trends! An example? If we still wonder whether gender diversity in the media is important... "Only 4.6% of disambiguated entity mentions refer to women, and the most mentioned woman is Golda Meir."

Bonus:
- Locations in these articles are georeferenced
- Topics are tagged using customized neural topic classification
- Named entities are recognized,
- Individuals are disambiguated to Wikipedia using a novel entity disambiguation model

Anyone thinking of cool AI projects with this data? Maybe tracking the spread of news stories over time & space?

𝐆𝐨 𝐝𝐞𝐞𝐩𝐞𝐫
👉 Digg into the dataset: dell-research-harvard/newswire
👉 Read the paper: Newswire: A Large-Scale Structured Database of a Century of Historical News (2406.09490)
posted an update 2 months ago
view post
Post
2148
Animate a portrait with a driving video. Lots of potential fun here 😅 KwaiVGI/LivePortrait
posted an update 2 months ago
view post
Post
3343
🧠 How to create more diverse, realistic synthetic AI training data?

@TencentAIGC-Lab AI Lab created @proj-persona , a vast collection of 1 billion diverse personas, to help create synthetic data with LLMs that encapsulate a wide array of perspectives, knowledge, experiences, interests, and professions.

These personas were created with automatically curated data, representing approximately 13% of the world’s total population.

💡 The authors argue that integrating a persona into data synthesis prompts effectively steers LLMs to adopt specific perspectives, creating unique and relevant synthetic data with minimal effort.

They showcased various practical applications of Persona Hub to demonstrate its effectiveness and versatility in various synthetic data creation scenarios: mathematical and logical reasoning problems, simulating diverse user requests and prompts for LLMs, generating informative and detailed text content across various topics, and more.

🚀 It's one of the trending datasets on Hugging Face. Digging into it is quite fun! I found one that reminds me of several people I know: "A journalist who covers technology and innovation in the print and digital media industries." It helped generate the prompt attached to this post (about which I'd be curious to know your answers 😉).

Synthetic data is a hot topic in AI. It will be interesting to see if this research could help make LLMs more robust, versatile, and capable of handling a wide array of real-world scenarios.

👉Explore the dataset: proj-persona/PersonaHub
👉 Read the paper: https://arxiv.org/pdf/2406.20094
posted an update 3 months ago
view post
Post
988
New dataset filtering feature just dropped! 🤗🚀

Find exactly what you need with filters for:
- Modalities (text, image, audio, etc.)
- Dataset size
- File format

Try it now: https://huggingface.co/datasets

What other filters would you find useful? Drop your ideas!
posted an update 3 months ago
posted an update 3 months ago
view post
Post
3361
Updated the Journalists on 🤗 community page:
- new text-to-speech tools collection JournalistsonHF/text-to-speech-6675c4dccdaa11e86928a15b
- additional leaderboards in the eval collection: TTS-AGI/TTS-Arena and dylanebert/3d-arena
- new tools in the Text-Analysis collection: gokaygokay/Florence-2, pdf2dataset/pdf2dataset, cvachet/pdf-chatbot
- Xenova/realtime-whisper-webgpu in the Transcription collection
- radames/flash-sd3-taesd3 in the Image Tools collection
- Last but not least, okaris/omni-zero in the fun collection for zero-shot stylized portrait creation

Is there any tool you would like to see added?

Find all the curated tools here: https://huggingface.co/collections/JournalistsonHF/
posted an update 3 months ago
view post
Post
2524
Finally, a good handwriting recognition tool?

I'm impressed by Microsoft's latest vision model, Florence-2 microsoft/Florence-2-large

The results are really good, boasting a remarkably low error rate, as you can see with this letter from George W. Bush to Bill Clinton!

🚀🔒 What’s even better? You can run it locally on your device, ensuring your data stays 100% safe.

👉 Try it out here: gokaygokay/Florence-2
  • 1 reply
·
posted an update 3 months ago
posted an update 3 months ago
view post
Post
3408
A nice improvement for Hugging Face on Sheets: You can now customize your prompt and select the model of your choice directly on the sheet.

Thanks to @louisbrulenaudet for the contribution. Really cool to see the community improving this tool!

Try it here: JournalistsonHF/huggingface-on-sheets
replied to their post 3 months ago
replied to their post 3 months ago
replied to their post 3 months ago
posted an update 4 months ago
view post
Post
2086
Hugging Face in your spreadsheet?

Because spreadsheets can be incredibly useful for journalists, I created this little project yesterday evening. Handy for prompting, extraction, classification, translation...

Ping me if you’re interested in trying it out!
·
posted an update 4 months ago
view post
Post
1398
Impressed by the work of @guipenedo @hynky @loubnabnl @anton-l @craffel @lvwerra @thomwolf on FineWeb.

LLMs are only as good as the data they have been trained on, but the crucial aspect of pretraining data remains obscure. Our approach lifts the veil on building high-quality pretraining datasets by sharing every detail about this process to enable a wider community to build on top of it.

- The FineWeb-Edu dataset, which outperforms all openly accessible web datasets in a number of educational benchmarks. We built it by developing a quality classifier using annotations generated by an LLM.

- A new technical report explaining in detail how to create a large and high-quality web-scale dataset for LLM pretraining such as FineWeb

👉 HuggingFaceFW/blogpost-fineweb-v1
posted an update 4 months ago
view post
Post
1104
How can AI help us write better headlines and reach more people?

I experimented with a new approach that is both useful and fun. It can help you overcome writer’s block, find better headlines, and make your blog posts and news articles climb in search engine results. Plus, we will learn new concepts along the way!

1️⃣ First, I scraped all the blog posts written on Hugging Face to create a dataset with the headlines, texts, dates, and authors' names.

2️⃣ I filtered the dataset to remove posts that were too long and would require a model with a longer context window. This was done to keep the project simple and cost-effective (actually, free).

3️⃣ Then, I used a dataset generation workflow built by @davanstrien to generate a DPO dataset.

4️⃣ As a last step, you can collectively rate these evaluations to improve the quality of the dataset using an easy-to-use interface with Argilla. Take a look at it and rate some of them! This way, you can contribute to making this dataset useful for different newsrooms that could use it as a starting point.

𝐖𝐡𝐲 𝐢𝐭 𝐦𝐚𝐭𝐭𝐞𝐫𝐬. This example is compelling because, if you look at the dataset, you can see some examples where the headlines are enhanced by the addition of an important keyword or an action verb.
These tweaks can have a big impact on your position in search engines and, therefore, on your traffic. It’s also good leverage for our creativity since you can compare the initial idea with another one from an outside perspective.

Imagine if you’re a large news organization; you could run this experiment with thousands of news articles.

With a dataset of several hundred to thousands of entries, you could fine-tune a model to suggest headlines better tailored to your needs and writing style.

👉 Take a look at it and rate the headlines fdaudens/journalism-argilla-space
👉 Daniel's code https://github.com/huggingface/data-is-better-together/blob/main/dpo/README.md
  • 1 reply
·
posted an update 4 months ago
view post
Post
965
If you're part of the Journalists on Hugging Face community, did you know you can receive notifications on ongoing discussions?

- "Repo discussions" for repo discussions you're participating in or mentioned in
- "New activity on watched orgs/users" for repo discussions & posts from users & orgs
you're watching

Activate them here: https://huggingface.co/settings/notifications

Join the community: If you’re part of the Journalists on Hugging Face community, did you know you can receive notifications about ongoing discussions?
posted an update 4 months ago
view post
Post
1131
Switching from French to German to Chinese in the same discussion 😅

Impressive to see Cohere for AI's new Aya model multilingual capabilities.

- C4AI Aya 23 is a research open weights release
- 8 and 35 billion parameter models
- 23 languages supported

You can try it out here: https://huggingface.co/spaces/CohereForAI/aya-23
·
posted an update 4 months ago
view post
Post
1028
Excited to share a new project to make journalists’ lives easier when gathering information!

Collecting data like lists, URLs, etc., from websites is not always easy (and sometimes painful). Web scraping requires technical skills that only a handful of people in each newsroom have.

I recently stumbled upon @scrapegraphai , a scraper that does the heavy lifting with AI for the user with a simple prompt in natural language. I asked them if they could integrate the Hugging Face Hub to use open-source models and created a no-code, easy-to-use interface on Gradio.

You can then save time and focus on storytelling!

🔧 How It Works
1. Input Your Prompt and Source URL
2. Click ‘Scrape and Summarize’
3. Receive Summarized Results

👩‍💻 Get Involved!
This is just the first version of the tool, and it’s pretty basic. I’ve uploaded it to the Journalists on Hugging Face community so we can work together on it. Whether you’re a developer, a data scientist, or a journalist with ideas, you can contribute to this project.

You can also copy this app to your own account or organization to customize it to your needs.

👉 Test the scraper here: JournalistsonHF/ai-scraper

🤝 Join the Journalists on 🤗 community: https://huggingface.co/JournalistsonHF
posted an update 4 months ago
view post
Post
1777
80% of fact-checked misinformation claims involve media, with a rise in AI-generated content in 2023, according to a new study, “A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild.” Worth a read for journalists, especially fact-checkers.

TL;DR:
• 📊 135,838 fact checks analyzed
• 📸 80% of these claims involve media
• 🎥 Videos became more common starting in 2022, now more than 60% of fact-checked claims that include media
• 🤖 AI-generated content was rare until Spring of 2023, and then dramatically increased
• 🖼️ Image manipulations don’t require complex operations. Most of the time it’s context manipulations

• Read the paper here: AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild (2405.11697)
• Take a look at the dataset: academic-datasets/AMMeBa

Thanks @davanstrien for spotting it!
posted an update 4 months ago
view post
Post
2190
Do you want to improve AI in your language? Here's how you can help.

I'm exploring different AI techniques for an upcoming project in journalism, and I wanted to test a cool idea by @davanstrien , Data is better together, which aims to foster a community of people to create DPO datasets in different languages.

This project gives the opportunity to explore various concepts:
- Direct Preference Optimization (DPO)
- Synthetic data
- Data annotation
- LLM as a judge

1️⃣ Take the Aya dataset of human-annotated prompt-completion pairs across 71 languages and filter it to include only those in the language you’re interested in.

2️⃣ Use distilabel from Argilla to generate a second response for each prompt and evaluate which response is best.

Basicaly, DPO datasets have a chosen and a rejected responses to a question, which helps align models on specific tasks. To quote Daniel: "Currently, there are only a few DPO datasets available for a limited number of languages. By generating more DPO datasets for different languages, we can help to improve the quality of generative models in a wider range of languages."

3️⃣ Send this dataset and evaluations to the easy-to-use interface to evaluate the evaluations.

This is where you can help. :) You can rate the LLM evaluation of the prompt-responses pairs. For my example, I built a dataset in French. And without wanting to start a debate about homeopathy, the second result is clearly better in the example below! fdaudens/demo-aya-dpo-french

The final dataset can be found here: fdaudens/aya_french_dpo

To contribute to other languages and learn more about synthetic data, you can also produce datasets in the language of your choice! Read more about the project: https://github.com/huggingface/data-is-better-together/blob/main/dpo/README.md
  • 1 reply
·
posted an update 4 months ago
view post
Post
1551
A useful tool for journalists: AutoQuizzer generates a quiz from a URL. You can play the quiz, or let the LLM play it!

deepset/autoquizzer
posted an update 4 months ago
view post
Post
1306
Access to computational resources is key for democratizing AI, in all domains.

We cooked up something we're proud of: Hugging Face is committing $10 million in free GPUs to help developers create new AI technologies.

“AI should not be held in the hands of the few. With this commitment to open-source developers, we’re excited to see what everyone will cook up next in the spirit of collaboration and transparency.” — @clem

Read the exclusive by Kylie Robison: https://www.theverge.com/2024/5/16/24156755/hugging-face-celement-delangue-free-shared-gpus-ai
posted an update 4 months ago
view post
Post
861
"This Journalism Professor Made a NYC Chatbot in Minutes. It Actually Worked"

A lot of interesting quotes in this interview in The Markup with Jonathan Soma, a professor in data journalism at Columbia University: https://themarkup.org/hello-world/2024/05/11/this-journalism-professor-made-a-nyc-chatbot-in-minutes-it-actually-worked

When the New York City government released its chatbot, journalists found that "Again and again, the bot messing up on city laws and regulations."

Enter Jonathan Soma, who tried to build his own version of the chatbot. And guess what? He got accurate responses.

💬 Still, he remains cautious: "Chatbots are great for low-stakes things. They are great when something is fun, they are great for a task where you do not need 100 percent accuracy, when you just want a little bit of guidance."

"I think that AI in general is absolutely useful for journalism, and I’ve been teaching machine learning and AI to journalists long before ChatGPT hit the scene. I think it is explicitly chatbots that are probably the most problematic part, because they are so confident in everything that they say."

🤗 I have a particular soft spot for this project, as it uses many Hugging Face tools under the hood. This is precisely the kind of work we want to build with the Journalists on HF community. Join us: https://huggingface.co/JournalistsonHF

📺 I can't recommend enough watching his video serie "Practical AI for Investigative Journalism": https://www.youtube.com/watch?v=N5wvtYRYbfA&list=PLewNEVDy7gq1_GPUaL0OQ31QsiHP5ncAQ

— Thanks @BrigitteTousi for the link!
posted an update 4 months ago
view post
Post
2089
What tools do you need to deconstruct bias in algorithms? (You know, this thing that is becoming increasingly prevalent in our lives)

Participate in the new discussion in the Journalists on Hugging Face community: JournalistsonHF/README#4
posted an update 5 months ago
posted an update 5 months ago
view post
Post
2476
A new dataset for anyone interested in Satellite imagery: 3 million @Satellogic images of unique locations — 6 million images, including location revisits — from around the world under a Creative Commons CC-BY 4.0 license.

Interesting potential in journalism.

satellogic/EarthView
posted an update 5 months ago
view post
Post
2054
I've added new collections to the Journalists on 🤗 community, focusing on Data Visualization, Optical Character Recognition, and Multimodal Models:

- TinyChart-3B: This model interprets data visualizations based on your prompts. It can generate the underlying data table from a chart or recreate the chart with Python code.
- PDF to OCR: Convert your PDFs to text—ideal for FOI records sent as images.
- Idefics-8b: A multimodal model that allows you to ask questions about images.

Explore these tools here: 👉 https://huggingface.co/JournalistsonHF