--- title: Fineweb-edu-fortified Semantic Search Demo emoji: 📚 sdk: gradio sdk_version: 4.41.0 app_file: app.py pinned: false datasets: - airtrain-ai/fineweb-edu-fortified - HuggingFaceFW/fineweb-edu models: - TaylorAI/bge-micro license: apache-2.0 --- # Semantic Search on Fineweb-edu-fortified sample This performs semantic search on one crawl ({{CRAWL_DUMP}}) from Fineweb-edu-fortified. It is intended to illustrate the contents of [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [fineweb-edu-fortified](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified). To explore Fineweb-edu-fortified further, you can view automatic clustering, embedding projections, and more for a 500k row sample using [this Airtrain dashboard](https://app.airtrain.ai/dataset/c232b33f-4f4a-49a7-ba55-8167a5f433da/null/1/0). The embeddings are the ones present in the dataset itself, and the same embedding model is used to embed your search phrase. The search is performed using the 15 rows with the closest embedding vectors to the embedding of the search phrase. The search data is lazily loaded, so shortly after the space is launched it may not yet have the full corpus of text from that crawl available for search. Refer to 'Rows searched' to see how many rows were searched across to retrieve the results.