mirror of
https://github.com/YGGverse/Yo.git
synced 2026-03-31 17:55:35 +00:00
Micro Web Crawler in PHP & Manticore
alfis-dnsaltwebcomposerfull-text-searchgeminigemini-protocolhttpinternetjs-lessmanticoremanticore-crawler-phpmanticoresearch-phpphp-crawlersearch-enginesmallwebspiderwebweb-crawleryggdrasil
| example | ||
| src | ||
| .gitignore | ||
| composer.json | ||
| LICENSE | ||
| README.md | ||
Yo! Micro Web Crawler in PHP & Manticore
Next generation of YGGo! project with goal to reduce server requirements and make deployment process simpler
- Index model changed to distributed cluster model, and now oriented to aggregate search results from network instances trough API
- Refactored data exchange model where drop all internal keys dependencies
- Snaps now using tar.gz compression to reduce storage requirements and still supporting remote mirrors, FTP including
- Minimalism everywhere
Implementation
Engine written in PHP 8 and uses Manticore on backend.
Default build adapted for Yggdrasil but could be used to make internet search portal.
Components
- CLI tools for index operations
- JS-less frontend to make search web portal
- API tools to make search index distributed
Features
- MIME-based crawler with flexible filter settings
- Page snap history with local and remote mirrors support
Install
- Install
composer,phpandmanticore - Grab latest
Yoversiongit clone https://github.com/YGGverse/Yo.git - Run
composer updateinside the project directory - Copy and customize config file
cp example/config.json config.json - Make sure
storagefolder writable - Run indexes init script
php src/cli/index/init.php - Add new URL
php src/cli/document/add.php URL - Run crawler
php src/cli/document/crawl.php - Get search results
php src/cli/document/search.php '*'
Web UI
cd src/webuiphp -S 127.0.0.1:8080- now open
127.0.0.1:8080in your browser!
Documentation
CLI
Index
Init
Create initial index
php src/cli/index/init.php [reset]
reset- optional, reset existing index
Document
Add
php src/cli/document/add.php URL
URL- add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Search
php src/cli/document/search.php '@title "*"' [limit]
query- requiredlimit- optional search results limit
Migration
YGGo
Import index from YGGo database
php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique]
Source DB fields required:
hostportuserpassworddatabaseunique- optional, check for unique URL (takes more time)