[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-wang-rui--phishguard-scaffold":3,"tool-wang-rui--phishguard-scaffold":65},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[13,37],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":83,"owner_website":84,"owner_url":85,"languages":86,"stars":91,"forks":92,"last_commit_at":93,"license":83,"difficulty_score":45,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":108,"github_topics":83,"view_count":45,"oss_zip_url":83,"oss_zip_packed_at":83,"status":16,"created_at":109,"updated_at":110,"faqs":111,"releases":112},1038,"wang-rui\u002Fphishguard-scaffold","phishguard-scaffold","Joint Semantic Detection and Dissemination Control of Phishing Attacks on Social Media via LLama- Based Modeling","PhishGuard 是一个专为社交媒体环境设计的钓鱼攻击防护框架，旨在实现语义检测与传播控制的联合优化。面对日益复杂的网络钓鱼威胁，传统方法往往只关注内容识别，而忽略了信息在社交网络中的扩散路径。PhishGuard 通过结合 LLaMA 大模型的深度语义理解能力与基于图论的传播干预策略，不仅能精准识别钓鱼内容，还能主动定位关键传播节点以遏制攻击蔓延。\n\n技术层面，PhishGuard 集成了 LLaMA-2-7B 编码器与 LoRA 微调技术，支持在资源受限环境下高效部署。其独特的对抗性训练机制显著提升了模型面对语义扰动时的鲁棒性，而独立级联模型则用于模拟并最小化有害信息的预期传播范围。实验数据显示，PhishGuard 在检测准确率与传播抑制效果上均表现优异。\n\n这套系统非常适合网络安全研究人员、AI 开发者以及企业安全团队使用。无论是希望深入探索大模型在安全领域应用的研究者，还是需要构建实际防御体系的工程师，PhishGuard 都提供了一个完整且可扩展的研究框架与实战基线。","# PhishGuard: Joint Semantic Detection & Propagation Control\n\n[![Python 3.8+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.8+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.0+-red.svg)](https:\u002F\u002Fpytorch.org\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n\n**A unified framework for phishing detection and propagation control on social media, powered by LLaMA and advanced graph-based intervention strategies.**\n\nThis repository implements the complete research framework described in *\"Joint Semantic Detection and Dissemination Control of Phishing Attacks on Social Media via LLaMA-Based Modeling\"*, featuring deep semantic understanding, adversarial robustness, and targeted intervention for comprehensive phishing mitigation.\n\n## 🏗️ System Architecture\n\n*Alternative view: [Architecture Diagram PNG](docs\u002Farchitecture_diagram.png)*\n\n```mermaid\ngraph TB\n    subgraph \"Data Layer\"\n        TD[Twitter Data\u003Cbr\u002F>~100k tweets] --> DP[Data Preprocessing\u003Cbr\u002F>• Deduplication\u003Cbr\u002F>• Language filtering\u003Cbr\u002F>• Text standardization]\n        ED[Edge Data\u003Cbr\u002F>User interactions] --> GC[Graph Construction\u003Cbr\u002F>• Social network\u003Cbr\u002F>• Temporal patterns]\n    end\n    \n    subgraph \"Model Architecture\"\n        DP --> LLaMA[LLaMA Encoder\u003Cbr\u002F>• Semantic embeddings\u003Cbr\u002F>• Attention mechanisms\u003Cbr\u002F>• LoRA fine-tuning]\n        LLaMA --> SE[Semantic Enhancement\u003Cbr\u002F>• Deep projection\u003Cbr\u002F>• Phishing-specific attention\u003Cbr\u002F>• Risk assessment]\n        SE --> CLS[Classifier Head\u003Cbr\u002F>• 2-way classification\u003Cbr\u002F>• Confidence scoring]\n    end\n    \n    subgraph \"Training Framework\"\n        CLS --> L1[Classification Loss\u003Cbr\u002F>L_cls = CrossEntropy]\n        SE --> ADV[Adversarial Training\u003Cbr\u002F>• Semantic perturbation\u003Cbr\u002F>• KL divergence\u003Cbr\u002F>• Distribution sharpening]\n        ADV --> L2[Adversarial Loss\u003Cbr\u002F>KL divergence robustness]\n        \n        GC --> PG[Propagation Graph\u003Cbr\u002F>G with nodes V and edges E\u003Cbr\u002F>• User nodes\u003Cbr\u002F>• Interaction edges]\n        PG --> IC[Independent Cascade\u003Cbr\u002F>• Diffusion simulation\u003Cbr\u002F>• Spread estimation σS]\n        IC --> L3[Propagation Loss\u003Cbr\u002F>L_prop = Expected spread]\n        \n        L1 --> JO[Joint Optimization\u003Cbr\u002F>L_total = L_cls + λ·L_adv + μ·L_prop]\n        L2 --> JO\n        L3 --> JO\n    end\n    \n    subgraph \"Intervention System\"\n        PG --> RA[Risk Assessment\u003Cbr\u002F>• Model predictions\u003Cbr\u002F>• Network centrality\u003Cbr\u002F>• Behavioral patterns]\n        RA --> CS[Candidate Selection\u003Cbr\u002F>• Influence ranking\u003Cbr\u002F>• Activity analysis]\n        CS --> GI[Greedy Intervention\u003Cbr\u002F>• Budget optimization\u003Cbr\u002F>• Marginal gain\u003Cbr\u002F>• Impact evaluation]\n    end\n    \n    subgraph \"Output\"\n        JO --> PM[Trained Model\u003Cbr\u002F>• Phishing detection\u003Cbr\u002F>• Risk scoring]\n        GI --> IN[Intervention Nodes\u003Cbr\u002F>• Optimal selection\u003Cbr\u002F>• Spread minimization]\n        PM --> MT[Metrics\u003Cbr\u002F>• Accuracy: 85-95%\u003Cbr\u002F>• F1-Score: 80-92%\u003Cbr\u002F>• AUC: 90-98%]\n        IN --> PR[Propagation Reduction\u003Cbr\u002F>• 15-40% spread decrease\u003Cbr\u002F>• Cost-effectiveness]\n    end\n    \n    style LLaMA fill:#e1f5fe\n    style JO fill:#f3e5f5\n    style GI fill:#e8f5e8\n    style MT fill:#fff3e0\n    style PR fill:#fff3e0\n```\n\n## ✨ Key Features\n\n### 🧠 **Advanced Phishing Detection**\n- **LLaMA-2-7B Integration**: Deep semantic understanding with fallback to DistilBERT for CPU deployment\n- **Enhanced Architecture**: Multi-layer semantic projection with phishing-specific attention mechanisms\n- **LoRA\u002FPEFT Support**: Parameter-efficient fine-tuning for resource-constrained environments\n- **Robust Preprocessing**: Automated deduplication, language filtering, and text standardization\n\n### 🛡️ **Adversarial Robustness**\n- **Semantic Perturbations**: Embedding-space adversarial examples with ‖δ‖ \u003C ε constraints\n- **KL Divergence Training**: Maximize distribution differences between clean and perturbed inputs\n- **Multiple Attack Strategies**: FGSM, PGD, and semantic perturbation methods\n- **Temperature Scaling**: Enhanced distribution sharpening for better robustness\n\n### 🌐 **Social Network Analysis**\n- **Graph Construction**: Automated social network building from user interactions and temporal patterns\n- **Independent Cascade Model**: Monte Carlo simulation of information diffusion with influence decay\n- **Multi-factor Risk Assessment**: Combines model predictions, network centrality, and behavioral patterns\n- **Real-time Propagation Control**: Dynamic loss computation based on actual graph structure\n\n### 🎯 **Targeted Intervention**\n- **Greedy Optimization**: Budget-constrained intervention node selection with marginal gain analysis\n- **Influence-aware Selection**: PageRank, betweenness centrality, and risk-based candidate ranking\n- **Impact Quantification**: Measurable spread reduction with cost-effectiveness metrics\n- **Scalable Implementation**: Efficient algorithms for large social networks (100k+ users)\n\n### 🔧 **Production-Ready Features**\n- **Mixed Precision Training**: FP16 support for memory efficiency\n- **Gradient Checkpointing**: Training large models on limited hardware\n- **Comprehensive Logging**: Detailed metrics and progress tracking\n- **Checkpoint Management**: Model state preservation and recovery\n- **Real Data Integration**: Twitter API collection and dataset formatting tools\n\n### 🧪 **MLOps & Experimentation**\n- **MLflow Integration**: Complete experiment tracking, model registry, and reproducible runs\n- **Ray Tune Hyperparameter Optimization**: Automated hyperparameter search with early stopping\n- **Ray Train Distributed Training**: Multi-GPU and multi-node training capabilities\n- **Experiment Comparison**: Grid search automation and result visualization\n- **Advanced Schedulers**: ASHA, Hyperband, and Optuna integration for efficient tuning\n\n## 🚀 Quick Start\n\n### Option 1: Use Demo Data (Fastest - 5 minutes)\n```bash\n# 1. Install dependencies\npip install -r requirements.txt\n\n# 2. Generate realistic demo data\npython scripts\u002Fgenerate_demo_data.py --tweets 5000 --users 1000\n\n# 3. Train the model\npython -m training.train --config configs\u002Fconfig.yaml\n\n# 4. View results\ncat runs\u002Fphishguard_exp\u002Ffinal_results.yaml\n```\n\n### Option 2: Use Real Twitter Data\n```bash\n# 1. Install dependencies\npip install -r requirements.txt tweepy\n\n# 2. Get Twitter API access (developer.twitter.com)\nexport TWITTER_BEARER_TOKEN=\"your_bearer_token_here\"\n\n# 3. Collect real data (10-15 minutes)\npython scripts\u002Fcollect_twitter_data.py\n\n# 4. Train with real data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### Option 3: Format Existing Dataset\n```bash\n# Format existing phishing dataset\npython scripts\u002Fformat_existing_data.py \\\n    --input your_dataset.csv \\\n    --output data\u002Ftweets.csv \\\n    --text-col \"tweet_content\" \\\n    --label-col \"is_phishing\"\n\n# Train on formatted data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### Option 4: MLflow Experiment Tracking (Recommended for Research)\n```bash\n# 1. Install MLflow and Ray dependencies\npip install -r requirements.txt\n\n# 2. Quick start with MLflow and Ray\npython scripts\u002Fquick_start_mlflow_ray.py --setup\n\n# 3. Run single experiment with tracking\npython scripts\u002Fquick_start_mlflow_ray.py --single-experiment\n\n# 4. Start MLflow UI to view results\npython scripts\u002Fquick_start_mlflow_ray.py --start-ui\n# Access at: http:\u002F\u002Flocalhost:5000\n```\n\n### Option 5: Hyperparameter Optimization with Ray Tune\n```bash\n# 1. Run automated hyperparameter search\npython scripts\u002Fquick_start_mlflow_ray.py --tune-hyperparams\n\n# 2. Or run advanced hyperparameter tuning\npython -m training.ray_tune_hyperparams --num-samples 20 --max-epochs 5\n\n# 3. Use best configuration found\npython -m training.train_mlflow --config configs\u002Fbest_config_ray.yaml\n```\n\n### Using the trained model\n\nAfter training, the model is saved under `runs\u002Fphishguard_exp\u002F` as `best_model.pt` and `model.pt`.\n\n**Single text (command line):**\n```bash\npython -m inference.run_inference --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt --text \"Claim your prize now: https:\u002F\u002Fbit.ly\u002Fxyz\"\n```\n\n**Batch from CSV:**\n```bash\npython -m inference.run_inference --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt --input data\u002Ftweets.csv --output predictions.csv --text-col text\n```\n\n**In Python:**\n```python\nimport torch\nfrom models.llama_classifier import PhishGuardClassifier\n\nckpt = torch.load(\"runs\u002Fphishguard_exp\u002Fbest_model.pt\", map_location=\"cpu\")\ncfg = ckpt[\"config\"]\nmodel = PhishGuardClassifier(cfg[\"model\"][\"model_name_or_path\"], num_labels=2, peft_cfg=cfg[\"model\"])\nmodel.load_state_dict(ckpt[\"model_state_dict\"])\nmodel.eval()\n\n# One prediction\ninputs = model.tokenizer(\"Your text here\", return_tensors=\"pt\", truncation=True, max_length=cfg[\"model\"][\"max_length\"])\nwith torch.no_grad():\n    logits = model(input_ids=inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"]).logits\np_phishing = torch.softmax(logits, dim=-1)[0, 1].item()  # P(phishing)\n```\n\n## 📋 **Detailed Running Instructions**\n\n### **🎯 Quick Navigation**\n- **[First Time? Start Here](#1-quick-demo-run-recommended-for-first-time-users)** - Get up and running in 5 minutes\n- **[Prerequisites & Installation](#prerequisites--installation)** - System requirements and setup\n- **[Basic Usage](#basic-usage-examples)** - Simple training and evaluation\n- **[Advanced Features](#advanced-usage)** - MLflow, Ray Tune, distributed training\n- **[Real Data](#working-with-real-data)** - Twitter API and custom datasets\n- **[Configuration](#configuration-guide)** - Customize model and training settings\n- **[Troubleshooting](#troubleshooting-guide)** - Common issues and solutions\n- **[Optimization](#performance-optimization-tips)** - Speed up training and improve quality\n- **[Production](#production-deployment)** - Deploy trained models\n\n---\n\n### **Prerequisites & Installation**\n\n1. **System Requirements:**\n   ```bash\n   # Check Python version (3.8+ required)\n   python --version\n   \n   # Check available GPU (optional but recommended)\n   nvidia-smi  # For NVIDIA GPUs\n   ```\n\n2. **Install Dependencies:**\n   ```bash\n   # Clone the repository\n   git clone \u003Cyour-repo-url>\n   cd phishguard-scaffold\n   \n   # Install Python dependencies\n   pip install -r requirements.txt\n   \n   # Optional: Install development dependencies\n   pip install black ruff pytest  # For code formatting and testing\n   ```\n\n3. **Verify Installation:**\n   ```bash\n   # Test basic imports\n   python -c \"from training.train import run; print('✅ Installation successful!')\"\n   ```\n\n### **Basic Usage Examples**\n\n#### **1. Quick Demo Run (Recommended for First-Time Users)**\n```bash\n# Generate synthetic demo data (5K tweets, 1K users)\npython scripts\u002Fgenerate_demo_data.py --tweets 5000 --users 1000\n\n# Run basic training with demo data\npython -m training.train --config configs\u002Fconfig.yaml\n\n# View results\ncat runs\u002Fphishguard_exp\u002Ffinal_results.yaml\n```\n\n#### **2. Training with Custom Configuration**\n```bash\n# Create custom config (copy and modify)\ncp configs\u002Fconfig.yaml configs\u002Fmy_config.yaml\n# Edit configs\u002Fmy_config.yaml as needed\n\n# Run with custom config\npython -m training.train --config configs\u002Fmy_config.yaml\n```\n\n#### **3. Evaluation Only (No Training)**\n```bash\n# Evaluate with pre-trained model checkpoints\npython -m training.train --config configs\u002Fconfig.yaml --eval-only\n```\n\n### **Advanced Usage**\n\n#### **MLflow Experiment Tracking**\n```bash\n# 1. Set up MLflow experiment tracking\npython scripts\u002Fquick_start_mlflow_ray.py --setup\n\n# 2. Run single tracked experiment\npython -m training.train_mlflow --config configs\u002Fmlflow_config.yaml \\\n    --experiment-name \"PhishGuard_Research\" \\\n    --run-name \"baseline_experiment\"\n\n# 3. Start MLflow UI to view results\nmlflow ui --host 0.0.0.0 --port 5000\n# Open: http:\u002F\u002Flocalhost:5000\n```\n\n#### **Hyperparameter Optimization with Ray Tune**\n```bash\n# Basic hyperparameter search (20 trials)\npython -m training.ray_tune_hyperparams \\\n    --config configs\u002Fmlflow_config.yaml \\\n    --num-samples 20 \\\n    --max-epochs 5\n\n# Advanced hyperparameter search with custom parameters\npython scripts\u002Frun_mlflow_experiments.py \\\n    --experiment-type lr \\\n    --config configs\u002Fmlflow_config.yaml\n```\n\n#### **Distributed Training**\n```bash\n# Multi-GPU training (if available)\npython -m training.ray_tune_hyperparams \\\n    --config configs\u002Fmlflow_config.yaml \\\n    --distributed\n\n# Check GPU utilization\nwatch -n 1 nvidia-smi\n```\n\n### **Working with Real Data**\n\n#### **Option 1: Twitter API Collection**\n```bash\n# 1. Get Twitter API credentials from developer.twitter.com\nexport TWITTER_BEARER_TOKEN=\"your_bearer_token_here\"\n\n# 2. Collect real tweets\npython scripts\u002Fcollect_twitter_data.py \\\n    --output data\u002Ftweets.csv \\\n    --count 10000 \\\n    --keywords \"phishing,scam,bitcoin,cryptocurrency\"\n\n# 3. Train with real data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n#### **Option 2: Format Existing Dataset**\n```bash\n# Format your existing CSV dataset\npython scripts\u002Fformat_existing_data.py \\\n    --input your_dataset.csv \\\n    --output data\u002Ftweets.csv \\\n    --text-col \"message_content\" \\\n    --label-col \"is_malicious\" \\\n    --user-col \"user_id\"\n\n# Verify data format\nhead -n 5 data\u002Ftweets.csv\n\n# Train with formatted data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### **Configuration Guide**\n\n#### **Key Configuration Parameters**\n```yaml\n# Essential settings in configs\u002Fconfig.yaml\n\nmodel:\n  model_name_or_path: \"meta-llama\u002FLlama-2-7b-hf\"  # Main model\n  fallback_model: \"distilbert-base-uncased\"        # CPU fallback\n  peft: lora                                       # Enable LoRA\n  max_length: 512                                  # Token length\n\ntrain:\n  batch_size: 8        # Adjust based on GPU memory\n  num_epochs: 5        # Training epochs\n  lr: 1e-4            # Learning rate\n  fp16: true          # Mixed precision (saves memory)\n\nloss:\n  lambda_cls: 1.0     # Classification loss weight\n  lambda_adv: 0.3     # Adversarial loss weight  \n  mu_prop: 0.2        # Propagation loss weight\n```\n\n#### **Memory Optimization Settings**\n```bash\n# For limited GPU memory (\u003C 8GB)\n# Edit configs\u002Fconfig.yaml:\n# - Set batch_size: 4 or 2\n# - Set fp16: true\n# - Set gradient_checkpointing: true\n# - Use fallback_model: \"distilbert-base-uncased\"\n\n# For CPU-only training\n# Edit configs\u002Fconfig.yaml:\n# - Set model_name_or_path: \"distilbert-base-uncased\"\n# - Set peft: null\n# - Set fp16: false\n```\n\n### **Monitoring & Evaluation**\n\n#### **Real-time Training Monitoring**\n```bash\n# Terminal 1: Start training\npython -m training.train_mlflow --config configs\u002Fmlflow_config.yaml\n\n# Terminal 2: Monitor with MLflow UI\nmlflow ui --host 0.0.0.0 --port 5000\n\n# Terminal 3: Monitor system resources\nwatch -n 2 'nvidia-smi; echo \"\"; free -h'\n```\n\n#### **Evaluate Model Performance**\n```bash\n# Basic evaluation metrics\npython -c \"\nfrom training.train import run\nresults = run('configs\u002Fconfig.yaml', eval_only=True)\nprint(f'Accuracy: {results[\\\"test_metrics\\\"][\\\"accuracy\\\"]:.3f}')\nprint(f'F1-Score: {results[\\\"test_metrics\\\"][\\\"f1\\\"]:.3f}')\nprint(f'AUC: {results[\\\"test_metrics\\\"][\\\"auc\\\"]:.3f}')\n\"\n\n# Detailed analysis with intervention impact\npython scripts\u002Fanalyze_results.py runs\u002Fphishguard_exp\u002F\n```\n\n### **Troubleshooting Guide**\n\n#### **Common Issues & Solutions**\n\n**1. CUDA Out of Memory:**\n```bash\n# Solution: Reduce batch size and enable optimizations\n# In configs\u002Fconfig.yaml:\ntrain:\n  batch_size: 2  # Reduce from 8\n  fp16: true\n  gradient_checkpointing: true\n```\n\n**2. Model Loading Failures:**\n```bash\n# Check if model name is correct\npython -c \"\nfrom transformers import AutoTokenizer\ntry:\n    tokenizer = AutoTokenizer.from_pretrained('meta-llama\u002FLlama-2-7b-hf')\n    print('✅ Model accessible')\nexcept Exception as e:\n    print(f'❌ Model error: {e}')\n\"\n\n# Use fallback model\n# In configs\u002Fconfig.yaml set: model_name_or_path: \"distilbert-base-uncased\"\n```\n\n**3. Data Loading Issues:**\n```bash\n# Check data file format\npython -c \"\nimport pandas as pd\ndf = pd.read_csv('data\u002Ftweets.csv')\nprint(f'Data shape: {df.shape}')\nprint(f'Columns: {list(df.columns)}')\nprint(f'Sample:\\\\n{df.head(2)}')\n\"\n\n# Regenerate demo data if needed\npython scripts\u002Fgenerate_demo_data.py --tweets 1000 --users 200\n```\n\n**4. Import Errors:**\n```bash\n# Check Python path\npython -c \"import sys; print('\\\\n'.join(sys.path))\"\n\n# Install missing dependencies\npip install -r requirements.txt\n\n# Run from project root\ncd \u002Fpath\u002Fto\u002Fphishguard-scaffold\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### **Performance Optimization Tips**\n\n#### **Speed Up Training:**\n```bash\n# 1. Use mixed precision\n# Set fp16: true in config\n\n# 2. Increase batch size (if memory allows)\n# Set batch_size: 16 or 32\n\n# 3. Use LoRA for faster fine-tuning\n# Set peft: lora, lora_r: 16\n\n# 4. Pre-compile model (PyTorch 2.0+)\n# Set compile_model: true in config\n```\n\n#### **Improve Model Quality:**\n```bash\n# 1. Use more training epochs\n# Set num_epochs: 10\n\n# 2. Tune loss weights\n# Experiment with lambda_adv: 0.5, mu_prop: 0.3\n\n# 3. Use larger model (if resources allow)\n# Set model_name_or_path: \"meta-llama\u002FLlama-2-13b-hf\"\n\n# 4. Collect more\u002Fbetter training data\npython scripts\u002Fcollect_twitter_data.py --count 50000\n```\n\n### **Development Workflow**\n\n#### **Code Quality Checks:**\n```bash\n# Format code\nblack training\u002F scripts\u002F --line-length 88\n\n# Check for issues\nruff check .\n\n# Run tests (if available)\npytest tests\u002F -v\n```\n\n#### **Experiment Tracking:**\n```bash\n# Compare multiple experiments\npython scripts\u002Frun_mlflow_experiments.py --experiment-type lr\npython scripts\u002Frun_mlflow_experiments.py --experiment-type loss-weights\n\n# View comparisons in MLflow UI\nmlflow ui\n```\n\n### **Production Deployment**\n\n#### **Export Trained Model:**\n```bash\n# Save model for inference\npython -c \"\nfrom training.train import run\nfrom models.llama_classifier import PhishGuardClassifier\nimport torch\n\n# Load trained model\nmodel = PhishGuardClassifier('path\u002Fto\u002Fcheckpoint')\ntorch.save(model.state_dict(), 'phishguard_production.pth')\nprint('Model saved for production use')\n\"\n```\n\n#### **Batch Inference:**\n```bash\n# Run inference on new data\npython -m inference.run_inference \\\n    --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt \\\n    --input new_tweets.csv \\\n    --output predictions.csv \\\n    --text-col text\n```\n\n## 📊 Expected Performance\n\n### Model Performance (with 10k+ real tweets)\n- **Accuracy**: 85-95%\n- **F1-Score**: 80-92%\n- **AUC**: 90-98%\n- **Precision**: 82-94%\n- **Recall**: 78-90%\n\n### Propagation Control\n- **Spread Reduction**: 15-40%\n- **Intervention Efficiency**: Budget-optimal node selection\n- **Cost-Effectiveness**: 2-8 nodes per 1% spread reduction\n- **Graph Coverage**: Supports 10k-100k+ user networks\n\n### Training Performance\n- **CPU Training**: 2-6 hours (DistilBERT)\n- **GPU Training**: 30-90 minutes (LLaMA + LoRA + FP16)\n- **Memory Requirements**: 4-16GB depending on model choice\n- **Data Processing**: 5-15 minutes for 50k tweets\n\n## 📁 Project Structure\n\n```\nphishguard_scaffold\u002F\n├── README.md, requirements.txt\n├── configs\u002F\n│   ├── config.yaml              # Main configuration file\n│   └── mlflow_config.yaml       # MLflow and Ray configuration\n├── data\u002F\n│   ├── dataset.py              # Enhanced data loading & preprocessing\n│   ├── tweets.csv              # Tweet dataset (text, labels, metadata)\n│   └── edges.csv               # Social network edges\n├── models\u002F\n│   └── llama_classifier.py     # LLaMA-based classifier with LoRA\n├── training\u002F\n│   ├── train.py               # Joint optimization training loop\n│   ├── train_mlflow.py        # MLflow-enhanced training with experiment tracking\n│   ├── ray_tune_hyperparams.py # Ray Tune hyperparameter optimization\n│   └── adversarial.py         # Adversarial training components\n├── propagation\u002F\n│   ├── graph.py               # Social network & IC simulation\n│   └── intervene.py           # Intervention strategies\n├── eval\u002F\n│   └── metrics.py             # Evaluation metrics\n├── inference\u002F\n│   └── run_inference.py       # Load checkpoint & run phishing detection (CLI)\n├── scripts\u002F\n│   ├── collect_twitter_data.py      # Real Twitter data collection\n│   ├── format_existing_data.py      # Dataset formatting utility\n│   ├── generate_demo_data.py        # Synthetic data generation\n│   ├── run_mlflow_experiments.py    # Automated experiment grid search\n│   └── quick_start_mlflow_ray.py    # MLflow and Ray quick start guide\n└── runs\u002F                            # Training outputs and checkpoints\n   ├── mlruns\u002F                       # MLflow experiment tracking data\n   └── ray_results\u002F                  # Ray Tune optimization results\n```\n\n## ⚙️ Configuration\n\nThe framework is highly configurable via `configs\u002Fconfig.yaml`:\n\n```yaml\nmodel:\n  model_name_or_path: meta-llama\u002FLlama-2-7b-hf  # Primary model\n  fallback_model: distilbert-base-uncased        # CPU fallback\n  peft: lora                                     # Enable LoRA\n  lora_r: 16                                     # LoRA rank\n  max_length: 512                                # Input sequence length\n\ntrain:\n  batch_size: 8                                  # Batch size\n  num_epochs: 5                                  # Training epochs\n  lr: 1e-4                                       # Learning rate\n  fp16: true                                     # Mixed precision\n\nloss:\n  lambda_cls: 1.0                                # Classification weight\n  lambda_adv: 0.3                                # Adversarial weight\n  mu_prop: 0.2                                   # Propagation weight\n\npropagation:\n  ic_samples: 100                                # IC simulation samples\n  budget: 20                                     # Intervention budget\n  topk_candidates: 200                           # Candidate pool size\n```\n\n## 📚 Data Format\n\n### Required Tweet Data (`tweets.csv`)\n| Column | Type | Description |\n|--------|------|-------------|\n| `text` | string | Tweet content |\n| `label` | int | 0=legitimate, 1=phishing |\n| `user_id` | string | Unique user identifier |\n| `timestamp` | string | ISO format timestamp |\n| `parent_user_id` | string | For retweets\u002Freplies (optional) |\n| `url` | string | Extracted URLs (optional) |\n\n### Social Network Data (`edges.csv`)\n| Column | Type | Description |\n|--------|------|-------------|\n| `src` | string | Source user ID |\n| `dst` | string | Destination user ID |\n| `weight` | float | Influence probability [0,1] |\n| `timestamp` | string | Interaction timestamp (optional) |\n\n## 🔬 Methodology\n\n\n\n- ✅ **Deep Semantic Modeling**: LLaMA-based semantic representations\n- ✅ **Adversarial Training**: Enhanced robustness against deceptive messages\n- ✅ **Social Network Analysis**: Influence-based diffusion modeling\n- ✅ **Targeted Intervention**: High-risk propagation path disruption\n- ✅ **Joint Optimization**: Unified loss combining all components\n- ✅ **Comprehensive Evaluation**: Multiple metrics and intervention impact\n\n### Joint Optimization Objective\n```\nL_total = λ_cls × L_cls + λ_adv × L_adv + μ_prop × L_prop\n\nWhere:\n- L_cls: Standard cross-entropy classification loss\n- L_adv: KL(clean vs perturbed) adversarial robustness loss\n- L_prop: Graph-based propagation control loss\n```\n\n## 🛠️ Advanced Usage\n\n### Custom Model Integration\n```python\nfrom models.llama_classifier import PhishGuardClassifier\n\n# Initialize with custom model\nmodel = PhishGuardClassifier(\n    \"microsoft\u002FDialoGPT-medium\",\n    peft_cfg={\"peft\": \"lora\", \"lora_r\": 8}\n)\n```\n\n### Programmatic Training\n```python\nfrom training.train import run\n\n# Run training programmatically\nresults = run(\"configs\u002Fconfig.yaml\", eval_only=False)\nprint(f\"Final accuracy: {results['test_metrics']['accuracy']:.3f}\")\n```\n\n### Custom Intervention Analysis\n```python\nfrom propagation.intervene import evaluate_intervention_impact\n\nimpact = evaluate_intervention_impact(graph, intervention_nodes, risk_scores)\nprint(f\"Spread reduction: {impact['relative_reduction']:.1%}\")\n```\n\n### MLflow Experiment Tracking\n```python\nimport mlflow\nfrom training.train_mlflow import MLflowPhishGuardTrainer, TrainingConfig\n\n# Configure experiment\nconfig = TrainingConfig(\n    experiment_name=\"PhishGuard_Research\",\n    run_name=\"adversarial_loss_study\",\n    lambda_adv=0.4,  # Custom adversarial weight\n    num_epochs=10\n)\n\n# Run tracked experiment\ntrainer = MLflowPhishGuardTrainer(config)\nresults = trainer.train()\n\n# View in MLflow UI: mlflow ui\n```\n\n### Ray Tune Hyperparameter Optimization\n```python\nfrom ray import tune\nfrom training.ray_tune_hyperparams import run_hyperparameter_optimization\n\n# Define custom search space\nsearch_space = {\n    \"lr\": tune.loguniform(1e-5, 1e-3),\n    \"lambda_adv\": tune.uniform(0.1, 0.5),\n    \"batch_size\": tune.choice([8, 16, 32])\n}\n\n# Run optimization\nresults, best_trial = run_hyperparameter_optimization(\n    num_samples=50,\n    max_num_epochs=10,\n    gpus_per_trial=0.25\n)\n\nprint(f\"Best F1 score: {best_trial.metrics['f1']:.4f}\")\n```\n\n### Distributed Training with Ray\n```python\nfrom training.ray_tune_hyperparams import distributed_training_example\n\n# Run distributed training across multiple GPUs\nresults = distributed_training_example(\"configs\u002Fmlflow_config.yaml\")\n```\n\n## 📈 Extending the Framework\n\n### Adding New Models\n1. Implement in `src\u002Fmodels\u002F` following the `PhishGuardClassifier` interface\n2. Update configuration options in `configs\u002Fconfig.yaml`\n3. Test with the existing training pipeline\n\n### Custom Loss Functions\n1. Add new loss components in `src\u002Ftraining\u002Fadversarial.py`\n2. Integrate into joint optimization in `src\u002Ftraining\u002Ftrain.py`\n3. Configure weights via YAML settings\n\n### Alternative Intervention Strategies\n1. Implement new algorithms in `src\u002Fpropagation\u002Fintervene.py`\n2. Follow the interface expected by `greedy_minimize_spread`\n3. Evaluate using provided impact metrics\n\n## 🤝 Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](docs\u002FCONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone \u003Crepository-url>\ncd phishguard_scaffold\npip install -r requirements.txt\npip install -r requirements-dev.txt  # Additional dev dependencies\n\n# Run tests\npython -m pytest tests\u002F\n\n# Format code\nblack src\u002F scripts\u002F\n```\n\n## 📊 Benchmarks\n\nPerformance on various dataset sizes:\n\n| Dataset Size | Training Time (GPU) | Memory Usage | Accuracy | F1-Score |\n|--------------|-------------------|--------------|----------|----------|\n| 1k tweets    | 10 minutes        | 4GB          | 89%      | 87%      |\n| 10k tweets   | 45 minutes        | 8GB          | 92%      | 90%      |\n| 50k tweets   | 2.5 hours         | 14GB         | 94%      | 92%      |\n| 100k tweets  | 4 hours           | 16GB         | 95%      | 93%      |\n\n*Benchmarks run on NVIDIA A100 with LLaMA-2-7B + LoRA*\n\n## 📝 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 📖 Citation\n\nIf you use this framework in your research, please cite:\n\n```bibtex\n@article{phishguard2024,\n  title={Joint Semantic Detection and Dissemination Control of Phishing Attacks on Social Media via LLaMA-Based Modeling},\n  author={Rui Wang},\n  journal={[ICCEA 2025 Proceedings]},\n  year={2024},\n  url={https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F11103392}\n}\n```\n\n## 🆘 Support\n\n- 📖 **Documentation**: See `docs\u002FDATA_INTEGRATION_GUIDE.md` and `docs\u002FREQUIREMENTS_ANALYSIS.md`\n- 🐛 **Issues**: Report bugs and feature requests via GitHub Issues\n- 💬 **Discussions**: Join our community discussions\n- 📧 **Contact**: [contact information]\n\n## 🎯 What's Next?\n\nPlanned improvements and extensions, by area and effort:\n\n| Area | Suggestion | Effort |\n|------|------------|--------|\n| **Training** | Save checkpoint + write `final_results.yaml` | Low |\n| **Scripts** | Add `batch_inference.py`, `analyze_results.py` | Low |\n| **Testing** | Add `tests\u002F` + pytest | Low–Med |\n| **Training** | Checkpointing + early stopping | Medium |\n| **Propagation** | Use real graph propagation loss in training | Medium |\n| **Eval** | Confusion matrix, calibration, threshold sweep | Medium |\n| **Interpretability** | Token-level explanations | Medium |\n| **Deployment** | Small API + Docker | Medium |\n| **Roadmap** | Multi-language, dashboard, Twitter v2\u002Fstreaming | Higher |\n\nExisting roadmap items:\n\n- [ ] Multi-language support beyond English\n- [ ] Real-time deployment pipeline\n- [ ] Integration with Twitter API v2 streaming\n- [ ] Advanced visualization dashboard\n- [ ] Federated learning capabilities\n\n---\n\n⭐ **Star this repository** if you find it useful for your research or applications!\n\n*Built with ❤️ for the cybersecurity and social media safety community*","# PhishGuard: 联合语义检测与传播控制\n\n[![Python 3.8+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.8+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.0+-red.svg)](https:\u002F\u002Fpytorch.org\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n\n**一个基于 LLaMA (大型语言模型) 和先进基于图干预策略的社交媒体钓鱼检测与传播控制统一框架。**\n\n本仓库实现了 *\"Joint Semantic Detection and Dissemination Control of Phishing Attacks on Social Media via LLaMA-Based Modeling\"* 中描述的完整研究框架，具备深度语义理解、对抗鲁棒性 (Adversarial Robustness) 和针对性干预功能，以实现全面的钓鱼缓解。\n\n## 🏗️ 系统架构\n\n*备选视图：[架构示意图 PNG](docs\u002Farchitecture_diagram.png)*\n\n```mermaid\ngraph TB\n    subgraph \"Data Layer\"\n        TD[Twitter Data\u003Cbr\u002F>~100k tweets] --> DP[Data Preprocessing\u003Cbr\u002F>• Deduplication\u003Cbr\u002F>• Language filtering\u003Cbr\u002F>• Text standardization]\n        ED[Edge Data\u003Cbr\u002F>User interactions] --> GC[Graph Construction\u003Cbr\u002F>• Social network\u003Cbr\u002F>• Temporal patterns]\n    end\n    \n    subgraph \"Model Architecture\"\n        DP --> LLaMA[LLaMA Encoder\u003Cbr\u002F>• Semantic embeddings\u003Cbr\u002F>• Attention mechanisms\u003Cbr\u002F>• LoRA fine-tuning]\n        LLaMA --> SE[Semantic Enhancement\u003Cbr\u002F>• Deep projection\u003Cbr\u002F>• Phishing-specific attention\u003Cbr\u002F>• Risk assessment]\n        SE --> CLS[Classifier Head\u003Cbr\u002F>• 2-way classification\u003Cbr\u002F>• Confidence scoring]\n    end\n    \n    subgraph \"Training Framework\"\n        CLS --> L1[Classification Loss\u003Cbr\u002F>L_cls = CrossEntropy]\n        SE --> ADV[Adversarial Training\u003Cbr\u002F>• Semantic perturbation\u003Cbr\u002F>• KL divergence\u003Cbr\u002F>• Distribution sharpening]\n        ADV --> L2[Adversarial Loss\u003Cbr\u002F>KL divergence robustness]\n        \n        GC --> PG[Propagation Graph\u003Cbr\u002F>G with nodes V and edges E\u003Cbr\u002F>• User nodes\u003Cbr\u002F>• Interaction edges]\n        PG --> IC[Independent Cascade\u003Cbr\u002F>• Diffusion simulation\u003Cbr\u002F>• Spread estimation σS]\n        IC --> L3[Propagation Loss\u003Cbr\u002F>L_prop = Expected spread]\n        \n        L1 --> JO[Joint Optimization\u003Cbr\u002F>L_total = L_cls + λ·L_adv + μ·L_prop]\n        L2 --> JO\n        L3 --> JO\n    end\n    \n    subgraph \"Intervention System\"\n        PG --> RA[Risk Assessment\u003Cbr\u002F>• Model predictions\u003Cbr\u002F>• Network centrality\u003Cbr\u002F>• Behavioral patterns]\n        RA --> CS[Candidate Selection\u003Cbr\u002F>• Influence ranking\u003Cbr\u002F>• Activity analysis]\n        CS --> GI[Greedy Intervention\u003Cbr\u002F>• Budget optimization\u003Cbr\u002F>• Marginal gain\u003Cbr\u002F>• Impact evaluation]\n    end\n    \n    subgraph \"Output\"\n        JO --> PM[Trained Model\u003Cbr\u002F>• Phishing detection\u003Cbr\u002F>• Risk scoring]\n        GI --> IN[Intervention Nodes\u003Cbr\u002F>• Optimal selection\u003Cbr\u002F>• Spread minimization]\n        PM --> MT[Metrics\u003Cbr\u002F>• Accuracy: 85-95%\u003Cbr\u002F>• F1-Score: 80-92%\u003Cbr\u002F>• AUC: 90-98%]\n        IN --> PR[Propagation Reduction\u003Cbr\u002F>• 15-40% spread decrease\u003Cbr\u002F>• Cost-effectiveness]\n    end\n    \n    style LLaMA fill:#e1f5fe\n    style JO fill:#f3e5f5\n    style GI fill:#e8f5e8\n    style MT fill:#fff3e0\n    style PR fill:#fff3e0\n```\n\n## ✨ 主要特性\n\n### 🧠 **高级钓鱼检测**\n- **LLaMA-2-7B 集成**：深度语义理解，支持回退到 DistilBERT 以用于 CPU 部署\n- **增强架构**：多层语义投影，具备钓鱼特定注意力机制\n- **LoRA\u002FPEFT (参数高效微调) 支持**：适用于资源受限环境的参数高效微调\n- **鲁棒预处理**：自动去重、语言过滤和文本标准化\n\n### 🛡️ **对抗鲁棒性**\n- **语义扰动**：嵌入空间对抗样本，带有 ‖δ‖ \u003C ε 约束\n- **KL 散度 (KL Divergence) 训练**：最大化干净输入与扰动输入之间的分布差异\n- **多种攻击策略**：FGSM, PGD (对抗攻击方法) 和语义扰动方法\n- **温度缩放**：增强分布锐化以提高鲁棒性\n\n### 🌐 **社交网络分析**\n- **图构建**：基于用户交互和时间模式自动构建社交网络\n- **独立级联模型 (Independent Cascade Model)**：带有影响力衰减的信息扩散蒙特卡洛 (Monte Carlo) 模拟\n- **多因素风险评估**：结合模型预测、网络中心性和行为模式\n- **实时传播控制**：基于实际图结构的动态损失计算\n\n### 🎯 **针对性干预**\n- **贪心优化 (Greedy Optimization)**：预算约束下的干预节点选择，带有边际增益分析\n- **影响力感知选择**：PageRank (页面排名算法)、介数中心性 (Betweenness Centrality) 和基于风险的候选排名\n- **影响量化**：可衡量的传播减少和成本效益指标\n- **可扩展实现**：适用于大型社交网络（10 万 + 用户）的高效算法\n\n### 🔧 **生产就绪特性**\n- **混合精度训练 (Mixed Precision Training)**：支持 FP16 (半精度浮点数) 以提高内存效率\n- **梯度检查点 (Gradient Checkpointing)**：在有限硬件上训练大型模型\n- **综合日志记录**：详细的指标和进度跟踪\n- **检查点管理**：模型状态保存和恢复\n- **真实数据集成**：Twitter API 收集和数据集格式化工具\n\n### 🧪 **MLOps 与实验**\n- **MLflow (机器学习生命周期管理平台) 集成**：完整的实验追踪、模型注册和可复现运行\n- **Ray Tune (超参数优化库) 超参数优化**：带有早期停止的自动超参数搜索\n- **Ray Train 分布式训练**：多 GPU 和多节点训练能力\n- **实验比较**：网格搜索自动化和结果可视化\n- **高级调度器**：集成 ASHA, Hyperband, 和 Optuna (调度\u002F优化算法) 以实现高效调优\n\n## 🚀 快速开始\n\n### 选项 1：使用演示数据（最快 - 5 分钟）\n```bash\n# 1. Install dependencies\npip install -r requirements.txt\n\n# 2. Generate realistic demo data\npython scripts\u002Fgenerate_demo_data.py --tweets 5000 --users 1000\n\n# 3. Train the model\npython -m training.train --config configs\u002Fconfig.yaml\n\n# 4. View results\ncat runs\u002Fphishguard_exp\u002Ffinal_results.yaml\n```\n\n### 选项 2：使用真实 Twitter 数据\n```bash\n# 1. Install dependencies\npip install -r requirements.txt tweepy\n\n# 2. Get Twitter API access (developer.twitter.com)\nexport TWITTER_BEARER_TOKEN=\"your_bearer_token_here\"\n\n# 3. Collect real data (10-15 minutes)\npython scripts\u002Fcollect_twitter_data.py\n\n# 4. Train with real data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### 选项 3：格式化现有数据集\n```bash\n# Format existing phishing dataset\npython scripts\u002Fformat_existing_data.py \\\n    --input your_dataset.csv \\\n    --output data\u002Ftweets.csv \\\n    --text-col \"tweet_content\" \\\n    --label-col \"is_phishing\"\n\n# Train on formatted data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### 选项 4：MLflow 实验追踪（研究推荐）\n```bash\n\n# 1. 安装 MLflow (机器学习生命周期管理平台) 和 Ray (分布式计算框架) 依赖\npip install -r requirements.txt\n\n# 2. MLflow 和 Ray 快速入门\npython scripts\u002Fquick_start_mlflow_ray.py --setup\n\n# 3. 运行带追踪的单个实验\npython scripts\u002Fquick_start_mlflow_ray.py --single-experiment\n\n# 4. 启动 MLflow UI (用户界面) 查看结果\npython scripts\u002Fquick_start_mlflow_ray.py --start-ui\n# 访问地址：http:\u002F\u002Flocalhost:5000\n```\n\n### 选项 5：使用 Ray Tune (超参数调优库) 进行超参数优化 (Hyperparameter Optimization)\n```bash\n# 1. Run automated hyperparameter search\npython scripts\u002Fquick_start_mlflow_ray.py --tune-hyperparams\n\n# 2. Or run advanced hyperparameter tuning\npython -m training.ray_tune_hyperparams --num-samples 20 --max-epochs 5\n\n# 3. Use best configuration found\npython -m training.train_mlflow --config configs\u002Fbest_config_ray.yaml\n```\n\n### 使用训练好的模型\n\n训练完成后，模型将保存在 `runs\u002Fphishguard_exp\u002F` 目录下，文件名为 `best_model.pt` 和 `model.pt`。\n\n**单条文本（命令行）：**\n```bash\npython -m inference.run_inference --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt --text \"Claim your prize now: https:\u002F\u002Fbit.ly\u002Fxyz\"\n```\n\n**CSV (逗号分隔值文件) 批量处理：**\n```bash\npython -m inference.run_inference --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt --input data\u002Ftweets.csv --output predictions.csv --text-col text\n```\n\n**在 Python 中：**\n```python\nimport torch\nfrom models.llama_classifier import PhishGuardClassifier\n\nckpt = torch.load(\"runs\u002Fphishguard_exp\u002Fbest_model.pt\", map_location=\"cpu\")\ncfg = ckpt[\"config\"]\nmodel = PhishGuardClassifier(cfg[\"model\"][\"model_name_or_path\"], num_labels=2, peft_cfg=cfg[\"model\"])\nmodel.load_state_dict(ckpt[\"model_state_dict\"])\nmodel.eval()\n\n# One prediction\ninputs = model.tokenizer(\"Your text here\", return_tensors=\"pt\", truncation=True, max_length=cfg[\"model\"][\"max_length\"])\nwith torch.no_grad():\n    logits = model(input_ids=inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"]).logits\np_phishing = torch.softmax(logits, dim=-1)[0, 1].item()  # P(phishing)\n```\n\n## 📋 **详细运行说明**\n\n### **🎯 快速导航**\n- **[第一次使用？从这里开始](#1-quick-demo-run-recommended-for-first-time-users)** - 5 分钟内快速上手\n- **[前置条件与安装](#prerequisites--installation)** - 系统要求和设置\n- **[基本用法](#basic-usage-examples)** - 简单的训练和评估\n- **[高级功能](#advanced-usage)** - MLflow、Ray Tune、分布式训练 (Distributed Training)\n- **[真实数据](#working-with-real-data)** - Twitter API (应用程序接口) 和自定义数据集\n- **[配置](#configuration-guide)** - 自定义模型和训练设置\n- **[故障排除](#troubleshooting-guide)** - 常见问题和解决方案\n- **[优化](#performance-optimization-tips)** - 加速训练并提高质量\n- **[生产环境](#production-deployment)** - 部署训练好的模型\n\n---\n\n### **前置条件与安装**\n\n1. **系统要求（含 GPU (图形处理器) 可选检查）：**\n   ```bash\n   # Check Python version (3.8+ required)\n   python --version\n   \n   # Check available GPU (optional but recommended)\n   nvidia-smi  # For NVIDIA GPUs\n   ```\n\n2. **安装依赖：**\n   ```bash\n   # Clone the repository\n   git clone \u003Cyour-repo-url>\n   cd phishguard-scaffold\n   \n   # Install Python dependencies\n   pip install -r requirements.txt\n   \n   # Optional: Install development dependencies\n   pip install black ruff pytest  # For code formatting and testing\n   ```\n\n3. **验证安装：**\n   ```bash\n   # Test basic imports\n   python -c \"from training.train import run; print('✅ Installation successful!')\"\n   ```\n\n### **基本用法示例**\n\n#### **1. 快速演示运行（推荐首次用户使用）**\n```bash\n# Generate synthetic demo data (5K tweets, 1K users)\npython scripts\u002Fgenerate_demo_data.py --tweets 5000 --users 1000\n\n# Run basic training with demo data\npython -m training.train --config configs\u002Fconfig.yaml\n\n# View results\ncat runs\u002Fphishguard_exp\u002Ffinal_results.yaml\n```\n\n#### **2. 使用自定义配置训练**\n```bash\n# Create custom config (copy and modify)\ncp configs\u002Fconfig.yaml configs\u002Fmy_config.yaml\n# Edit configs\u002Fmy_config.yaml as needed\n\n# Run with custom config\npython -m training.train --config configs\u002Fmy_config.yaml\n```\n\n#### **3. 仅评估（不训练）**\n```bash\n# Evaluate with pre-trained model checkpoints\npython -m training.train --config configs\u002Fconfig.yaml --eval-only\n```\n\n### **高级用法**\n\n#### **MLflow 实验追踪**\n```bash\n# 1. Set up MLflow experiment tracking\npython scripts\u002Fquick_start_mlflow_ray.py --setup\n\n# 2. Run single tracked experiment\npython -m training.train_mlflow --config configs\u002Fmlflow_config.yaml \\\n    --experiment-name \"PhishGuard_Research\" \\\n    --run-name \"baseline_experiment\"\n\n# 3. Start MLflow UI to view results\nmlflow ui --host 0.0.0.0 --port 5000\n# Open: http:\u002F\u002Flocalhost:5000\n```\n\n#### **使用 Ray Tune 进行超参数优化**\n```bash\n# Basic hyperparameter search (20 trials)\npython -m training.ray_tune_hyperparams \\\n    --config configs\u002Fmlflow_config.yaml \\\n    --num-samples 20 \\\n    --max-epochs 5\n\n# Advanced hyperparameter search with custom parameters\npython scripts\u002Frun_mlflow_experiments.py \\\n    --experiment-type lr \\\n    --config configs\u002Fmlflow_config.yaml\n```\n\n#### **分布式训练**\n```bash\n# Multi-GPU training (if available)\npython -m training.ray_tune_hyperparams \\\n    --config configs\u002Fmlflow_config.yaml \\\n    --distributed\n\n# Check GPU utilization\nwatch -n 1 nvidia-smi\n```\n\n### **处理真实数据**\n\n#### **选项 1：Twitter API 收集**\n```bash\n# 1. Get Twitter API credentials from developer.twitter.com\nexport TWITTER_BEARER_TOKEN=\"your_bearer_token_here\"\n\n# 2. Collect real tweets\npython scripts\u002Fcollect_twitter_data.py \\\n    --output data\u002Ftweets.csv \\\n    --count 10000 \\\n    --keywords \"phishing,scam,bitcoin,cryptocurrency\"\n\n# 3. Train with real data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n#### **选项 2：格式化现有数据集**\n```bash\n# Format your existing CSV dataset\npython scripts\u002Fformat_existing_data.py \\\n    --input your_dataset.csv \\\n    --output data\u002Ftweets.csv \\\n    --text-col \"message_content\" \\\n    --label-col \"is_malicious\" \\\n    --user-col \"user_id\"\n\n# Verify data format\nhead -n 5 data\u002Ftweets.csv\n\n# Train with formatted data\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### **配置指南 (YAML 配置文件格式)**\n\n#### **关键配置参数**\n```yaml\n\n# configs\u002Fconfig.yaml 中的基本设置\n\nmodel:\n  model_name_or_path: \"meta-llama\u002FLlama-2-7b-hf\"  # Main model\n  fallback_model: \"distilbert-base-uncased\"        # CPU fallback\n  peft: lora                                       # Enable LoRA\n  max_length: 512                                  # Token length\n\ntrain:\n  batch_size: 8        # Adjust based on GPU memory\n  num_epochs: 5        # Training epochs\n  lr: 1e-4            # Learning rate\n  fp16: true          # Mixed precision (saves memory)\n\nloss:\n  lambda_cls: 1.0     # Classification loss weight\n  lambda_adv: 0.3     # Adversarial loss weight  \n  mu_prop: 0.2        # Propagation loss weight\n```\n\n#### **内存优化设置**\n```bash\n# For limited GPU memory (\u003C 8GB)\n# Edit configs\u002Fconfig.yaml:\n# - Set batch_size: 4 or 2\n# - Set fp16: true\n# - Set gradient_checkpointing: true\n# - Use fallback_model: \"distilbert-base-uncased\"\n\n# For CPU-only training\n# Edit configs\u002Fconfig.yaml:\n# - Set model_name_or_path: \"distilbert-base-uncased\"\n# - Set peft: null\n# - Set fp16: false\n```\n\n### **监控与评估**\n\n#### **实时训练监控**\n```bash\n# Terminal 1: Start training\npython -m training.train_mlflow --config configs\u002Fmlflow_config.yaml\n\n# Terminal 2: Monitor with MLflow UI\nmlflow ui --host 0.0.0.0 --port 5000\n\n# Terminal 3: Monitor system resources\nwatch -n 2 'nvidia-smi; echo \"\"; free -h'\n```\n\n#### **评估模型性能**\n```bash\n# Basic evaluation metrics\npython -c \"\nfrom training.train import run\nresults = run('configs\u002Fconfig.yaml', eval_only=True)\nprint(f'Accuracy: {results[\\\"test_metrics\\\"][\\\"accuracy\\\"]:.3f}')\nprint(f'F1-Score: {results[\\\"test_metrics\\\"][\\\"f1\\\"]:.3f}')\nprint(f'AUC: {results[\\\"test_metrics\\\"][\\\"auc\\\"]:.3f}')\n\"\n\n# Detailed analysis with intervention impact\npython scripts\u002Fanalyze_results.py runs\u002Fphishguard_exp\u002F\n```\n\n### **故障排除指南**\n\n#### **常见问题与解决方案**\n\n**1. CUDA (统一计算设备架构) 显存不足：**\n```bash\n# Solution: Reduce batch size and enable optimizations\n# In configs\u002Fconfig.yaml:\ntrain:\n  batch_size: 2  # Reduce from 8\n  fp16: true\n  gradient_checkpointing: true\n```\n\n**2. 模型加载失败：**\n```bash\n# Check if model name is correct\npython -c \"\nfrom transformers import AutoTokenizer\ntry:\n    tokenizer = AutoTokenizer.from_pretrained('meta-llama\u002FLlama-2-7b-hf')\n    print('✅ Model accessible')\nexcept Exception as e:\n    print(f'❌ Model error: {e}')\n\"\n\n# Use fallback model\n# In configs\u002Fconfig.yaml set: model_name_or_path: \"distilbert-base-uncased\"\n```\n\n**3. 数据加载问题：**\n```bash\n# Check data file format\npython -c \"\nimport pandas as pd\ndf = pd.read_csv('data\u002Ftweets.csv')\nprint(f'Data shape: {df.shape}')\nprint(f'Columns: {list(df.columns)}')\nprint(f'Sample:\\\\n{df.head(2)}')\n\"\n\n# Regenerate demo data if needed\npython scripts\u002Fgenerate_demo_data.py --tweets 1000 --users 200\n```\n\n**4. 导入错误：**\n```bash\n# Check Python path\npython -c \"import sys; print('\\\\n'.join(sys.path))\"\n\n# Install missing dependencies\npip install -r requirements.txt\n\n# Run from project root\ncd \u002Fpath\u002Fto\u002Fphishguard-scaffold\npython -m training.train --config configs\u002Fconfig.yaml\n```\n\n### **性能优化提示**\n\n#### **加速训练：**\n```bash\n# 1. Use mixed precision\n# Set fp16: true in config\n\n# 2. Increase batch size (if memory allows)\n# Set batch_size: 16 or 32\n\n# 3. Use LoRA for faster fine-tuning\n# Set peft: lora, lora_r: 16\n\n# 4. Pre-compile model (PyTorch 2.0+)\n# Set compile_model: true in config\n```\n\n#### **提高模型质量：**\n```bash\n# 1. Use more training epochs\n# Set num_epochs: 10\n\n# 2. Tune loss weights\n# Experiment with lambda_adv: 0.5, mu_prop: 0.3\n\n# 3. Use larger model (if resources allow)\n# Set model_name_or_path: \"meta-llama\u002FLlama-2-13b-hf\"\n\n# 4. Collect more\u002Fbetter training data\npython scripts\u002Fcollect_twitter_data.py --count 50000\n```\n\n### **开发工作流**\n\n#### **代码质量检查：**\n```bash\n# Format code\nblack training\u002F scripts\u002F --line-length 88\n\n# Check for issues\nruff check .\n\n# Run tests (if available)\npytest tests\u002F -v\n```\n\n#### **实验追踪**\n```bash\n# Compare multiple experiments\npython scripts\u002Frun_mlflow_experiments.py --experiment-type lr\npython scripts\u002Frun_mlflow_experiments.py --experiment-type loss-weights\n\n# View comparisons in MLflow UI\nmlflow ui\n```\n\n### **生产部署**\n\n#### **导出训练好的模型：**\n```bash\n# Save model for inference\npython -c \"\nfrom training.train import run\nfrom models.llama_classifier import PhishGuardClassifier\nimport torch\n\n# Load trained model\nmodel = PhishGuardClassifier('path\u002Fto\u002Fcheckpoint')\ntorch.save(model.state_dict(), 'phishguard_production.pth')\nprint('Model saved for production use')\n\"\n```\n\n#### **批量推理**\n```bash\n# Run inference on new data\npython -m inference.run_inference \\\n    --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt \\\n    --input new_tweets.csv \\\n    --output predictions.csv \\\n    --text-col text\n```\n\n## 📊 预期性能\n\n### 模型性能（使用 10k+ 真实推文）\n- **准确率 (Accuracy)**: 85-95%\n- **F1 分数 (F1-Score)**: 80-92%\n- **AUC (曲线下面积)**: 90-98%\n- **精确率 (Precision)**: 82-94%\n- **召回率 (Recall)**: 78-90%\n\n### 传播控制\n- **传播减少率**: 15-40%\n- **干预效率**: 预算最优节点选择\n- **成本效益**: 每减少 1% 传播需 2-8 个节点\n- **图覆盖范围**: 支持 10k-100k+ 用户网络\n\n### 训练性能\n- **CPU 训练**: 2-6 小时 (DistilBERT)\n- **GPU 训练**: 30-90 分钟 (LLaMA + LoRA (低秩适配) + FP16 (半精度浮点数))\n- **内存需求**: 4-16GB，取决于模型选择\n- **数据处理**: 50k 推文需 5-15 分钟\n\n## 📁 项目结构\n\n```\nphishguard_scaffold\u002F\n├── README.md, requirements.txt\n├── configs\u002F\n│   ├── config.yaml              # 主配置文件\n│   └── mlflow_config.yaml       # MLflow（实验跟踪）和 Ray（分布式计算）配置\n├── data\u002F\n│   ├── dataset.py              # 增强型数据加载与预处理\n│   ├── tweets.csv              # 推文数据集（文本、标签、元数据）\n│   └── edges.csv               # 社交网络边\n├── models\u002F\n│   └── llama_classifier.py     # 基于 LLaMA（大型语言模型）的分类器，带有 LoRA（低秩适应）\n├── training\u002F\n│   ├── train.py               # 联合优化训练循环\n│   ├── train_mlflow.py        # 带有实验跟踪的 MLflow 增强训练\n│   ├── ray_tune_hyperparams.py # Ray Tune 超参数优化\n│   └── adversarial.py         # 对抗训练组件\n├── propagation\u002F\n│   ├── graph.py               # 社交网络与 IC（独立级联）模拟\n│   └── intervene.py           # 干预策略\n├── eval\u002F\n│   └── metrics.py             # 评估指标\n├── inference\u002F\n│   └── run_inference.py       # 加载检查点并运行钓鱼检测（命令行界面）\n├── scripts\u002F\n│   ├── collect_twitter_data.py      # 真实 Twitter 数据收集\n│   ├── format_existing_data.py      # 数据集格式化工具\n│   ├── generate_demo_data.py        # 合成数据生成\n│   ├── run_mlflow_experiments.py    # 自动化实验网格搜索\n│   └── quick_start_mlflow_ray.py    # MLflow 和 Ray 快速入门指南\n└── runs\u002F                            # 训练输出和检查点\n   ├── mlruns\u002F                       # MLflow 实验跟踪数据\n   └── ray_results\u002F                  # Ray Tune 优化结果\n```\n\n## ⚙️ 配置\n\n该框架可通过 `configs\u002Fconfig.yaml` 进行高度配置：\n\n```yaml\nmodel:\n  model_name_or_path: meta-llama\u002FLlama-2-7b-hf  # 主模型\n  fallback_model: distilbert-base-uncased        # CPU 备用方案\n  peft: lora                                     # 启用 LoRA（低秩适应）\n  lora_r: 16                                     # LoRA 秩\n  max_length: 512                                # 输入序列长度\n\ntrain:\n  batch_size: 8                                  # 批次大小\n  num_epochs: 5                                  # 训练轮数\n  lr: 1e-4                                       # 学习率\n  fp16: true                                     # 混合精度\n\nloss:\n  lambda_cls: 1.0                                # 分类权重\n  lambda_adv: 0.3                                # 对抗权重\n  mu_prop: 0.2                                   # 传播权重\n\npropagation:\n  ic_samples: 100                                # IC（独立级联）模拟样本\n  budget: 20                                     # 干预预算\n  topk_candidates: 200                           # 候选池大小\n```\n\n## 📚 数据格式\n\n### 必需推文数据 (`tweets.csv`)\n| 列 | 类型 | 描述 |\n|--------|------|-------------|\n| `text` | string | 推文内容 |\n| `label` | int | 0=合法，1=钓鱼 |\n| `user_id` | string | 唯一用户标识符 |\n| `timestamp` | string | ISO 格式时间戳 |\n| `parent_user_id` | string | 用于转发\u002F回复（可选） |\n| `url` | string | 提取的 URL（可选） |\n\n### 社交网络数据 (`edges.csv`)\n| 列 | 类型 | 描述 |\n|--------|------|-------------|\n| `src` | string | 源用户 ID |\n| `dst` | string | 目标用户 ID |\n| `weight` | float | 影响概率 [0,1] |\n| `timestamp` | string | 交互时间戳（可选） |\n\n## 🔬 方法论\n\n- ✅ **深度语义建模**：基于 LLaMA 的语义表示\n- ✅ **对抗训练**：增强针对欺骗性消息的鲁棒性\n- ✅ **社交网络分析**：基于影响力的扩散建模\n- ✅ **定向干预**：高风险传播路径阻断\n- ✅ **联合优化**：结合所有组件的统一损失\n- ✅ **综合评估**：多种指标和干预影响\n\n### 联合优化目标\n```\nL_total = λ_cls × L_cls + λ_adv × L_adv + μ_prop × L_prop\n\nWhere:\n- L_cls: Standard cross-entropy classification loss\n- L_adv: KL(clean vs perturbed) adversarial robustness loss\n- L_prop: Graph-based propagation control loss\n```\n\n## 🛠️ 高级用法\n\n### 自定义模型集成\n```python\nfrom models.llama_classifier import PhishGuardClassifier\n\n# Initialize with custom model\nmodel = PhishGuardClassifier(\n    \"microsoft\u002FDialoGPT-medium\",\n    peft_cfg={\"peft\": \"lora\", \"lora_r\": 8}\n)\n```\n\n### 编程式训练\n```python\nfrom training.train import run\n\n# Run training programmatically\nresults = run(\"configs\u002Fconfig.yaml\", eval_only=False)\nprint(f\"Final accuracy: {results['test_metrics']['accuracy']:.3f}\")\n```\n\n### 自定义干预分析\n```python\nfrom propagation.intervene import evaluate_intervention_impact\n\nimpact = evaluate_intervention_impact(graph, intervention_nodes, risk_scores)\nprint(f\"Spread reduction: {impact['relative_reduction']:.1%}\")\n```\n\n### MLflow 实验跟踪\n```python\nimport mlflow\nfrom training.train_mlflow import MLflowPhishGuardTrainer, TrainingConfig\n\n# Configure experiment\nconfig = TrainingConfig(\n    experiment_name=\"PhishGuard_Research\",\n    run_name=\"adversarial_loss_study\",\n    lambda_adv=0.4,  # Custom adversarial weight\n    num_epochs=10\n)\n\n# Run tracked experiment\ntrainer = MLflowPhishGuardTrainer(config)\nresults = trainer.train()\n\n# View in MLflow UI: mlflow ui\n```\n\n### Ray Tune 超参数优化\n```python\nfrom ray import tune\nfrom training.ray_tune_hyperparams import run_hyperparameter_optimization\n\n# Define custom search space\nsearch_space = {\n    \"lr\": tune.loguniform(1e-5, 1e-3),\n    \"lambda_adv\": tune.uniform(0.1, 0.5),\n    \"batch_size\": tune.choice([8, 16, 32])\n}\n\n# Run optimization\nresults, best_trial = run_hyperparameter_optimization(\n    num_samples=50,\n    max_num_epochs=10,\n    gpus_per_trial=0.25\n)\n\nprint(f\"Best F1 score: {best_trial.metrics['f1']:.4f}\")\n```\n\n### 使用 Ray 进行分布式训练\n```python\nfrom training.ray_tune_hyperparams import distributed_training_example\n\n# Run distributed training across multiple GPUs\nresults = distributed_training_example(\"configs\u002Fmlflow_config.yaml\")\n```\n\n## 📈 扩展框架\n\n### 添加新模型\n1. 在 `src\u002Fmodels\u002F` 中遵循 `PhishGuardClassifier` 接口实现\n2. 在 `configs\u002Fconfig.yaml` 中更新配置选项\n3. 使用现有训练管道进行测试\n\n### 自定义损失函数\n1. 在 `src\u002Ftraining\u002Fadversarial.py` 中添加新的损失组件\n2. 在 `src\u002Ftraining\u002Ftrain.py` 中集成到联合优化\n3. 通过 YAML 设置配置权重\n\n### 替代干预策略\n1. 在 `src\u002Fpropagation\u002Fintervene.py` 中实现新算法\n2. 遵循 `greedy_minimize_spread` 预期的接口\n3. 使用提供的影响指标进行评估\n\n## 🤝 贡献\n\n我们欢迎贡献！详情请参阅我们的 [贡献指南](docs\u002FCONTRIBUTING.md)。\n\n### 开发环境设置\n```bash\ngit clone \u003Crepository-url>\ncd phishguard_scaffold\npip install -r requirements.txt\npip install -r requirements-dev.txt  # Additional dev dependencies\n\n# Run tests\npython -m pytest tests\u002F\n\n# Format code\nblack src\u002F scripts\u002F\n```\n\n## 📊 基准测试\n\n不同数据集大小下的性能表现：\n\n| 数据集大小 | 训练时间 (GPU (图形处理器)) | 内存占用 | 准确率 | F1-Score (F1 分数) |\n|--------------|-------------------|--------------|----------|----------|\n| 1k tweets    | 10 分钟        | 4GB          | 89%      | 87%      |\n| 10k tweets   | 45 分钟        | 8GB          | 92%      | 90%      |\n| 50k tweets   | 2.5 小时         | 14GB         | 94%      | 92%      |\n| 100k tweets  | 4 小时           | 16GB         | 95%      | 93%      |\n\n*基准测试运行于 NVIDIA A100，使用 LLaMA-2-7B + LoRA (低秩适配)*\n\n## 📝 许可证\n\n本项目采用 MIT License (MIT 许可证) 授权 - 详见 [LICENSE](LICENSE) 文件。\n\n## 📖 引用\n\n如果您在研究中使用本框架，请引用：\n\n```bibtex\n@article{phishguard2024,\n  title={Joint Semantic Detection and Dissemination Control of Phishing Attacks on Social Media via LLaMA-Based Modeling},\n  author={Rui Wang},\n  journal={[ICCEA 2025 Proceedings]},\n  year={2024},\n  url={https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F11103392}\n}\n```\n\n## 🆘 支持\n\n- 📖 **文档**: 参见 `docs\u002FDATA_INTEGRATION_GUIDE.md` 和 `docs\u002FREQUIREMENTS_ANALYSIS.md`\n- 🐛 **问题**: 通过 GitHub Issues 报告错误和功能请求\n- 💬 **讨论**: 加入我们的社区讨论\n- 📧 **联系**: [联系信息]\n\n## 🎯 下一步计划？\n\n按计划区域和工作量划分的改进与扩展：\n\n| 领域 | 建议 | 工作量 |\n|------|------------|--------|\n| **训练** | 保存 checkpoint (模型检查点) + 写入 `final_results.yaml` (配置文件格式) | 低 |\n| **脚本** | 添加 `batch_inference.py`, `analyze_results.py` | 低 |\n| **测试** | 添加 `tests\u002F` + pytest (测试框架) | 低–中 |\n| **训练** | Checkpointing (检查点机制) + early stopping (早停策略) | 中 |\n| **传播** | 在训练中使用真实的 graph propagation loss (图传播损失) | 中 |\n| **评估** | Confusion matrix (混淆矩阵), calibration (校准), threshold sweep (阈值扫描) | 中 |\n| **可解释性** | Token-level (词元级) 解释 | 中 |\n| **部署** | 小型 API (应用程序接口) + Docker (容器化平台) | 中 |\n| **路线图** | 多语言、dashboard (仪表盘)、Twitter v2\u002F流式 API | 高 |\n\n现有路线图项目：\n\n- [ ] 除英语外的多语言支持\n- [ ] 实时部署流水线\n- [ ] 集成 Twitter API v2 流式传输\n- [ ] 高级可视化 dashboard (仪表盘)\n- [ ] Federated learning (联邦学习) 能力\n\n---\n\n⭐ 如果您觉得它对您的研究或应用有用，请 **Star 此仓库**！\n\n*专为网络安全和社交媒体安全社区构建，倾注 ❤️*","# PhishGuard 快速上手指南\n\nPhishGuard 是一个基于 LLaMA 和图模型的统一框架，用于社交媒体上的钓鱼检测与传播控制。本指南将帮助您快速完成环境配置、安装及基础运行。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **Python 版本**: 3.8 或更高版本\n*   **深度学习框架**: PyTorch 2.0+\n*   **硬件建议**: 支持 NVIDIA GPU（可选，但推荐用于训练）\n*   **系统检查**:\n    ```bash\n    python --version\n    nvidia-smi  # 检查 GPU 状态\n    ```\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone \u003Cyour-repo-url>\n    cd phishguard-scaffold\n    ```\n\n2.  **安装依赖**\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n3.  **验证安装**\n    ```bash\n    python -c \"from training.train import run; print('✅ Installation successful!')\"\n    ```\n\n## 基本使用\n\n以下是最快速的演示流程，使用生成的模拟数据在 5 分钟内完成训练与推理。\n\n### 1. 生成演示数据\n生成包含 5000 条推文和 1000 个用户的模拟数据集：\n```bash\npython scripts\u002Fgenerate_demo_data.py --tweets 5000 --users 1000\n```\n\n### 2. 训练模型\n使用默认配置文件启动训练：\n```bash\npython -m training.train --config configs\u002Fconfig.yaml\n```\n训练完成后，模型将保存在 `runs\u002Fphishguard_exp\u002F` 目录下（`best_model.pt`）。\n\n### 3. 模型推理\n\n**命令行单条文本检测：**\n```bash\npython -m inference.run_inference --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt --text \"Claim your prize now: https:\u002F\u002Fbit.ly\u002Fxyz\"\n```\n\n**CSV 批量检测：**\n```bash\npython -m inference.run_inference --checkpoint runs\u002Fphishguard_exp\u002Fbest_model.pt --input data\u002Ftweets.csv --output predictions.csv --text-col text\n```\n\n**Python 代码调用：**\n```python\nimport torch\nfrom models.llama_classifier import PhishGuardClassifier\n\nckpt = torch.load(\"runs\u002Fphishguard_exp\u002Fbest_model.pt\", map_location=\"cpu\")\ncfg = ckpt[\"config\"]\nmodel = PhishGuardClassifier(cfg[\"model\"][\"model_name_or_path\"], num_labels=2, peft_cfg=cfg[\"model\"])\nmodel.load_state_dict(ckpt[\"model_state_dict\"])\nmodel.eval()\n\n# 单条预测\ninputs = model.tokenizer(\"Your text here\", return_tensors=\"pt\", truncation=True, max_length=cfg[\"model\"][\"max_length\"])\nwith torch.no_grad():\n    logits = model(input_ids=inputs[\"input_ids\"], attention_mask=inputs[\"attention_mask\"]).logits\np_phishing = torch.softmax(logits, dim=-1)[0, 1].item()  # 获取钓鱼概率\n```","某社交平台安全团队每天需监控海量推文，旨在识别并阻断钓鱼链接的传播，保护用户免受诈骗侵害。\n\n### 没有 phishguard-scaffold 时\n- 传统关键词匹配无法识别语义变体，面对伪装性强的钓鱼文案漏报率极高。\n- 仅能封禁单个恶意账号，无法分析社交关系网，难以阻断信息在用户间的扩散路径。\n- 模型缺乏对抗训练，攻击者只需轻微修改文本结构即可绕过检测。\n- 人工干预成本高，无法精准定位最具影响力的传播节点，预算浪费严重。\n\n### 使用 phishguard-scaffold 后\n- phishguard-scaffold 基于 LLaMA 的语义理解精准识别变体钓鱼内容，检测准确率提升至 90% 以上。\n- 利用传播图模型模拟扩散路径，自动筛选关键节点进行干预，有效切断传播链。\n- 内置对抗训练机制，有效抵御语义扰动攻击，确保模型在复杂攻击下保持鲁棒性。\n- 贪心干预策略优化预算，用最少资源实现最大范围的传播抑制，扩散范围减少 15-40%。\n\n通过联合语义检测与传播管控，phishguard-scaffold 实现了从单一内容识别到全网防御的跨越，显著提升了钓鱼攻击的防御效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fwang-rui_phishguard-scaffold_ada41e42.png","wang-rui","Rui","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fwang-rui_d896fe4d.png","ML Engineer | Independent Researcher","Carnegie Mellon University","Pittsburgh, PA","rwang778@gmail.com",null,"http:\u002F\u002Fwww.linkedin.com\u002Fruiwang2","https:\u002F\u002Fgithub.com\u002Fwang-rui",[87],{"name":88,"color":89,"percentage":90},"Python","#3572A5",100,1006,169,"2026-03-06T15:22:17","Linux, macOS, Windows","非必需，推荐 NVIDIA GPU（用于 LLaMA 训练），支持 CPU 部署（DistilBERT）；显存建议 8GB+","未说明（建议 16GB+ 以处理大规模数据及模型）",{"notes":98,"python":99,"dependencies":100},"使用真实 Twitter 数据需配置 API Token；LLaMA 模型权重需单独下载；支持 LoRA 微调以降低资源需求；首次运行需下载模型文件","3.8+",[101,102,103,104,105,106,107],"torch>=2.0","transformers","mlflow","ray","tweepy","peft","accelerate",[15,37],"2026-03-27T02:49:30.150509","2026-04-06T07:13:59.242596",[],[]]