[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-tech-srl--code2vec":3,"tool-tech-srl--code2vec":62},[4,18,26,36,46,54],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",158594,2,"2026-04-16T23:34:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":42,"last_commit_at":43,"category_tags":44,"status":17},8272,"opencode","anomalyco\u002Fopencode","OpenCode 是一款开源的 AI 编程助手（Coding Agent），旨在像一位智能搭档一样融入您的开发流程。它不仅仅是一个代码补全插件，而是一个能够理解项目上下文、自主规划任务并执行复杂编码操作的智能体。无论是生成全新功能、重构现有代码，还是排查难以定位的 Bug，OpenCode 都能通过自然语言交互高效完成，显著减少开发者在重复性劳动和上下文切换上的时间消耗。\n\n这款工具专为软件开发者、工程师及技术研究人员设计，特别适合希望利用大模型能力来提升编码效率、加速原型开发或处理遗留代码维护的专业人群。其核心亮点在于完全开源的架构，这意味着用户可以审查代码逻辑、自定义行为策略，甚至私有化部署以保障数据安全，彻底打破了传统闭源 AI 助手的“黑盒”限制。\n\n在技术体验上，OpenCode 提供了灵活的终端界面（Terminal UI）和正在测试中的桌面应用程序，支持 macOS、Windows 及 Linux 全平台。它兼容多种包管理工具，安装便捷，并能无缝集成到现有的开发环境中。无论您是追求极致控制权的资深极客，还是渴望提升产出的独立开发者，OpenCode 都提供了一个透明、可信",144296,1,"2026-04-16T14:50:03",[13,45],"插件",{"id":47,"name":48,"github_repo":49,"description_zh":50,"stars":51,"difficulty_score":32,"last_commit_at":52,"category_tags":53,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":55,"name":56,"github_repo":57,"description_zh":58,"stars":59,"difficulty_score":32,"last_commit_at":60,"category_tags":61,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[45,13,15,14],{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":76,"owner_url":77,"languages":78,"stars":95,"forks":96,"last_commit_at":97,"license":98,"difficulty_score":99,"env_os":100,"env_gpu":101,"env_ram":102,"env_deps":103,"category_tags":110,"github_topics":111,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":118,"updated_at":119,"faqs":120,"releases":154},8142,"tech-srl\u002Fcode2vec","code2vec","TensorFlow code for the neural network presented in the paper: \"code2vec: Learning Distributed Representations of Code\"","code2vec 是一款基于深度学习的开源工具，旨在让计算机像理解自然语言一样“读懂”代码。它核心解决了如何从源代码中提取语义信息并转化为机器可处理的数学向量这一难题。传统方法往往难以捕捉代码的结构逻辑，而 code2vec 通过独特的路径神经网络架构，能够分析代码的抽象语法树（AST），将代码片段映射为分布式表示。这使得它不仅能精准预测 Java 等语言中的方法名称，还能用于代码分类、相似性搜索及漏洞检测等任务。\n\n该工具特别适合人工智能研究人员、编程语言学者以及希望探索代码智能应用的开发者使用。其技术亮点在于不依赖特定编程语言的硬编码规则，而是通过学习代码结构路径来通用化地理解逻辑，因此易于扩展至 Python、JavaScript 等其他语言。作为学术界广受引用的经典模型（发表于 POPL'2019），code2vec 提供了基于 TensorFlow 和 Keras 的两种实现版本，兼顾了研究实验的灵活性与工程落地的便利性。无论是想要复现前沿论文成果，还是构建智能编程辅助原型，code2vec 都是一个扎实且易上手的起点。","# Code2vec\nA neural network for learning distributed representations of code.\nThis is an official implementation of the model described in:\n\n[Uri Alon](http:\u002F\u002Furialon.cswp.cs.technion.ac.il), [Meital Zilberstein](http:\u002F\u002Fwww.cs.technion.ac.il\u002F~mbs\u002F), [Omer Levy](https:\u002F\u002Flevyomer.wordpress.com) and [Eran Yahav](http:\u002F\u002Fwww.cs.technion.ac.il\u002F~yahave\u002F),\n\"code2vec: Learning Distributed Representations of Code\", POPL'2019 [[PDF]](https:\u002F\u002Furialon.cswp.cs.technion.ac.il\u002Fwp-content\u002Fuploads\u002Fsites\u002F83\u002F2018\u002F12\u002Fcode2vec-popl19.pdf)\n\n_**October 2018** - The paper was accepted to [POPL'2019](https:\u002F\u002Fpopl19.sigplan.org)_!\n\n_**April 2019** - The talk video is available [here](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EJ8okcxL2Iw)_.\n\n_**July 2019** - Add `tf.keras` model implementation (see [here](#choosing-implementation-to-use))._\n\nAn **online demo** is available at [https:\u002F\u002Fcode2vec.org\u002F](https:\u002F\u002Fcode2vec.org\u002F).\n\n## See also:\n  * **code2seq** (ICLR'2019) is our newer model. It uses LSTMs to encode paths node-by-node (rather than monolithic path embeddings as in code2vec), and an LSTM to decode a target sequence (rather than predicting a single label at a time as in code2vec). See [PDF](https:\u002F\u002Fopenreview.net\u002Fpdf?id=H1gKYo09tX), demo at [http:\u002F\u002Fwww.code2seq.org](http:\u002F\u002Fwww.code2seq.org) and [code](https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2seq\u002F).\n  * **Structural Language Models of Code** is a new paper that learns to generate the missing code within a larger code snippet. This is similar to code completion, but is able to predict complex expressions rather than a single token at a time. See [PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.00577.pdf), demo at [http:\u002F\u002FAnyCodeGen.org](http:\u002F\u002FAnyCodeGen.org).\n  * **Adversarial Examples for Models of Code** is a new paper that shows how to slightly mutate the input code snippet of code2vec and GNNs models (thus, introducing adversarial examples), such that the model (code2vec or GNNs) will output a prediction of our choice. See [PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.07517.pdf) (code: soon).\n  * **Neural Reverse Engineering of Stripped Binaries** is a new paper that learns to predict procedure names in stripped binaries, thus use neural networks for reverse engineering. See [PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1902.09122) (code: soon).\n\nThis is a TensorFlow implementation, designed to be easy and useful in research, \nand for experimenting with new ideas in machine learning for code tasks.\nBy default, it learns Java source code and predicts Java method names, but it can be easily extended to other languages, \nsince the TensorFlow network is agnostic to the input programming language (see [Extending to other languages](#extending-to-other-languages).\nContributions are welcome.\nThis repo actually contains two model implementations. The 1st uses pure TensorFlow and the 2nd uses TensorFlow's Keras ([more details](#choosing-implementation-to-use)). \n\n\u003Ccenter style=\"padding: 40px\">\u003Cimg width=\"70%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftech-srl_code2vec_readme_61eac4ac9408.png\" \u002F>\u003C\u002Fcenter>\n\nTable of Contents\n=================\n  * [Requirements](#requirements)\n  * [Quickstart](#quickstart)\n  * [Configuration](#configuration)\n  * [Features](#features)\n  * [Extending to other languages](#extending-to-other-languages)\n  * [Additional datasets](#additional-datasets)\n  * [Citation](#citation)\n\n## Requirements\nOn Ubuntu:\n  * [Python3](https:\u002F\u002Fwww.linuxbabe.com\u002Fubuntu\u002Finstall-python-3-6-ubuntu-16-04-16-10-17-04) (>=3.6). To check the version:\n> python3 --version\n  * TensorFlow - version 2.0.0 ([install](https:\u002F\u002Fwww.tensorflow.org\u002Finstall\u002Finstall_linux)).\n  To check TensorFlow version:\n> python3 -c 'import tensorflow as tf; print(tf.\\_\\_version\\_\\_)'\n  * If you are using a GPU, you will need CUDA 10.0\n  ([download](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-10.0-download-archive-base)) \n  as this is the version that is currently supported by TensorFlow. To check CUDA version:\n> nvcc --version\n  * For GPU: cuDNN (>=7.5) ([download](http:\u002F\u002Fdeveloper.nvidia.com\u002Fcudnn)) To check cuDNN version:\n> cat \u002Fusr\u002Finclude\u002Fcudnn.h | grep CUDNN_MAJOR -A 2\n  * For [creating a new dataset](#creating-and-preprocessing-a-new-java-dataset)\n  or [manually examining a trained model](#step-4-manual-examination-of-a-trained-model)\n  (any operation that requires parsing of a new code example) - [Java JDK](https:\u002F\u002Fopenjdk.java.net\u002Finstall\u002F)\n\n## Quickstart\n### Step 0: Cloning this repository\n```\ngit clone https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\ncd code2vec\n```\n\n### Step 1: Creating a new dataset from java sources\nIn order to have a preprocessed dataset to train a network on, you can either download our\npreprocessed dataset, or create a new dataset of your own.\n\n#### Download our preprocessed dataset of ~14M examples (compressed: 6.3GB, extracted 32GB)\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava14m_data.tar.gz\ntar -xvzf java14m_data.tar.gz\n```\nThis will create a data\u002Fjava14m\u002F sub-directory, containing the files that hold that training, test and validation sets,\nand a vocabulary file for various dataset properties.\n\n#### Creating and preprocessing a new Java dataset\nIn order to create and preprocess a new dataset (for example, to compare code2vec to another model on another dataset):\n  * Edit the file [preprocess.sh](preprocess.sh) using the instructions there, pointing it to the correct training, validation and test directories.\n  * Run the preprocess.sh file:\n> source preprocess.sh\n\n### Step 2: Training a model\nYou can either download an already-trained model, or train a new model using a preprocessed dataset.\n\n#### Downloading a trained model (1.4 GB)\nWe already trained a model for 8 epochs on the data that was preprocessed in the previous step.\nThe number of epochs was chosen using [early stopping](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEarly_stopping), as the version that maximized the F1 score on the validation set. This model can be downloaded [here](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model.tar.gz) or using:\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model.tar.gz\ntar -xvzf java14m_model.tar.gz\n```\n\n##### Note:\nThis trained model is in a \"released\" state, which means that we stripped it from its training parameters and can thus be used for inference, but cannot be further trained. If you use this trained model in the next steps, use 'saved_model_iter8.release' instead of 'saved_model_iter8' in every command line example that loads the model such as: '--load models\u002Fjava14_model\u002Fsaved_model_iter8'. To read how to release a model, see [Releasing the model](#releasing-the-model).\n\n#### Downloading a trained model (3.5 GB) _which can be further trained_\n\nA non-stripped trained model can be obtained [here](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model_trainable.tar.gz) or using:\n\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model_trainable.tar.gz\ntar -xvzf java14m_model_trainable.tar\n```  \n\nThis model weights more than twice than the stripped version, and it is recommended only if you wish to continue training a model which is already trained. To continue training this trained model, use the `--load` flag to load the trained model; the `--data` flag to point to the new dataset to train on; and the `--save` flag to provide a new save path.\n\n#### A model that was trained on the Java-large dataset\nWe provide an additional code2vec model that was trained on the \"Java-large\" dataset (this dataset was introduced in the code2seq paper). See [Java-large](#java-large-compressed-72gb-extracted-37gb)\n\n#### Training a model from scratch\nTo train a model from scratch:\n  * Edit the file [train.sh](train.sh) to point it to the right preprocessed data. By default, \n  it points to our \"java14m\" dataset that was preprocessed in the previous step.\n  * Before training, you can edit the configuration hyper-parameters in the file [config.py](config.py),\n  as explained in [Configuration](#configuration).\n  * Run the [train.sh](train.sh) script:\n```\nsource train.sh\n```\n\n##### Notes:\n  1. By default, the network is evaluated on the validation set after every training epoch.\n  2. The newest 10 versions are kept (older are deleted automatically). This can be changed, but will be more space consuming.\n  3. By default, the network is training for 20 epochs.\nThese settings can be changed by simply editing the file [config.py](config.py).\nTraining on a Tesla v100 GPU takes about 50 minutes per epoch. \nTraining on Tesla K80 takes about 4 hours per epoch.\n\n### Step 3: Evaluating a trained model\nOnce the score on the validation set stops improving over time, you can stop the training process (by killing it)\nand pick the iteration that performed the best on the validation set.\nSuppose that iteration #8 is our chosen model, run:\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --test data\u002Fjava14m\u002Fjava14m.test.c2v\n```\nWhile evaluating, a file named \"log.txt\" is written with each test example name and the model's prediction.\n\n### Step 4: Manual examination of a trained model\nTo manually examine a trained model, run:\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --predict\n```\nAfter the model loads, follow the instructions and edit the file [Input.java](Input.java) and enter a Java \nmethod or code snippet, and examine the model's predictions and attention scores.\n\n## Configuration\nChanging hyper-parameters is possible by editing the file\n[config.py](config.py).\n\nHere are some of the parameters and their description:\n#### config.NUM_TRAIN_EPOCHS = 20\nThe max number of epochs to train the model. Stopping earlier must be done manually (kill).\n#### config.SAVE_EVERY_EPOCHS = 1\nAfter how many training iterations a model should be saved.\n#### config.TRAIN_BATCH_SIZE = 1024 \nBatch size in training.\n#### config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE\nBatch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.\n#### config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10\nNumber of words with highest scores in $ y_hat $ to consider during prediction and evaluation.\n#### config.NUM_BATCHES_TO_LOG_PROGRESS = 100\nNumber of batches (during training \u002F evaluating) to complete between two progress-logging records.\n#### config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100\nNumber of training batches to complete between model evaluations on the test set.\n#### config.READER_NUM_PARALLEL_BATCHES = 4\nThe number of threads enqueuing examples to the reader queue.\n#### config.SHUFFLE_BUFFER_SIZE = 10000\nSize of buffer in reader to shuffle example within during training.\nBigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.\n#### config.CSV_BUFFER_SIZE = 100 * 1024 * 1024  # 100 MB\nThe buffer size (in bytes) of the CSV dataset reader.\n\n#### config.MAX_CONTEXTS = 200\nThe number of contexts to use in each example.\n#### config.MAX_TOKEN_VOCAB_SIZE = 1301136\nThe max size of the token vocabulary.\n#### config.MAX_TARGET_VOCAB_SIZE = 261245\nThe max size of the target words vocabulary.\n#### config.MAX_PATH_VOCAB_SIZE = 911417\nThe max size of the path vocabulary.\n#### config.DEFAULT_EMBEDDINGS_SIZE = 128\nDefault embedding size to be used for token and path if not specified otherwise.\n#### config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE\nEmbedding size for tokens.\n#### config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE\nEmbedding size for paths.\n#### config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE\nSize of code vectors.\n#### config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE\nEmbedding size for target words.\n#### config.MAX_TO_KEEP = 10\nKeep this number of newest trained versions during training.\n#### config.DROPOUT_KEEP_RATE = 0.75\nDropout rate used during training.\n#### config.SEPARATE_OOV_AND_PAD = False\nWhether to treat `\u003COOV>` and `\u003CPAD>` as two different special tokens whenever possible.\n\n## Features\nCode2vec supports the following features: \n\n### Choosing implementation to use\nThis repo comes with two model implementations:\n(i) uses pure TensorFlow (written in [tensorflow_model.py](tensorflow_model.py));\n(ii) uses TensorFlow's Keras (written in [keras_model.py](keras_model.py)).\nThe default implementation used by `code2vec.py` is the pure TensorFlow.\nTo explicitly choose the desired implementation to use, specify `--framework tensorflow` or `--framework keras`\nas an additional argument when executing the script `code2vec.py`.\nParticularly, this argument can be added to each one of the usage examples (of `code2vec.py`) detailed in this file.\nNote that in order to load a trained model (from file), one should use the same implementation used during its training.\n\n### Releasing the model\nIf you wish to keep a trained model for inference only (without the ability to continue training it) you can\nrelease the model using:\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8 --release\n```\nThis will save a copy of the trained model with the '.release' suffix.\nA \"released\" model usually takes 3x less disk space.\n\n### Exporting the trained token vectors and target vectors\nToken and target embeddings are available to download: \n\n[[Token vectors]](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Ftoken_vecs.tar.gz) [[Method name vectors]](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Ftarget_vecs.tar.gz)\n\nThese saved embeddings are saved without subtoken-delimiters (\"*toLower*\" is saved as \"*tolower*\").\n\nIn order to export embeddings from a trained model, use the \"--save_w2v\" and \"--save_t2v\" flags:\n\nExporting the trained *token* embeddings:\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --save_w2v models\u002Fjava14_model\u002Ftokens.txt\n```\nExporting the trained *target* (method name) embeddings:\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --save_t2v models\u002Fjava14_model\u002Ftargets.txt\n```\nThis saves the tokens\u002Ftargets embedding matrices in word2vec format to the specified text file, in which:\nthe first line is: \\\u003Cvocab_size\\> \\\u003Cdimension\\>\nand each of the following lines contains: \\\u003Cword\\> \\\u003Cfloat_1\\> \\\u003Cfloat_2\\> ... \\\u003Cfloat_dimension\\>\n\nThese word2vec files can be manually parsed or easily loaded and inspected using the [gensim](https:\u002F\u002Fradimrehurek.com\u002Fgensim\u002Fmodels\u002Fword2vec.html) python package:\n```python\npython3\n>>> from gensim.models import KeyedVectors as word2vec\n>>> vectors_text_path = 'models\u002Fjava14_model\u002Ftargets.txt' # or: `models\u002Fjava14_model\u002Ftokens.txt'\n>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)\n>>> model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings\n>>> model.most_similar(positive=['download', 'send'], negative=['receive'])\n```\nThe above python commands will result in the closest name to both \"equals\" and \"to|lower\", which is \"equals|ignore|case\".\nNote: In embeddings that were exported manually using the \"--save_w2v\" or \"--save_t2v\" flags, the input token and target words are saved using the symbol \"|\" as a subtokens delimiter (\"*toLower*\" is saved as: \"*to|lower*\"). In the embeddings that are available to download (which are the same as in the paper), the \"|\" symbol is not used, thus \"*toLower*\" is saved as \"*tolower*\".\n\n### Exporting the code vectors for the given code examples\nThe flag `--export_code_vectors` allows to export the code vectors for the given examples. \n\nIf used with the `--test \u003CTEST_FILE>` flag,\na file named `\u003CTEST_FILE>.vectors` will be saved in the same directory as `\u003CTEST_FILE>`. \nEach row in the saved file is the code vector of the code snipped in the corresponding row in `\u003CTEST_FILE>`.\n \nIf used with the `--predict` flag, the code vector will be printed to console.\n\n\n## Extending to other languages  \n\nThis project currently supports Java and C\\# as the input languages.\n\n_**April 2020** - an extension for code2vec that addresses obfuscated Java code was developed by [@basedrhys](https:\u002F\u002Fgithub.com\u002Fbasedrhys), and is available here:\n[https:\u002F\u002Fgithub.com\u002Fbasedrhys\u002Fobfuscated-code2vec](https:\u002F\u002Fgithub.com\u002Fbasedrhys\u002Fobfuscated-code2vec)._\n\n\n_**January 2020** - an extractor for predicting TypeScript type annotations for JavaScript input using code2vec was developed by [@izosak](https:\u002F\u002Fgithub.com\u002Fizosak) and Noa Cohen, and is available here:\n[https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fid2vec](https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fid2vec)._\n\n~~_**June 2019** - an extractor for **C** that is compatible with our model was developed by [CMU SEI team](https:\u002F\u002Fgithub.com\u002Fcmu-sei\u002Fcode2vec-c)._~~ - removed by CMU SEI team.\n\n_**June 2019** - an extractor for **Python, Java, C, C++** by JetBrains Research is available here: [PathMiner](https:\u002F\u002Fgithub.com\u002FJetBrains-Research\u002Fastminer)._\n\nIn order to extend code2vec to work with other languages, a new extractor (similar to the [JavaExtractor](JavaExtractor))\nshould be implemented, and be called by [preprocess.sh](preprocess.sh).\nBasically, an extractor should be able to output for each directory containing source files:\n  * A single text file, where each row is an example.\n  * Each example is a space-delimited list of fields, where:\n  1. The first \"word\" is the target label, internally delimited by the \"|\" character.\n  2. Each of the following words are contexts, where each context has three components separated by commas (\",\"). Each of these components cannot include spaces nor commas.\n  We refer to these three components as a token, a path, and another token, but in general other types of ternary contexts can be considered.  \n\nFor example, a possible novel Java context extraction for the following code example:\n```java\nvoid fooBar() {\n\tSystem.out.println(\"Hello World\");\n}\n```\nMight be (in a new context extraction algorithm, which is different than ours since it doesn't use paths in the AST):\n> foo|Bar System,FIELD_ACCESS,out System.out,FIELD_ACCESS,println THE_METHOD,returns,void THE_METHOD,prints,\"hello_world\" \n\nConsider the first example context \"System,FIELD_ACCESS,out\". \nIn the current implementation, the 1st (\"System\") and 3rd (\"out\") components of a context are taken from the same \"tokens\" vocabulary, \nand the 2nd component (\"FIELD_ACCESS\") is taken from a separate \"paths\" vocabulary. \n\n## Additional datasets\nWe preprocessed additional three datasets used by the [code2seq](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1808.01400) paper, using the code2vec preprocessing.\nThese datasets are available in raw format (i.e., .java files) at [https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2seq\u002Fblob\u002Fmaster\u002FREADME.md#datasets](https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2seq\u002Fblob\u002Fmaster\u002FREADME.md#datasets),\nand are also available to download in a preprocessed format (i.e., ready to train a code2vec model on) here:\n\n### Java-small (compressed: 366MB, extracted 1.9GB)\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava-small_data.tar.gz\n```\nThis dataset is based on the dataset of [Allamanis et al. (ICML'2016)](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fcup\u002Fcodeattention\u002F), with the difference that training\u002Fvalidation\u002Ftest are split by-project rather than by-file.\nThis dataset contains 9 Java projects for training, 1 for validation and 1 testing. Overall, it contains about 700K examples.\n\n### Java-med (compressed: 1.8GB, extracted 9.3GB)\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava-med_data.tar.gz\n```\nA dataset of the 1000 top-starred Java projects from GitHub. It contains\n800 projects for training, 100 for validation and 100 for testing. Overall, it contains about 4M examples.\n\n### Java-large (compressed: 7.2GB, extracted 37GB)\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava-large_data.tar.gz\n```\nA dataset of the 9500 top-starred Java projects from GitHub that were created\nsince January 2007. It contains 9000 projects for training, 200 for validation and 300 for\ntesting. Overall, it contains about 16M examples.\n\nAdditionally, we provide a trained code2vec model that was trained on the Java-large dataset (this model was not part of the original code2vec paper, but was later used as a baseline in the code2seq paper which introduced this dataset).\nTrainable model (3.5 GB):\n```\nwget https:\u002F\u002Fcode2vec.s3.amazonaws.com\u002Fmodel\u002Fjava-large-model.tar.gz\n```\n\n\"Released model\" (1.4 GB, cannot be further trained).\n```\nwget https:\u002F\u002Fcode2vec.s3.amazonaws.com\u002Fmodel\u002Fjava-large-released-model.tar.gz\n```\n\n## Citation\n\n[code2vec: Learning Distributed Representations of Code](https:\u002F\u002Furialon.cswp.cs.technion.ac.il\u002Fwp-content\u002Fuploads\u002Fsites\u002F83\u002F2018\u002F12\u002Fcode2vec-popl19.pdf)\n\n```\n@article{alon2019code2vec,\n author = {Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran},\n title = {Code2Vec: Learning Distributed Representations of Code},\n journal = {Proc. ACM Program. Lang.},\n issue_date = {January 2019},\n volume = {3},\n number = {POPL},\n month = jan,\n year = {2019},\n issn = {2475-1421},\n pages = {40:1--40:29},\n articleno = {40},\n numpages = {29},\n url = {http:\u002F\u002Fdoi.acm.org\u002F10.1145\u002F3290353},\n doi = {10.1145\u002F3290353},\n acmid = {3290353},\n publisher = {ACM},\n address = {New York, NY, USA},\n keywords = {Big Code, Distributed Representations, Machine Learning},\n}\n```\n","# Code2vec\n一种用于学习代码分布式表示的神经网络。\n这是对以下论文中描述的模型的官方实现：\n\n[Uri Alon](http:\u002F\u002Furialon.cswp.cs.technion.ac.il), [Meital Zilberstein](http:\u002F\u002Fwww.cs.technion.ac.il\u002F~mbs\u002F), [Omer Levy](https:\u002F\u002Flevyomer.wordpress.com) 和 [Eran Yahav](http:\u002F\u002Fwww.cs.technion.ac.il\u002F~yahave\u002F),\n“code2vec: 学习代码的分布式表示”, POPL'2019 [[PDF]](https:\u002F\u002Furialon.cswp.cs.technion.ac.il\u002Fwp-content\u002Fuploads\u002Fsites\u002F83\u002F2018\u002F12\u002Fcode2vec-popl19.pdf)\n\n_**2018年10月** - 该论文已被[POPL'2019](https:\u002F\u002Fpopl19.sigplan.org) 接受！_\n\n_**2019年4月** - 演讲视频可在[这里](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EJ8okcxL2Iw)观看。_\n\n_**2019年7月** - 添加了 `tf.keras` 模型实现（见[此处](#choosing-implementation-to-use))。_\n\n一个**在线演示**可在[https:\u002F\u002Fcode2vec.org\u002F](https:\u002F\u002Fcode2vec.org\u002F)访问。\n\n## 参阅：\n  * **code2seq** (ICLR'2019) 是我们更新的模型。它使用 LSTM 对路径进行逐节点编码（而不是像 code2vec 那样使用整体路径嵌入），并使用 LSTM 解码目标序列（而不是像 code2vec 那样一次预测单个标签）。请参阅 [PDF](https:\u002F\u002Fopenreview.net\u002Fpdf?id=H1gKYo09tX)，演示可在 [http:\u002F\u002Fwww.code2seq.org](http:\u002F\u002Fwww.code2seq.org) 查看，代码可在 [GitHub](https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2seq\u002F) 上找到。\n  * **代码的结构化语言模型**是一篇新论文，它学习在较大的代码片段中生成缺失的代码。这类似于代码补全，但能够预测复杂的表达式，而不仅仅是一次一个标记。请参阅 [PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.00577.pdf)，演示可在 [http:\u002F\u002FAnyCodeGen.org](http:\u002F\u002FAnyCodeGen.org) 查看。\n  * **代码模型的对抗样本**是一篇新论文，展示了如何轻微地改变 code2vec 和 GNN 模型的输入代码片段（从而引入对抗样本），使得模型（code2vec 或 GNN）会输出我们所选择的预测结果。请参阅 [PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.07517.pdf)（代码：即将发布）。\n  * **剥离二进制文件的神经网络逆向工程**是一篇新论文，它学习预测剥离后的二进制文件中的过程名称，从而利用神经网络进行逆向工程。请参阅 [PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1902.09122)（代码：即将发布）。\n\n这是一个 TensorFlow 实现，旨在便于研究和实验机器学习在代码任务中的新想法。\n默认情况下，它学习 Java 源代码并预测 Java 方法名，但可以轻松扩展到其他语言，\n因为 TensorFlow 网络对输入编程语言是无感的（参见 [扩展到其他语言](#extending-to-other-languages)）。\n欢迎贡献。\n这个仓库实际上包含两种模型实现。第一种使用纯 TensorFlow，第二种使用 TensorFlow 的 Keras（[更多详情](#choosing-implementation-to-use)）。\n\n\u003Ccenter style=\"padding: 40px\">\u003Cimg width=\"70%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftech-srl_code2vec_readme_61eac4ac9408.png\" \u002F>\u003C\u002Fcenter>\n\n目录\n=================\n  * [要求](#requirements)\n  * [快速入门](#quickstart)\n  * [配置](#configuration)\n  * [功能](#features)\n  * [扩展到其他语言](#extending-to-other-languages)\n  * [附加数据集](#additional-datasets)\n  * [引用](#citation)\n\n## 要求\n在 Ubuntu 上：\n  * [Python3](https:\u002F\u002Fwww.linuxbabe.com\u002Fubuntu\u002Finstall-python-3-6-ubuntu-16-04-16-10-17-04)（>=3.6）。检查版本：\n> python3 --version\n  * TensorFlow - 版本 2.0.0（[安装](https:\u002F\u002Fwww.tensorflow.org\u002Finstall\u002Finstall_linux)）。\n  检查 TensorFlow 版本：\n> python3 -c 'import tensorflow as tf; print(tf.__version__)'\n  * 如果您使用 GPU，需要 CUDA 10.0\n  ([下载](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-10.0-download-archive-base))\n  因为这是目前 TensorFlow 支持的版本。检查 CUDA 版本：\n> nvcc --version\n  * 对于 GPU：cuDNN（>=7.5）([下载](http:\u002F\u002Fdeveloper.nvidia.com\u002Fcudnn))。检查 cuDNN 版本：\n> cat \u002Fusr\u002Finclude\u002Fcudnn.h | grep CUDNN_MAJOR -A 2\n  * 对于 [创建新数据集](#creating-and-preprocessing-a-new-java-dataset)\n  或 [手动检查训练好的模型](#step-4-manual-examination-of-a-trained-model)\n  （任何需要解析新代码示例的操作）- [Java JDK](https:\u002F\u002Fopenjdk.java.net\u002Finstall\u002F)\n\n## 快速入门\n### 第 0 步：克隆此仓库\n```\ngit clone https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\ncd code2vec\n```\n\n### 第 1 步：从 Java 源代码创建新数据集\n为了获得可用于训练网络的预处理数据集，您可以下载我们的预处理数据集，也可以自行创建一个新的数据集。\n\n#### 下载我们约 1400 万条目的预处理数据集（压缩后：6.3GB，解压后：32GB）\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava14m_data.tar.gz\ntar -xvzf java14m_data.tar.gz\n```\n这将创建一个 data\u002Fjava14m\u002F 子目录，其中包含存储训练、测试和验证集的文件，以及用于各种数据集属性的词汇表文件。\n\n#### 创建并预处理新的 Java 数据集\n要创建并预处理一个新的数据集（例如，为了在另一个数据集上将 code2vec 与另一模型进行比较）：\n  * 根据文件 [preprocess.sh](preprocess.sh) 中的说明编辑该文件，将其指向正确的训练、验证和测试目录。\n  * 运行 preprocess.sh 文件：\n> source preprocess.sh\n\n### 步骤 2：训练模型\n您可以选择下载一个已经训练好的模型，或者使用预处理过的数据集来训练一个新的模型。\n\n#### 下载已训练好的模型（1.4 GB）\n我们已经在上一步预处理的数据上训练了一个模型，共进行了8个epoch。epoch数是通过[早停法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEarly_stopping)确定的，最终选择了在验证集上F1分数最高的版本。该模型可以从[这里](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model.tar.gz)下载，或者使用以下命令：\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model.tar.gz\ntar -xvzf java14m_model.tar.gz\n```\n\n##### 注意：\n这个已训练好的模型处于“发布”状态，这意味着我们已经移除了它的训练参数，因此只能用于推理，而不能继续训练。如果您在后续步骤中使用此模型，请在所有加载模型的命令行示例中使用`saved_model_iter8.release`，而不是`saved_model_iter8`，例如：`--load models\u002Fjava14_model\u002Fsaved_model_iter8`。有关如何发布模型的详细信息，请参阅[发布模型](#releasing-the-model)。\n\n#### 下载可继续训练的已训练模型（3.5 GB）\n\n未剥离的已训练模型可以从[这里](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model_trainable.tar.gz)获取，或者使用以下命令：\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model_trainable.tar.gz\ntar -xvzf java14m_model_trainable.tar\n```\n\n该模型的文件大小是剥离版本的两倍多，仅建议在您希望继续训练一个已经训练好的模型时使用。要继续训练此模型，可以使用`--load`标志加载已训练好的模型；使用`--data`标志指向新的训练数据集；并使用`--save`标志指定新的保存路径。\n\n#### 在Java-large数据集上训练的模型\n我们还提供了一个基于“Java-large”数据集训练的code2vec模型（该数据集在code2seq论文中被介绍）。请参阅[Java-large](#java-large-compressed-72gb-extracted-37gb)。\n\n#### 从头开始训练模型\n要从头开始训练模型：\n  * 编辑[train.sh](train.sh)文件，使其指向正确的预处理数据。默认情况下，它指向我们在上一步预处理的“java14m”数据集。\n  * 在训练之前，您可以编辑[config.py](config.py)文件中的配置超参数，具体说明请参见[配置](#configuration)。\n  * 运行[train.sh](train.sh)脚本：\n```\nsource train.sh\n```\n\n##### 注意事项：\n  1. 默认情况下，网络会在每个训练epoch结束后在验证集上进行评估。\n  2. 系统会保留最近的10个版本（较旧的版本会被自动删除）。这一设置可以更改，但会占用更多存储空间。\n  3. 默认情况下，网络会训练20个epoch。这些设置可以通过简单地编辑[config.py](config.py)文件来更改。  \n    在Tesla v100 GPU上训练，每个epoch大约需要50分钟。而在Tesla K80上训练，则每个epoch大约需要4小时。\n\n### 步骤 3：评估已训练好的模型\n一旦验证集上的得分不再随时间提升，您可以停止训练过程（通过终止进程），并选择在验证集上表现最好的迭代版本。假设第8次迭代是我们选定的模型，运行以下命令：\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --test data\u002Fjava14m\u002Fjava14m.test.c2v\n```\n\n在评估过程中，系统会生成一个名为“log.txt”的文件，记录每个测试样本的名称以及模型的预测结果。\n\n### 步骤 4：手动检查已训练好的模型\n要手动检查已训练好的模型，运行以下命令：\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --predict\n```\n\n模型加载完成后，按照提示编辑[Input.java](Input.java)文件，输入一个Java方法或代码片段，然后查看模型的预测结果和注意力权重。\n\n## 配置\n通过编辑[config.py](config.py)文件，可以调整超参数。\n\n以下是一些参数及其说明：\n#### config.NUM_TRAIN_EPOCHS = 20\n模型训练的最大epoch数。如果需要提前停止，必须手动终止训练。\n#### config.SAVE_EVERY_EPOCHS = 1\n每经过多少个训练epoch保存一次模型。\n#### config.TRAIN_BATCH_SIZE = 1024 \n训练时的批次大小。\n#### config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE\n评估时的批次大小。这仅影响评估的速度和内存消耗，不会影响结果。\n#### config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10\n在预测和评估过程中，考虑$ y_hat $中得分最高的前K个词。\n#### config.NUM_BATCHES_TO_LOG_PROGRESS = 100\n在训练或评估过程中，每经过多少个批次记录一次进度。\n#### config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100\n在对测试集进行评估之前，需要完成多少个训练批次。\n#### config.READER_NUM_PARALLEL_BATCHES = 4\n将样本排队到读取器队列中的线程数。\n#### config.SHUFFLE_BUFFER_SIZE = 10000\n读取器中用于在训练过程中随机打乱样本的缓冲区大小。更大的缓冲区可以提高随机性，但会占用更多内存，并可能降低训练吞吐量。\n#### config.CSV_BUFFER_SIZE = 100 * 1024 * 1024  # 100 MB\nCSV数据集读取器的缓冲区大小（以字节为单位）。\n\n#### config.MAX_CONTEXTS = 200\n每个样本中使用的上下文数量。\n#### config.MAX_TOKEN_VOCAB_SIZE = 1301136\n标记词汇表的最大大小。\n#### config.MAX_TARGET_VOCAB_SIZE = 261245\n目标词汇表的最大大小。\n#### config.MAX_PATH_VOCAB_SIZE = 911417\n路径词汇表的最大大小。\n#### config.DEFAULT_EMBEDDINGS_SIZE = 128\n如果未另行指定，标记和路径的默认嵌入维度。\n#### config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE\n标记的嵌入维度。\n#### config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE\n路径的嵌入维度。\n#### config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE\n代码向量的维度。\n#### config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE\n目标词的嵌入维度。\n#### config.MAX_TO_KEEP = 10\n在训练过程中保留最新版本的数量。\n#### config.DROPOUT_KEEP_RATE = 0.75\n训练过程中使用的丢弃率。\n#### config.SEPARATE_OOV_AND_PAD = False\n是否尽可能将`\u003COOV>`和`\u003CPAD>`视为两个不同的特殊标记。\n\n## 功能\nCode2vec支持以下功能：\n\n### 选择使用的实现\n本仓库提供了两种模型实现：\n(i) 使用纯 TensorFlow（在 [tensorflow_model.py](tensorflow_model.py) 中编写）；\n(ii) 使用 TensorFlow 的 Keras（在 [keras_model.py](keras_model.py) 中编写）。\n`code2vec.py` 默认使用的是纯 TensorFlow 实现。\n要显式选择所需的实现，可以在执行 `code2vec.py` 脚本时，添加 `--framework tensorflow` 或 `--framework keras` 作为额外参数。\n特别地，这一参数可以添加到本文件中详细列出的每个 `code2vec.py` 使用示例中。\n请注意，为了从文件加载一个训练好的模型，必须使用与其训练时相同的实现。\n\n### 发布模型\n如果您希望保留一个仅用于推理的训练好的模型（而不再继续训练），可以使用以下命令发布模型：\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8 --release\n```\n这将保存一份带有 `.release` 后缀的训练好的模型副本。发布后的模型通常占用的磁盘空间会减少约 3 倍。\n\n### 导出训练好的词向量和目标向量\n词嵌入和目标嵌入可供下载：\n\n[[词向量]](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Ftoken_vecs.tar.gz) [[方法名向量]](https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Ftarget_vecs.tar.gz)\n\n这些保存的嵌入不包含子词分隔符（例如，“*toLower*”会被保存为“*tolower*”）。\n\n若要从训练好的模型中导出嵌入，可使用 `--save_w2v` 和 `--save_t2v` 标志：\n\n导出训练好的 *词* 嵌入：\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --save_w2v models\u002Fjava14_model\u002Ftokens.txt\n```\n\n导出训练好的 *目标*（方法名）嵌入：\n```\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --save_t2v models\u002Fjava14_model\u002Ftargets.txt\n```\n\n这会将词\u002F目标嵌入矩阵以 word2vec 格式保存到指定的文本文件中，其中：\n第一行是：\\\u003C词汇表大小\\> \\\u003C维度\\>\n随后的每一行包含：\\\u003C词\\> \\\u003C浮点数_1\\> \\\u003C浮点数_2\\> ... \\\u003C浮点数_维度\\>\n\n这些 word2vec 文件可以手动解析，也可以使用 [gensim](https:\u002F\u002Fradimrehurek.com\u002Fgensim\u002Fmodels\u002Fword2vec.html) Python 包轻松加载和检查：\n```python\npython3\n>>> from gensim.models import KeyedVectors as word2vec\n>>> vectors_text_path = 'models\u002Fjava14_model\u002Ftargets.txt' # 或：`models\u002Fjava14_model\u002Ftokens.txt`\n>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)\n>>> model.most_similar(positive=['equals', 'to|lower']) # 或：'tolower'，如果使用下载的嵌入\n>>> model.most_similar(positive=['download', 'send'], negative=['receive'])\n```\n\n上述 Python 命令将返回与“equals”和“to|lower”同时最相似的名字，即“equals|ignore|case”。  \n注意：通过 `--save_w2v` 或 `--save_t2v` 标志手动导出的嵌入中，输入的词和目标词会使用符号“|”作为子词分隔符（例如，“*toLower*”会被保存为“*to|lower*”）。而在可供下载的嵌入中（与论文中的相同），则未使用“|”符号，因此“*toLower*”会被保存为“*tolower*”。\n\n### 导出给定代码示例的代码向量\n`--export_code_vectors` 标志可用于导出给定示例的代码向量。\n\n如果与 `--test \u003CTEST_FILE>` 标志一起使用，\n将在 `\u003CTEST_FILE>` 所在目录下生成一个名为 `\u003CTEST_FILE>.vectors` 的文件。\n保存文件中的每一行对应于 `\u003CTEST_FILE>` 中相应行的代码片段的代码向量。\n\n如果与 `--predict` 标志一起使用，则代码向量将被打印到控制台。\n\n## 扩展到其他语言\n\n目前，该项目支持 Java 和 C# 作为输入语言。\n\n**2020 年 4 月**——由 [@basedrhys](https:\u002F\u002Fgithub.com\u002Fbasedrhys) 开发了一种针对混淆 Java 代码的 code2vec 扩展，现已发布在此处：\n[https:\u002F\u002Fgithub.com\u002Fbasedrhys\u002Fobfuscated-code2vec](https:\u002F\u002Fgithub.com\u002Fbasedrhys\u002Fobfuscated-code2vec)。\n\n**2020 年 1 月**——由 [@izosak](https:\u002F\u002Fgithub.com\u002Fizosak) 和 Noa Cohen 开发了一种利用 code2vec 预测 JavaScript 输入中 TypeScript 类型注解的提取器，现已发布在此处：\n[https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fid2vec](https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fid2vec)。\n\n**2019 年 6 月**——由 CMU SEI 团队开发的与我们的模型兼容的 **C** 语言提取器：[CMU SEI team](https:\u002F\u002Fgithub.com\u002Fcmu-sei\u002Fcode2vec-c)——已由 CMU SEI 团队移除。\n\n**2019 年 6 月**——由 JetBrains Research 开发的适用于 **Python、Java、C、C++** 的提取器：[PathMiner](https:\u002F\u002Fgithub.com\u002FJetBrains-Research\u002Fastminer)。\n\n要将 code2vec 扩展到其他语言，需要实现一个新的提取器（类似于 [JavaExtractor](JavaExtractor)），并由 [preprocess.sh](preprocess.sh) 调用。\n基本上，提取器应能为每个包含源文件的目录输出：\n  * 一个单一的文本文件，每行代表一个示例。\n  * 每个示例是一个由空格分隔的字段列表，其中：\n    1. 第一个“词”为目标标签，内部由“|”字符分隔。\n    2. 随后的每个词都是上下文，每个上下文由三个用逗号分隔的组件组成。这些组件不能包含空格或逗号。\n    我们将这三个组件称为一个词、一条路径和另一个词，但一般情况下也可以考虑其他类型的三元上下文。\n\n例如，对于以下 Java 示例代码：\n```java\nvoid fooBar() {\n\tSystem.out.println(\"Hello World\");\n}\n```\n一种可能的新 Java 上下文提取方式（不同于我们当前的实现，因为它不使用 AST 中的路径）可能是：\n> foo|Bar System,FIELD_ACCESS,out System.out,FIELD_ACCESS,println THE_METHOD,returns,void THE_METHOD,prints,\"hello_world\" \n\n以第一个上下文“System,FIELD_ACCESS,out”为例。\n在当前实现中，上下文的第一个（“System”）和第三个（“out”）组件来自同一个“词”词汇表，而第二个组件（“FIELD_ACCESS”）则来自独立的“路径”词汇表。\n\n## 其他数据集\n我们使用 code2vec 的预处理流程，对 [code2seq](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1808.01400) 论文中使用的另外三个数据集进行了预处理。\n这些原始数据集（即 .java 文件）可在 [https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2seq\u002Fblob\u002Fmaster\u002FREADME.md#datasets](https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2seq\u002Fblob\u002Fmaster\u002FREADME.md#datasets) 获取，\n同时也提供预处理后的格式（即可以直接用于训练 code2vec 模型的格式），请见：\n\n### Java-small（压缩后：366MB，解压后1.9GB）\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava-small_data.tar.gz\n```\n该数据集基于[Allamanis等人（ICML'2016）]的研究数据集，不同之处在于，训练集、验证集和测试集是按项目划分的，而非按文件划分。\n\n此数据集包含9个用于训练的Java项目、1个用于验证的项目以及1个用于测试的项目。总体而言，它包含了约70万个示例。\n\n### Java-med（压缩后：1.8GB，解压后9.3GB）\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava-med_data.tar.gz\n```\n该数据集涵盖了GitHub上Star数排名前1000的Java项目。其中，800个项目用于训练，100个项目用于验证，100个项目用于测试。总体而言，它包含了约400万个示例。\n\n### Java-large（压缩后：7.2GB，解压后37GB）\n```\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava-large_data.tar.gz\n```\n该数据集包含了自2007年1月以来在GitHub上创建的、Star数排名前9500的Java项目。其中，9000个项目用于训练，200个项目用于验证，300个项目用于测试。总体而言，它包含了约1600万个示例。\n\n此外，我们还提供了一个在Java-large数据集上训练好的code2vec模型。该模型并未出现在最初的code2vec论文中，而是在后来介绍该数据集的code2seq论文中被用作基准模型。\n\n可训练模型（3.5 GB）：\n```\nwget https:\u002F\u002Fcode2vec.s3.amazonaws.com\u002Fmodel\u002Fjava-large-model.tar.gz\n```\n\n“已发布模型”（1.4 GB，不可进一步训练）：\n```\nwget https:\u002F\u002Fcode2vec.s3.amazonaws.com\u002Fmodel\u002Fjava-large-released-model.tar.gz\n```\n\n## 引用\n\n[code2vec：学习代码的分布式表示](https:\u002F\u002Furialon.cswp.cs.technion.ac.il\u002Fwp-content\u002Fuploads\u002Fsites\u002F83\u002F2018\u002F12\u002Fcode2vec-popl19.pdf)\n\n```\n@article{alon2019code2vec,\n author = {Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran},\n title = {Code2Vec: Learning Distributed Representations of Code},\n journal = {Proc. ACM Program. Lang.},\n issue_date = {January 2019},\n volume = {3},\n number = {POPL},\n month = jan,\n year = {2019},\n issn = {2475-1421},\n pages = {40:1--40:29},\n articleno = {40},\n numpages = {29},\n url = {http:\u002F\u002Fdoi.acm.org\u002F10.1145\u002F3290353},\n doi = {10.1145\u002F3290353},\n acmid = {3290353},\n publisher = {ACM},\n address = {New York, NY, USA},\n keywords = {Big Code, Distributed Representations, Machine Learning},\n}\n```","# Code2vec 快速上手指南\n\nCode2vec 是一个用于学习代码分布式表示的神经网络模型，默认支持 Java 方法名预测，也可扩展至其他语言。本指南帮助中国开发者快速完成环境搭建与基础运行。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**：推荐 Ubuntu Linux\n- **Python**：版本 >= 3.6\n- **GPU 支持（可选但推荐）**：\n  - CUDA 10.0\n  - cuDNN >= 7.5\n\n### 前置依赖\n请确保已安装以下组件：\n\n```bash\n# 检查 Python 版本\npython3 --version\n\n# 安装 TensorFlow 2.0.0\npip3 install tensorflow==2.0.0\n\n# 检查 TensorFlow 版本\npython3 -c 'import tensorflow as tf; print(tf.__version__)'\n\n# 若使用 GPU，检查 CUDA 和 cuDNN\nnvcc --version\ncat \u002Fusr\u002Finclude\u002Fcudnn.h | grep CUDNN_MAJOR -A 2\n```\n\n> 💡 **国内加速建议**：可使用清华或阿里镜像源加速 pip 安装：\n> ```bash\n> pip3 install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple tensorflow==2.0.0\n> ```\n\n如需解析 Java 代码或创建新数据集，还需安装：\n- **Java JDK**（OpenJDK 推荐）\n\n## 安装步骤\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\ncd code2vec\n```\n\n### 2. 获取预处理数据集（可选）\n可直接下载官方提供的 Java 数据集（约 1400 万样本）：\n\n```bash\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fdata\u002Fjava14m_data.tar.gz\ntar -xvzf java14m_data.tar.gz\n```\n\n> 💡 **国内加速替代方案**：若上述链接下载缓慢，可尝试通过国内云存储中转或使用代理加速。\n\n### 3. 获取预训练模型（可选）\n下载已训练好的模型（仅用于推理）：\n\n```bash\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model.tar.gz\ntar -xvzf java14m_model.tar.gz\n```\n\n如需继续训练，请下载可训练版本（体积更大）：\n```bash\nwget https:\u002F\u002Fs3.amazonaws.com\u002Fcode2vec\u002Fmodel\u002Fjava14m_model_trainable.tar.gz\ntar -xvzf java14m_model_trainable.tar.gz\n```\n\n## 基本使用\n\n### 场景一：评估预训练模型\n使用测试集评估模型性能：\n\n```bash\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --test data\u002Fjava14m\u002Fjava14m.test.c2v\n```\n\n运行后将在当前目录生成 `log.txt`，包含每个测试样本的预测结果。\n\n### 场景二：手动预测自定义代码\n1. 编辑 `Input.java` 文件，填入待分析的 Java 方法或代码片段。\n2. 运行预测命令：\n\n```bash\npython3 code2vec.py --load models\u002Fjava14_model\u002Fsaved_model_iter8.release --predict\n```\n\n模型将输出预测的方法名及各路径的注意力权重，便于理解决策依据。\n\n### 场景三：从头训练模型（进阶）\n1. 修改 `preprocess.sh` 指定你的训练\u002F验证\u002F测试数据目录。\n2. 执行预处理：\n   ```bash\n   source preprocess.sh\n   ```\n3. （可选）调整超参数：编辑 `config.py`。\n4. 启动训练：\n   ```bash\n   source train.sh\n   ```\n\n> ⏱️ 训练耗时参考：Tesla V100 GPU 约 50 分钟\u002Fepoch；Tesla K80 约 4 小时\u002Fepoch。\n\n---\n\n现在你已成功运行 Code2vec！可进一步探索其配置选项或扩展至其他编程语言。","某大型金融科技公司正在对遗留的 Java 核心系统进行重构，技术团队需要从数百万行缺乏文档且命名混乱的代码中快速梳理业务逻辑。\n\n### 没有 code2vec 时\n- 开发人员只能依靠人工逐行阅读源码来猜测方法功能，面对 `funcA`、`test1` 等无意义命名时效率极低。\n- 静态分析工具仅能识别语法结构，无法理解代码深层的语义逻辑，导致自动生成的文档准确率低。\n- 在搜索特定业务功能（如“计算复利”）时，因方法名与功能不匹配，传统关键词搜索完全失效。\n- 新入职员工需要数周时间熟悉代码库，极易因误解旧代码意图而引入严重 Bug。\n\n### 使用 code2vec 后\n- code2vec 能自动分析代码抽象语法树路径，为无名或命名糟糕的方法预测出高置信度的语义名称（如 `calculateCompoundInterest`）。\n- 基于学习到的代码分布式表示，系统可自动生成贴合业务逻辑的方法描述文档，大幅减少人工编写成本。\n- 支持语义级代码搜索，即使输入自然语言描述，也能精准定位到功能相符但命名无关的代码片段。\n- 团队利用模型输出的向量相似度快速聚类功能相近的代码，迅速识别冗余逻辑并制定重构计划。\n\ncode2vec 通过将代码转化为可计算的语义向量，让机器真正“读懂”了程序意图，将原本耗时数月的代码审计工作缩短至几天完成。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftech-srl_code2vec_eca2fb77.png","tech-srl","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ftech-srl_ad3af89e.jpg","",null,"https:\u002F\u002Fgithub.com\u002Ftech-srl",[79,83,87,91],{"name":80,"color":81,"percentage":82},"Python","#3572A5",68.2,{"name":84,"color":85,"percentage":86},"C#","#178600",14.7,{"name":88,"color":89,"percentage":90},"Java","#b07219",12.8,{"name":92,"color":93,"percentage":94},"Shell","#89e051",4.2,1143,292,"2026-04-10T00:35:43","MIT",4,"Linux (Ubuntu)","非必需（CPU 可运行），若使用 GPU 需 NVIDIA 显卡，支持 CUDA 10.0 及 cuDNN >= 7.5","未说明（处理大型数据集如 java14m 解压后约 32GB，建议大内存）",{"notes":104,"python":105,"dependencies":106},"默认针对 Java 源代码进行训练和预测，但可扩展至其他语言。若使用 GPU，必须严格匹配 CUDA 10.0 版本。项目包含两种实现：纯 TensorFlow 和 tf.keras。预处理数据集（如 java14m）解压后高达 32GB，训练模型文件约为 1.4GB 至 3.5GB。在 Tesla V100 GPU 上每轮训练约需 50 分钟。",">=3.6",[107,108,109],"tensorflow==2.0.0","tf.keras (内置于 TensorFlow 2.0)","Java JDK (用于数据集预处理和代码解析)",[35,14,45],[65,112,113,114,115,116,117],"learning","distributed","representations","of","code","technion","2026-03-27T02:49:30.150509","2026-04-17T08:23:10.978337",[121,126,131,136,141,145,149],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},36407,"如何使用 code2vec 训练自定义数据集（如测试套件命名规范）？数据量较少时该怎么办？","对于自定义任务（如测试方法命名），应将脚本指向包含源代码的根目录（例如 Maven 项目中的 `src\u002Ftest\u002Fjava\u002F`），无需手动提取方法合并为单个文件。如果训练数据量较少，模型效果可能会受限，建议尽可能收集更多数据或尝试数据增强技术。维护者指出具体配置需参考项目文档中的数据集格式说明。","https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\u002Fissues\u002F5",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},36408,"为什么使用 --release 或 --export_code_vectors 参数后没有生成输出文件？","这是一个已知问题，在最新的提交中已针对 TensorFlow 2 (TF2) 框架修复了 `--save_w2v` 和 `--save_t2v` 相关标志。对于 TF2 版本，现在可以直接运行以下命令来释放模型或导出向量，即使不使用虚拟的 `--test` 标志也能正常工作：\n`python3 code2vec.py --load models\u002Ftest_model\u002Fsaved_model_iter8 --release`\n如果您使用的是 Keras 版本，可能尚未修复，建议切换到 TensorFlow 2 环境尝试。","https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\u002Fissues\u002F54",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},36409,"运行 preprocess.sh 脚本时出现\"Is a directory\"错误或训练数据文件大小为 0，如何解决？","该错误通常发生在 Windows 环境下或路径配置不正确时。脚本期望输入的是具体的文件路径或符合特定结构的目录，而不是直接指向包含子目录的父文件夹导致被识别为目录而非文件集。请检查 `preprocess.sh` 中配置的 `TRAIN_DIR`, `VAL_DIR`, `TEST_DIR` 变量，确保它们指向包含实际代码文件的正确路径，并且脚本有权限读取这些文件。如果在 Windows 上运行，建议使用 Git Bash 或 WSL，并确保路径分隔符正确（使用正斜杠 `\u002F`）。","https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\u002Fissues\u002F70",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},36410,"如何将非 Java 语言（如 Python, Go, C）的代码集成到 code2vec 中？","Code2vec 原生主要支持 Java。对于其他语言，您需要使用外部工具（如 ASTMiner）将代码转换为 code2vec 可识别的 `path_contexts` 格式（.c2v 或 .c2s 文件）。\n1. 使用 ASTMiner 提取路径上下文。\n2. 确保生成的文件格式符合 code2vec 的输入要求（参考项目 README 中的 Notes 部分）。\n3. 注意：直接使用 `--predict` 选项可能无法处理非 Java 代码，因为内部解析器是为 Java 设计的。对于 C 语言等，如果模型效果不佳，可以尝试增加训练轮数（epochs），或者考虑使用专门针对多语言优化的模型（如 Code-LMs 或 code2seq）。","https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\u002Fissues\u002F144",{"id":142,"question_zh":143,"answer_zh":144,"source_url":125},36411,"如何处理自定义数据集的配置文件（DATASET_NAME 等参数）？","在配置自定义数据集时，`DATASET_NAME` 可以设置为任意名称（例如 'java' 或 'my_dataset'），它主要用于标识当前运行的实验。关键是要正确设置以下环境变量或参数：\n- `TRAIN_DIR`, `VAL_DIR`, `TEST_DIR`: 分别指向训练、验证和测试数据的目录路径。\n- `MAX_DATA_CONTEXTS`: 每个样本保留的最大上下文数量（例如 1000）。\n- `MAX_CONTEXTS`: 模型每次迭代使用的最大上下文数（例如 200）。\n- `SUBTOKEN_VOCAB_SIZE` 和 `TARGET_VOCAB_SIZE`: 根据预处理后的数据统计结果进行设置。\n确保目录结构正确，且预处理脚本能从中提取出非空的数据文件。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":140},36412,"训练过程中默认运行多少个 epoch？最终会生成几个模型文件？","默认的 epoch 数量取决于配置文件（通常在 `config.py` 或命令行参数中设定）。程序会在每个 epoch 结束后保存一个模型快照。如果您运行了 50 个 epoch，通常会生成 50 个对应的模型文件（除非配置了只保存最佳模型）。建议在训练完成后，根据验证集的表现选择效果最好的那个模型文件进行后续的预测或发布操作。",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},36413,"针对 C 语言代码，code2vec 的效果不好或无法预测怎么办？","如果在 C 语言代码上训练效果不佳（例如只训练了 2 个 epoch），建议大幅增加训练轮数。此外，由于 code2vec 的预测功能 (`--predict`) 内部依赖 Java 解析器，直接用于 C 代码可能会失败。对于 C 语言任务，维护者推荐使用专门为此设计的模型，如 Code-LMs（在 C 语言任务上表现优于 Codex），或者尝试 code2seq 模型。相关链接：https:\u002F\u002Fgithub.com\u002FVHellendoorn\u002FCode-LMs","https:\u002F\u002Fgithub.com\u002Ftech-srl\u002Fcode2vec\u002Fissues\u002F141",[]]