[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-microsoft--mattergen":3,"tool-microsoft--mattergen":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":29,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":108,"github_topics":109,"view_count":29,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":113,"updated_at":114,"faqs":115,"releases":145},970,"microsoft\u002Fmattergen","mattergen","Official implementation of MatterGen -- a generative model for inorganic materials design across the periodic table that can be fine-tuned to steer the generation towards a wide range of property constraints.","MatterGen 是一款用于设计无机材料的生成模型，能够覆盖整个元素周期表，并可根据多种材料特性进行微调，从而引导生成符合特定需求的新材料。它解决了传统材料设计中效率低、探索范围有限的问题，通过生成模型快速探索化学空间，帮助发现具有目标特性的新型材料。\n\n这款工具特别适合材料科学领域的研究人员和开发者使用。研究人员可以用它来探索新材料的生成，而开发者则可以基于其代码库进行扩展或集成到其他应用中。普通用户可能需要一定的技术背景才能充分利用其功能。\n\nMatterGen 的独特亮点在于其高度灵活性：不仅提供了一个基础的无条件生成模型，还支持针对化学系统、空间群、磁密度、带隙等多种材料属性的微调模型。此外，它还兼容主流硬件环境（如 CUDA 和 Apple Silicon），并提供了预训练模型和数据集，方便用户快速上手。无论是生成全新材料还是优化特定性能，MatterGen 都为材料设计领域带来了强大的助力。","\n\u003Ch1>\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_1ebb8cf99582.png\" alt=\"MatterGen logo\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\u003C\u002Fh1>\n\n\u003Ch4 align=\"center\">\n\n[![DOI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.1038%2Fs41586--025--08628--5-blue)](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-025-08628-5)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2312.03687-blue.svg?logo=arxiv&logoColor=white.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03687)\n[![Requires Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-blue.svg?logo=python&logoColor=white)](https:\u002F\u002Fpython.org\u002Fdownloads)\n\u003C\u002Fh4>\n\nMatterGen is a generative model for inorganic materials design across the periodic table that can be fine-tuned to steer the generation towards a wide range of property constraints.\n\n\n## Table of Contents\n- [Installation](#installation)\n- [Get started with a pre-trained model](#get-started-with-a-pre-trained-model)\n- [Generating materials](#generating-materials)\n- [Evaluation](#evaluation)\n- [Train MatterGen yourself](#train-mattergen-yourself)\n- [Data release](#data-release)\n- [Citation](#citation)\n- [Trademarks](#trademarks)\n- [Responsible AI Transparency Documentation](#responsible-ai-transparency-documentation)\n- [Get in touch](#get-in-touch)\n\n## Installation\n\n\nThe easiest way to install prerequisites is via [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F), a fast Python package and project manager.\n\nThe MatterGen environment can be installed via the following command (assumes you are running Linux and have a CUDA GPU):\n```bash\npip install uv\nuv venv .venv --python 3.10 \nsource .venv\u002Fbin\u002Factivate\nuv pip install -e .\n```\n\nNote that our datasets and model checkpoints are provided inside this repo via [Git Large File Storage (LFS)](https:\u002F\u002Fgit-lfs.com\u002F).\nTo find out whether LFS is installed on your machine, run\n```bash\ngit lfs --version\n```\nIf this prints some version like `git-lfs\u002F3.0.2 (GitHub; linux amd64; go 1.18.1)`, you can skip the following step.\n\n### Install Git LFS\nIf Git LFS was not installed before you cloned this repo, you can install it via:\n```bash\nsudo apt install git-lfs\ngit lfs install\n```\n\n### Apple Silicon\n> [!WARNING]\n> Running MatterGen on Apple Silicon is **experimental**. Use at your own risk.  \n> Further, you need to run `export PYTORCH_ENABLE_MPS_FALLBACK=1` before any training or generation run.\n\n## Get started with a pre-trained model\nWe provide checkpoints of an unconditional base version of MatterGen as well as fine-tuned models for these properties:\n* `mattergen_base`: unconditional base model trained on Alex-MP-20\n* `mp_20_base`: unconditional base model trained on MP-20\n* `chemical_system`: fine-tuned model conditioned on chemical system\n* `space_group`: fine-tuned model conditioned on space group\n* `dft_mag_density`: fine-tuned model conditioned on magnetic density from DFT\n* `dft_band_gap`: fine-tuned model conditioned on band gap from DFT\n* `ml_bulk_modulus`: fine-tuned model conditioned on bulk modulus from ML predictor\n* `dft_mag_density_hhi_score`: fine-tuned model jointly conditioned on magnetic density from DFT and HHI score\n* `chemical_system_energy_above_hull`: fine-tuned model jointly conditioned on chemical system and energy above hull from DFT\n\nThe checkpoints are located at `checkpoints\u002F\u003Cmodel_name>` and are also available on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fmattergen). By default, they are downloaded from Huggingface when requested. You can also manually download them from Git LFS via \n```bash\ngit lfs pull -I checkpoints\u002F\u003Cmodel_name> --exclude=\"\" \n```\n\n> [!NOTE]\n> The checkpoints provided were re-trained using this repository, i.e., are not identical to the ones used in the paper. Hence, results may slightly deviate from those in the publication. \n\n## Generating materials\n### Unconditional generation\nTo sample from the pre-trained base model, run the following command.\n```bash\nexport MODEL_NAME=mattergen_base\nexport RESULTS_PATH=results\u002F  # Samples will be written to this directory\n\n# generate batch_size * num_batches samples\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --num_batches 1\n```\nThis script will write the following files into `$RESULTS_PATH`:\n* `generated_crystals_cif.zip`: a ZIP file containing a single `.cif` file per generated structure.\n* `generated_crystals.extxyz`, a single file containing the individual generated structures as frames.\n* If `--record-trajectories == True` (default): `generated_trajectories.zip`: a ZIP file containing a `.extxyz` file per generated structure, which contains the full denoising trajectory for each individual structure.\n> [!TIP]\n> For best efficiency, increase the batch size to the largest your GPU can sustain without running out of memory.\n\n> [!NOTE]\n> To sample from a model you've trained yourself, replace `--pretrained-name=$MODEL_NAME` with `--model_path=$MODEL_PATH`, filling in your model's location for `$MODEL_PATH`.\n### Property-conditioned generation\nWith a fine-tuned model, you can generate materials conditioned on a target property.\nFor example, to sample from the model trained on magnetic density, you can run the following command.\n```bash\nexport MODEL_NAME=dft_mag_density\nexport RESULTS_PATH=\"results\u002F$MODEL_NAME\u002F\"  # Samples will be written to this directory, e.g., `results\u002Fdft_mag_density`\n\n# Generate conditional samples with a target magnetic density of 0.15\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on=\"{'dft_mag_density': 0.15}\" --diffusion_guidance_factor=2.0\n```\n> [!TIP]\n> The argument `--diffusion-guidance-factor` corresponds to the $\\gamma$ parameter in [classifier-free diffusion guidance](https:\u002F\u002Fsander.ai\u002F2022\u002F05\u002F26\u002Fguidance.html). Setting it to zero corresponds to unconditional generation, and increasing it further tends to produce samples which adhere more to the input property values, though at the expense of diversity and realism of samples.\n\n### Multiple property-conditioned generation\nYou can also generate materials conditioned on more than one property. For instance, you can use the pre-trained model located at `checkpoints\u002Fchemical_system_energy_above_hull` to generate conditioned on chemical system and energy above the hull, or the model at `checkpoints\u002Fdft_mag_density_hhi_score` for joint conditioning on [HHI score](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHerfindahl%E2%80%93Hirschman_index) and magnetic density.\nAdapt the following command to your specific needs:\n```bash\nexport MODEL_NAME=chemical_system_energy_above_hull\nexport RESULTS_PATH=\"results\u002F$MODEL_NAME\u002F\"  # Samples will be written to this directory, e.g., `results\u002Fdft_mag_density`\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on=\"{'energy_above_hull': 0.05, 'chemical_system': 'Li-O'}\" --diffusion_guidance_factor=2.0\n```\n## Evaluation\n\nOnce you have generated a list of structures contained in `$RESULTS_PATH` (either using MatterGen or another method), you can relax the structures using the default MatterSim machine learning force field (see [repository](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattersim)) and compute novelty, uniqueness, stability (using energy estimated by MatterSim), and other metrics via the following command:\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_MP2020correction.gz --exclude=\"\"  # first download the MP2020 reference dataset from Git LFS\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as=\"$RESULTS_PATH\u002Fmetrics.json\"\n```\n\nIf you want to use the reference dataset while applying the TRI2024 correction scheme (recommended), instead run the following:\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_TRI2024correction.gz --exclude=\"\"  # ownload the TRI2024 reference datasets\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as=\"$RESULTS_PATH\u002Fmetrics.json\" --reference_dataset_path=\"data-release\u002Falex-mp\u002Freference_TRI2024correction.gz\"\n```\n\nThis script will write `metrics.json` containing the metric results to `$RESULTS_PATH` and will print it to your console.\n> [!IMPORTANT]\n> The evaluation script in this repository uses [MatterSim](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattersim), a machine-learning force field (MLFF) to relax structures and assess their stability via MatterSim's predicted energies. While this is orders of magnitude faster than evaluation via density functional theory (DFT), it doesn't require a license to run the evaluation, and typically has a high accuracy, there are important caveats. (1) In the MatterGen publication we use DFT to evaluate structures generated by all models and baselines; (2) DFT is more accurate and reliable, particularly in less common chemical systems. Thus, evaluation results obtained with this evaluation code may give different results than DFT evaluation; and we recommend to confirm results obtained with MLFFs with DFT before drawing conclusions. \n\n> [!TIP]\n> By default, this uses `MatterSim-v1-1M`. If you would like to use the larger `MatterSim-v1-5M` model, you can add the `--potential_load_path=\"MatterSim-v1.0.0-5M.pth\"` argument. You may also check the [MatterSim repository](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattersim) for the latest version of the model. \n\n\nIf, instead, you have relaxed the structures and obtained the relaxed total energies via another mean (e.g., DFT), you can evaluate the metrics via:\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_MP2020correction.gz --exclude=\"\"  # first download the reference dataset from Git LFS\nmattergen-evaluate --structures_path=$RESULTS_PATH --energies_path='energies.npy' --relax=False --structure_matcher='disordered' --save_as='metrics'\n```\nThis script will try to read structures from disk in the following precedence order:\n* If `$RESULTS_PATH` points to a `.xyz` or `.extxyz` file, it will read it directly and assume each frame is a different structure.\n* If `$RESULTS_PATH` points to a `.zip` file containing `.cif` files, it will first extract and then read the cif files.\n* If `$RESULTS_PATH` points to a directory, it will read all `.cif`,  `.xyz`, or `.extxyz` files in the order they occur in `os.listdir`.\n\nHere, we expect `energies.npy` to be a numpy array with the entries being `float` energies in the same order as the structures read from `$RESULTS_PATH`.\n\n> [!IMPORTANT]\n> For any task beyond benchmarking against existing literature, we recommend using the TRI2024 correction scheme and reference dataset. To do so, run:\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_TRI2024correction.gz --exclude=\"\"  # first download the reference dataset from Git LFS\nmattergen-evaluate --structures_path=$RESULTS_PATH --energies_path='energies.npy' --relax=False --structure_matcher='disordered' --save_as='metrics' --energy_correction_scheme=\"TRI2024\" --reference_dataset_path=\"data-release\u002Falex-mp\u002Freference_TRI2024correction.gz\" \n```\n\nIf you want to save the relaxed structures, toghether with their energies, forces, and stresses, add `--structures_output_path=YOUR_PATH` to the script call, like so:\n```bash\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as='metrics' --structures_output_path=\"relaxed_structures.extxyz\"\n```\n\nIf you want to obtain per-structure metrics (e.g., `energy_above_hull` for every crystal rather than just the average), add `--save_detailed_as` to save a JSON file with per-structure values:\n```bash\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as='metrics.json' --save_detailed_as='detailed_metrics.json'\n```\nThe detailed metrics file contains per-structure values for `energy_above_hull`, `self_consistent_energy_above_hull`, `stability`, `novelty`, `uniqueness`, and other metrics.\n\n### Benchmark\nIn [`plot_benchmark_results.ipynb`](benchmark\u002Fplot_benchmark_results.ipynb) we provide a Jupyter notebook to generate figures like Figs. 2e and 2f in the paper. We further provide the resulting metrics of analyzing samples generated by several baselines under [`benchmark\u002Fmetrics`](benchmark\u002Fmetrics). You can add your own model's results by copying the metrics JSON file resulting from `mattergen-evaluate` into the same folder. Note, again, that these results were obtained via MatterSim relaxation and energies, so results will differ from those obtained via DFT (e.g., as those in the paper).\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_27da50042f36.png\" alt=\"S.U.N. plot\" width=\"410\"\u002F>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_61c2e5a51bb8.png\" alt=\"RMSD plot\" width=\"410\"\u002F>\n\u003C\u002Fp>\nFor convenience, here are the **numerical results from Figs. 2e and 2f in the paper** (as well as Table D4 in the supplementary information):\n\nModel | % S.U.N. | RMSD | % Stable | % Unique | % Novel\n------|----------|------|----------|----------|--------|\nMatterGen | 38.57 | 0.021 | 74.41 | 100.0 | 61.96\nMatterGen MP20 | 22.27 | 0.110 | 42.19 | 100.0 | 75.44\nDiffCSP Alex-MP-20 | 33.27 | 0.104 | 63.33 | 99.90 | 66.94\nDiffCSP MP20 | 12.71 | 0.232 | 36.23 | 100.0 | 70.73\nCDVAE | 13.99 | 0.359 | 19.31 | 100.0 | 92.00 \nFTCP | 0.0 | 1.492 | 0.0 | 100.0 | 100.0\nG-SchNet | 0.98 | 1.347 | 1.63 | 100.0 | 98.23\nP-G-SchNet | 1.29 | 1.360 | 3.11 | 100.0 | 88.40\n\n### Evaluate using your own reference dataset\n\n> [!IMPORTANT]\n> If you are planning to use MatterSim to evaluate the stability of the generated structures, then the reference dataset you provide must contain energies\n> that are compatible with MatterSim, meaning they should be either DFT-computed energies calculated according to the Materials Project Compatbility scheme,\n> or energies directly computed with MatterSim.\n\nIf you want to use your own custom dataset for evaluation, you first need to serialize and save it by doing so:\n\n``` python\nfrom mattergen.evaluation.reference.reference_dataset import ReferenceDataset\nfrom mattergen.evaluation.reference.reference_dataset_serializer import LMDBGZSerializer\n\n\nreference_dataset = ReferenceDataset.from_entries(name=\"my_reference_dataset\", entries=entries)\nLMDBGZSerializer().serialize(reference_dataset, \"path_to_file.gz\")\n```\n\nwhere `entries` is a list of `pymatgen.entries.computed_entries.ComputedStructureEntry` objects containing structure-energy pairs for each structure.\n\nBy default, we apply the MaterialsProject2020Compatibility energy correction scheme to all input structures during evaluation, and assume that the reference dataset \nhas already been pre-processed using the same compatibility scheme. \nTherefore, unless you have already done this, you should obtain the `entries` object for\nyour custom reference dataset in the following way:\n\n``` python\nfrom mattergen.evaluation.utils.vasprunlike import VasprunLike\nfrom pymatgen.entries.compatibility import MaterialsProject2020Compatibility\n\nentries = []\nfor structure, energy in zip(structures, energies)\n  vasprun_like = VasprunLike(structure=structure, energy=energy)\n  entries.append(vasprun_like.get_computed_entry(\n      inc_structure=True, energy_correction_scheme=MaterialsProject2020Compatibility()\n  ))\n```\n\n> [!NOTE]\n> Because of some known issues with the MaterialsProject2020Compatibility scheme, we recommend using the `TRI110Compatibility2024` reference dataset and correction scheme to evaluate stability of materials outside benchmarks.\nTo do so, run: \n``` python\nfrom mattergen.evaluation.utils.vasprunlike import VasprunLike\nfrom mattergen.evaluation.reference.correction_schemes import TRI110Compatibility2024\n\nentries = []\nfor structure, energy in zip(structures, energies)\n  vasprun_like = VasprunLike(structure=structure, energy=energy)\n  entries.append(vasprun_like.get_computed_entry(\n      inc_structure=True, energy_correction_scheme=TRI110Compatibility2024()\n  ))\n```\n\n\n## Train MatterGen yourself\nBefore we can train MatterGen from scratch, we have to unpack and preprocess the dataset files.\n\n### Pre-process a dataset for training\n\nYou can run the following command for `mp_20`:\n```bash\n# Download file from LFS\ngit lfs pull -I data-release\u002Fmp-20\u002F --exclude=\"\"\nunzip data-release\u002Fmp-20\u002Fmp_20.zip -d datasets\ncsv-to-dataset --csv-folder datasets\u002Fmp_20\u002F --dataset-name mp_20 --cache-folder datasets\u002Fcache\n```\nYou will get preprocessed data files in `datasets\u002Fcache\u002Fmp_20`.\n\nTo preprocess our larger `alex_mp_20` dataset, run:\n```bash\n# Download file from LFS\ngit lfs pull -I data-release\u002Falex-mp\u002Falex_mp_20.zip --exclude=\"\"\nunzip data-release\u002Falex-mp\u002Falex_mp_20.zip -d datasets\ncsv-to-dataset --csv-folder datasets\u002Falex_mp_20\u002F --dataset-name alex_mp_20 --cache-folder datasets\u002Fcache\n```\nThis will take some time (~1h). You will get preprocessed data files in `datasets\u002Fcache\u002Falex_mp_20`.\n\n### Training\nYou can train the MatterGen base model on `mp_20` using the following command.\n\n```bash\nmattergen-train data_module=mp_20 ~trainer.logger\n```\n> [!NOTE]\n> For Apple Silicon training, add `~trainer.strategy trainer.accelerator=mps` to the above command.\n\nThe validation loss (`loss_val`) should reach 0.4 after 360 epochs (about 80k steps). The output checkpoints can be found at `outputs\u002Fsinglerun\u002F${now:%Y-%m-%d}\u002F${now:%H-%M-%S}`. We call this folder `$MODEL_PATH` for future reference. \n> [!NOTE]\n> We use [`hydra`](https:\u002F\u002Fhydra.cc\u002Fdocs\u002Fintro\u002F) to configure our training and sampling jobs. The hierarchical configuration can be found under [`mattergen\u002Fconf`](mattergen\u002Fconf). In the following we make use of `hydra`'s config overrides to update these configs via the CLI. See the `hydra` [documentation](https:\u002F\u002Fhydra.cc\u002Fdocs\u002Fadvanced\u002Foverride_grammar\u002Fbasic\u002F) for an introduction to the config override syntax.\n\n> [!TIP]\n> By default, we disable Weights & Biases (W&B) logging via the `~trainer.logger` config override. You can enable it by removing this override. In [`mattergen\u002Fconf\u002Ftrainer\u002Fdefault.yaml`](mattergen\u002Fconf\u002Ftrainer\u002Fdefault.yaml), you may enter your W&B logging info or specify your own logger.\n\nTo train the MatterGen base model on `alex_mp_20`, use the following command:\n```bash\nmattergen-train data_module=alex_mp_20 ~trainer.logger trainer.accumulate_grad_batches=4\n```\n> [!NOTE]\n> For Apple Silicon training, add `~trainer.strategy trainer.accelerator=mps` to the above command.\n\n> [!TIP]\n> Note that a single GPU's memory usually is not enough for the batch size of 512, hence we accumulate gradients over 4 batches. If you still run out of memory, increase this further.\n\n#### Crystal structure prediction\nEven though not a focus of our paper, you can also train MatterGen in crystal structure prediction (CSP) mode, where it does not denoise the atom types during generation. \nThis gives you the ability to condition on a specific chemical formula for generation. You can train MatterGen in this mode by passing `--config-name=csp` to `run.py`.\n\nTo sample from this model, pass `--target_compositions=['{\"\u003Celement1>\": \u003Cnumber_of_element1_atoms>, \"\u003Celement2>\": \u003Cnumber_of_element2_atoms>, ..., \"\u003CelementN>\": \u003Cnumber_of_elementN_atoms>}'] --sampling-config-name=csp` to `generate.py`. \nAn example composition could be `--target_compositions=['{\"Na\": 1, \"Cl\": 1}']`.\n### Fine-tuning on property data\n\nYou can fine-tune the MatterGen base model using the following command.\n\n```bash\nexport PROPERTY=dft_mag_density\nmattergen-finetune adapter.pretrained_name=mattergen_base data_module=mp_20 +lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY=$PROPERTY ~trainer.logger data_module.properties=[\"$PROPERTY\"]\n```\n`dft_mag_density` denotes the target property for fine-tuning. You can also fine-tune a model you've trained yourself by **replacing** `adapter.pretrained_name=mattergen_base` with `adapter.model_path=$MODEL_PATH`, filling in your model's location for `$MODEL_PATH`.\n> [!NOTE]\n> For Apple Silicon training, add `~trainer.strategy trainer.accelerator=mps` to the above command.\n\n\n> [!TIP]\n> You can select any property that is available in the dataset. See [`mattergen\u002Fconf\u002Fdata_module\u002Fmp_20.yaml`](mattergen\u002Fconf\u002Fdata_module\u002Fmp_20.yaml) or [`mattergen\u002Fconf\u002Fdata_module\u002Falex_mp_20.yaml`](mattergen\u002Fconf\u002Fdata_module\u002Falex_mp_20.yaml) for the list of supported properties. You can also add your own custom property data. See [below](#fine-tune-on-your-own-property-data) for instructions.\n\n#### Multi-property fine-tuning\nYou can also fine-tune MatterGen on multiple properties. For instance, to fine-tune it on `dft_mag_density` and `dft_band_gap`, you can use the following command.\n\n```bash\nexport PROPERTY1=dft_mag_density\nexport PROPERTY2=dft_band_gap \nexport MODEL_NAME=mattergen_base\nmattergen-finetune adapter.pretrained_name=$MODEL_NAME data_module=mp_20 +lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY1=$PROPERTY1 +lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY2=$PROPERTY2 ~trainer.logger data_module.properties=[\"$PROPERTY1\",\"$PROPERTY2\"]\n```\n> [!TIP]\n> Add more properties analogously by adding these overrides:\n> 1. `+lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.\u003Cmy_property>=\u003Cmy_property>`\n> 2. Add `\u003Cmy_property>` to the `data_module.properties=[\"$PROPERTY1\",\"$PROPERTY2\",...,\u003Cmy_property>]` override.\n\n> [!NOTE]\n> For Apple Silicon training, add `~trainer.strategy trainer.accelerator=mps` to the above command.\n\n#### Fine-tune on your own property data\nYou may also fine-tune MatterGen on your own property data. Essentially what you need is a property value (typically `float`) for a subset of the data you want to train on (e.g., `alex_mp_20`). Proceed as follows:\n1. Add the name of your property to the `PROPERTY_SOURCE_IDS` list inside [`mattergen\u002Fcommon\u002Futils\u002Fglobals.py`](mattergen\u002Fcommon\u002Futils\u002Fglobals.py).\n2. Add a new column with this name to the dataset(s) you want to train on, e.g., `datasets\u002Falex_mp_20\u002Ftrain.csv` and `datasets\u002Falex_mp_20\u002Fval.csv` (requires you to have followed the [pre-processing steps](#pre-process-a-dataset-for-training)).\n3. Re-run the CSV to dataset script `csv-to-dataset --csv-folder datasets\u002F\u003CMY_DATASET>\u002F --dataset-name \u003CMY_DATASET> --cache-folder datasets\u002Fcache`, substituting your dataset name for `MY_DATASET`.\n4. Add a `\u003Cyour_property>.yaml` config file to [`mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings). If you are adding a float-valued property, you may copy an existing configuration, e.g., [`dft_mag_density.yaml`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings\u002Fdft_mag_density.yaml). More complicated properties will require you to create your own custom `PropertyEmbedding` subclass, e.g., see the [`space_group`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings\u002Fspace_group.yaml) or [`chemical_system`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings\u002Fchemical_system.yaml) configs.\n5. Follow the [instructions for fine-tuning](#fine-tuning-on-property-data) and reference your own property in the same way as we used the existing properties like `dft_mag_density`.\n\n## Data release\nWe provide datasets to train as well as evaluate MatterGen. For more details and license information see the respective README files under [`data-release`](data-release).\n### Training datasets\n* MP-20 ([Jain et al., 2013](https:\u002F\u002Fpubs.aip.org\u002Faip\u002Fapm\u002Farticle\u002F1\u002F1\u002F011002\u002F119685)): contains 45k general inorganic materials, including most experimentally known materials with no more than 20 atoms in unit cell.\n* Alex-MP-20: Training dataset consisting of around 600k structures from MP-20 and Alexandria ([Schmidt et al. 2022](https:\u002F\u002Farchive.materialscloud.org\u002Frecord\u002F2022.126)) with at most 20 atoms inside the unit cell and below 0.1 eV\u002Fatom of the convex hull. See the venn diagram below and the MatterGen paper for more details.\n\n### Reference dataset\nWe further provide the Alex-MP reference dataset which can be used to evaluate novelty and stability of generated samples. \nThe reference set contains 845,997 structures with their DFT energies. See the following Venn diagram for more details about the composition of the training and reference datasets.\n> [!NOTE]\n> For license reasons, we cannot share the 4.4k ordered + 117.7k disordered ICSD structures, so results may differ from those in the paper. \n\n![Dataset Venn diagram](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_e726c906263f.png)\n\n### CIFs and experimental measurements\nThe [`data-release`](data-release) directory also contains the CIF files to all structures shown in the paper as well as xps, xrd, and nanoindentation measurements of the TaCr2O6 sample presented in the paper.\n\n## Citation\nIf you are using our code, model, data, or evaluation pipeline, please consider citing our work:\n```bibtex\n@article{MatterGen2025,\n  author  = {Zeni, Claudio and Pinsler, Robert and Z{\\\"u}gner, Daniel and Fowler, Andrew and Horton, Matthew and Fu, Xiang and Wang, Zilong and Shysheya, Aliaksandra and Crabb{\\'e}, Jonathan and Ueda, Shoko and Sordillo, Roberto and Sun, Lixin and Smith, Jake and Nguyen, Bichlien and Schulz, Hannes and Lewis, Sarah and Huang, Chin-Wei and Lu, Ziheng and Zhou, Yichi and Yang, Han and Hao, Hongxia and Li, Jielan and Yang, Chunlei and Li, Wenjie and Tomioka, Ryota and Xie, Tian},\n  journal = {Nature},\n  title   = {A generative model for inorganic materials design},\n  year    = {2025},\n  doi     = {10.1038\u002Fs41586-025-08628-5},\n}\n```\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services.\nAuthorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks\u002Fusage\u002Fgeneral).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n\n## Responsible AI Transparency Documentation\n\nThe responsible AI transparency documentation can be found [here](MODEL_CARD.md).\n\n## Get in touch\nIf you have any questions not covered here, please ask a questions in the Q&A section of Discussions.\nIf you want to report a bug or propose a feature, create an Issue using the template and \u002F or open a pull request.\n","\u003Ch1>\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_1ebb8cf99582.png\" alt=\"MatterGen 标志\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\u003C\u002Fh1>\n\n\u003Ch4 align=\"center\">\n\n[![DOI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.1038%2Fs41586--025--08628--5-blue)](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-025-08628-5)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2312.03687-blue.svg?logo=arxiv&logoColor=white.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03687)\n[![需要 Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-blue.svg?logo=python&logoColor=white)](https:\u002F\u002Fpython.org\u002Fdownloads)\n\u003C\u002Fh4>\n\nMatterGen 是一个用于整个元素周期表范围内无机材料设计的生成模型，可以微调以引导生成满足各种属性约束的材料。\n\n\n## 目录\n- [安装](#installation)\n- [使用预训练模型开始](#get-started-with-a-pre-trained-model)\n- [生成材料](#generating-materials)\n- [评估](#evaluation)\n- [自行训练 MatterGen](#train-mattergen-yourself)\n- [数据发布](#data-release)\n- [引用](#citation)\n- [商标](#trademarks)\n- [负责任的人工智能透明文档](#responsible-ai-transparency-documentation)\n- [联系我们](#get-in-touch)\n\n## 安装\n\n\n安装依赖项的最简单方法是通过 [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)，这是一个快速的 Python 包和项目管理器。\n\n可以通过以下命令安装 MatterGen 环境（假设您正在运行 Linux 并且拥有 CUDA GPU）：\n```bash\npip install uv\nuv venv .venv --python 3.10 \nsource .venv\u002Fbin\u002Factivate\nuv pip install -e .\n```\n\n请注意，我们的数据集和模型检查点通过 [Git Large File Storage (LFS)](https:\u002F\u002Fgit-lfs.com\u002F) 提供在此存储库中。\n要检查您的机器上是否安装了 LFS，请运行：\n```bash\ngit lfs --version\n```\n如果输出类似 `git-lfs\u002F3.0.2 (GitHub; linux amd64; go 1.18.1)` 的版本信息，则可以跳过以下步骤。\n\n### 安装 Git LFS\n如果您在克隆此存储库之前未安装 Git LFS，可以通过以下方式安装：\n```bash\nsudo apt install git-lfs\ngit lfs install\n```\n\n### Apple Silicon\n> [!警告]\n> 在 Apple Silicon 上运行 MatterGen 是**实验性的**。请自行承担风险使用。  \n> 此外，在任何训练或生成运行之前，您需要运行 `export PYTORCH_ENABLE_MPS_FALLBACK=1`。\n\n## 使用预训练模型开始\n我们提供了 MatterGen 的无条件基础版本的检查点，以及针对这些属性微调的模型：\n* `mattergen_base`: 基于 Alex-MP-20 训练的无条件基础模型\n* `mp_20_base`: 基于 MP-20 训练的无条件基础模型\n* `chemical_system`: 基于化学系统条件微调的模型\n* `space_group`: 基于空间群条件微调的模型\n* `dft_mag_density`: 基于 DFT 磁密度条件微调的模型\n* `dft_band_gap`: 基于 DFT 带隙条件微调的模型\n* `ml_bulk_modulus`: 基于 ML 预测器预测的体积模量条件微调的模型\n* `dft_mag_density_hhi_score`: 基于 DFT 磁密度和 HHI 分数联合条件微调的模型\n* `chemical_system_energy_above_hull`: 基于化学系统和 DFT 能量高于凸包联合条件微调的模型\n\n检查点位于 `checkpoints\u002F\u003Cmodel_name>`，也可以在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fmattergen) 上获取。默认情况下，它们会在请求时从 Huggingface 下载。您也可以通过以下命令手动从 Git LFS 下载：\n```bash\ngit lfs pull -I checkpoints\u002F\u003Cmodel_name> --exclude=\"\" \n```\n\n> [!注意]\n> 提供的检查点是使用此存储库重新训练的，因此与论文中使用的检查点不完全相同。结果可能会略有偏差。\n\n## 生成材料\n### 无条件生成\n要从预训练的基础模型中采样，请运行以下命令：\n```bash\nexport MODEL_NAME=mattergen_base\nexport RESULTS_PATH=results\u002F  # 样本将写入此目录\n\n# 生成 batch_size * num_batches 个样本\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --num_batches 1\n```\n该脚本将在 `$RESULTS_PATH` 中写入以下文件：\n* `generated_crystals_cif.zip`: 包含每个生成结构的单个 `.cif` 文件的 ZIP 文件。\n* `generated_crystals.extxyz`: 包含各个生成结构作为帧的单个文件。\n* 如果 `--record-trajectories == True`（默认值）：`generated_trajectories.zip`: 包含每个生成结构的完整去噪轨迹的 `.extxyz` 文件的 ZIP 文件。\n> [!提示]\n> 为了获得最佳效率，请将批处理大小增加到您的 GPU 可以承受的最大值，而不会耗尽内存。\n\n> [!注意]\n> 要从您自己训练的模型中采样，请将 `--pretrained-name=$MODEL_NAME` 替换为 `--model_path=$MODEL_PATH`，并填写您的模型位置以替换 `$MODEL_PATH`。\n### 属性条件生成\n使用微调模型，您可以根据目标属性生成材料。\n例如，要从基于磁密度训练的模型中采样，可以运行以下命令：\n```bash\nexport MODEL_NAME=dft_mag_density\nexport RESULTS_PATH=\"results\u002F$MODEL_NAME\u002F\"  # 样本将写入此目录，例如 `results\u002Fdft_mag_density`\n\n# 使用目标磁密度为 0.15 生成条件样本\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on=\"{'dft_mag_density': 0.15}\" --diffusion_guidance_factor=2.0\n```\n> [!提示]\n> 参数 `--diffusion-guidance-factor` 对应于 [classifier-free diffusion guidance](https:\u002F\u002Fsander.ai\u002F2022\u002F05\u002F26\u002Fguidance.html) 中的 $\\gamma$ 参数。将其设置为零对应于无条件生成，进一步增加它往往会生成更符合输入属性值的样本，但会牺牲样本的多样性和真实性。\n\n### 多属性条件生成\n您还可以根据多个属性生成材料。例如，您可以使用位于 `checkpoints\u002Fchemical_system_energy_above_hull` 的预训练模型，根据化学系统和能量高于凸包进行生成，或者使用位于 `checkpoints\u002Fdft_mag_density_hhi_score` 的模型，根据 [HHI 分数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHerfindahl%E2%80%93Hirschman_index) 和磁密度进行联合条件生成。\n根据您的具体需求调整以下命令：\n```bash\nexport MODEL_NAME=chemical_system_energy_above_hull\nexport RESULTS_PATH=\"results\u002F$MODEL_NAME\u002F\"  # 样本将写入此目录，例如 `results\u002Fdft_mag_density`\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on=\"{'energy_above_hull': 0.05, 'chemical_system': 'Li-O'}\" --diffusion_guidance_factor=2.0\n```\n\n## 评估\n\n一旦你生成了一个包含在 `$RESULTS_PATH` 中的结构列表（无论是使用 MatterGen 还是其他方法），你可以使用默认的 MatterSim 机器学习力场（MLFF，Machine Learning Force Field，请参见 [仓库](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattersim)）对这些结构进行弛豫，并通过以下命令计算新颖性、独特性、稳定性（使用 MatterSim 估计的能量）及其他指标：\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_MP2020correction.gz --exclude=\"\"  # 首先从 Git LFS 下载 MP2020 参考数据集\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as=\"$RESULTS_PATH\u002Fmetrics.json\"\n```\n\n如果你想在应用 TRI2024 校正方案时使用参考数据集（推荐），请改为运行以下命令：\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_TRI2024correction.gz --exclude=\"\"  # 下载 TRI2024 参考数据集\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as=\"$RESULTS_PATH\u002Fmetrics.json\" --reference_dataset_path=\"data-release\u002Falex-mp\u002Freference_TRI2024correction.gz\"\n```\n\n该脚本会将包含指标结果的 `metrics.json` 写入 `$RESULTS_PATH` 并将其打印到你的控制台。\n> [!重要]\n> 此存储库中的评估脚本使用了 [MatterSim](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattersim)，一种机器学习力场（MLFF），用于弛豫结构并通过 MatterSim 预测的能量评估其稳定性。虽然这种方法比基于密度泛函理论（DFT，Density Functional Theory）的评估快几个数量级，无需许可证即可运行评估，并且通常具有高精度，但仍有一些重要的注意事项：(1) 在 MatterGen 论文中，我们使用 DFT 来评估所有模型和基线生成的结构；(2) DFT 更加准确和可靠，尤其是在不太常见的化学体系中。因此，使用此评估代码获得的评估结果可能与 DFT 评估的结果不同；我们建议在得出结论之前，使用 DFT 确认通过 MLFF 获得的结果。\n\n> [!提示]\n> 默认情况下，这会使用 `MatterSim-v1-1M`。如果你想使用更大的 `MatterSim-v1-5M` 模型，可以添加 `--potential_load_path=\"MatterSim-v1.0.0-5M.pth\"` 参数。你还可以查看 [MatterSim 仓库](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattersim) 获取模型的最新版本。\n\n\n如果相反，你已经通过其他方式（例如 DFT）弛豫了结构并获得了弛豫后的总能量，则可以通过以下方式评估指标：\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_MP2020correction.gz --exclude=\"\"  # 首先从 Git LFS 下载参考数据集\nmattergen-evaluate --structures_path=$RESULTS_PATH --energies_path='energies.npy' --relax=False --structure_matcher='disordered' --save_as='metrics'\n```\n该脚本将尝试按照以下优先顺序从磁盘读取结构：\n* 如果 `$RESULTS_PATH` 指向一个 `.xyz` 或 `.extxyz` 文件，它将直接读取该文件，并假设每一帧是一个不同的结构。\n* 如果 `$RESULTS_PATH` 指向一个包含 `.cif` 文件的 `.zip` 文件，它将首先解压然后读取 cif 文件。\n* 如果 `$RESULTS_PATH` 指向一个目录，它将按照 `os.listdir` 中出现的顺序读取所有的 `.cif`、`.xyz` 或 `.extxyz` 文件。\n\n在这里，我们期望 `energies.npy` 是一个 NumPy 数组，其中的条目为 `float` 类型的能量，顺序与从 `$RESULTS_PATH` 读取的结构相同。\n\n> [!重要]\n> 对于任何超出与现有文献基准测试的任务，我们建议使用 TRI2024 校正方案和参考数据集。为此，请运行：\n```bash\ngit lfs pull -I data-release\u002Falex-mp\u002Freference_TRI2024correction.gz --exclude=\"\"  # 首先从 Git LFS 下载参考数据集\nmattergen-evaluate --structures_path=$RESULTS_PATH --energies_path='energies.npy' --relax=False --structure_matcher='disordered' --save_as='metrics' --energy_correction_scheme=\"TRI2024\" --reference_dataset_path=\"data-release\u002Falex-mp\u002Freference_TRI2024correction.gz\" \n```\n\n如果你想保存弛豫后的结构及其能量、力和应力，可以在脚本调用中添加 `--structures_output_path=YOUR_PATH`，如下所示：\n```bash\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as='metrics' --structures_output_path=\"relaxed_structures.extxyz\"\n```\n\n如果你想获取每个结构的指标（例如，每个晶体的 `energy_above_hull` 而不仅仅是平均值），可以添加 `--save_detailed_as` 来保存包含每个结构值的 JSON 文件：\n```bash\nmattergen-evaluate --structures_path=$RESULTS_PATH --relax=True --structure_matcher='disordered' --save_as='metrics.json' --save_detailed_as='detailed_metrics.json'\n```\n详细的指标文件包含每个结构的 `energy_above_hull`、`self_consistent_energy_above_hull`、`stability`、`novelty`、`uniqueness` 和其他指标的值。\n\n### 基准测试\n在 [`plot_benchmark_results.ipynb`](benchmark\u002Fplot_benchmark_results.ipynb) 中，我们提供了一个 Jupyter 笔记本来生成类似于论文中图 2e 和 2f 的图表。我们进一步提供了分析多个基线生成样本所得的指标结果，位于 [`benchmark\u002Fmetrics`](benchmark\u002Fmetrics)。你可以通过将 `mattergen-evaluate` 生成的指标 JSON 文件复制到同一文件夹来添加你自己模型的结果。再次提醒，这些结果是通过 MatterSim 弛豫和能量获得的，因此结果将与通过 DFT 获得的结果不同（例如，如论文中的结果）。\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_27da50042f36.png\" alt=\"S.U.N. 图表\" width=\"410\"\u002F>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_61c2e5a51bb8.png\" alt=\"RMSD 图表\" width=\"410\"\u002F>\n\u003C\u002Fp>\n为了方便起见，以下是**论文中图 2e 和 2f 的数值结果**（以及补充信息中的表 D4）：\n\n模型 | % S.U.N. | RMSD | % 稳定 | % 独特 | % 新颖\n------|----------|------|----------|----------|--------\nMatterGen | 38.57 | 0.021 | 74.41 | 100.0 | 61.96\nMatterGen MP20 | 22.27 | 0.110 | 42.19 | 100.0 | 75.44\nDiffCSP Alex-MP-20 | 33.27 | 0.104 | 63.33 | 99.90 | 66.94\nDiffCSP MP20 | 12.71 | 0.232 | 36.23 | 100.0 | 70.73\nCDVAE | 13.99 | 0.359 | 19.31 | 100.0 | 92.00 \nFTCP | 0.0 | 1.492 | 0.0 | 100.0 | 100.0\nG-SchNet | 0.98 | 1.347 | 1.63 | 100.0 | 98.23\nP-G-SchNet | 1.29 | 1.360 | 3.11 | 100.0 | 88.40\n\n### 使用您自己的参考数据集进行评估\n\n> [!重要]\n> 如果您计划使用 MatterSim（物质模拟器）来评估生成结构的稳定性，那么您提供的参考数据集必须包含与 MatterSim 兼容的能量值，\n> 这意味着这些能量值应为根据 Materials Project（材料项目）兼容性方案计算的 DFT（密度泛函理论）能量，或者是直接通过 MatterSim 计算的能量。\n\n如果您想使用自定义数据集进行评估，则首先需要将其序列化并保存，操作如下：\n\n``` python\nfrom mattergen.evaluation.reference.reference_dataset import ReferenceDataset\nfrom mattergen.evaluation.reference.reference_dataset_serializer import LMDBGZSerializer\n\n\nreference_dataset = ReferenceDataset.from_entries(name=\"my_reference_dataset\", entries=entries)\nLMDBGZSerializer().serialize(reference_dataset, \"path_to_file.gz\")\n```\n\n其中 `entries` 是一个包含结构-能量对的 `pymatgen.entries.computed_entries.ComputedStructureEntry` 对象列表。\n\n默认情况下，我们在评估过程中会对所有输入结构应用 MaterialsProject2020Compatibility 能量校正方案，并假设参考数据集已使用相同的兼容性方案进行了预处理。\n因此，除非您已经完成了此步骤，否则应按以下方式获取自定义参考数据集的 `entries` 对象：\n\n``` python\nfrom mattergen.evaluation.utils.vasprunlike import VasprunLike\nfrom pymatgen.entries.compatibility import MaterialsProject2020Compatibility\n\nentries = []\nfor structure, energy in zip(structures, energies)\n  vasprun_like = VasprunLike(structure=structure, energy=energy)\n  entries.append(vasprun_like.get_computed_entry(\n      inc_structure=True, energy_correction_scheme=MaterialsProject2020Compatibility()\n  ))\n```\n\n> [!注意]\n> 由于 MaterialsProject2020Compatibility 方案存在一些已知问题，我们建议使用 `TRI110Compatibility2024` 参考数据集和校正方案来评估基准测试之外材料的稳定性。\n为此，请运行以下代码：\n``` python\nfrom mattergen.evaluation.utils.vasprunlike import VasprunLike\nfrom mattergen.evaluation.reference.correction_schemes import TRI110Compatibility2024\n\nentries = []\nfor structure, energy in zip(structures, energies)\n  vasprun_like = VasprunLike(structure=structure, energy=energy)\n  entries.append(vasprun_like.get_computed_entry(\n      inc_structure=True, energy_correction_scheme=TRI110Compatibility2024()\n  ))\n```\n\n\n## 自行训练 MatterGen\n在从头开始训练 MatterGen 之前，我们需要解压并预处理数据集文件。\n\n### 预处理用于训练的数据集\n\n您可以为 `mp_20` 数据集运行以下命令：\n```bash\n# 从 LFS 下载文件\ngit lfs pull -I data-release\u002Fmp-20\u002F --exclude=\"\"\nunzip data-release\u002Fmp-20\u002Fmp_20.zip -d datasets\ncsv-to-dataset --csv-folder datasets\u002Fmp_20\u002F --dataset-name mp_20 --cache-folder datasets\u002Fcache\n```\n您将在 `datasets\u002Fcache\u002Fmp_20` 中获得预处理后的数据文件。\n\n要预处理我们的较大数据集 `alex_mp_20`，请运行：\n```bash\n# 从 LFS 下载文件\ngit lfs pull -I data-release\u002Falex-mp\u002Falex_mp_20.zip --exclude=\"\"\nunzip data-release\u002Falex-mp\u002Falex_mp_20.zip -d datasets\ncsv-to-dataset --csv-folder datasets\u002Falex_mp_20\u002F --dataset-name alex_mp_20 --cache-folder datasets\u002Fcache\n```\n这将花费一些时间（约 1 小时）。您将在 `datasets\u002Fcache\u002Falex_mp_20` 中获得预处理后的数据文件。\n\n### 训练\n您可以使用以下命令在 `mp_20` 数据集上训练 MatterGen 基础模型：\n\n```bash\nmattergen-train data_module=mp_20 ~trainer.logger\n```\n> [!注意]\n> 对于 Apple Silicon（苹果芯片）训练，请在上述命令中添加 `~trainer.strategy trainer.accelerator=mps`。\n\n验证损失 (`loss_val`) 应在 360 个 epoch（约 80k 步）后达到 0.4。输出检查点可以在 `outputs\u002Fsinglerun\u002F${now:%Y-%m-%d}\u002F${now:%H-%M-%S}` 找到。我们称此文件夹为 `$MODEL_PATH` 以供将来引用。\n> [!注意]\n> 我们使用 [`hydra`](https:\u002F\u002Fhydra.cc\u002Fdocs\u002Fintro\u002F) 来配置训练和采样任务。分层配置可以在 [`mattergen\u002Fconf`](mattergen\u002Fconf) 下找到。接下来我们将使用 `hydra` 的配置覆盖功能通过 CLI 更新这些配置。有关配置覆盖语法的介绍，请参阅 `hydra` [文档](https:\u002F\u002Fhydra.cc\u002Fdocs\u002Fadvanced\u002Foverride_grammar\u002Fbasic\u002F)。\n\n> [!提示]\n> 默认情况下，我们通过 `~trainer.logger` 配置覆盖禁用 Weights & Biases (W&B) 日志记录。您可以通过移除此覆盖来启用它。在 [`mattergen\u002Fconf\u002Ftrainer\u002Fdefault.yaml`](mattergen\u002Fconf\u002Ftrainer\u002Fdefault.yaml) 中，您可以输入您的 W&B 日志信息或指定您自己的日志记录器。\n\n要在 `alex_mp_20` 数据集上训练 MatterGen 基础模型，请使用以下命令：\n```bash\nmattergen-train data_module=alex_mp_20 ~trainer.logger trainer.accumulate_grad_batches=4\n```\n> [!注意]\n> 对于 Apple Silicon 训练，请在上述命令中添加 `~trainer.strategy trainer.accelerator=mps`。\n\n> [!提示]\n> 请注意，单个 GPU 的内存通常不足以支持 512 的批量大小，因此我们在 4 个批次上累积梯度。如果仍然内存不足，请进一步增加此值。\n\n#### 晶体结构预测\n尽管这不是我们论文的重点，您也可以在晶体结构预测（CSP，Crystal Structure Prediction）模式下训练 MatterGen，在这种模式下，它在生成过程中不会去噪原子类型。\n这使您能够基于特定化学式进行生成。您可以通过向 `run.py` 传递 `--config-name=csp` 来在此模式下训练 MatterGen。\n\n要从此模型中采样，请向 `generate.py` 传递 `--target_compositions=['{\"\u003Celement1>\": \u003Cnumber_of_element1_atoms>, \"\u003Celement2>\": \u003Cnumber_of_element2_atoms>, ..., \"\u003CelementN>\": \u003Cnumber_of_elementN_atoms>}'] --sampling-config-name=csp`。\n例如，目标组成可以是 `--target_compositions=['{\"Na\": 1, \"Cl\": 1}']`。\n\n### 在属性数据上进行微调\n\n你可以使用以下命令对 MatterGen 基础模型（MatterGen base model）进行微调。\n\n```bash\nexport PROPERTY=dft_mag_density\nmattergen-finetune adapter.pretrained_name=mattergen_base data_module=mp_20 +lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY=$PROPERTY ~trainer.logger data_module.properties=[\"$PROPERTY\"]\n```\n`dft_mag_density` 表示用于微调的目标属性。你也可以通过**替换** `adapter.pretrained_name=mattergen_base` 为 `adapter.model_path=$MODEL_PATH` 来微调你自己训练的模型，将 `$MODEL_PATH` 替换为你模型的实际路径。\n> [!注意]\n> 如果在 Apple Silicon 上进行训练，请在上述命令中添加 `~trainer.strategy trainer.accelerator=mps`。\n\n\n> [!提示]\n> 你可以选择数据集中可用的任何属性。支持的属性列表请参见 [`mattergen\u002Fconf\u002Fdata_module\u002Fmp_20.yaml`](mattergen\u002Fconf\u002Fdata_module\u002Fmp_20.yaml) 或 [`mattergen\u002Fconf\u002Fdata_module\u002Falex_mp_20.yaml`](mattergen\u002Fconf\u002Fdata_module\u002Falex_mp_20.yaml)。你还可以添加自己的自定义属性数据。有关说明请参见[下方](#在你自己的属性数据上进行微调)。\n\n#### 多属性微调\n你还可以对多个属性进行 MatterGen 微调。例如，要对 `dft_mag_density` 和 `dft_band_gap` 进行微调，可以使用以下命令。\n\n```bash\nexport PROPERTY1=dft_mag_density\nexport PROPERTY2=dft_band_gap \nexport MODEL_NAME=mattergen_base\nmattergen-finetune adapter.pretrained_name=$MODEL_NAME data_module=mp_20 +lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY1=$PROPERTY1 +lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.$PROPERTY2=$PROPERTY2 ~trainer.logger data_module.properties=[\"$PROPERTY1\",\"$PROPERTY2\"]\n```\n> [!提示]\n> 按照以下方式添加更多属性：\n> 1. 添加覆盖项：`+lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@adapter.adapter.property_embeddings_adapt.\u003Cmy_property>=\u003Cmy_property>`\n> 2. 将 `\u003Cmy_property>` 添加到 `data_module.properties=[\"$PROPERTY1\",\"$PROPERTY2\",...,\u003Cmy_property>]` 覆盖项中。\n\n> [!注意]\n> 如果在 Apple Silicon 上进行训练，请在上述命令中添加 `~trainer.strategy trainer.accelerator=mps`。\n\n#### 在你自己的属性数据上进行微调\n你还可以在自己的属性数据上对 MatterGen 进行微调。本质上，你需要的是一个子集数据的属性值（通常是 `float` 类型），例如 `alex_mp_20`。按照以下步骤操作：\n1. 将你的属性名称添加到 [`mattergen\u002Fcommon\u002Futils\u002Fglobals.py`](mattergen\u002Fcommon\u002Futils\u002Fglobals.py) 文件中的 `PROPERTY_SOURCE_IDS` 列表中。\n2. 在你要训练的数据集（例如 `datasets\u002Falex_mp_20\u002Ftrain.csv` 和 `datasets\u002Falex_mp_20\u002Fval.csv`）中添加一个以此名称命名的新列（需要完成[预处理步骤](#预处理用于训练的数据集)）。\n3. 重新运行 CSV 到数据集脚本 `csv-to-dataset --csv-folder datasets\u002F\u003CMY_DATASET>\u002F --dataset-name \u003CMY_DATASET> --cache-folder datasets\u002Fcache`，将你的数据集名称替换为 `MY_DATASET`。\n4. 在 [`mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings) 中添加一个 `\u003Cyour_property>.yaml` 配置文件。如果你添加的是浮点值属性，可以复制现有的配置文件，例如 [`dft_mag_density.yaml`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings\u002Fdft_mag_density.yaml)。更复杂的属性则需要创建自己的自定义 `PropertyEmbedding` 子类，例如参考 [`space_group`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings\u002Fspace_group.yaml) 或 [`chemical_system`](mattergen\u002Fconf\u002Flightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings\u002Fchemical_system.yaml) 配置。\n5. 按照[属性数据微调说明](#在属性数据上进行微调)，并以与现有属性（如 `dft_mag_density`）相同的方式引用你自己的属性。\n\n## 数据发布\n我们提供了用于训练和评估 MatterGen 的数据集。更多详细信息和许可信息，请参阅 [`data-release`](data-release) 下的相应 README 文件。\n### 训练数据集\n* MP-20 ([Jain et al., 2013](https:\u002F\u002Fpubs.aip.org\u002Faip\u002Fapm\u002Farticle\u002F1\u002F1\u002F011002\u002F119685))：包含 45k 种通用无机材料，包括大多数实验已知的、晶胞中原子数不超过 20 的材料。\n* Alex-MP-20：由来自 MP-20 和 Alexandria ([Schmidt et al. 2022](https:\u002F\u002Farchive.materialscloud.org\u002Frecord\u002F2022.126)) 的约 600k 种结构组成的训练数据集，晶胞内原子数最多为 20 且凸包能量低于 0.1 eV\u002Fatom。更多信息请参见下文的维恩图以及 MatterGen 论文。\n\n### 参考数据集\n我们进一步提供了 Alex-MP 参考数据集，可用于评估生成样本的新颖性和稳定性。\n参考集包含 845,997 种结构及其 DFT 能量。有关训练和参考数据集组成的更多详细信息，请参见以下维恩图。\n> [!注意]\n> 出于许可原因，我们无法分享 4.4k 有序 + 117.7k 无序 ICSD 结构，因此结果可能与论文中的有所不同。\n\n![数据集维恩图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_readme_e726c906263f.png)\n\n### CIF 文件和实验测量\n[`data-release`](data-release) 目录还包含论文中展示的所有结构的 CIF 文件，以及论文中介绍的 TaCr2O6 样品的 XPS、XRD 和纳米压痕测量数据。\n\n## 引用\n如果你使用我们的代码、模型、数据或评估管道，请考虑引用我们的工作：\n```bibtex\n@article{MatterGen2025,\n  author  = {Zeni, Claudio and Pinsler, Robert and Z{\\\"u}gner, Daniel and Fowler, Andrew and Horton, Matthew and Fu, Xiang and Wang, Zilong and Shysheya, Aliaksandra and Crabb{\\'e}, Jonathan and Ueda, Shoko and Sordillo, Roberto and Sun, Lixin and Smith, Jake and Nguyen, Bichlien and Schulz, Hannes and Lewis, Sarah and Huang, Chin-Wei and Lu, Ziheng and Zhou, Yichi and Yang, Han and Hao, Hongxia and Li, Jielan and Yang, Chunlei and Li, Wenjie and Tomioka, Ryota and Xie, Tian},\n  journal = {Nature},\n  title   = {A generative model for inorganic materials design},\n  year    = {2025},\n  doi     = {10.1038\u002Fs41586-025-08628-5},\n}\n```\n\n## 商标\n\n本项目可能包含某些项目、产品或服务的商标或标志。\nMicrosoft 商标或标志的授权使用需遵守并遵循 [Microsoft 的商标和品牌指南](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks\u002Fusage\u002Fgeneral)。\n在修改版本的项目中使用 Microsoft 商标或标志不得引起混淆或暗示 Microsoft 赞助。\n任何第三方商标或标志的使用均需遵守这些第三方的政策。\n\n## 负责任的人工智能透明性文档\n\n负责任的人工智能透明性文档可以在这里找到 [这里](MODEL_CARD.md)。\n\n## 联系我们\n如果您有任何这里未涵盖的问题，请在讨论区的问答部分提问。\n如果您想报告错误或提出新功能，请使用模板创建一个问题和\u002F或发起一个拉取请求。","```markdown\n# MatterGen 快速上手指南\n\nMatterGen 是一个用于无机材料设计的生成模型，支持多种属性约束的微调和生成。\n\n---\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux（推荐）或 macOS（实验性支持）\n- **Python 版本**: 3.10 及以上\n- **硬件要求**: 支持 CUDA 的 GPU（推荐），或 Apple Silicon（实验性）\n\n### 前置依赖\n- Git Large File Storage (LFS)\n- [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)（快速 Python 包管理工具）\n\n> **注意**: 如果使用 Apple Silicon，请在运行训练或生成命令前设置环境变量：\n> ```bash\n> export PYTORCH_ENABLE_MPS_FALLBACK=1\n> ```\n\n---\n\n## 安装步骤\n\n1. 安装 `uv` 和创建虚拟环境：\n   ```bash\n   pip install uv\n   uv venv .venv --python 3.10 \n   source .venv\u002Fbin\u002Factivate\n   uv pip install -e .\n   ```\n\n2. 检查并安装 Git LFS：\n   ```bash\n   git lfs --version\n   ```\n   如果未安装，请运行以下命令：\n   ```bash\n   sudo apt install git-lfs\n   git lfs install\n   ```\n\n3. 下载预训练模型和数据集（通过 Git LFS）：\n   ```bash\n   git lfs pull -I checkpoints\u002F\u003Cmodel_name> --exclude=\"\"\n   ```\n   或从 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fmattergen) 下载。\n\n---\n\n## 基本使用\n\n### 无条件生成\n运行以下命令以使用预训练模型生成材料：\n```bash\nexport MODEL_NAME=mattergen_base\nexport RESULTS_PATH=results\u002F  # 生成结果将保存在此目录\n\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --num_batches 1\n```\n生成结果包括：\n- `generated_crystals_cif.zip`: 包含生成结构的 `.cif` 文件。\n- `generated_crystals.extxyz`: 包含所有生成结构的单个文件。\n- （可选）`generated_trajectories.zip`: 包含去噪轨迹的 `.extxyz` 文件。\n\n> **提示**: 调整 `--batch_size` 参数以充分利用 GPU 内存。\n\n---\n\n### 属性条件生成\n使用微调模型生成具有特定属性的材料。例如，生成磁密度为 0.15 的材料：\n```bash\nexport MODEL_NAME=dft_mag_density\nexport RESULTS_PATH=\"results\u002F$MODEL_NAME\u002F\"\n\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on=\"{'dft_mag_density': 0.15}\" --diffusion_guidance_factor=2.0\n```\n\n> **提示**: `--diffusion-guidance-factor` 参数控制生成样本对目标属性的依附程度，值越高依附越强，但可能降低多样性。\n\n---\n\n### 多属性条件生成\n生成同时满足多个属性条件的材料。例如，生成化学系统为 `Li-O` 且能量高于稳定态 0.05 的材料：\n```bash\nexport MODEL_NAME=chemical_system_energy_above_hull\nexport RESULTS_PATH=\"results\u002F$MODEL_NAME\u002F\"\n\nmattergen-generate $RESULTS_PATH --pretrained-name=$MODEL_NAME --batch_size=16 --properties_to_condition_on=\"{'energy_above_hull': 0.05, 'chemical_system': 'Li-O'}\" --diffusion_guidance_factor=2.0\n```\n\n---\n\n完成上述步骤后，您即可开始使用 MatterGen 进行材料生成！\n```","一位材料科学家正在为新型太阳能电池开发高效、稳定的无机材料，需要快速筛选出符合特定性能约束的候选材料。\n\n### 没有 mattergen 时\n- 材料设计完全依赖实验试错，周期长且成本高，可能需要数月甚至数年才能找到合适的材料。\n- 筛选范围受限于已知材料数据库，难以探索全新化学组成或晶体结构。\n- 针对特定性能（如带隙或磁性）优化时，缺乏系统性方法，只能手动调整参数进行有限尝试。\n- 实验室资源有限，无法同时测试大量假设，导致研究效率低下。\n- 跨团队协作困难，因为材料设计过程缺乏标准化和可复现性。\n\n### 使用 mattergen 后\n- 通过生成模型快速预测和设计新材料，将实验验证周期从数月缩短至数天或数周。\n- 能够探索整个元素周期表范围内的全新化学组成和晶体结构，突破现有数据库的限制。\n- 利用预训练模型（如`dft_band_gap`或`ml_bulk_modulus`），直接生成满足特定性能约束的候选材料，显著提升优化效率。\n- 批量生成多样化候选材料，支持实验室并行验证，最大化利用资源。\n- 提供统一框架和可复现的结果，促进团队间协作与知识共享。\n\nmattergen 的核心价值在于大幅加速新材料发现过程，同时降低成本和资源消耗，为材料科学创新提供强大助力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_mattergen_1ebb8cf9.png","microsoft","Microsoft","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmicrosoft_4900709c.png","Open source projects and samples from Microsoft",null,"opensource@microsoft.com","OpenAtMicrosoft","https:\u002F\u002Fopensource.microsoft.com","https:\u002F\u002Fgithub.com\u002Fmicrosoft",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",91.2,{"name":91,"color":92,"percentage":93},"Jupyter Notebook","#DA5B0B",8.8,1668,312,"2026-04-05T07:31:22","MIT","Linux","需要支持 CUDA 的 NVIDIA GPU，显存未说明，CUDA 版本未说明","未说明",{"notes":102,"python":103,"dependencies":104},"建议使用 uv 管理 Python 环境，首次运行需下载模型文件（大小未说明）。Apple Silicon 支持为实验性质，需设置 PYTORCH_ENABLE_MPS_FALLBACK=1。部分功能依赖 MatterSim 库进行结构优化和评估。","3.10+",[105,106,107],"uv","torch","git-lfs",[18],[110,111,112],"generative-ai","materials-design","materials-science","2026-03-27T02:49:30.150509","2026-04-06T05:16:47.344597",[116,121,126,131,136,140],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},4299,"如何解决生成样本时出现 `_pickle.UnpicklingError: invalid load key, 'v'` 错误？","确保检查点文件已完全下载，并尝试重新运行命令 `mattergen-generate $RESULTS_PATH --model_path=checkpoints\u002Fmattergen_base --batch_size=16 --num_batches 1`。如果问题仍然存在，可以尝试克隆一个新的仓库副本。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattergen\u002Fissues\u002F62",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},4300,"为什么在条件采样中设置 `dft_band_gap = 3 eV` 时，生成的结构中有很多接近零的带隙值？","可能是由于生成的结构未经过松弛处理。论文中的结果通常基于 `BandStructureMaker`，而您使用的是 `MPGGAStaticMaker`。建议在计算带隙之前对生成的结构进行松弛处理，以获得更准确的结果。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattergen\u002Fissues\u002F213",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},4297,"如何在自定义数据集上训练 MatterGen 模型以生成具有目标带隙值的新晶体结构？","需要提供覆盖参数 `hydra.run.dir=\u002Fpath\u002Fto\u002Fyour-existing-run`，然后模型应该能够加载现有的检查点。此外，确保数据格式包含化合物名称、CIF 文件内容和 DFT 计算的带隙值。如果需要从之前的检查点恢复训练，可以使用命令 `+lightning_module\u002Fdiffusion_module\u002Fmodel\u002Fproperty_embeddings@lightning_module.diffusion_module.model.property_embeddings.${PROPERTY}=${PROPERTY}`。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattergen\u002Fissues\u002F128",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},4298,"为什么在生成材料时，晶体晶格无法适应指定的密度参数？","建议检查训练样本中是否包含目标材料（如 Inconel）的数据。如果没有，可能需要扩展数据集。此外，可以通过调整化学元素的质量分数或优先级来改进生成结果，但目前模型可能不直接支持这些功能。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattergen\u002Fissues\u002F106",{"id":137,"question_zh":138,"answer_zh":139,"source_url":125},4301,"如何正确配置 `_BASE_MPGGA_SET` 参数？","`_BASE_MPGGA_SET` 是基于 `MPRelaxSet` 的自定义配置，定义如下：\n```python\nfrom pkg_resources import resource_filename\n_BASE_MPGGA_SET: Dict[str, Any] = loadfn(resource_filename(\"pymatgen.io.vasp\", \"MPRelaxSet.yaml\"))\n_BASE_MPGGA_SET[\"POTCAR\"][\"Yb\"] = \"Yb_2\"\n```\n如果您使用 `atomate2` 提供的 `MPGGAStaticMaker`，也可以直接替换为该工具流。",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},4302,"如何在 Windows 上训练 MatterGen 模型？","建议在 Linux 环境下运行 MatterGen，因为其依赖项与 Linux 兼容性更好。如果必须在 Windows 上运行，可以尝试使用 WSL（Windows Subsystem for Linux）部署虚拟环境。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmattergen\u002Fissues\u002F82",[146,151,156,161],{"id":147,"version":148,"summary_zh":149,"released_at":150},103759,"v1.0.3","Adds more metadata to pyproject.toml.","2025-07-23T11:23:07",{"id":152,"version":153,"summary_zh":154,"released_at":155},103760,"v1.0.2","Bumps version in `pyproject.toml`.","2025-07-23T10:15:16",{"id":157,"version":158,"summary_zh":159,"released_at":160},103761,"v1.0.1","* Adds support running on Mac\r\n* Adds results and figures to README\r\n* Adds crystal structure prediction mode\r\n* Adds log10 transform to property scalers\r\n* Adds huggingface checkpoint support\r\n* Various small bugfixes","2025-07-23T09:43:18",{"id":162,"version":163,"summary_zh":164,"released_at":165},103762,"v1.0.0","Initial release of the MatterGen codebase.","2025-01-17T17:14:04"]