[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Element-Research--rnn":3,"tool-Element-Research--rnn":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":76,"languages":77,"stars":86,"forks":87,"last_commit_at":88,"license":89,"difficulty_score":90,"env_os":91,"env_gpu":92,"env_ram":93,"env_deps":94,"category_tags":105,"github_topics":75,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":106,"updated_at":107,"faqs":108,"releases":137},9115,"Element-Research\u002Frnn","rnn","Recurrent Neural Network library for Torch7's nn","rnn 是一个专为 Torch7 深度学习框架设计的循环神经网络（RNN）扩展库。它旨在解决传统神经网络难以有效处理序列数据（如文本、时间序列）的痛点，通过提供模块化的组件，让开发者能够轻松构建复杂的时序模型。\n\n该工具主要面向使用 Lua\u002FTorch 进行算法开发的科研人员与工程师。借助 rnn，用户可以快速搭建包括标准 RNN、长短期记忆网络（LSTM）、门控循环单元（GRU）以及双向网络（BiRNN）在内的多种主流架构。其独特的技术亮点在于提供了高度优化的序列处理模块（如 SeqLSTM、SeqGRU），能够一次性处理整个序列而非逐步迭代，显著提升了训练效率。此外，它还内置了注意力机制、范数稳定化准则以及针对填充数据的掩码处理功能，有效解决了长序列训练中的梯度不稳定和计算冗余问题。虽然官方已提示该项目归档并推荐迁移至新版 torch\u002Frnn，但其设计思路与丰富的功能模块依然是理解和使用 Torch 生态中序列建模的重要参考。","# rnn: recurrent neural networks #\n\nNote: this repository is deprecated in favor of https:\u002F\u002Fgithub.com\u002Ftorch\u002Frnn.\n\nThis is a Recurrent Neural Network library that extends Torch's nn. \nYou can use it to build RNNs, LSTMs, GRUs, BRNNs, BLSTMs, and so forth and so on.\nThis library includes documentation for the following objects:\n\nModules that consider successive calls to `forward` as different time-steps in a sequence :\n * [AbstractRecurrent](#rnn.AbstractRecurrent) : an abstract class inherited by Recurrent and LSTM;\n * [Recurrent](#rnn.Recurrent) : a generalized recurrent neural network container;\n * [LSTM](#rnn.LSTM) : a vanilla Long-Short Term Memory module;\n  * [FastLSTM](#rnn.FastLSTM) : a faster [LSTM](#rnn.LSTM) with optional support for batch normalization;\n * [GRU](#rnn.GRU) : Gated Recurrent Units module;\n * [MuFuRu](#rnn.MuFuRu) : [Multi-function Recurrent Unit](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03002) module;\n * [Recursor](#rnn.Recursor) : decorates a module to make it conform to the [AbstractRecurrent](#rnn.AbstractRecurrent) interface;\n * [Recurrence](#rnn.Recurrence) : decorates a module that outputs `output(t)` given `{input(t), output(t-1)}`;\n * [NormStabilizer](#rnn.NormStabilizer) : implements [norm-stabilization](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.08400) criterion (add this module between RNNs);\n\nModules that `forward` entire sequences through a decorated `AbstractRecurrent` instance :\n * [AbstractSequencer](#rnn.AbstractSequencer) : an abstract class inherited by Sequencer, Repeater, RecurrentAttention, etc.;\n * [Sequencer](#rnn.Sequencer) : applies an encapsulated module to all elements in an input sequence  (Tensor or Table);\n * [SeqLSTM](#rnn.SeqLSTM) : a very fast version of `nn.Sequencer(nn.FastLSTM)` where the `input` and `output` are tensors;\n  * [SeqLSTMP](#rnn.SeqLSTMP) : `SeqLSTM` with a projection layer;\n * [SeqGRU](#rnn.SeqGRU) : a very fast version of `nn.Sequencer(nn.GRU)` where the `input` and `output` are tensors;\n * [SeqBRNN](#rnn.SeqBRNN) : Bidirectional RNN based on SeqLSTM;\n * [BiSequencer](#rnn.BiSequencer) : used for implementing Bidirectional RNNs and LSTMs;\n * [BiSequencerLM](#rnn.BiSequencerLM) : used for implementing Bidirectional RNNs and LSTMs for language models;\n * [Repeater](#rnn.Repeater) : repeatedly applies the same input to an AbstractRecurrent instance;\n * [RecurrentAttention](#rnn.RecurrentAttention) : a generalized attention model for [REINFORCE modules](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.Reinforce);\n\nMiscellaneous modules and criterions :\n * [MaskZero](#rnn.MaskZero) : zeroes the `output` and `gradOutput` rows of the decorated module for commensurate `input` rows which are tensors of zeros;\n * [TrimZero](#rnn.TrimZero) : same behavior as `MaskZero`, but more efficient when `input` contains lots zero-masked rows;\n * [LookupTableMaskZero](#rnn.LookupTableMaskZero) : extends `nn.LookupTable` to support zero indexes for padding. Zero indexes are forwarded as tensors of zeros;\n * [MaskZeroCriterion](#rnn.MaskZeroCriterion) : zeros the `gradInput` and `err` rows of the decorated criterion for commensurate `input` rows which are tensors of zeros;\n * [SeqReverseSequence](#rnn.SeqReverseSequence) : reverses an input sequence on a specific dimension;\n\nCriterions used for handling sequential inputs and targets :\n * [SequencerCriterion](#rnn.SequencerCriterion) : sequentially applies the same criterion to a sequence of inputs and targets (Tensor or Table).\n * [RepeaterCriterion](#rnn.RepeaterCriterion) : repeatedly applies the same criterion with the same target on a sequence.\n\nTo install this repository:\n```\ngit clone git@github.com:Element-Research\u002Frnn.git\ncd rnn\nluarocks make rocks\u002Frnn-scm-1.rockspec\n```\nNote that `luarocks intall rnn` now installs https:\u002F\u002Fgithub.com\u002Ftorch\u002Frnn instead.\n\n\u003Ca name='rnn.examples'>\u003C\u002Fa>\n## Examples ##\n\nThe following are example training scripts using this package :\n\n  * [RNN\u002FLSTM\u002FGRU](examples\u002Frecurrent-language-model.lua) for Penn Tree Bank dataset;\n  * [Noise Contrastive Estimate](examples\u002Fnoise-contrastive-estimate.lua) for training multi-layer [SeqLSTM](#rnn.SeqLSTM) language models on the [Google Billion Words dataset](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdataload#dl.loadGBW). The example uses [MaskZero](#rnn.MaskZero) to train independent variable length sequences using the [NCEModule](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdpnn#nn.NCEModule) and [NCECriterion](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdpnn#nn.NCECriterion). This script is our fastest yet boasting speeds of 20,000 words\u002Fsecond (on NVIDIA Titan X) with a 2-layer LSTM having 250 hidden units, a batchsize of 128 and sequence length of a 100. Note that you will need to have [Torch installed with Lua instead of LuaJIT](http:\u002F\u002Ftorch.ch\u002Fdocs\u002Fgetting-started.html#_);\n  * [Recurrent Model for Visual Attention](examples\u002Frecurrent-visual-attention.lua) for the MNIST dataset;\n  * [Encoder-Decoder LSTM](examples\u002Fencoder-decoder-coupling.lua) shows you how to couple encoder and decoder `LSTMs` for sequence-to-sequence networks;\n  * [Simple Recurrent Network](examples\u002Fsimple-recurrent-network.lua) shows a simple example for building and training a simple recurrent neural network;\n  * [Simple Sequencer Network](examples\u002Fsimple-sequencer-network.lua) is a version of the above script that uses the Sequencer to decorate the `rnn` instead;\n  * [Sequence to One](examples\u002Fsequence-to-one.lua) demonstrates how to do many to one sequence learning as is the case for sentiment analysis;\n  * [Multivariate Time Series](examples\u002Frecurrent-time-series.lua) demonstrates how train a simple RNN to do multi-variate time-series predication.\n\n### External Resources\n\n  * [rnn-benchmarks](https:\u002F\u002Fgithub.com\u002Fglample\u002Frnn-benchmarks) : benchmarks comparing Torch (using this library), Theano and TensorFlow.\n  * [Harvard Jupyter Notebook Tutorial](http:\u002F\u002Fnbviewer.jupyter.org\u002Fgithub\u002FCS287\u002FLectures\u002Fblob\u002Fgh-pages\u002Fnotebooks\u002FElementRNNTutorial.ipynb) : an in-depth tutorial for how to use the Element-Research rnn package by Harvard University;\n  * [dpnn](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdpnn) : this is a dependency of the __rnn__ package. It contains useful nn extensions, modules and criterions;\n  * [dataload](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdataload) : a collection of torch dataset loaders;\n  * [RNN\u002FLSTM\u002FBRNN\u002FBLSTM training script ](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdp\u002Fblob\u002Fmaster\u002Fexamples\u002Frecurrentlanguagemodel.lua) for Penn Tree Bank or Google Billion Words datasets;\n  * A brief (1 hours) overview of Torch7, which includes some details about the __rnn__ packages (at the end), is available via this [NVIDIA GTC Webinar video](http:\u002F\u002Fon-demand.gputechconf.com\u002Fgtc\u002F2015\u002Fwebinar\u002Ftorch7-applied-deep-learning-for-vision-natural-language.mp4). In any case, this presentation gives a nice overview of Logistic Regression, Multi-Layer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks using Torch7;\n  * [Sequence to Sequence mapping using encoder-decoder RNNs](https:\u002F\u002Fgithub.com\u002Frahul-iisc\u002Fseq2seq-mapping) : a complete training example using synthetic data.\n  * [ConvLSTM](https:\u002F\u002Fgithub.com\u002Fviorik\u002FConvLSTM) is a repository for training a [Spatio-temporal video autoencoder with differentiable memory](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06309).\n  * An [time series example](https:\u002F\u002Fgithub.com\u002Frracinskij\u002Frnntest01\u002Fblob\u002Fmaster\u002Frnntest01.lua) for univariate timeseries prediction.\n  \n## Citation ##\n\nIf you use __rnn__ in your work, we'd really appreciate it if you could cite the following paper:\n\nLéonard, Nicholas, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. [rnn: Recurrent Library for Torch.](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.07889) arXiv preprint arXiv:1511.07889 (2015).\n\nAny significant contributor to the library will also get added as an author to the paper.\nA [significant contributor](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Frnn\u002Fgraphs\u002Fcontributors) \nis anyone who added at least 300 lines of code to the library.\n\n## Troubleshooting ##\n\nMost issues can be resolved by updating the various dependencies:\n```bash\nluarocks install torch\nluarocks install nn\nluarocks install dpnn\nluarocks install torchx\n```\n\nIf you are using CUDA :\n```bash\nluarocks install cutorch\nluarocks install cunn\nluarocks install cunnx\n```\n\nAnd don't forget to update this package :\n```bash\ngit clone git@github.com:Element-Research\u002Frnn.git\ncd rnn\nluarocks make rocks\u002Frnn-scm-1.rockspec\n```\n\nIf that doesn't fix it, open and issue on github.\n\n\u003Ca name='rnn.AbstractRecurrent'>\u003C\u002Fa>\n## AbstractRecurrent ##\nAn abstract class inherited by [Recurrent](#rnn.Recurrent), [LSTM](#rnn.LSTM) and [GRU](#rnn.GRU).\nThe constructor takes a single argument :\n```lua\nrnn = nn.AbstractRecurrent([rho])\n```\nArgument `rho` is the maximum number of steps to backpropagate through time (BPTT).\nSub-classes can set this to a large number like 99999 (the default) if they want to backpropagate through \nthe entire sequence whatever its length. Setting lower values of rho are \nuseful when long sequences are forward propagated, but we only whish to \nbackpropagate through the last `rho` steps, which means that the remainder \nof the sequence doesn't need to be stored (so no additional cost). \n\n### [recurrentModule] getStepModule(step) ###\nReturns a module for time-step `step`. This is used internally by sub-classes \nto obtain copies of the internal `recurrentModule`. These copies share \n`parameters` and `gradParameters` but each have their own `output`, `gradInput` \nand any other intermediate states. \n\n### setOutputStep(step) ###\nThis is a method reserved for internal use by [Recursor](#rnn.Recursor) \nwhen doing backward propagation. It sets the object's `output` attribute\nto point to the output at time-step `step`. \nThis method was introduced to solve a very annoying bug.\n\n\u003Ca name='rnn.AbstractRecurrent.maskZero'>\u003C\u002Fa>\n### maskZero(nInputDim) ###\nDecorates the internal `recurrentModule` with [MaskZero](#rnn.MaskZero). \nThe `output` Tensor (or table thereof) of the `recurrentModule`\nwill have each row (i.e. samples) zeroed when the commensurate row of the `input` \nis a tensor of zeros. \n\nThe `nInputDim` argument must specify the number of non-batch dims \nin the first Tensor of the `input`. In the case of an `input` table,\nthe first Tensor is the first one encountered when doing a depth-first search.\n\nCalling this method makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\nWhen a sample time-step is masked (i.e. `input` is a row of zeros), then \nthe hidden state is effectively reset (i.e. forgotten) for the next non-mask time-step.\nIn other words, it is possible seperate unrelated sequences with a masked element.\n\n### trimZero(nInputDim) ###\nDecorates the internal `recurrentModule` with [TrimZero](#rnn.TrimZero). \n\n### [output] updateOutput(input) ###\nForward propagates the input for the current step. The outputs or intermediate \nstates of the previous steps are used recurrently. This is transparent to the \ncaller as the previous outputs and intermediate states are memorized. This \nmethod also increments the `step` attribute by 1.\n\n\u003Ca name='rnn.AbstractRecurrent.updateGradInput'>\u003C\u002Fa>\n### updateGradInput(input, gradOutput) ###\nLike `backward`, this method should be called in the reverse order of \n`forward` calls used to propagate a sequence. So for example :\n\n```lua\nrnn = nn.LSTM(10, 10) -- AbstractRecurrent instance\nlocal outputs = {}\nfor i=1,nStep do -- forward propagate sequence\n   outputs[i] = rnn:forward(inputs[i])\nend\n\nfor i=nStep,1,-1 do -- backward propagate sequence in reverse order\n   gradInputs[i] = rnn:backward(inputs[i], gradOutputs[i])\nend\n\nrnn:forget()\n``` \n\nThe reverse order implements backpropagation through time (BPTT).\n\n### accGradParameters(input, gradOutput, scale) ###\nLike `updateGradInput`, but for accumulating gradients w.r.t. parameters.\n\n### recycle(offset) ###\nThis method goes hand in hand with `forget`. It is useful when the current\ntime-step is greater than `rho`, at which point it starts recycling \nthe oldest `recurrentModule` `sharedClones`, \nsuch that they can be reused for storing the next step. This `offset` \nis used for modules like `nn.Recurrent` that use a different module \nfor the first step. Default offset is 0.\n\n\u003Ca name='rnn.AbstractRecurrent.forget'>\u003C\u002Fa>\n### forget(offset) ###\nThis method brings back all states to the start of the sequence buffers, \ni.e. it forgets the current sequence. It also resets the `step` attribute to 1.\nIt is highly recommended to call `forget` after each parameter update. \nOtherwise, the previous state will be used to activate the next, which \nwill often lead to instability. This is caused by the previous state being\nthe result of now changed parameters. It is also good practice to call \n`forget` at the start of each new sequence.\n\n\u003Ca name='rnn.AbstractRecurrent.maxBPTTstep'>\u003C\u002Fa>\n###  maxBPTTstep(rho) ###\nThis method sets the maximum number of time-steps for which to perform \nbackpropagation through time (BPTT). So say you set this to `rho = 3` time-steps,\nfeed-forward for 4 steps, and then backpropgate, only the last 3 steps will be \nused for the backpropagation. If your AbstractRecurrent instance is wrapped \nby a [Sequencer](#rnn.Sequencer), this will be handled auto-magically by the Sequencer.\nOtherwise, setting this value to a large value (i.e. 9999999), is good for most, if not all, cases.\n\n\u003Ca name='rnn.AbstractRecurrent.backwardOnline'>\u003C\u002Fa>\n### backwardOnline() ###\nThis method was deprecated Jan 6, 2016. \nSince then, by default, `AbstractRecurrent` instances use the \nbackwardOnline behaviour. \nSee [updateGradInput](#rnn.AbstractRecurrent.updateGradInput) for details.\n\n### training() ###\nIn training mode, the network remembers all previous `rho` (number of time-steps)\nstates. This is necessary for BPTT. \n\n### evaluate() ###\nDuring evaluation, since their is no need to perform BPTT at a later time, \nonly the previous step is remembered. This is very efficient memory-wise, \nsuch that evaluation can be performed using potentially infinite-length \nsequence.\n \n\u003Ca name='rnn.Recurrent'>\u003C\u002Fa>\n## Recurrent ##\nReferences :\n * A. [Sutsekever Thesis Sec. 2.5 and 2.8](http:\u002F\u002Fwww.cs.utoronto.ca\u002F~ilya\u002Fpubs\u002Filya_sutskever_phd_thesis.pdf)\n * B. [Mikolov Thesis Sec. 3.2 and 3.3](http:\u002F\u002Fwww.fit.vutbr.cz\u002F~imikolov\u002Frnnlm\u002Fthesis.pdf)\n * C. [RNN and Backpropagation Guide](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.3.9311&rep=rep1&type=pdf)\n\nA [composite Module](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Fcontainers.md#containers) for implementing Recurrent Neural Networks (RNN), excluding the output layer. \n\nThe `nn.Recurrent(start, input, feedback, [transfer, rho, merge])` constructor takes 6 arguments:\n * `start` : the size of the output (excluding the batch dimension), or a Module that will be inserted between the `input` Module and `transfer` module during the first step of the propagation. When `start` is a size (a number or `torch.LongTensor`), then this *start* Module will be initialized as `nn.Add(start)` (see Ref. A).\n * `input` : a Module that processes input Tensors (or Tables). Output must be of same size as `start` (or its output in the case of a `start` Module), and same size as the output of the `feedback` Module.\n * `feedback` : a Module that feedbacks the previous output Tensor (or Tables) up to the `merge` module.\n * `merge` : a [table Module](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Ftable.md#table-layers) that merges the outputs of the `input` and `feedback` Module before being forwarded through the `transfer` Module.\n * `transfer` : a non-linear Module used to process the output of the `merge` module, or in the case of the first step, the output of the `start` Module.\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Due to the vanishing gradients effect, references A and B recommend `rho = 5` (or lower). Defaults to 99999.\n \nAn RNN is used to process a sequence of inputs. \nEach step in the sequence should be propagated by its own `forward` (and `backward`), \none `input` (and `gradOutput`) at a time. \nEach call to `forward` keeps a log of the intermediate states (the `input` and many `Module.outputs`) \nand increments the `step` attribute by 1. \nMethod `backward` must be called in reverse order of the sequence of calls to `forward` in \norder to backpropgate through time (BPTT). This reverse order is necessary \nto return a `gradInput` for each call to `forward`. \n\nThe `step` attribute is only reset to 1 when a call to the `forget` method is made. \nIn which case, the Module is ready to process the next sequence (or batch thereof).\nNote that the longer the sequence, the more memory that will be required to store all the \n`output` and `gradInput` states (one for each time step). \n\nTo use this module with batches, we suggest using different \nsequences of the same size within a batch and calling `updateParameters` \nevery `rho` steps and `forget` at the end of the sequence. \n\nNote that calling the `evaluate` method turns off long-term memory; \nthe RNN will only remember the previous output. This allows the RNN \nto handle long sequences without allocating any additional memory.\n\n\nFor a simple concise example of how to make use of this module, please consult the \n[simple-recurrent-network.lua](examples\u002Fsimple-recurrent-network.lua)\ntraining script.\n\n\u003Ca name='rnn.Recurrent.Sequencer'>\u003C\u002Fa>\n### Decorate it with a Sequencer ###\n\nNote that any `AbstractRecurrent` instance can be decorated with a [Sequencer](#rnn.Sequencer) \nsuch that an entire sequence (a table) can be presented with a single `forward\u002Fbackward` call.\nThis is actually the recommended approach as it allows RNNs to be stacked and makes the \nrnn conform to the Module interface, i.e. each call to `forward` can be \nfollowed by its own immediate call to `backward` as each `input` to the \nmodel is an entire sequence, i.e. a table of tensors where each tensor represents\na time-step.\n\n```lua\nseq = nn.Sequencer(module)\n```\n\nThe [simple-sequencer-network.lua](examples\u002Fsimple-sequencer-network.lua) training script\nis equivalent to the above mentionned [simple-recurrent-network.lua](examples\u002Fsimple-recurrent-network.lua)\nscript, except that it decorates the `rnn` with a `Sequencer` which takes \na table of `inputs` and `gradOutputs` (the sequence for that batch).\nThis lets the `Sequencer` handle the looping over the sequence.\n\nYou should only think about using the `AbstractRecurrent` modules without \na `Sequencer` if you intend to use it for real-time prediction. \nActually, you can even use an `AbstractRecurrent` instance decorated by a `Sequencer`\nfor real time prediction by calling `Sequencer:remember()` and presenting each \ntime-step `input` as `{input}`.\n\nOther decorators can be used such as the [Repeater](#rnn.Repeater) or [RecurrentAttention](#rnn.RecurrentAttention).\nThe `Sequencer` is only the most common one. \n\n\u003Ca name='rnn.LSTM'>\u003C\u002Fa>\n## LSTM ##\nReferences :\n * A. [Speech Recognition with Deep Recurrent Neural Networks](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1303.5778v1.pdf)\n * B. [Long-Short Term Memory](http:\u002F\u002Fweb.eecs.utk.edu\u002F~itamar\u002Fcourses\u002FECE-692\u002FBobby_paper1.pdf)\n * C. [LSTM: A Search Space Odyssey](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1503.04069v1.pdf)\n * D. [nngraph LSTM implementation on github](https:\u002F\u002Fgithub.com\u002Fwojzaremba\u002Flstm)\n\nThis is an implementation of a vanilla Long-Short Term Memory module. \nWe used Ref. A's LSTM as a blueprint for this module as it was the most concise.\nYet it is also the vanilla LSTM described in Ref. C. \n\nThe `nn.LSTM(inputSize, outputSize, [rho])` constructor takes 3 arguments:\n * `inputSize` : a number specifying the size of the input;\n * `outputSize` : a number specifying the size of the output;\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999.\n\n![LSTM](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_a70564f3bef3.png) \n\nThe actual implementation corresponds to the following algorithm:\n```lua\ni[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + W[c->i]c[t−1] + b[1->i])      (1)\nf[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + W[c->f]c[t−1] + b[1->f])      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + W[c->o]c[t] + b[1->o])        (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n```\nwhere `W[s->q]` is the weight matrix from `s` to `q`, `t` indexes the time-step,\n`b[1->q]` are the biases leading into `q`, `σ()` is `Sigmoid`, `x[t]` is the input,\n`i[t]` is the input gate (eq. 1), `f[t]` is the forget gate (eq. 2), \n`z[t]` is the input to the cell (which we call the hidden) (eq. 3), \n`c[t]` is the cell (eq. 4), `o[t]` is the output gate (eq. 5), \nand `h[t]` is the output of this module (eq. 6). Also note that the \nweight matrices from cell to gate vectors are diagonal `W[c->s]`, where `s` \nis `i`,`f`, or `o`.\n\nAs you can see, unlike [Recurrent](#rnn.Recurrent), this \nimplementation isn't generic enough that it can take arbitrary component Module\ndefinitions at construction. However, the LSTM module can easily be adapted \nthrough inheritance by overriding the different factory methods :\n  * `buildGate` : builds generic gate that is used to implement the input, forget and output gates;\n  * `buildInputGate` : builds the input gate (eq. 1). Currently calls `buildGate`;\n  * `buildForgetGate` : builds the forget gate (eq. 2). Currently calls `buildGate`;\n  * `buildHidden` : builds the hidden (eq. 3);\n  * `buildCell` : builds the cell (eq. 4);\n  * `buildOutputGate` : builds the output gate (eq. 5). Currently calls `buildGate`;\n  * `buildModel` : builds the actual LSTM model which is used internally (eq. 6).\n  \nNote that we recommend decorating the `LSTM` with a `Sequencer` \n(refer to [this](#rnn.Recurrent.Sequencer) for details).\n  \n\u003Ca name='rnn.FastLSTM'>\u003C\u002Fa>\n## FastLSTM ##\n\nA faster version of the [LSTM](#rnn.LSTM). \nBasically, the input, forget and output gates, as well as the hidden state are computed at one fellswoop.\n\nNote that `FastLSTM` does not use peephole connections between cell and gates. The algorithm from `LSTM` changes as follows:\n```lua\ni[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i])                      (1)\nf[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f])                      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n```\ni.e. omitting the summands `W[c->i]c[t−1]` (eq. 1), `W[c->f]c[t−1]` (eq. 2), and `W[c->o]c[t]` (eq. 5).\n\n### usenngraph ###\nThis is a static attribute of the `FastLSTM` class. The default value is `false`.\nSetting `usenngraph = true` will force all new instantiated instances of `FastLSTM` \nto use `nngraph`'s `nn.gModule` to build the internal `recurrentModule` which is \ncloned for each time-step.\n\n\u003Ca name='rnn.FastLSTM.bn'>\u003C\u002Fa>\n#### Recurrent Batch Normalization ####\n\nThis extends the `FastLSTM` class to enable faster convergence during training by zero-centering the input-to-hidden and hidden-to-hidden transformations. \nIt reduces the [internal covariate shift](https:\u002F\u002FarXiv.org\u002F1502.03167v3) between time steps. It is an implementation of Cooijmans et. al.'s [Recurrent Batch Normalization](https:\u002F\u002Farxiv.org\u002F1603.09025). The hidden-to-hidden transition of each LSTM cell is normalized according to \n```lua\ni[t] = σ(BN(W[x->i]x[t]) + BN(W[h->i]h[t−1]) + b[1->i])                      (1)\nf[t] = σ(BN(W[x->f]x[t]) + BN(W[h->f]h[t−1]) + b[1->f])                      (2)\nz[t] = tanh(BN(W[x->c]x[t]) + BN(W[h->c]h[t−1]) + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                                 (4)\no[t] = σ(BN(W[x->o]x[t]) + BN(W[h->o]h[t−1]) + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                        (6)\n``` \nwhere the batch normalizing transform is:                                   \n```lua\n  BN(h; gamma, beta) = beta + gamma *      hd - E(hd)\n                                      ------------------\n                                       sqrt(E(σ(hd) + eps))                       \n```\nwhere `hd` is a vector of (pre)activations to be normalized, `gamma`, and `beta` are model parameters that determine the mean and standard deviation of the normalized activation. `eps` is a regularization hyperparameter to keep the division numerically stable and `E(hd)` and `E(σ(hd))` are the estimates of the mean and variance in the mini-batch respectively. The authors recommend initializing `gamma` to a small value and found 0.1 to be the value that did not cause vanishing gradients. `beta`, the shift parameter, is `null` by default.\n\nTo turn on batch normalization during training, do:\n```lua\nnn.FastLSTM.bn = true\nlstm = nn.FastLSTM(inputsize, outputsize, [rho, eps, momentum, affine]\n``` \n\nwhere `momentum` is same as `gamma` in the equation above (defaults to 0.1), `eps` is defined above and `affine` is a boolean whose state determines if the learnable affine transform is turned off (`false`) or on (`true`, the default).\n\n\u003Ca name='rnn.GRU'>\u003C\u002Fa>\n## GRU ##\n\nReferences :\n * A. [Learning Phrase Representations Using RNN Encoder-Decoder For Statistical Machine Translation.](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1406.1078.pdf)\n * B. [Implementing a GRU\u002FLSTM RNN with Python and Theano](http:\u002F\u002Fwww.wildml.com\u002F2015\u002F10\u002Frecurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano\u002F)\n * C. [An Empirical Exploration of Recurrent Network Architectures](http:\u002F\u002Fjmlr.org\u002Fproceedings\u002Fpapers\u002Fv37\u002Fjozefowicz15.pdf)\n * D. [Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling](http:\u002F\u002Farxiv.org\u002Fabs\u002F1412.3555)\n * E. [RnnDrop: A Novel Dropout for RNNs in ASR](http:\u002F\u002Fwww.stat.berkeley.edu\u002F~tsmoon\u002Ffiles\u002FConference\u002Fasru2015.pdf)\n * F. [A Theoretically Grounded Application of Dropout in Recurrent Neural Networks](http:\u002F\u002Farxiv.org\u002Fabs\u002F1512.05287)\n\nThis is an implementation of Gated Recurrent Units module. \n\nThe `nn.GRU(inputSize, outputSize [,rho [,p [, mono]]])` constructor takes 3 arguments likewise `nn.LSTM` or 4 arguments for dropout:\n * `inputSize` : a number specifying the size of the input;\n * `outputSize` : a number specifying the size of the output;\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;\n * `p` : dropout probability for inner connections of GRUs.\n * `mono` : Monotonic sample for dropouts inside GRUs. Only needed in a `TrimZero` + `BGRU`(p>0) situation.\n\n![GRU](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_595e6882a6c3.png) \n\nThe actual implementation corresponds to the following algorithm:\n```lua\nz[t] = σ(W[x->z]x[t] + W[s->z]s[t−1] + b[1->z])            (1)\nr[t] = σ(W[x->r]x[t] + W[s->r]s[t−1] + b[1->r])            (2)\nh[t] = tanh(W[x->h]x[t] + W[hr->c](s[t−1]r[t]) + b[1->h])  (3)\ns[t] = (1-z[t])h[t] + z[t]s[t-1]                           (4)\n```\nwhere `W[s->q]` is the weight matrix from `s` to `q`, `t` indexes the time-step, `b[1->q]` are the biases leading into `q`, `σ()` is `Sigmoid`, `x[t]` is the input and `s[t]` is the output of the module (eq. 4). Note that unlike the [LSTM](#rnn.LSTM), the GRU has no cells.\n\nThe GRU was benchmark on `PennTreeBank` dataset using [recurrent-language-model.lua](examples\u002Frecurrent-language-model.lua) script. \nIt slightly outperfomed `FastLSTM`, however, since LSTMs have more parameters than GRUs, \nthe dataset larger than `PennTreeBank` might change the performance result. \nDon't be too hasty to judge on which one is the better of the two (see Ref. C and D).\n\n```\n                Memory   examples\u002Fs\n    FastLSTM      176M        16.5K \n    GRU            92M        15.8K\n```\n\n__Memory__ is measured by the size of `dp.Experiment` save file. __examples\u002Fs__ is measured by the training speed at 1 epoch, so, it may have a disk IO bias.\n\n![GRU-BENCHMARK](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_a33c86babf09.png)\n\nRNN dropout (see Ref. E and F) was benchmark on `PennTreeBank` dataset using `recurrent-language-model.lua` script, too. The details can be found in the script. In the benchmark, `GRU` utilizes a dropout after `LookupTable`, while `BGRU`, stands for Bayesian GRUs, uses dropouts on inner connections (naming as Ref. F), but not after `LookupTable`.\n\nAs Yarin Gal (Ref. F) mentioned, it is recommended that one may use `p = 0.25` for the first attempt.\n\n![GRU-BENCHMARK](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_bac51fe002ea.png)\n\n### SAdd\n\nTo implement `GRU`, a simple module is added, which cannot be possible to build only using `nn` modules.\n\n```lua\nmodule = nn.SAdd(addend, negate)\n```\nApplies a single scalar addition to the incoming data, i.e. y_i = x_i + b, then negate all components if `negate` is true. Which is used to implement `s[t] = (1-z[t])h[t] + z[t]s[t-1]` of `GRU` (see above Equation (4)).\n\n```lua\nnn.SAdd(-1, true)\n```\nHere, if the incoming data is `z[t]`, then the output becomes `-(z[t]-1)=1-z[t]`. Notice that `nn.Mul()` multiplies a scalar which is a learnable parameter.\n\n\u003Ca name='rnn.MuFuRu'>\u003C\u002Fa>\n## MuFuRu ##\n\nReferences :\n * A. [MuFuRU: The Multi-Function Recurrent Unit.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03002)\n * B. [Tensorflow Implementation of the Multi-Function Recurrent Unit](https:\u002F\u002Fgithub.com\u002Fdirkweissenborn\u002Fmufuru)\n\nThis is an implementation of the Multi-Function Recurrent Unit module. \n\nThe `nn.MuFuRu(inputSize, outputSize [,ops [,rho]])` constructor takes 2 required arguments, plus optional arguments:\n * `inputSize` : a number specifying the dimension of the input;\n * `outputSize` : a number specifying the dimension of the output;\n * `ops`: a table of strings, representing which composition operations should be used. The table can be any subset of `{'keep', 'replace', 'mul', 'diff', 'forget', 'sqrt_diff', 'max', 'min'}`. By default, all composition operations are enabled.\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;\n\nThe Multi-Function Recurrent Unit generalizes the GRU by allowing weightings of arbitrary composition operators to be learned. As in the GRU, the reset gate is computed based on the current input and previous hidden state, and used to compute a new feature vector:\n\n```lua\nr[t] = σ(W[x->r]x[t] + W[s->r]s[t−1] + b[1->r])            (1)\nv[t] = tanh(W[x->v]x[t] + W[sr->v](s[t−1]r[t]) + b[1->v])  (2)\n```\n\nwhere `W[a->b]` denotes the weight matrix from activation `a` to `b`, `t` denotes the time step, `b[1->a]` is the bias for activation `a`, and `s[t-1]r[t]` is the element-wise multiplication of the two vectors.\n\nUnlike in the GRU, rather than computing a single update gate (`z[t]` in [GRU](#rnn.GRU)), MuFuRU computes a weighting over an arbitrary number of composition operators.\n\nA composition operator is any differentiable operator which takes two vectors of the same size, the previous hidden state, and a new feature vector, and returns a new vector representing the new hidden state. The GRU implicitly defines two such operations, `keep` and `replace`, defined as `keep(s[t-1], v[t]) = s[t-1]` and `replace(s[t-1], v[t]) = v[t]`.\n\n[Ref. A](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03002) proposes 6 additional operators, which all operate element-wise:\n\n* `mul(x,y) = x * y`\n* `diff(x,y) = x - y`\n* `forget(x,y) = 0`\n* `sqrt_diff(x,y) = 0.25 * sqrt(|x - y|)`\n* `max(x,y)`\n* `min(x,y)`\n\nThe weightings of each operation are computed via a softmax from the current input and previous hidden state, similar to the update gate in the GRU. The produced hidden state is then the element-wise weighted sum of the output of each operation.\n```lua\n\np^[t][j] = W[x->pj]x[t] + W[s->pj]s[t−1] + b[1->pj])         (3)\n(p[t][1], ... p[t][J])  = softmax (p^[t][1], ..., p^[t][J])  (4)\ns[t] = sum(p[t][j] * op[j](s[t-1], v[t]))                    (5)\n```\n\nwhere `p[t][j]` is the weightings for operation `j` at time step `t`, and `sum` in equation 5 is over all operators `J`.\n\n\u003Ca name='rnn.Recursor'>\u003C\u002Fa>\n## Recursor ##\n\nThis module decorates a `module` to be used within an `AbstractSequencer` instance.\nIt does this by making the decorated module conform to the `AbstractRecurrent` interface,\nwhich like the `LSTM` and `Recurrent` classes, this class inherits. \n\n```lua\nrec = nn.Recursor(module[, rho])\n```\n\nFor each successive call to `updateOutput` (i.e. `forward`), this \ndecorator will create a `stepClone()` of the decorated `module`. \nSo for each time-step, it clones the `module`. Both the clone and \noriginal share parameters and gradients w.r.t. parameters. However, for \nmodules that already conform to the `AbstractRecurrent` interface, \nthe clone and original module are one and the same (i.e. no clone).\n\nExamples :\n\nLet's assume I want to stack two LSTMs. I could use two sequencers :\n\n```lua\nlstm = nn.Sequential()\n   :add(nn.Sequencer(nn.LSTM(100,100)))\n   :add(nn.Sequencer(nn.LSTM(100,100)))\n```\n\nUsing a `Recursor`, I make the same model with a single `Sequencer` :\n\n```lua\nlstm = nn.Sequencer(\n   nn.Recursor(\n      nn.Sequential()\n         :add(nn.LSTM(100,100))\n         :add(nn.LSTM(100,100))\n      )\n   )\n```\n\nActually, the `Sequencer` will wrap any non-`AbstractRecurrent` module automatically, \nso I could simplify this further to :\n\n```lua\nlstm = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.LSTM(100,100))\n      :add(nn.LSTM(100,100))\n   )\n```\n\nI can also add a `Linear` between the two `LSTM`s. In this case,\na `Linear` will be cloned (and have its parameters shared) for each time-step,\nwhile the `LSTM`s will do whatever cloning internally :\n\n```lua\nlstm = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.LSTM(100,100))\n      :add(nn.Linear(100,100))\n      :add(nn.LSTM(100,100))\n   )\n``` \n\n`AbstractRecurrent` instances like `Recursor`, `Recurrent` and `LSTM` are \nexpcted to manage time-steps internally. Non-`AbstractRecurrent` instances\ncan be wrapped by a `Recursor` to have the same behavior. \n\nEvery call to `forward` on an `AbstractRecurrent` instance like `Recursor` \nwill increment the `self.step` attribute by 1, using a shared parameter clone\nfor each successive time-step (for a maximum of `rho` time-steps, which defaults to 9999999).\nIn this way, `backward` can be called in reverse order of the `forward` calls \nto perform backpropagation through time (BPTT). Which is exactly what \n[AbstractSequencer](#rnn.AbstractSequencer) instances do internally.\nThe `backward` call, which is actually divided into calls to `updateGradInput` and \n`accGradParameters`, decrements by 1 the `self.udpateGradInputStep` and `self.accGradParametersStep`\nrespectively, starting at `self.step`.\nSuccessive calls to `backward` will decrement these counters and use them to \nbackpropagate through the appropriate internall step-wise shared-parameter clones.\n\nAnyway, in most cases, you will not have to deal with the `Recursor` object directly as\n`AbstractSequencer` instances automatically decorate non-`AbstractRecurrent` instances\nwith a `Recursor` in their constructors.\n\nFor a concrete example of its use, please consult the [simple-recurrent-network.lua](examples\u002Fsimple-recurrent-network.lua)\ntraining script for an example of its use.\n\n\u003Ca name='rnn.Recurrence'>\u003C\u002Fa>\n## Recurrence ##\n\nA extremely general container for implementing pretty much any type of recurrence.\n\n```lua\nrnn = nn.Recurrence(recurrentModule, outputSize, nInputDim, [rho])\n```\n\nUnlike [Recurrent](#rnn.Recurrent), this module doesn't manage a separate \nmodules like `inputModule`, `startModule`, `mergeModule` and the like.\nInstead, it only manages a single `recurrentModule`, which should \noutput a Tensor or table : `output(t)` \ngiven an input table : `{input(t), output(t-1)}`.\nUsing a mix of `Recursor` (say, via `Sequencer`) with `Recurrence`, one can implement \npretty much any type of recurrent neural network, including LSTMs and RNNs.\n\nFor the first step, the `Recurrence` forwards a Tensor (or table thereof)\nof zeros through the recurrent layer (like LSTM, unlike Recurrent).\nSo it needs to know the `outputSize`, which is either a number or \n`torch.LongStorage`, or table thereof. The batch dimension should be \nexcluded from the `outputSize`. Instead, the size of the batch dimension \n(i.e. number of samples) will be extrapolated from the `input` using \nthe `nInputDim` argument. For example, say that our input is a Tensor of size \n`4 x 3` where `4` is the number of samples, then `nInputDim` should be `1`.\nAs another example, if our input is a table of table [...] of tensors \nwhere the first tensor (depth first) is the same as in the previous example,\nthen our `nInputDim` is also `1`.\n\n\nAs an example, let's use `Sequencer` and `Recurrence` \nto build a Simple RNN for language modeling :\n\n```lua\nrho = 5\nhiddenSize = 10\noutputSize = 5 -- num classes\nnIndex = 10000\n\n-- recurrent module\nrm = nn.Sequential()\n   :add(nn.ParallelTable()\n      :add(nn.LookupTable(nIndex, hiddenSize))\n      :add(nn.Linear(hiddenSize, hiddenSize)))\n   :add(nn.CAddTable())\n   :add(nn.Sigmoid())\n\nrnn = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.Recurrence(rm, hiddenSize, 1))\n      :add(nn.Linear(hiddenSize, outputSize))\n      :add(nn.LogSoftMax())\n)\n```\n\nNote : We could very well reimplement the `LSTM` module using the\nnewer `Recursor` and `Recurrent` modules, but that would mean \nbreaking backwards compatibility for existing models saved on disk.\n\n\u003Ca name='rnn.NormStabilizer'>\u003C\u002Fa>\n## NormStabilizer ##\n\nRef. A : [Regularizing RNNs by Stabilizing Activations](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.08400)\n\nThis module implements the [norm-stabilization](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.08400) criterion:\n\n```lua\nns = nn.NormStabilizer([beta])\n``` \n\nThis module regularizes the hidden states of RNNs by minimizing the difference between the\nL2-norms of consecutive steps. The cost function is defined as :\n```\nloss = beta * 1\u002FT sum_t( ||h[t]|| - ||h[t-1]|| )^2\n``` \nwhere `T` is the number of time-steps. Note that we do not divide the gradient by `T`\nsuch that the chosen `beta` can scale to different sequence sizes without being changed.\n\nThe sole argument `beta` is defined in ref. A. Since we don't divide the gradients by\nthe number of time-steps, the default value of `beta=1` should be valid for most cases. \n\nThis module should be added between RNNs (or LSTMs or GRUs) to provide better regularization of the hidden states. \nFor example :\n```lua\nlocal stepmodule = nn.Sequential()\n   :add(nn.FastLSTM(10,10))\n   :add(nn.NormStabilizer())\n   :add(nn.FastLSTM(10,10))\n   :add(nn.NormStabilizer())\nlocal rnn = nn.Sequencer(stepmodule)\n``` \n\nTo use it with `SeqLSTM` you can do something like this :\n```lua\nlocal rnn = nn.Sequential()\n   :add(nn.SeqLSTM(10,10))\n   :add(nn.Sequencer(nn.NormStabilizer()))\n   :add(nn.SeqLSTM(10,10))\n   :add(nn.Sequencer(nn.NormStabilizer()))\n``` \n\n\u003Ca name='rnn.AbstractSequencer'>\u003C\u002Fa>\n## AbstractSequencer ##\nThis abstract class implements a light interface shared by \nsubclasses like : `Sequencer`, `Repeater`, `RecurrentAttention`, `BiSequencer` and so on.\n  \n\u003Ca name='rnn.Sequencer'>\u003C\u002Fa>\n## Sequencer ##\n\nThe `nn.Sequencer(module)` constructor takes a single argument, `module`, which is the module \nto be applied from left to right, on each element of the input sequence.\n\n```lua\nseq = nn.Sequencer(module)\n```\n\nThis Module is a kind of [decorator](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern) \nused to abstract away the intricacies of `AbstractRecurrent` modules. While an `AbstractRecurrent` instance \nrequires that a sequence to be presented one input at a time, each with its own call to `forward` (and `backward`),\nthe `Sequencer` forwards an `input` sequence (a table) into an `output` sequence (a table of the same length).\nIt also takes care of calling `forget` on AbstractRecurrent instances.\n\n### Input\u002FOutput Format\n\nThe `Sequencer` requires inputs and outputs to be of shape `seqlen x batchsize x featsize` :\n\n * `seqlen` is the number of time-steps that will be fed into the `Sequencer`.\n * `batchsize` is the number of examples in the batch. Each example is its own independent sequence.\n * `featsize` is the size of the remaining non-batch dimensions. So this could be `1` for language models, or `c x h x w` for convolutional models, etc.\n \n![Hello Fuzzy](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_c0962d6e8c3f.png)\n\nAbove is an example input sequence for a character level language model.\nIt has `seqlen` is 5 which means that it contains sequences of 5 time-steps. \nThe openning `{` and closing `}` illustrate that the time-steps are elements of a Lua table, although \nit also accepts full Tensors of shape `seqlen x batchsize x featsize`. \nThe `batchsize` is 2 as their are two independent sequences : `{ H, E, L, L, O }` and `{ F, U, Z, Z, Y, }`.\nThe `featsize` is 1 as their is only one feature dimension per character and each such character is of size 1.\nSo the input in this case is a table of `seqlen` time-steps where each time-step is represented by a `batchsize x featsize` Tensor.\n\n![Sequence](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_b6f087017adc.png)\n\nAbove is another example of a sequence (input or output). \nIt has a `seqlen` of 4 time-steps. \nThe `batchsize` is again 2 which means there are two sequences.\nThe `featsize` is 3 as each time-step of each sequence has 3 variables.\nSo each time-step (element of the table) is represented again as a tensor\nof size `batchsize x featsize`. \nNote that while in both examples the `featsize` encodes one dimension, \nit could encode more. \n\n\n### Example\n\nFor example, `rnn` : an instance of nn.AbstractRecurrent, can forward an `input` sequence one forward at a time:\n```lua\ninput = {torch.randn(3,4), torch.randn(3,4), torch.randn(3,4)}\nrnn:forward(input[1])\nrnn:forward(input[2])\nrnn:forward(input[3])\n``` \n\nEquivalently, we can use a Sequencer to forward the entire `input` sequence at once:\n\n```lua\nseq = nn.Sequencer(rnn)\nseq:forward(input)\n``` \n\nWe can also forward Tensors instead of Tables :\n\n```lua\n-- seqlen x batchsize x featsize\ninput = torch.randn(3,3,4)\nseq:forward(input)\n``` \n\n### Details\n\nThe `Sequencer` can also take non-recurrent Modules (i.e. non-AbstractRecurrent instances) and apply it to each \ninput to produce an output table of the same length. \nThis is especially useful for processing variable length sequences (tables).\n\nInternally, the `Sequencer` expects the decorated `module` to be an \n`AbstractRecurrent` instance. When this is not the case, the `module` \nis automatically decorated with a [Recursor](#rnn.Recursor) module, which makes it \nconform to the `AbstractRecurrent` interface. \n\nNote : this is due a recent update (27 Oct 2015), as before this \n`AbstractRecurrent` and and non-`AbstractRecurrent` instances needed to \nbe decorated by their own `Sequencer`. The recent update, which introduced the \n`Recursor` decorator, allows a single `Sequencer` to wrap any type of module, \n`AbstractRecurrent`, non-`AbstractRecurrent` or a composite structure of both types.\nNevertheless, existing code shouldn't be affected by the change.\n\nFor a concise example of its use, please consult the [simple-sequencer-network.lua](examples\u002Fsimple-sequencer-network.lua)\ntraining script.\n\n\u003Ca name='rnn.Sequencer.remember'>\u003C\u002Fa>\n### remember([mode]) ###\nWhen `mode='neither'` (the default behavior of the class), the Sequencer will additionally call [forget](#nn.AbstractRecurrent.forget) before each call to `forward`. \nWhen `mode='both'` (the default when calling this function), the Sequencer will never call [forget](#nn.AbstractRecurrent.forget).\nIn which case, it is up to the user to call `forget` between independent sequences.\nThis behavior is only applicable to decorated AbstractRecurrent `modules`.\nAccepted values for argument `mode` are as follows :\n\n * 'eval' only affects evaluation (recommended for RNNs)\n * 'train' only affects training\n * 'neither' affects neither training nor evaluation (default behavior of the class)\n * 'both' affects both training and evaluation (recommended for LSTMs)\n\n### forget() ###\nCalls the decorated AbstractRecurrent module's `forget` method.\n\n\u003Ca name='rnn.SeqLSTM'>\u003C\u002Fa>\n## SeqLSTM ##\n\nThis module is a faster version of `nn.Sequencer(nn.FastLSTM(inputsize, outputsize))` :\n\n```lua\nseqlstm = nn.SeqLSTM(inputsize, outputsize)\n``` \n\nEach time-step is computed as follows (same as [FastLSTM](#rnn.FastLSTM)):\n\n```lua\ni[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i])                      (1)\nf[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f])                      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n``` \n\nA notable difference is that this module expects the `input` and `gradOutput` to \nbe tensors instead of tables. The default shape is `seqlen x batchsize x inputsize` for\nthe `input` and `seqlen x batchsize x outputsize` for the `output` :\n\n```lua\ninput = torch.randn(seqlen, batchsize, inputsize)\ngradOutput = torch.randn(seqlen, batchsize, outputsize)\n\noutput = seqlstm:forward(input)\ngradInput = seqlstm:backward(input, gradOutput)\n``` \n\nNote that if you prefer to transpose the first two dimension (i.e. `batchsize x seqlen` instead of the default `seqlen x batchsize`)\nyou can set `seqlstm.batchfirst = true` following initialization.\n\nFor variable length sequences, set `seqlstm.maskzero = true`. \nThis is equivalent to calling `maskZero(1)` on a `FastLSTM` wrapped by a `Sequencer`:\n```lua\nfastlstm = nn.FastLSTM(inputsize, outputsize)\nfastlstm:maskZero(1)\nseqfastlstm = nn.Sequencer(fastlstm)\n``` \n\nFor `maskzero = true`, input sequences are expected to be seperated by tensor of zeros for a time step.\n\nThe `seqlstm:toFastLSTM()` method generates a [FastLSTM](#rnn.FastLSTM) instance initialized with the parameters \nof the `seqlstm` instance. Note however that the resulting parameters will not be shared (nor can they ever be).\n\nLike the `FastLSTM`, the `SeqLSTM` does not use peephole connections between cell and gates (see [FastLSTM](#rnn.FastLSTM) for details).\n\nLike the `Sequencer`, the `SeqLSTM` provides a [remember](rnn.Sequencer.remember) method.\n\nNote that a `SeqLSTM` cannot replace `FastLSTM` in code that decorates it with a\n`AbstractSequencer` or `Recursor` as this would be equivalent to `Sequencer(Sequencer(FastLSTM))`.\nYou have been warned.\n\n\u003Ca name='rnn.SeqLSTMP'>\u003C\u002Fa>\n## SeqLSTMP ##\nReferences:\n * A. [LSTM RNN Architectures for Large Scale Acoustic Modeling](http:\u002F\u002Fstatic.googleusercontent.com\u002Fmedia\u002Fresearch.google.com\u002Fen\u002F\u002Fpubs\u002Farchive\u002F43905.pdf)\n * B. [Exploring the Limits of Language Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.02410v2.pdf)\n \n```lua\nlstmp = nn.SeqLSTMP(inputsize, hiddensize, outputsize)\n``` \n\nThe `SeqLSTMP` is a subclass of [SeqLSTM](#rnn.SeqLSTM). \nIt differs in that after computing the hidden state `h[t]` (eq. 6), it is \nprojected onto `r[t]` using a simple linear transform (eq. 7). \nThe computation of the gates also uses the previous such projection `r[t-1]` (eq. 1, 2, 3, 5).\nThis differs from `SeqLSTM` which uses `h[t-1]` instead of `r[t-1]`.\n \nThe computation of a time-step outlined in `SeqLSTM` is replaced with the following:\n```lua\ni[t] = σ(W[x->i]x[t] + W[r->i]r[t−1] + b[1->i])                      (1)\nf[t] = σ(W[x->f]x[t] + W[r->f]r[t−1] + b[1->f])                      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]r[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[r->o]r[t−1] + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\nr[t] = W[h->r]h[t]                                                   (7)\n``` \n\nThe algorithm is outlined in ref. A and benchmarked with state of the art results on the Google billion words dataset in ref. B.\n`SeqLSTMP` can be used with an `hiddensize >> outputsize` such that the effective size of the memory cells `c[t]` \nand gates `i[t]`, `f[t]` and `o[t]` can be much larger than the actual input `x[t]` and output `r[t]`.\nFor fixed `inputsize` and `outputsize`, the `SeqLSTMP` will be able to remember much more information than the `SeqLSTM`.\n\n\u003Ca name='rnn.SeqGRU'>\u003C\u002Fa>\n## SeqGRU ##\n\nThis module is a faster version of `nn.Sequencer(nn.GRU(inputsize, outputsize))` :\n\n```lua\nseqGRU = nn.SeqGRU(inputsize, outputsize)\n``` \n\nUsage of SeqGRU differs from GRU in the same manner as SeqLSTM differs from LSTM. Therefore see [SeqLSTM](#rnn.SeqLSTM) for more details.\n\n\u003Ca name='rnn.SeqBRNN'>\u003C\u002Fa>\n## SeqBRNN ##\n\n```lua\nbrnn = nn.SeqBRNN(inputSize, outputSize, [batchFirst], [merge])\n``` \n\nA bi-directional RNN that uses SeqLSTM. Internally contains a 'fwd' and 'bwd' module of SeqLSTM. Expects an input shape of ```seqlen x batchsize x inputsize```.\nBy setting [batchFirst] to true, the input shape can be ```batchsize x seqLen x inputsize```. Merge module defaults to CAddTable(), summing the outputs from each\noutput layer.\n\nExample:\n```\ninput = torch.rand(1, 1, 5)\nbrnn = nn.SeqBRNN(5, 5)\nprint(brnn:forward(input))\n``` \nPrints an output of a 1x1x5 tensor.\n\n\u003Ca name='rnn.BiSequencer'>\u003C\u002Fa>\n## BiSequencer ##\nApplies encapsulated `fwd` and `bwd` rnns to an input sequence in forward and reverse order.\nIt is used for implementing Bidirectional RNNs and LSTMs.\n\n```lua\nbrnn = nn.BiSequencer(fwd, [bwd, merge])\n```\n\nThe input to the module is a sequence (a table) of tensors\nand the output is a sequence (a table) of tensors of the same length.\nApplies a `fwd` rnn (an [AbstractRecurrent](#rnn.AbstractRecurrent) instance) to each element in the sequence in\nforward order and applies the `bwd` rnn in reverse order (from last element to first element).\nThe `bwd` rnn defaults to:\n\n```lua\nbwd = fwd:clone()\nbwd:reset()\n```\n\nFor each step (in the original sequence), the outputs of both rnns are merged together using\nthe `merge` module (defaults to `nn.JoinTable(1,1)`). \nIf `merge` is a number, it specifies the [JoinTable](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Ftable.md#nn.JoinTable)\nconstructor's `nInputDim` argument. Such that the `merge` module is then initialized as :\n\n```lua\nmerge = nn.JoinTable(1,merge)\n```\n\nInternally, the `BiSequencer` is implemented by decorating a structure of modules that makes \nuse of 3 Sequencers for the forward, backward and merge modules.\n\nSimilarly to a [Sequencer](#rnn.Sequencer), the sequences in a batch must have the same size.\nBut the sequence length of each batch can vary.\n\nNote : make sure you call `brnn:forget()` after each call to `updateParameters()`. \nAlternatively, one could call `brnn.bwdSeq:forget()` so that only `bwd` rnn forgets.\nThis is the minimum requirement, as it would not make sense for the `bwd` rnn to remember future sequences.\n\n\n\u003Ca name='rnn.BiSequencerLM'>\u003C\u002Fa>\n## BiSequencerLM ##\n\nApplies encapsulated `fwd` and `bwd` rnns to an input sequence in forward and reverse order.\nIt is used for implementing Bidirectional RNNs and LSTMs for Language Models (LM).\n\n```lua\nbrnn = nn.BiSequencerLM(fwd, [bwd, merge])\n```\n\nThe input to the module is a sequence (a table) of tensors\nand the output is a sequence (a table) of tensors of the same length.\nApplies a `fwd` rnn (an [AbstractRecurrent](#rnn.AbstractRecurrent) instance to the \nfirst `N-1` elements in the sequence in forward order.\nApplies the `bwd` rnn in reverse order to the last `N-1` elements (from second-to-last element to first element).\nThis is the main difference of this module with the [BiSequencer](#rnn.BiSequencer).\nThe latter cannot be used for language modeling because the `bwd` rnn would be trained to predict the input it had just be fed as input.\n\n![BiDirectionalLM](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_6f299e328d3e.png)\n\nThe `bwd` rnn defaults to:\n\n```lua\nbwd = fwd:clone()\nbwd:reset()\n```\n\nWhile the `fwd` rnn will output representations for the last `N-1` steps,\nthe `bwd` rnn will output representations for the first `N-1` steps.\nThe missing outputs for each rnn ( the first step for the `fwd`, the last step for the `bwd`)\nwill be filled with zero Tensors of the same size the commensure rnn's outputs.\nThis way they can be merged. If `nn.JoinTable` is used (the default), then the first \nand last output elements will be padded with zeros for the missing `fwd` and `bwd` rnn outputs, respectively.\n\nFor each step (in the original sequence), the outputs of both rnns are merged together using\nthe `merge` module (defaults to `nn.JoinTable(1,1)`). \nIf `merge` is a number, it specifies the [JoinTable](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Ftable.md#nn.JoinTable)\nconstructor's `nInputDim` argument. Such that the `merge` module is then initialized as :\n\n```lua\nmerge = nn.JoinTable(1,merge)\n```\n\nSimilarly to a [Sequencer](#rnn.Sequencer), the sequences in a batch must have the same size.\nBut the sequence length of each batch can vary.\n\nNote that LMs implemented with this module will not be classical LMs as they won't measure the \nprobability of a word given the previous words. Instead, they measure the probabiliy of a word\ngiven the surrounding words, i.e. context. While for mathematical reasons you may not be able to use this to measure the \nprobability of a sequence of words (like a sentence), \nyou can still measure the pseudo-likeliness of such a sequence (see [this](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1504.01575.pdf) for a discussion).\n\n\u003Ca name='rnn.Repeater'>\u003C\u002Fa>\n## Repeater ##\nThis Module is a [decorator](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern) similar to [Sequencer](#rnn.Sequencer).\nIt differs in that the sequence length is fixed before hand and the input is repeatedly forwarded \nthrough the wrapped `module` to produce an output table of length `nStep`:\n```lua\nr = nn.Repeater(module, nStep)\n```\nArgument `module` should be an `AbstractRecurrent` instance.\nThis is useful for implementing models like [RCNNs](http:\u002F\u002Fjmlr.org\u002Fproceedings\u002Fpapers\u002Fv32\u002Fpinheiro14.pdf),\nwhich are repeatedly presented with the same input.\n\n\u003Ca name='rnn.RecurrentAttention'>\u003C\u002Fa>\n## RecurrentAttention ##\nReferences :\n  \n  * A. [Recurrent Models of Visual Attention](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5542-recurrent-models-of-visual-attention.pdf)\n  * B. [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http:\u002F\u002Fincompleteideas.net\u002Fsutton\u002Fwilliams-92.pdf)\n  \nThis module can be used to implement the Recurrent Attention Model (RAM) presented in Ref. A :\n```lua\nram = nn.RecurrentAttention(rnn, action, nStep, hiddenSize)\n```\n\n`rnn` is an [AbstractRecurrent](#rnn.AbstractRecurrent) instance. \nIts input is `{x, z}` where `x` is the input to the `ram` and `z` is an \naction sampled from the `action` module. \nThe output size of the `rnn` must be equal to `hiddenSize`.\n\n`action` is a [Module](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Fmodule.md#nn.Module) \nthat uses a [REINFORCE module](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.Reinforce) (ref. B) like \n[ReinforceNormal](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.ReinforceNormal), \n[ReinforceCategorical](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.ReinforceCategorical), or \n[ReinforceBernoulli](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.ReinforceBernoulli) \nto sample actions given the previous time-step's output of the `rnn`. \nDuring the first time-step, the `action` module is fed with a Tensor of zeros of size `input:size(1) x hiddenSize`.\nIt is important to understand that the sampled actions do not receive gradients \nbackpropagated from the training criterion. \nInstead, a reward is broadcast from a Reward Criterion like [VRClassReward](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.VRClassReward) Criterion to \nthe `action`'s REINFORCE module, which will backprogate graidents computed from the `output` samples \nand the `reward`. \nTherefore, the `action` module's outputs are only used internally, within the RecurrentAttention module.\n\n`nStep` is the number of actions to sample, i.e. the number of elements in the `output` table.\n\n`hiddenSize` is the output size of the `rnn`. This variable is necessary \nto generate the zero Tensor to sample an action for the first step (see above).\n\nA complete implementation of Ref. A is available [here](examples\u002Frecurrent-visual-attention.lua).\n\n\u003Ca name='rnn.MaskZero'>\u003C\u002Fa>\n## MaskZero ##\nThis module zeroes the `output` rows of the decorated module \nfor commensurate `input` rows which are tensors of zeros.\n\n```lua\nmz = nn.MaskZero(module, nInputDim)\n```\n\nThe `output` Tensor (or table thereof) of the decorated `module`\nwill have each row (samples) zeroed when the commensurate row of the `input` \nis a tensor of zeros. \n\nThe `nInputDim` argument must specify the number of non-batch dims \nin the first Tensor of the `input`. In the case of an `input` table,\nthe first Tensor is the first one encountered when doing a depth-first search.\n\nThis decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\nCaveat: `MaskZero` not guarantee that the `output` and `gradInput` tensors of the internal modules \nof the decorated `module` will be zeroed as well when the `input` is zero as well. \n`MaskZero` only affects the immediate `gradInput` and `output` of the module that it encapsulates.\nHowever, for most modules, the gradient update for that time-step will be zero because \nbackpropagating a gradient of zeros will typically yield zeros all the way to the input.\nIn this respect, modules to avoid in encapsulating inside a `MaskZero` are `AbsractRecurrent` \ninstances as the flow of gradients between different time-steps internally. \nInstead, call the [AbstractRecurrent.maskZero](#rnn.AbstractRecurrent.maskZero) method\nto encapsulate the internal `recurrentModule`.\n\n\u003Ca name='rnn.TrimZero'>\u003C\u002Fa>\n## TrimZero ##\n\nWARNING : only use this module if your input contains lots of zeros. \nIn almost all cases, [`MaskZero`](#rnn.MaskZero) will be faster, especially with CUDA.\n\nRef. A : [TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing](https:\u002F\u002Fbi.snu.ac.kr\u002FPublications\u002FConferences\u002FDomestic\u002FKIIS2016S_JHKim.pdf)\n\nThe usage is the same with `MaskZero`.\n\n```lua\nmz = nn.TrimZero(module, nInputDim)\n```\n\nThe only difference from `MaskZero` is that it reduces computational costs by varying a batch size, if any, for the case that varying lengths are provided in the input. \nNotice that when the lengths are consistent, `MaskZero` will be faster, because `TrimZero` has an operational cost. \n\nIn short, the result is the same with `MaskZero`'s, however, `TrimZero` is faster than `MaskZero` only when sentence lengths is costly vary.\n\nIn practice, e.g. language model, `TrimZero` is expected to be faster than `MaskZero` about 30%. (You can test with it using `test\u002Ftest_trimzero.lua`.)\n\n\u003Ca name='rnn.LookupTableMaskZero'>\u003C\u002Fa>\n## LookupTableMaskZero ##\nThis module extends `nn.LookupTable` to support zero indexes. Zero indexes are forwarded as zero tensors.\n\n```lua\nlt = nn.LookupTableMaskZero(nIndex, nOutput)\n```\n\nThe `output` Tensor will have each row zeroed when the commensurate row of the `input` is a zero index. \n\nThis lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\n\u003Ca name='rnn.MaskZeroCriterion'>\u003C\u002Fa>\n## MaskZeroCriterion ##\nThis criterion zeroes the `err` and `gradInput` rows of the decorated criterion \nfor commensurate `input` rows which are tensors of zeros.\n\n```lua\nmzc = nn.MaskZeroCriterion(criterion, nInputDim)\n```\n\nThe `gradInput` Tensor (or table thereof) of the decorated `criterion`\nwill have each row (samples) zeroed when the commensurate row of the `input` \nis a tensor of zeros. The `err` will also disregard such zero rows.\n\nThe `nInputDim` argument must specify the number of non-batch dims \nin the first Tensor of the `input`. In the case of an `input` table,\nthe first Tensor is the first one encountered when doing a depth-first search.\n\nThis decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\n\u003Ca name='rnn.SeqReverseSequence'>\u003C\u002Fa>\n## SeqReverseSequence ##\n\n```lua\nreverseSeq = nn.SeqReverseSequence(dim)\n```\n\nReverses an input tensor on a specified dimension. The reversal dimension can be no larger than three.\n\nExample:\n```\ninput = torch.Tensor({{1,2,3,4,5}, {6,7,8,9,10}})\nreverseSeq = nn.SeqReverseSequence(1)\nprint(reverseSeq:forward(input))\n\nGives us an output of torch.Tensor({{6,7,8,9,10},{1,2,3,4,5}})\n```\n\n\u003Ca name='rnn.SequencerCriterion'>\u003C\u002Fa>\n## SequencerCriterion ##\n\nThis Criterion is a [decorator](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern):\n\n```lua\nc = nn.SequencerCriterion(criterion, [sizeAverage])\n``` \n\nBoth the `input` and `target` are expected to be a sequence, either as a table or Tensor. \nFor each step in the sequence, the corresponding elements of the input and target \nwill be applied to the `criterion`.\nThe output of `forward` is the sum of all individual losses in the sequence. \nThis is useful when used in conjunction with a [Sequencer](#rnn.Sequencer).\n\nIf `sizeAverage` is `true` (default is `false`), the `output` loss and `gradInput` is averaged over each time-step.\n\n\u003Ca name='rnn.RepeaterCriterion'>\u003C\u002Fa>\n## RepeaterCriterion ##\n\nThis Criterion is a [decorator](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern):\n\n```lua\nc = nn.RepeaterCriterion(criterion)\n``` \n\nThe `input` is expected to be a sequence (table or Tensor). A single `target` is \nrepeatedly applied using the same `criterion` to each element in the `input` sequence.\nThe output of `forward` is the sum of all individual losses in the sequence.\nThis is useful for implementing models like [RCNNs](http:\u002F\u002Fjmlr.org\u002Fproceedings\u002Fpapers\u002Fv32\u002Fpinheiro14.pdf),\nwhich are repeatedly presented with the same target.\n","# rnn: 循环神经网络 #\n\n注意：此仓库已被弃用，推荐使用 https:\u002F\u002Fgithub.com\u002Ftorch\u002Frnn。\n\n这是一个扩展了 Torch 的 nn 模块的循环神经网络库。你可以用它来构建 RNN、LSTM、GRU、双向 RNN、双向 LSTM 等等。\n\n该库包含以下对象的文档：\n\n将连续的 `forward` 调用视为序列中不同时间步的模块：\n * [AbstractRecurrent](#rnn.AbstractRecurrent)：由 Recurrent 和 LSTM 继承的抽象类；\n * [Recurrent](#rnn.Recurrent)：一个通用的循环神经网络容器；\n * [LSTM](#rnn.LSTM)：一个标准的长短期记忆模块；\n  * [FastLSTM](#rnn.FastLSTM)：一个更快的 [LSTM](#rnn.LSTM)，可选支持批量归一化；\n * [GRU](#rnn.GRU)：门控循环单元模块；\n * [MuFuRu](#rnn.MuFuRu)：[多功能循环单元](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03002)模块；\n * [Recursor](#rnn.Recursor)：装饰一个模块使其符合 [AbstractRecurrent](#rnn.AbstractRecurrent) 接口；\n * [Recurrence](#rnn.Recurrence)：装饰一个根据 `{input(t), output(t-1)}` 输出 `output(t)` 的模块；\n * [NormStabilizer](#rnn.NormStabilizer)：实现 [范数稳定化](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.08400)准则（可在 RNN 之间添加此模块）；\n\n通过装饰后的 `AbstractRecurrent` 实例对整个序列进行 `forward` 的模块：\n * [AbstractSequencer](#rnn.AbstractSequencer)：由 Sequencer、Repeater、RecurrentAttention 等继承的抽象类；\n * [Sequencer](#rnn.Sequencer)：将封装的模块应用于输入序列中的所有元素（张量或表）；\n * [SeqLSTM](#rnn.SeqLSTM)：`nn.Sequencer(nn.FastLSTM)` 的极速版本，其中 `input` 和 `output` 均为张量；\n  * [SeqLSTMP](#rnn.SeqLSTMP)：带有投影层的 `SeqLSTM`；\n * [SeqGRU](#rnn.SeqGRU)：`nn.Sequencer(nn.GRU)` 的极速版本，其中 `input` 和 `output` 均为张量；\n * [SeqBRNN](#rnn.SeqBRNN)：基于 SeqLSTM 的双向 RNN；\n * [BiSequencer](#rnn.BiSequencer)：用于实现双向 RNN 和 LSTM；\n * [BiSequencerLM](#rnn.BiSequencerLM)：用于实现语言模型的双向 RNN 和 LSTM；\n * [Repeater](#rnn.Repeater)：将相同的输入重复应用于一个 AbstractRecurrent 实例；\n * [RecurrentAttention](#rnn.RecurrentAttention)：一种针对 [REINFORCE 模块](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.Reinforce)的通用注意力模型；\n\n其他模块和损失函数：\n * [MaskZero](#rnn.MaskZero)：对于输入中与零张量对应的行，将被装饰模块的 `output` 和 `gradOutput` 行置零；\n * [TrimZero](#rnn.TrimZero)：行为与 `MaskZero` 相同，但在输入包含大量零掩码行时效率更高；\n * [LookupTableMaskZero](#rnn.LookupTableMaskZero)：扩展 `nn.LookupTable` 以支持用于填充的零索引。零索引会被当作零张量传递；\n * [MaskZeroCriterion](#rnn.MaskZeroCriterion)：对于输入中与零张量对应的行，将被装饰损失函数的 `gradInput` 和 `err` 行置零；\n * [SeqReverseSequence](#rnn.SeqReverseSequence)：在特定维度上反转输入序列；\n\n用于处理序列型输入和目标的损失函数：\n * [SequencerCriterion](#rnn.SequencerCriterion)：依次将同一损失函数应用于输入和目标序列（张量或表）。\n * [RepeaterCriterion](#rnn.RepeaterCriterion)：对同一个目标重复应用同一损失函数于一个序列。\n\n安装此仓库的方法如下：\n```\ngit clone git@github.com:Element-Research\u002Frnn.git\ncd rnn\nluarocks make rocks\u002Frnn-scm-1.rockspec\n```\n\n请注意，现在执行 `luarocks install rnn` 将会安装 https:\u002F\u002Fgithub.com\u002Ftorch\u002Frnn。\n\n\u003Ca name='rnn.examples'>\u003C\u002Fa>\n## 示例 ##\n\n以下是使用本包的一些训练脚本示例：\n\n  * [RNN\u002FLSTM\u002FGRU](examples\u002Frecurrent-language-model.lua) 用于 Penn Tree Bank 数据集；\n  * [噪声对比估计](examples\u002Fnoise-contrastive-estimate.lua) 用于在 [Google Billion Words 数据集](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdataload#dl.loadGBW)上训练多层 [SeqLSTM](#rnn.SeqLSTM) 语言模型。该示例使用 [MaskZero](#rnn.MaskZero) 来训练具有不同长度的独立序列，结合 [NCEModule](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdpnn#nn.NCEModule) 和 [NCECriterion](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdpnn#nn.NCECriterion)。此脚本是我们目前最快的，速度可达每秒 20,000 个词（在 NVIDIA Titan X 上），采用两层隐藏单元数为 250 的 LSTM，批次大小为 128，序列长度为 100。请注意，你需要安装 [Torch，并使用 Lua 而不是 LuaJIT](http:\u002F\u002Ftorch.ch\u002Fdocs\u002Fgetting-started.html#_)；\n  * [视觉注意力的循环模型](examples\u002Frecurrent-visual-attention.lua) 用于 MNIST 数据集；\n  * [编码器-解码器 LSTM](examples\u002Fencoder-decoder-coupling.lua) 展示如何将编码器和解码器的 `LSTM` 耦合起来，用于序列到序列网络；\n  * [简单循环网络](examples\u002Fsimple-recurrent-network.lua) 展示如何构建并训练一个简单的循环神经网络；\n  * [简单序列网络](examples\u002Fsimple-sequencer-network.lua) 是上述脚本的一个版本，使用 Sequencer 来装饰 `rnn`；\n  * [序列到单个输出](examples\u002Fsequence-to-one.lua) 演示如何进行多对一的序列学习，例如情感分析；\n  * [多元时间序列](examples\u002Frecurrent-time-series.lua) 演示如何训练一个简单的 RNN 来进行多元时间序列预测。\n\n### 外部资源\n\n  * [rnn-benchmarks](https:\u002F\u002Fgithub.com\u002Fglample\u002Frnn-benchmarks) : 对 Torch（使用本库）、Theano 和 TensorFlow 进行比较的基准测试。\n  * [哈佛大学 Jupyter Notebook 教程](http:\u002F\u002Fnbviewer.jupyter.org\u002Fgithub\u002FCS287\u002FLectures\u002Fblob\u002Fgh-pages\u002Fnotebooks\u002FElementRNNTutorial.ipynb) : 哈佛大学提供的关于如何使用 Element-Research rnn 包的深入教程；\n  * [dpnn](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdpnn) : 这是 __rnn__ 包的一个依赖项。它包含有用的 nn 扩展、模块和损失函数；\n  * [dataload](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Fdataload) : 一系列 Torch 数据集加载器；\n  * [RNN\u002FLSTM\u002FBRNN\u002FBLSTM 训练脚本 ](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdp\u002Fblob\u002Fmaster\u002Fexamples\u002Frecurrentlanguagemodel.lua) 用于 Penn Tree Bank 或 Google Billion Words 数据集；\n  * 关于 Torch7 的简短概述（1 小时），其中包括关于 __rnn__ 包的一些细节（在结尾处），可通过此 [NVIDIA GTC 网络研讨会视频](http:\u002F\u002Fon-demand.gputechconf.com\u002Fgtc\u002F2015\u002Fwebinar\u002Ftorch7-applied-deep-learning-for-vision-natural-language.mp4)观看。无论如何，该演示文稿都很好地概述了使用 Torch7 的逻辑回归、多层感知机、卷积神经网络和循环神经网络；\n  * [使用编码器-解码器 RNN 进行序列到序列映射](https:\u002F\u002Fgithub.com\u002Frahul-iisc\u002Fseq2seq-mapping) : 一个使用合成数据的完整训练示例。\n  * [ConvLSTM](https:\u002F\u002Fgithub.com\u002Fviorik\u002FConvLSTM) 是一个用于训练 [具有可微分记忆的时空视频自编码器](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06309)的仓库。\n  * 一个用于单变量时间序列预测的 [时间序列示例](https:\u002F\u002Fgithub.com\u002Frracinskij\u002Frnntest01\u002Fblob\u002Fmaster\u002Frnntest01.lua)。\n\n## 引用 ##\n\n如果您在工作中使用了 __rnn__，我们非常感谢您能引用以下论文：\n\nLéonard, Nicholas, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. [rnn: Recurrent Library for Torch.](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.07889) arXiv 预印本 arXiv:1511.07889 (2015)。\n\n任何对该库做出重大贡献的人都将被添加为论文的作者。\n[重大贡献者](https:\u002F\u002Fgithub.com\u002FElement-Research\u002Frnn\u002Fgraphs\u002Fcontributors) \n是指向该库添加至少 300 行代码的人。\n\n## 故障排除 ##\n\n大多数问题可以通过更新各种依赖项来解决：\n```bash\nluarocks install torch\nluarocks install nn\nluarocks install dpnn\nluarocks install torchx\n```\n\n如果您使用 CUDA：\n```bash\nluarocks install cutorch\nluarocks install cunn\nluarocks install cunnx\n```\n\n别忘了更新这个包：\n```bash\ngit clone git@github.com:Element-Research\u002Frnn.git\ncd rnn\nluarocks make rocks\u002Frnn-scm-1.rockspec\n```\n\n如果这样仍然无法解决问题，请在 GitHub 上提交一个问题。\n\n\u003Ca name='rnn.AbstractRecurrent'>\u003C\u002Fa>\n## AbstractRecurrent ##\n由 [Recurrent](#rnn.Recurrent)、[LSTM](#rnn.LSTM) 和 [GRU](#rnn.GRU) 继承的抽象类。\n构造函数接受一个参数：\n```lua\nrnn = nn.AbstractRecurrent([rho])\n```\n参数 `rho` 是反向传播通过时间（BPTT）的最大步数。\n子类可以将其设置为一个很大的数字，比如 99999（默认值），如果他们希望对整个序列进行反向传播，无论其长度如何。设置较小的 `rho` 值在处理长序列时很有用，但我们只希望对最后 `rho` 步进行反向传播，这意味着不需要存储序列的其余部分（因此没有额外的成本）。\n\n### [recurrentModule] getStepModule(step) ###\n返回第 `step` 个时间步的模块。这由子类内部使用，以获取内部 `recurrentModule` 的副本。这些副本共享 `parameters` 和 `gradParameters`，但各自拥有自己的 `output`、`gradInput` 以及任何其他中间状态。\n\n### setOutputStep(step) ###\n这是为 [Recursor](#rnn.Recursor) 在反向传播时保留的内部方法。它将对象的 `output` 属性设置为指向第 `step` 个时间步的输出。\n引入此方法是为了修复一个非常烦人的错误。\n\n\u003Ca name='rnn.AbstractRecurrent.maskZero'>\u003C\u002Fa>\n### maskZero(nInputDim) ###\n用 [MaskZero](#rnn.MaskZero) 装饰内部 `recurrentModule`。当 `input` 中相应行是零张量时，`recurrentModule` 的 `output` 张量（或其表）中的每一行（即样本）都会被置零。\n\n`nInputDim` 参数必须指定 `input` 中第一张张量的非批次维度数量。对于 `input` 表格，第一张张量是在深度优先搜索中遇到的第一张。\n\n调用此方法可以使同一批次中不同长度的序列用零向量填充。\n\n当某个样本时间步被掩蔽（即 `input` 是一行零）时，隐藏状态实际上会在下一个非掩蔽时间步被重置（即遗忘）。换句话说，可以用掩蔽元素将不相关的序列分开。\n\n### trimZero(nInputDim) ###\n用 [TrimZero](#rnn.TrimZero) 装饰内部 `recurrentModule`。\n\n### [output] updateOutput(input) ###\n向前传播当前步骤的输入。前几步的输出或中间状态会被递归地使用。这对调用者来说是透明的，因为之前的输出和中间状态会被记住。此方法还会将 `step` 属性加 1。\n\n\u003Ca name='rnn.AbstractRecurrent.updateGradInput'>\u003C\u002Fa>\n### updateGradInput(input, gradOutput) ###\n与 `backward` 类似，此方法应按照与传播序列时的 `forward` 调用相反的顺序调用。例如：\n\n```lua\nrnn = nn.LSTM(10, 10) -- AbstractRecurrent 实例\nlocal outputs = {}\nfor i=1,nStep do -- 向前传播序列\n   outputs[i] = rnn:forward(inputs[i])\nend\n\nfor i=nStep,1,-1 do -- 按照相反顺序向后传播序列\n   gradInputs[i] = rnn:backward(inputs[i], gradOutputs[i])\nend\n\nrnn:forget()\n``` \n\n相反的顺序实现了通过时间的反向传播（BPTT）。\n\n### accGradParameters(input, gradOutput, scale) ###\n与 `updateGradInput` 类似，但用于累积相对于参数的梯度。\n\n### recycle(offset) ###\n此方法与 `forget` 密切相关。当当前时间步大于 `rho` 时，它很有用，此时会开始回收最旧的 `recurrentModule` `sharedClones`，以便它们可以用于存储下一步。此 `offset` 用于像 `nn.Recurrent` 这样的模块，它们在第一步使用不同的模块。默认偏移量为 0。\n\n\u003Ca name='rnn.AbstractRecurrent.forget'>\u003C\u002Fa>\n\n### forget(offset) ###\n此方法将所有状态重置到序列缓冲区的起始位置，\n即忘记当前序列。它还会将 `step` 属性重置为 1。\n强烈建议在每次参数更新后调用 `forget` 方法。\n否则，前一个状态会被用来激活下一个状态，这通常会导致不稳定。\n这是因为前一个状态是基于已经改变的参数计算得出的。\n此外，在每个新序列开始时调用 `forget` 也是一个良好的实践。\n\n\u003Ca name='rnn.AbstractRecurrent.maxBPTTstep'>\u003C\u002Fa>\n### maxBPTTstep(rho) ###\n此方法设置进行时间反向传播（BPTT）的最大时间步数。例如，如果你将此值设置为 `rho = 3` 个时间步，\n然后进行 4 步前向传播，再进行反向传播，那么只有最后 3 步会被用于反向传播。如果你的 AbstractRecurrent 实例被 [Sequencer](#rnn.Sequencer) 包装，\nSequencer 会自动处理这一过程。否则，将此值设置为较大的数（如 9999999）对于大多数情况都是合适的。\n\n\u003Ca name='rnn.AbstractRecurrent.backwardOnline'>\u003C\u002Fa>\n### backwardOnline() ###\n此方法已于 2016 年 1 月 6 日弃用。\n自那以后，AbstractRecurrent 实例默认使用 backwardOnline 行为。\n详情请参阅 [updateGradInput](#rnn.AbstractRecurrent.updateGradInput)。\n\n### training() ###\n在训练模式下，网络会记住过去 `rho`（时间步数）内的所有状态。\n这是进行 BPTT 所必需的。\n\n### evaluate() ###\n在评估过程中，由于无需稍后执行 BPTT，\n因此仅需记住上一步的状态即可。这样在内存使用上非常高效，\n使得可以对潜在无限长度的序列进行评估。\n\n\u003Ca name='rnn.Recurrent'>\u003C\u002Fa>\n## 循环神经网络 ##\n参考文献：\n * A. [Sutsekever 论文第 2.5 和 2.8 节](http:\u002F\u002Fwww.cs.utoronto.ca\u002F~ilya\u002Fpubs\u002Filya_sutskever_phd_thesis.pdf)\n * B. [Mikolov 论文第 3.2 和 3.3 节](http:\u002F\u002Fwww.fit.vutbr.cz\u002F~imikolov\u002Frnnlm\u002Fthesis.pdf)\n * C. [RNN 和反向传播指南](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.3.9311&rep=rep1&type=pdf)\n\n这是一个用于实现循环神经网络（RNN）的[复合模块](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Fcontainers.md#containers)，不包括输出层。\n\n`nn.Recurrent(start, input, feedback, [transfer, rho, merge])` 构造函数接受 6 个参数：\n * `start`：输出的大小（不包括批次维度），或者是在传播的第一步中插入在 `input` 模块和 `transfer` 模块之间的模块。当 `start` 是一个大小（数字或 `torch.LongTensor`）时，这个 *start* 模块会被初始化为 `nn.Add(start)`（见参考文献 A）。\n * `input`：处理输入张量（或表）的模块。其输出必须与 `start` 的大小相同（或者在有 `start` 模块的情况下为其输出大小相同），并且与 `feedback` 模块的输出大小一致。\n * `feedback`：将前一步的输出张量（或表）反馈至 `merge` 模块的模块。\n * `merge`：一个[表模块](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Ftable.md#table-layers)，用于在通过 `transfer` 模块之前合并 `input` 和 `feedback` 模块的输出。\n * `transfer`：一个非线性模块，用于处理 `merge` 模块的输出，或者在第一步中处理 `start` 模块的输出。\n * `rho`：可回溯的最大反向传播步数。限制了存储在内存中的先前步骤数量。由于梯度消失效应，参考文献 A 和 B 建议将 `rho` 设置为 5（或更低）。默认值为 99999。\n\nRNN 用于处理一系列输入。\n序列中的每一步都应通过各自的 `forward`（和 `backward`）来传播，\n一次处理一个 `input`（和 `gradOutput`）。\n每次调用 `forward` 都会记录中间状态（`input` 和许多 `Module.outputs`），\n并将 `step` 属性加 1。\n`backward` 方法必须按照与 `forward` 调用相反的顺序调用，\n以进行时间反向传播（BPTT）。这种逆序是必要的，\n以便为每次 `forward` 调用返回对应的 `gradInput`。\n\n只有在调用 `forget` 方法时，`step` 属性才会重置为 1。\n此时，模块便准备好处理下一个序列（或其批次）。\n需要注意的是，序列越长，存储所有 `output` 和 `gradInput` 状态所需的内存就越多（每个时间步都需要一个状态）。\n\n为了将此模块用于批次处理，我们建议在一个批次中使用多个相同大小的序列，\n并在每 `rho` 步调用 `updateParameters`，在序列结束时调用 `forget`。\n\n请注意，调用 `evaluate` 方法会关闭长期记忆；\nRNN 将只记住上一步的输出。这使得 RNN 可以处理长序列而无需额外分配内存。\n\n\n有关如何使用此模块的简单示例，请参阅\n[simple-recurrent-network.lua](examples\u002Fsimple-recurrent-network.lua)\n训练脚本。\n\n\u003Ca name='rnn.Recurrent.Sequencer'>\u003C\u002Fa>\n### 使用 Sequencer 进行装饰 ###\n\n请注意，任何 `AbstractRecurrent` 实例都可以用 [Sequencer](#rnn.Sequencer) 进行装饰，\n这样只需一次 `forward\u002Fbackward` 调用就可以处理整个序列（一个表）。\n实际上，这也是推荐的做法，因为它允许堆叠 RNN，并使 rnn 符合模块接口，\n即每次调用 `forward` 后都可以立即调用 `backward`，\n因为模型的每个输入都是一个完整的序列，即一个由张量组成的表，\n其中每个张量代表一个时间步。\n\n```lua\nseq = nn.Sequencer(module)\n```\n\n[simple-sequencer-network.lua](examples\u002Fsimple-sequencer-network.lua) 训练脚本\n与上述 [simple-recurrent-network.lua](examples\u002Fsimple-recurrent-network.lua) 脚本等效，\n唯一的区别在于它用 `Sequencer` 装饰了 `rnn`，该 `Sequencer` 接受一个包含 `inputs` 和 `gradOutputs` 的表（该批次的序列）。\n这使得 `Sequencer` 能够自动处理序列的循环。\n\n只有当你打算将其用于实时预测时，才需要考虑不使用 `Sequencer` 的 `AbstractRecurrent` 模块。\n实际上，即使使用被 `Sequencer` 装饰的 `AbstractRecurrent` 实例，\n也可以通过调用 `Sequencer:remember()` 并将每个时间步的输入表示为 `{input}` 来进行实时预测。\n其他装饰器也可以使用，比如 [Repeater](#rnn.Repeater) 或 [RecurrentAttention](#rnn.RecurrentAttention)。\n`Sequencer` 只是最常见的选择。\n\n\u003Ca name='rnn.LSTM'>\u003C\u002Fa>\n\n## LSTM ##\n参考文献：\n * A. [深度循环神经网络在语音识别中的应用](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1303.5778v1.pdf)\n * B. [长短期记忆网络](http:\u002F\u002Fweb.eecs.utk.edu\u002F~itamar\u002Fcourses\u002FECE-692\u002FBobby_paper1.pdf)\n * C. [LSTM：搜索空间奥德赛](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1503.04069v1.pdf)\n * D. [github上的nngraph LSTM实现](https:\u002F\u002Fgithub.com\u002Fwojzaremba\u002Flstm)\n\n这是一个标准的长短期记忆模块的实现。我们以参考文献A中的LSTM作为本模块的设计蓝图，因为它最为简洁。同时，这也是参考文献C中描述的标准LSTM。\n\n`nn.LSTM(inputSize, outputSize, [rho])`构造函数接受三个参数：\n * `inputSize`：指定输入大小的数字；\n * `outputSize`：指定输出大小的数字；\n * `rho`：反向传播时回溯的最大步数。限制了保存在内存中的先前步骤数量。默认值为9999。\n\n![LSTM](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_a70564f3bef3.png) \n\n实际实现对应于以下算法：\n```lua\ni[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + W[c->i]c[t−1] + b[1->i])      (1)\nf[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + W[c->f]c[t−1] + b[1->f])      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + W[c->o]c[t] + b[1->o])        (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n```\n其中`W[s->q]`表示从`s`到`q`的权重矩阵，`t`表示时间步，`b[1->q]`是进入`q`的偏置项，`σ()`为Sigmoid函数，`x[t]`为输入，`i[t]`为输入门（公式1），`f[t]`为遗忘门（公式2），`z[t]`为传入细胞的状态（我们称之为隐藏状态）（公式3），`c[t]`为细胞状态（公式4），`o[t]`为输出门（公式5），而`h[t]`则是该模块的输出（公式6）。另外需要注意的是，从细胞到门向量的权重矩阵都是对角矩阵`W[c->s]`，其中`s`可以是`i`、`f`或`o`。\n\n正如你所见，与[RNN](#rnn.Recurrent)不同，这个实现并不足够通用，无法在构造时接受任意组件模块的定义。然而，可以通过继承的方式轻松地对LSTM模块进行定制，只需重写不同的工厂方法：\n  * `buildGate`：构建用于实现输入、遗忘和输出门的通用门；\n  * `buildInputGate`：构建输入门（公式1）。目前调用`buildGate`；\n  * `buildForgetGate`：构建遗忘门（公式2）。目前调用`buildGate`；\n  * `buildHidden`：构建隐藏状态（公式3）；\n  * `buildCell`：构建细胞状态（公式4）；\n  * `buildOutputGate`：构建输出门（公式5）。目前调用`buildGate`；\n  * `buildModel`：构建内部使用的实际LSTM模型（公式6）。\n  \n请注意，我们建议将`LSTM`与`Sequencer`结合使用（详情请参阅[这里](#rnn.Recurrent.Sequencer)）。\n  \n\u003Ca name='rnn.FastLSTM'>\u003C\u002Fa>\n## FastLSTM ##\n\n[龙短期记忆网络](#rnn.LSTM)的一个更快版本。基本上，输入门、遗忘门、输出门以及隐藏状态是在一次计算中同时完成的。\n\n需要注意的是，`FastLSTM`不使用细胞与门之间的窥视连接。其算法与`LSTM`相比有所变化：\n```lua\ni[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i])                      (1)\nf[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f])                      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n```\n即省略了公式1中的`W[c->i]c[t−1]`、公式2中的`W[c->f]c[t−1]`以及公式5中的`W[c->o]c[t]`这些项。\n\n### usenngraph ###\n这是`FastLSTM`类的一个静态属性，默认值为`false`。将`usenngraph`设置为`true`会强制所有新实例化的`FastLSTM`对象使用`nngraph`的`nn.gModule`来构建内部的`recurrentModule`，该模块会在每个时间步被克隆。\n\n\u003Ca name='rnn.FastLSTM.bn'>\u003C\u002Fa>\n#### 循环批量归一化 ####\n\n此扩展使`FastLSTM`类能够在训练过程中通过将输入到隐藏层以及隐藏层之间的变换中心化为零，从而加快收敛速度。它减少了时间步之间的[内部协变量偏移](https:\u002F\u002FarXiv.org\u002F1502.03167v3)。这是Cooijmans等人提出的[循环批量归一化](https:\u002F\u002Farxiv.org\u002F1603.09025)的一种实现。每个LSTM单元的隐藏层到隐藏层的转换按照以下方式归一化：\n```lua\ni[t] = σ(BN(W[x->i]x[t]) + BN(W[h->i]h[t−1]) + b[1->i])                      (1)\nf[t] = σ(BN(W[x->f]x[t]) + BN(W[h->f]h[t−1]) + b[1->f])                      (2)\nz[t] = tanh(BN(W[x->c]x[t]) + BN(W[h->c]h[t−1]) + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                                 (4)\no[t] = σ(BN(W[x->o]x[t]) + BN(W[h->o]h[t−1]) + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                        (6)\n``` \n其中批量归一化变换为：                                   \n```lua\n  BN(h; gamma, beta) = beta + gamma *      hd - E(hd)\n                                      ------------------\n                                       sqrt(E(σ(hd) + eps))                       \n```\n其中`hd`是要归一化的（预）激活向量，`gamma`和`beta`是模型参数，分别决定归一化后激活的均值和标准差。`eps`是一个正则化超参数，用于保持除法运算的数值稳定性，而`E(hd)`和`E(σ(hd))`分别是小批量中激活向量的均值和方差的估计值。作者建议将`gamma`初始化为一个小值，并发现0.1是既不会导致梯度消失又能有效工作的值。`beta`作为平移参数，默认值为`null`。\n\n要在训练中启用批量归一化，请执行以下操作：\n```lua\nnn.FastLSTM.bn = true\nlstm = nn.FastLSTM(inputsize, outputsize, [rho, eps, momentum, affine]\n``` \n\n其中`momentum`与上述公式中的`gamma`相同（默认值为0.1），`eps`如前所述，而`affine`是一个布尔值，用于决定是否关闭可学习的仿射变换（`false`）或开启（`true`，默认值）。\n\n\u003Ca name='rnn.GRU'>\u003C\u002Fa>\n\n## GRU ##\n\n参考文献：\n * A. [使用 RNN 编码器-解码器进行统计机器翻译的短语表示学习。](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1406.1078.pdf)\n * B. [用 Python 和 Theano 实现 GRU\u002FLSTM RNN](http:\u002F\u002Fwww.wildml.com\u002F2015\u002F10\u002Frecurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano\u002F)\n * C. [循环网络架构的实证探索](http:\u002F\u002Fjmlr.org\u002Fproceedings\u002Fpapers\u002Fv37\u002Fjozefowicz15.pdf)\n * D. [门控循环神经网络在序列建模上的经验评估](http:\u002F\u002Farxiv.org\u002Fabs\u002F1412.3555)\n * E. [RnnDrop：一种用于 ASR 中 RNN 的新型 Dropout 技术](http:\u002F\u002Fwww.stat.berkeley.edu\u002F~tsmoon\u002Ffiles\u002FConference\u002Fasru2015.pdf)\n * F. [循环神经网络中 Dropout 的理论基础应用](http:\u002F\u002Farxiv.org\u002Fabs\u002F1512.05287)\n\n这是一个门控循环单元模块的实现。\n\n`nn.GRU(inputSize, outputSize [,rho [,p [, mono]]])` 构造函数与 `nn.LSTM` 类似，接受 3 个参数，或对于 dropout 接受 4 个参数：\n * `inputSize`：指定输入大小的数字；\n * `outputSize`：指定输出大小的数字；\n * `rho`：反向传播时回溯的最大步数。限制了保存在内存中的先前步骤数量。默认值为 9999；\n * `p`：GRU 内部连接的 dropout 概率。\n * `mono`：GRU 内部 dropout 的单调采样。仅在 `TrimZero` + `BGRU`(p>0) 的情况下需要。\n\n![GRU](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_595e6882a6c3.png) \n\n实际实现对应于以下算法：\n```lua\nz[t] = σ(W[x->z]x[t] + W[s->z]s[t−1] + b[1->z])            (1)\nr[t] = σ(W[x->r]x[t] + W[s->r]s[t−1] + b[1->r])            (2)\nh[t] = tanh(W[x->h]x[t] + W[hr->c](s[t−1]r[t]) + b[1->h])  (3)\ns[t] = (1-z[t])h[t] + z[t]s[t-1]                           (4)\n```\n其中 `W[s->q]` 是从 `s` 到 `q` 的权重矩阵，`t` 表示时间步，`b[1->q]` 是指向 `q` 的偏置，`σ()` 是 Sigmoid 函数，`x[t]` 是输入，`s[t]` 是模块的输出（等式 4）。请注意，与 [LSTM](#rnn.LSTM) 不同，GRU 没有细胞。\n\nGRU 使用 [recurrent-language-model.lua](examples\u002Frecurrent-language-model.lua) 脚本在 `PennTreeBank` 数据集上进行了基准测试。它略胜于 `FastLSTM`，然而由于 LSTM 的参数比 GRU 多，\n如果数据集大于 `PennTreeBank`，性能结果可能会发生变化。不要急于判断哪一种更好（参见参考文献 C 和 D）。\n\n```\n                内存   示例\u002F秒\n    FastLSTM      176M        16.5K \n    GRU            92M        15.8K\n```\n\n__内存__ 以 `dp.Experiment` 保存文件的大小来衡量。__示例\u002F秒__ 以 1 个 epoch 的训练速度来衡量，因此可能受到磁盘 IO 的影响。\n\n![GRU-BENCHMARK](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_a33c86babf09.png)\n\nRNN 的 dropout（参见参考文献 E 和 F）也使用 `recurrent-language-model.lua` 脚本在 `PennTreeBank` 数据集上进行了基准测试。详细信息可在脚本中找到。在基准测试中，`GRU` 在 `LookupTable` 之后使用 dropout，而 `BGRU`，即贝叶斯 GRU，则在内部连接上使用 dropout（如参考文献 F 所述），而不是在 `LookupTable` 之后使用。\n\n正如 Yarin Gal（参考文献 F）所提到的，建议首次尝试时可以使用 `p = 0.25`。\n\n![GRU-BENCHMARK](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_bac51fe002ea.png)\n\n### SAdd\n\n为了实现 `GRU`，添加了一个简单的模块，该模块无法仅使用 `nn` 模块构建。\n\n```lua\nmodule = nn.SAdd(addend, negate)\n```\n对输入数据应用单个标量加法，即 y_i = x_i + b，然后如果 `negate` 为真，则将所有分量取反。这用于实现 `GRU` 的 `s[t] = (1-z[t])h[t] + z[t]s[t-1]`（见上述等式 4）。\n\n```lua\nnn.SAdd(-1, true)\n```\n在这里，如果输入数据是 `z[t]`，那么输出就会变成 `-(z[t]-1)=1-z[t]`。请注意，`nn.Mul()` 是一个可学习参数的标量乘法。\n\n\u003Ca name='rnn.MuFuRu'>\u003C\u002Fa>\n## MuFuRu ##\n\n参考文献：\n * A. [MuFuRU：多功能循环单元。](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03002)\n * B. [Tensorflow 实现的多功能循环单元](https:\u002F\u002Fgithub.com\u002Fdirkweissenborn\u002Fmufuru)\n\n这是一个多功能循环单元模块的实现。\n\n`nn.MuFuRu(inputSize, outputSize [,ops [,rho]])` 构造函数接受 2 个必填参数，以及可选参数：\n * `inputSize`：指定输入维度的数字；\n * `outputSize`：指定输出维度的数字；\n * `ops`：字符串表，表示应使用哪些组合操作。该表可以是 `{'keep', 'replace', 'mul', 'diff', 'forget', 'sqrt_diff', 'max', 'min'}` 的任意子集。默认情况下，所有组合操作都启用。\n * `rho`：反向传播时回溯的最大步数。限制了保存在内存中的先前步骤数量。默认值为 9999；\n\n多功能循环单元通过允许学习任意组合算子的权重，对 GRU 进行了泛化。与 GRU 类似，重置门根据当前输入和前一隐藏状态计算，并用于计算新的特征向量：\n\n```lua\nr[t] = σ(W[x->r]x[t] + W[s->r]s[t−1] + b[1->r])            (1)\nv[t] = tanh(W[x->v]x[t] + W[sr->v](s[t−1]r[t]) + b[1->v])  (2)\n```\n\n其中 `W[a->b]` 表示从激活 `a` 到 `b` 的权重矩阵，`t` 表示时间步，`b[1->a]` 是激活 `a` 的偏置，而 `s[t-1]r[t]` 是两个向量的逐元素相乘。\n\n与 GRU 不同，MuFuRU 不是计算单个更新门（如 [GRU](#rnn.GRU) 中的 `z[t]`），而是对任意数量的组合算子进行加权。\n\n组合算子是指任何可微分的算子，它接受两个相同大小的向量——前一隐藏状态和新特征向量——并返回代表新隐藏状态的新向量。GRU 隐式定义了两种这样的操作，即 `keep` 和 `replace`，分别定义为 `keep(s[t-1], v[t]) = s[t-1]` 和 `replace(s[t-1], v[t]) = v[t]`。\n\n[参考文献 A](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03002) 提出了 6 种额外的算子，它们都是逐元素操作：\n\n* `mul(x,y) = x * y`\n* `diff(x,y) = x - y`\n* `forget(x,y) = 0`\n* `sqrt_diff(x,y) = 0.25 * sqrt(|x - y|)`\n* `max(x,y)`\n* `min(x,y)`\n\n每个操作的权重通过当前输入和前一隐藏状态的 softmax 计算得出，类似于 GRU 中的更新门。然后，生成的隐藏状态是每个操作输出的逐元素加权和。\n```lua\n\np^[t][j] = W[x->pj]x[t] + W[s->pj]s[t−1] + b[1->pj])         (3)\n(p[t][1], ... p[t][J])  = softmax (p^[t][1], ..., p^[t][J])  (4)\ns[t] = sum(p[t][j] * op[j](s[t-1], v[t]))                    (5)\n```\n\n其中 `p[t][j]` 是时间步 `t` 时操作 `j` 的权重，而等式 5 中的 `sum` 是对所有算子 `J` 的求和。\n\u003Ca name='rnn.Recursor'>\u003C\u002Fa>\n\n## 递归器 ##\n\n该模块用于装饰一个 `module`，以便在 `AbstractSequencer` 实例中使用。\n它通过使被装饰的模块符合 `AbstractRecurrent` 接口来实现这一点，\n而该类本身也继承自 `LSTM` 和 `Recurrent` 类。\n\n```lua\nrec = nn.Recursor(module[, rho])\n```\n\n对于每次连续调用 `updateOutput`（即 `forward`），这个装饰器都会创建被装饰模块的一个 `stepClone()`。\n因此，在每个时间步长上，它都会克隆该模块。克隆的模块和原始模块共享参数以及相对于这些参数的梯度。\n然而，对于已经符合 `AbstractRecurrent` 接口的模块，克隆的模块和原始模块是同一个（即不进行克隆）。\n\n示例：\n\n假设我想堆叠两个 LSTM。我可以使用两个序列器：\n\n```lua\nlstm = nn.Sequential()\n   :add(nn.Sequencer(nn.LSTM(100,100)))\n   :add(nn.Sequencer(nn.LSTM(100,100)))\n```\n\n使用 `Recursor`，我可以用一个 `Sequencer` 实现相同的模型：\n\n```lua\nlstm = nn.Sequencer(\n   nn.Recursor(\n      nn.Sequential()\n         :add(nn.LSTM(100,100))\n         :add(nn.LSTM(100,100))\n      )\n   )\n```\n\n实际上，`Sequencer` 会自动包装任何非 `AbstractRecurrent` 模块，\n因此我可以进一步简化为：\n\n```lua\nlstm = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.LSTM(100,100))\n      :add(nn.LSTM(100,100))\n   )\n```\n\n我还可以在两个 LSTM 之间添加一个 `Linear` 层。在这种情况下，\n`Linear` 层会在每个时间步长上被克隆（并共享其参数），\n而 LSTM 层则会在内部自行处理克隆操作：\n\n```lua\nlstm = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.LSTM(100,100))\n      :add(nn.Linear(100,100))\n      :add(nn.LSTM(100,100))\n   )\n```\n\n像 `Recursor`、`Recurrent` 和 `LSTM` 这样的 `AbstractRecurrent` 实例\n应该在内部管理时间步长。非 `AbstractRecurrent` 实例可以通过 `Recursor` 包装，\n以获得相同的行为。\n\n对像 `Recursor` 这样的 `AbstractRecurrent` 实例每次调用 `forward` 时，\n都会将 `self.step` 属性增加 1，并为每个后续时间步长使用共享参数的克隆版本\n（最多可达到 `rho` 个时间步长，默认值为 9999999）。这样，\n`backward` 可以按照与 `forward` 调用相反的顺序被调用，\n从而执行时间反向传播（BPTT）。这正是 [AbstractSequencer](#rnn.AbstractSequencer) 实例\n在内部所做的事情。`backward` 调用实际上分为对 `updateGradInput` 和\n`accGradParameters` 的调用，分别从 `self.step` 开始，依次递减 `self.updateGradInputStep` 和\n`self.accGradParametersStep`。连续的 `backward` 调用会递减这些计数器，\n并利用它们通过相应的内部逐层共享参数的克隆版本进行反向传播。\n\n无论如何，在大多数情况下，你不需要直接操作 `Recursor` 对象，\n因为 `AbstractSequencer` 实例会在其构造函数中自动使用 `Recursor` 装饰非\n`AbstractRecurrent` 实例。\n\n有关其具体使用的示例，请参阅 [simple-recurrent-network.lua](examples\u002Fsimple-recurrent-network.lua)\n训练脚本中的示例。\n\n\u003Ca name='rnn.Recurrence'>\u003C\u002Fa>\n## 循环单元 ##\n\n一个极其通用的容器，可用于实现几乎任何类型的循环结构。\n\n```lua\nrnn = nn.Recurrence(recurrentModule, outputSize, nInputDim, [rho])\n```\n\n与 [Recurrent](#rnn.Recurrent) 不同，该模块不管理单独的\n`inputModule`、`startModule`、`mergeModule` 等模块。\n相反，它只管理一个 `recurrentModule`，该模块应根据输入表：\n`{input(t), output(t-1)}`，输出张量或表：`output(t)`。\n通过结合使用 `Recursor`（例如通过 `Sequencer`）和 `Recurrence`，\n可以实现几乎任何类型的循环神经网络，包括 LSTM 和 RNN。\n\n在第一步中，`Recurrence` 会将零张量（或其表）传递给循环层（如 LSTM，不同于 Recurrent）。\n因此，它需要知道 `outputSize`，它可以是数字或 `torch.LongStorage`，\n也可以是其表。批次维度应排除在 `outputSize` 之外。相反，\n批次维度的大小（即样本数量）将通过 `nInputDim` 参数从输入中推断出来。\n例如，假设我们的输入是一个大小为 `4 x 3` 的张量，其中 `4` 是样本数量，\n那么 `nInputDim` 应该为 `1`。再举一个例子，如果我们的输入是一系列表 [...]\n组成的张量，其中第一个张量（深度优先）与前一个例子相同，\n那么我们的 `nInputDim` 也是 `1`。\n\n作为一个示例，让我们使用 `Sequencer` 和 `Recurrence` 来构建一个用于语言建模的简单 RNN：\n\n```lua\nrho = 5\nhiddenSize = 10\noutputSize = 5 -- 类别数\nnIndex = 10000\n\n-- 循环模块\nrm = nn.Sequential()\n   :add(nn.ParallelTable()\n      :add(nn.LookupTable(nIndex, hiddenSize))\n      :add(nn.Linear(hiddenSize, hiddenSize)))\n   :add(nn.CAddTable())\n   :add(nn.Sigmoid())\n\nrnn = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.Recurrence(rm, hiddenSize, 1))\n      :add(nn.Linear(hiddenSize, outputSize))\n      :add(nn.LogSoftMax())\n)\n```\n\n注意：我们完全可以用新的 `Recursor` 和 `Recurrence` 模块重新实现 `LSTM` 模块，\n但这将破坏现有已保存在磁盘上的模型的向后兼容性。\n\n\u003Ca name='rnn.NormStabilizer'>\u003C\u002Fa>\n## 范数稳定器 ##\n\n参考文献 A：[通过稳定激活来正则化 RNN](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.08400)\n\n该模块实现了 [范数稳定化](http:\u002F\u002Farxiv.org\u002Fabs\u002F1511.08400) 准则：\n\n```lua\nns = nn.NormStabilizer([beta])\n``` \n\n该模块通过最小化连续步骤之间 L2 范数的差异来正则化 RNN 的隐藏状态。\n成本函数定义如下：\n\n```\nloss = beta * 1\u002FT sum_t( ||h[t]|| - ||h[t-1]|| )^2\n``` \n其中 `T` 是时间步长的数量。请注意，我们不会将梯度除以 `T`，\n这样选择的 `beta` 就可以在不同序列长度下进行缩放，而无需更改。\n\n唯一参数 `beta` 在参考文献 A 中定义。由于我们不将梯度除以\n时间步长的数量，因此默认值 `beta=1` 应适用于大多数情况。\n\n该模块应添加到 RNN（或 LSTM 或 GRU）之间，以更好地正则化隐藏状态。\n例如：\n\n```lua\nlocal stepmodule = nn.Sequential()\n   :add(nn.FastLSTM(10,10))\n   :add(nn.NormStabilizer())\n   :add(nn.FastLSTM(10,10))\n   :add(nn.NormStabilizer())\nlocal rnn = nn.Sequencer(stepmodule)\n``` \n\n要将其与 `SeqLSTM` 一起使用，你可以这样做：\n\n```lua\nlocal rnn = nn.Sequential()\n   :add(nn.SeqLSTM(10,10))\n   :add(nn.Sequencer(nn.NormStabilizer()))\n   :add(nn.SeqLSTM(10,10))\n   :add(nn.Sequencer(nn.NormStabilizer()))\n``` \n\n\u003Ca name='rnn.AbstractSequencer'>\u003C\u002Fa>\n\n## 抽象序列器 ##\n这个抽象类实现了一个轻量级接口，供诸如 `Sequencer`、`Repeater`、`RecurrentAttention`、`BiSequencer` 等子类共享。\n\n\u003Ca name='rnn.Sequencer'>\u003C\u002Fa>\n## 序列器 ##\n\n`nn.Sequencer(module)` 构造函数接受一个参数 `module`，即要从左到右依次应用于输入序列中每个元素的模块。\n\n```lua\nseq = nn.Sequencer(module)\n```\n\n该模块是一种[装饰器](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern)，用于抽象出 `AbstractRecurrent` 模块的复杂性。虽然 `AbstractRecurrent` 实例要求按时间步逐个输入数据，并且每次都需要调用 `forward`（以及 `backward`），但 `Sequencer` 会将一个 `input` 序列（表）传递为一个 `output` 序列（长度相同的表）。它还会负责在 `AbstractRecurrent` 实例上调用 `forget` 方法。\n\n### 输入\u002F输出格式\n\n`Sequencer` 要求输入和输出的形状为 `seqlen x batchsize x featsize`：\n\n * `seqlen` 是将被送入 `Sequencer` 的时间步数。\n * `batchsize` 是批次中的样本数量。每个样本都是独立的序列。\n * `featsize` 是除批次维度之外的其余维度大小。例如，对于语言模型来说可能是 1，而对于卷积模型来说可能是 `c x h x w` 等。\n\n![Hello Fuzzy](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_c0962d6e8c3f.png)\n\n以上是一个字符级别的语言模型的示例输入序列。它的 `seqlen` 为 5，意味着它包含 5 个时间步的序列。开头的 `{` 和结尾的 `}` 表明这些时间步是 Lua 表的元素，不过它也接受形状为 `seqlen x batchsize x featsize` 的完整张量。`batchsize` 为 2，因为有两个独立的序列：`{ H, E, L, L, O }` 和 `{ F, U, Z, Z, Y, }`。`featsize` 为 1，因为每个字符只有一个特征维度，且每个字符的大小为 1。因此，在这种情况下，输入是一个包含 `seqlen` 个时间步的表，其中每个时间步由一个 `batchsize x featsize` 的张量表示。\n\n![Sequence](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_b6f087017adc.png)\n\n以上是另一个序列（输入或输出）的示例。它有 4 个时间步 (`seqlen`)。`batchsize` 再次为 2，这意味着有两个序列。`featsize` 为 3，因为每个序列的每个时间步都有 3 个变量。因此，每个时间步（表的元素）再次由一个大小为 `batchsize x featsize` 的张量表示。需要注意的是，在这两个示例中，`featsize` 编码的是一个维度，但它也可以编码更多维度。\n\n\n### 示例\n\n例如，`rnn`：一个 `nn.AbstractRecurrent` 的实例，可以一次向前传递一个 `input` 序列：\n```lua\ninput = {torch.randn(3,4), torch.randn(3,4), torch.randn(3,4)}\nrnn:forward(input[1])\nrnn:forward(input[2])\nrnn:forward(input[3])\n``` \n\n同样地，我们也可以使用 `Sequencer` 来一次性向前传递整个 `input` 序列：\n\n```lua\nseq = nn.Sequencer(rnn)\nseq:forward(input)\n``` \n\n我们也可以传递张量而不是表：\n\n```lua\n-- seqlen x batchsize x featsize\ninput = torch.randn(3,3,4)\nseq:forward(input)\n``` \n\n### 详情\n\n`Sequencer` 也可以接受非递归模块（即非 `AbstractRecurrent` 实例），并将其应用于每个输入，以生成一个长度相同的输出表。这对于处理可变长度序列（表）特别有用。\n\n在内部，`Sequencer` 假设被装饰的 `module` 是一个 `AbstractRecurrent` 实例。如果情况并非如此，`module` 将自动被 [Recursor](#rnn.Recursor) 模块装饰，使其符合 `AbstractRecurrent` 接口。\n\n注意：这是由于最近的一次更新（2015年10月27日）所致，在此之前，`AbstractRecurrent` 和非 `AbstractRecurrent` 实例需要分别由各自的 `Sequencer` 进行装饰。而这次引入了 `Recursor` 装饰器的更新，使得单个 `Sequencer` 可以包装任何类型的模块，无论是 `AbstractRecurrent`、非 `AbstractRecurrent`，还是两者的复合结构。尽管如此，现有代码不应受到此次更改的影响。\n\n有关其使用的简明示例，请参阅训练脚本 [simple-sequencer-network.lua](examples\u002Fsimple-sequencer-network.lua)。\n\n\u003Ca name='rnn.Sequencer.remember'>\u003C\u002Fa>\n### remember([mode]) ###\n当 `mode='neither'`（类的默认行为）时，`Sequencer` 在每次调用 `forward` 之前还会额外调用 [forget](#nn.AbstractRecurrent.forget)。当 `mode='both'`（调用此函数时的默认值）时，`Sequencer` 将永远不会调用 [forget](#nn.AbstractRecurrent.forget)。在这种情况下，用户需要在独立序列之间手动调用 `forget`。此行为仅适用于被装饰的 `AbstractRecurrent` `modules`。参数 `mode` 的有效值如下：\n\n * 'eval' 仅影响评估阶段（推荐用于 RNN）\n * 'train' 仅影响训练阶段\n * 'neither' 既不影响训练也不影响评估（类的默认行为）\n * 'both' 同时影响训练和评估（推荐用于 LSTM）\n\n### forget() ###\n调用被装饰的 `AbstractRecurrent` 模块的 `forget` 方法。\n\n\u003Ca name='rnn.SeqLSTM'>\u003C\u002Fa>\n\n## SeqLSTM ##\n\n该模块是 `nn.Sequencer(nn.FastLSTM(inputsize, outputsize))` 的更快版本：\n\n```lua\nseqlstm = nn.SeqLSTM(inputsize, outputsize)\n``` \n\n每个时间步的计算方式如下（与 [FastLSTM](#rnn.FastLSTM) 相同）：\n\n```lua\ni[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i])                      (1)\nf[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f])                      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n``` \n\n一个显著的区别是，该模块期望 `input` 和 `gradOutput` 是张量而不是表。默认形状为：`input` 为 `seqlen x batchsize x inputsize`，`output` 为 `seqlen x batchsize x outputsize`：\n\n```lua\ninput = torch.randn(seqlen, batchsize, inputsize)\ngradOutput = torch.randn(seqlen, batchsize, outputsize)\n\noutput = seqlstm:forward(input)\ngradInput = seqlstm:backward(input, gradOutput)\n``` \n\n请注意，如果您希望交换前两个维度（即 `batchsize x seqlen` 而不是默认的 `seqlen x batchsize`），可以在初始化后设置 `seqlstm.batchfirst = true`。\n\n对于可变长度序列，设置 `seqlstm.maskzero = true`。这等价于在由 `Sequencer` 包装的 `FastLSTM` 上调用 `maskZero(1)`：\n\n```lua\nfastlstm = nn.FastLSTM(inputsize, outputsize)\nfastlstm:maskZero(1)\nseqfastlstm = nn.Sequencer(fastlstm)\n``` \n\n当 `maskzero = true` 时，输入序列应以零张量分隔各个时间步。\n\n`seqlstm:toFastLSTM()` 方法会生成一个使用 `seqlstm` 实例参数初始化的 [FastLSTM](#rnn.FastLSTM) 实例。需要注意的是，生成的参数不会共享（也无法共享）。\n\n与 `FastLSTM` 一样，`SeqLSTM` 不使用单元和门之间的窥孔连接（详情请参阅 [FastLSTM](#rnn.FastLSTM)）。\n\n与 `Sequencer` 一样，`SeqLSTM` 提供一个 [remember](rnn.Sequencer.remember) 方法。\n\n请注意，`SeqLSTM` 不能替换代码中被 `AbstractSequencer` 或 `Recursor` 装饰的 `FastLSTM`，因为那样就相当于 `Sequencer(Sequencer(FastLSTM))`。特此提醒。\n\n\u003Ca name='rnn.SeqLSTMP'>\u003C\u002Fa>\n## SeqLSTMP ##\n参考文献：\n * A. [用于大规模声学建模的 LSTM RNN 架构](http:\u002F\u002Fstatic.googleusercontent.com\u002Fmedia\u002Fresearch.google.com\u002Fen\u002F\u002Fpubs\u002Farchive\u002F43905.pdf)\n * B. [探索语言建模的极限](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.02410v2.pdf)\n \n```lua\nlstmp = nn.SeqLSTMP(inputsize, hiddensize, outputsize)\n``` \n\n`SeqLSTMP` 是 [SeqLSTM](#rnn.SeqLSTM) 的子类。其不同之处在于，在计算隐藏状态 `h[t]`（公式 6）之后，它会通过一个简单的线性变换将其投影到 `r[t]` 上（公式 7）。门的计算也使用了前一次的这种投影 `r[t-1]`（公式 1、2、3、5）。这与 `SeqLSTM` 使用 `h[t-1]` 而不是 `r[t-1]` 的情况不同。\n \n`SeqLSTM` 中概述的时间步计算被以下内容取代：\n```lua\ni[t] = σ(W[x->i]x[t] + W[r->i]r[t−1] + b[1->i])                      (1)\nf[t] = σ(W[x->f]x[t] + W[r->f]r[t−1] + b[1->f])                      (2)\nz[t] = tanh(W[x->c]x[t] + W[h->c]r[t−1] + b[1->c])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x->o]x[t] + W[r->o]r[t−1] + b[1->o])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\nr[t] = W[h->r]h[t]                                                   (7)\n``` \n\n该算法在参考文献 A 中有详细介绍，并在参考文献 B 中使用 Google 十亿词数据集进行了基准测试，取得了最先进的结果。`SeqLSTMP` 可以与 `hiddensize >> outputsize` 一起使用，这样记忆单元 `c[t]` 以及门 `i[t]`、`f[t]` 和 `o[t]` 的有效尺寸可以远大于实际输入 `x[t]` 和输出 `r[t]`。对于固定的 `inputsize` 和 `outputsize`，`SeqLSTMP` 将能够记住比 `SeqLSTM` 更多的信息。\n\n\u003Ca name='rnn.SeqGRU'>\u003C\u002Fa>\n## SeqGRU ##\n\n该模块是 `nn.Sequencer(nn.GRU(inputsize, outputsize))` 的更快版本：\n\n```lua\nseqGRU = nn.SeqGRU(inputsize, outputsize)\n``` \n\nSeqGRU 的使用方式与 GRU 的区别类似于 SeqLSTM 与 LSTM 的区别。因此，请参阅 [SeqLSTM](#rnn.SeqLSTM) 以获取更多详细信息。\n\n\u003Ca name='rnn.SeqBRNN'>\u003C\u002Fa>\n## SeqBRNN ##\n\n```lua\nbrnn = nn.SeqBRNN(inputSize, outputSize, [batchFirst], [merge])\n``` \n\n一种使用 SeqLSTM 的双向 RNN。内部包含一个正向和一个反向的 SeqLSTM 模块。期望输入形状为 `seqlen x batchsize x inputsize`。通过将 `[batchFirst]` 设置为 `true`，输入形状可以变为 `batchsize x seqLen x inputsize`。合并模块默认为 CAddTable()，对来自每个输出层的输出进行求和。\n\n示例：\n```\ninput = torch.rand(1, 1, 5)\nbrnn = nn.SeqBRNN(5, 5)\nprint(brnn:forward(input))\n``` \n打印出一个 1x1x5 张量的输出。\n\n\u003Ca name='rnn.BiSequencer'>\u003C\u002Fa>\n## BiSequencer ##\n将封装的正向和反向 RNN 分别按正向和反向顺序应用于输入序列。它用于实现双向 RNN 和 LSTM。\n\n```lua\nbrnn = nn.BiSequencer(fwd, [bwd, merge])\n```\n\n该模块的输入是一个张量序列（表），输出也是一个相同长度的张量序列（表）。它按照正向顺序将一个正向 RNN（一个 [AbstractRecurrent](#rnn.AbstractRecurrent) 实例）应用于序列中的每个元素，并按照反向顺序（从最后一个元素到第一个元素）应用反向 RNN。反向 RNN 默认为：\n\n```lua\nbwd = fwd:clone()\nbwd:reset()\n```\n\n对于原始序列中的每个步骤，两个 RNN 的输出会通过合并模块合并在一起（默认为 `nn.JoinTable(1,1)`）。如果 `merge` 是一个数字，则指定 [JoinTable](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Ftable.md#nn.JoinTable) 构造函数的 `nInputDim` 参数。因此，合并模块会被初始化为：\n\n```lua\nmerge = nn.JoinTable(1,merge)\n```\n\n在内部，`BiSequencer` 是通过装饰一个由三个 Sequencer 组成的结构来实现的，分别用于正向、反向和合并模块。\n\n与 [Sequencer](#rnn.Sequencer) 类似，批次中的序列必须具有相同的大小。但每个批次的序列长度可以不同。\n\n注意：每次调用 `updateParameters()` 后，请务必调用 `brnn:forget()`。或者也可以只调用 `brnn.bwdSeq:forget()`，使只有反向 RNN 忘记状态。这是最低要求，因为让反向 RNN 记住未来的序列是没有意义的。\n\n\n\u003Ca name='rnn.BiSequencerLM'>\u003C\u002Fa>\n\n## BiSequencerLM ##\n\n将封装的 `fwd` 和 `bwd` RNN 按照正向和反向顺序应用于输入序列。\n它用于实现语言模型（LM）中的双向 RNN 和 LSTM。\n\n```lua\nbrnn = nn.BiSequencerLM(fwd, [bwd, merge])\n```\n\n该模块的输入是一个张量序列（表），输出也是一个长度相同的张量序列（表）。\n它会按照正向顺序将一个 `fwd` RNN（一个 [AbstractRecurrent](#rnn.AbstractRecurrent) 实例）应用于序列中的前 `N-1` 个元素。\n然后，它会以反向顺序将 `bwd` RNN 应用于最后 `N-1` 个元素（从倒数第二个元素到第一个元素）。\n这是该模块与 [BiSequencer](#rnn.BiSequencer) 的主要区别。后者不能用于语言建模，因为 `bwd` RNN 会被训练去预测它刚刚接收到的输入。\n\n![BiDirectionalLM](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_readme_6f299e328d3e.png)\n\n`bwd` RNN 的默认设置是：\n\n```lua\nbwd = fwd:clone()\nbwd:reset()\n```\n\n`fwd` RNN 将为最后 `N-1` 个时间步生成表示，而 `bwd` RNN 则会为前 `N-1` 个时间步生成表示。\n每个 RNN 缺失的时间步输出（`fwd` 的第一个时间步和 `bwd` 的最后一个时间步）将用与相应 RNN 输出大小相同的零张量填充。\n这样它们就可以被合并。如果使用 `nn.JoinTable`（默认值），则第一个和最后一个输出元素将分别用零来填充缺失的 `fwd` 和 `bwd` RNN 输出。\n\n对于原始序列中的每一个时间步，两个 RNN 的输出都会使用 `merge` 模块进行合并（默认为 `nn.JoinTable(1,1)`）。如果 `merge` 是一个数字，则它指定了 [JoinTable](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Ftable.md#nn.JoinTable) 构造函数的 `nInputDim` 参数。因此，`merge` 模块会被初始化为：\n\n```lua\nmerge = nn.JoinTable(1,merge)\n```\n\n与 [Sequencer](#rnn.Sequencer) 类似，批次中的序列必须具有相同的大小，但每个批次的序列长度可以不同。\n\n需要注意的是，使用此模块实现的语言模型并不是传统的语言模型，因为它们不会计算给定先前词的情况下某个词的概率。相反，它们计算的是在上下文（即周围词）给定的情况下某个词的概率。虽然出于数学原因，你可能无法用它来计算一串词（如一句话）的概率，但你仍然可以计算这种序列的伪似然性（有关讨论，请参阅 [this](http:\u002F\u002Farxiv.org\u002Fpdf\u002F1504.01575.pdf)）。\n\n\u003Ca name='rnn.Repeater'>\u003C\u002Fa>\n## Repeater ##\n该模块是一个类似于 [Sequencer](#rnn.Sequencer) 的 [装饰器](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern)。\n它的不同之处在于，序列长度是预先固定的，输入会反复通过被包装的 `module`，从而产生一个长度为 `nStep` 的输出表：\n```lua\nr = nn.Repeater(module, nStep)\n```\n参数 `module` 应该是一个 `AbstractRecurrent` 实例。\n这对于实现像 [RCNNs](http:\u002F\u002Fjmlr.org\u002Fproceedings\u002Fpapers\u002Fv32\u002Fpinheiro14.pdf) 这样的模型非常有用，这些模型会反复接收相同的输入。\n\n\u003Ca name='rnn.RecurrentAttention'>\u003C\u002Fa>\n## RecurrentAttention ##\n参考文献：\n\n  * A. [视觉注意力的循环模型](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5542-recurrent-models-of-visual-attention.pdf)\n  * B. [连接主义强化学习的简单统计梯度跟随算法](http:\u002F\u002Fincompleteideas.net\u002Fsutton\u002Fwilliams-92.pdf)\n\n该模块可用于实现参考文献 A 中提出的循环注意力模型（RAM）：\n```lua\nram = nn.RecurrentAttention(rnn, action, nStep, hiddenSize)\n```\n\n`rnn` 是一个 [AbstractRecurrent](#rnn.AbstractRecurrent) 实例。\n它的输入是 `{x, z}`，其中 `x` 是 RAM 的输入，`z` 是从 `action` 模块中采样得到的动作。\n`rnn` 的输出大小必须等于 `hiddenSize`。\n\n`action` 是一个 [Module](https:\u002F\u002Fgithub.com\u002Ftorch\u002Fnn\u002Fblob\u002Fmaster\u002Fdoc\u002Fmodule.md#nn.Module)，它使用一个 [REINFORCE 模块](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.Reinforce)（参考文献 B），例如 [ReinforceNormal](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.ReinforceNormal)、[ReinforceCategorical](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.ReinforceCategorical) 或 [ReinforceBernoulli](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.ReinforceBernoulli)，根据 `rnn` 上一个时间步的输出来采样动作。\n在第一个时间步，`action` 模块会接收到一个大小为 `input:size(1) x hiddenSize` 的零张量。\n重要的是要理解，采样的动作不会接收来自训练准则的反向传播梯度。\n相反，奖励会从一个奖励准则（如 [VRClassReward](https:\u002F\u002Fgithub.com\u002Fnicholas-leonard\u002Fdpnn#nn.VRClassReward)）广播到 `action` 的 REINFORCE 模块，该模块会根据 `output` 样本和 `reward` 计算并反向传播梯度。\n因此，`action` 模块的输出仅在循环注意力模块内部使用。\n\n`nStep` 是要采样的动作数量，即 `output` 表中的元素数量。\n\n`hiddenSize` 是 `rnn` 的输出大小。这个变量对于生成用于第一个时间步采样动作的零张量是必要的（见上文）。\n\n参考文献 A 的完整实现可以在 [这里](examples\u002Frecurrent-visual-attention.lua) 找到。\n\n\u003Ca name='rnn.MaskZero'>\u003C\u002Fa>\n## MaskZero ##\n该模块会将被装饰模块的 `output` 行清零，前提是对应的 `input` 行是零张量。\n\n```lua\nmz = nn.MaskZero(module, nInputDim)\n```\n\n当被装饰模块的 `input` 对应行是零张量时，其 `output` 张量（或张量表）的每一行（样本）都会被清零。\n\n`nInputDim` 参数必须指定 `input` 中第一个张量的非批处理维度的数量。如果是 `input` 表，则第一个张量是指在深度优先搜索中遇到的第一个张量。\n这个装饰器使得在同一批次中对不同长度的序列用零向量进行填充成为可能。\n\n注意：`MaskZero` 并不能保证当 `input` 为零时，被装饰模块内部的 `output` 和 `gradInput` 张量也会被清零。`MaskZero` 只会影响它所封装模块的直接 `gradInput` 和 `output`。\n然而，对于大多数模块而言，该时间步的梯度更新将会是零，因为反向传播零梯度通常会导致整个路径上的梯度都为零。在这方面，不建议将 `AbsractRecurrent` 实例封装在 `MaskZero` 内，因为它们会在不同时间步之间传递梯度。相反，应该调用 [AbstractRecurrent.maskZero](#rnn.AbstractRecurrent.maskZero) 方法来封装内部的 `recurrentModule`。\n\n\u003Ca name='rnn.TrimZero'>\u003C\u002Fa>\n\n## TrimZero ##\n\n警告：仅当您的输入包含大量零时才使用此模块。在几乎所有情况下，[`MaskZero`](#rnn.MaskZero) 都会更快，尤其是在使用 CUDA 时。\n\n参考 A：[TrimZero：用于高效自然语言处理的 Torch 循环神经网络模块](https:\u002F\u002Fbi.snu.ac.kr\u002FPublications\u002FConferences\u002FDomestic\u002FKIIS2016S_JHKim.pdf)\n\n其用法与 `MaskZero` 相同。\n\n```lua\nmz = nn.TrimZero(module, nInputDim)\n```\n\n与 `MaskZero` 的唯一区别在于，当输入中存在不同长度的序列时，它可以通过调整批次大小来降低计算成本。请注意，当序列长度一致时，`MaskZero` 会更快，因为 `TrimZero` 存在一定的操作开销。\n\n简而言之，结果与 `MaskZero` 相同，但只有在句子长度变化较大时，`TrimZero` 才会比 `MaskZero` 更快。\n\n在实践中，例如在语言模型中，`TrimZero` 的速度预计比 `MaskZero` 快约 30%。您可以通过 `test\u002Ftest_trimzero.lua` 进行测试。\n\n\u003Ca name='rnn.LookupTableMaskZero'>\u003C\u002Fa>\n## LookupTableMaskZero ##\n该模块扩展了 `nn.LookupTable`，以支持零索引。零索引会被传递为零张量。\n\n```lua\nlt = nn.LookupTableMaskZero(nIndex, nOutput)\n```\n\n当 `input` 中对应行的索引为零时，输出张量的相应行将被置零。\n\n此查找表使得在同一批次中对不同长度的序列用零向量进行填充成为可能。\n\n\u003Ca name='rnn.MaskZeroCriterion'>\u003C\u002Fa>\n## MaskZeroCriterion ##\n该准则会对被装饰准则中的 `err` 和 `gradInput` 行进行置零处理，这些行对应于 `input` 中为零张量的行。\n\n```lua\nmzc = nn.MaskZeroCriterion(criterion, nInputDim)\n```\n\n被装饰 `criterion` 的 `gradInput` 张量（或其表）中的每一行（样本）都会在 `input` 中对应行为零张量时被置零。同时，`err` 也会忽略这些零行。\n\n`nInputDim` 参数必须指定 `input` 中第一个张量的非批次维度数量。如果 `input` 是一个表，则第一个张量是指在深度优先搜索中遇到的第一个张量。\n\n此装饰器使得在同一批次中对不同长度的序列用零向量进行填充成为可能。\n\n\u003Ca name='rnn.SeqReverseSequence'>\u003C\u002Fa>\n## SeqReverseSequence ##\n\n```lua\nreverseSeq = nn.SeqReverseSequence(dim)\n```\n\n沿指定维度反转输入张量。反转维度不能超过三。\n\n示例：\n```\ninput = torch.Tensor({{1,2,3,4,5}, {6,7,8,9,10}})\nreverseSeq = nn.SeqReverseSequence(1)\nprint(reverseSeq:forward(input))\n\n输出为 torch.Tensor({{6,7,8,9,10},{1,2,3,4,5}})\n```\n\n\u003Ca name='rnn.SequencerCriterion'>\u003C\u002Fa>\n## SequencerCriterion ##\n该准则是一种 [装饰器模式](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern)：\n\n```lua\nc = nn.SequencerCriterion(criterion, [sizeAverage])\n``` \n\n`input` 和 `target` 预期都是一系列数据，可以是表或张量。对于序列中的每一步，`input` 和 `target` 中对应的元素都会被应用于 `criterion`。\n`forward` 的输出是序列中所有单个损失的总和。这在与 [Sequencer](#rnn.Sequencer) 结合使用时非常有用。\n\n如果 `sizeAverage` 为 `true`（默认为 `false`），则 `output` 损失和 `gradInput` 会在每个时间步上取平均值。\n\n\u003Ca name='rnn.RepeaterCriterion'>\u003C\u002Fa>\n## RepeaterCriterion ##\n该准则是一种 [装饰器模式](http:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDecorator_pattern)：\n\n```lua\nc = nn.RepeaterCriterion(criterion)\n``` \n\n`input` 预期是一个序列（表或张量）。单个 `target` 会使用相同的 `criterion` 反复应用于 `input` 序列中的每个元素。\n`forward` 的输出是序列中所有单个损失的总和。这对于实现如 [RCNNs](http:\u002F\u002Fjmlr.org\u002Fproceedings\u002Fpapers\u002Fv32\u002Fpinheiro14.pdf) 等模型非常有用，这些模型会反复接收相同的目标。","# rnn 快速上手指南\n\n> **注意**：本仓库（`Element-Research\u002Frnn`）已弃用，官方推荐迁移至 [`torch\u002Frnn`](https:\u002F\u002Fgithub.com\u002Ftorch\u002Frnn)。以下指南基于原 `rnn` 库编写，适用于仍在使用 Torch7 生态的旧项目维护。\n\n## 环境准备\n\n本工具基于 **Torch7** 深度学习框架，主要依赖 Lua 环境。\n\n*   **操作系统**: Linux \u002F macOS (Windows 需通过 WSL 或 Docker)\n*   **核心依赖**:\n    *   Torch7 (`torch`)\n    *   神经网络包 (`nn`)\n    *   扩展神经网络包 (`dpnn`)\n    *   工具扩展包 (`torchx`)\n*   **可选依赖 (GPU 加速)**:\n    *   `cutorch`, `cunn`, `cunnx`\n\n> **提示**：由于 Torch7 社区活跃度降低，国内暂无官方维护的镜像源。建议确保网络通畅以从 GitHub 和 Luarocks 官方源拉取代码。若需使用 Lua 而非 LuaJIT（部分高级示例如 NCE 需要），请在安装 Torch 时指定。\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone git@github.com:Element-Research\u002Frnn.git\n    cd rnn\n    ```\n\n2.  **安装依赖**\n    在执行构建前，请确保已安装基础依赖：\n    ```bash\n    luarocks install torch\n    luarocks install nn\n    luarocks install dpnn\n    luarocks install torchx\n    ```\n    如需 CUDA 支持：\n    ```bash\n    luarocks install cutorch\n    luarocks install cunn\n    luarocks install cunnx\n    ```\n\n3.  **编译安装**\n    在 `rnn` 目录下执行：\n    ```bash\n    luarocks make rocks\u002Frnn-scm-1.rockspec\n    ```\n\n## 基本使用\n\n`rnn` 库扩展了 Torch 的 `nn` 模块，支持构建 RNN、LSTM、GRU 等循环神经网络。其核心逻辑是将序列数据按时间步（time-step）依次传入网络。\n\n### 1. 构建一个简单的 LSTM 网络\n\n以下示例展示如何创建一个输入维度为 10、隐藏层维度为 10 的 LSTM 单元：\n\n```lua\nrequire 'rnn'\n\n-- 创建 LSTM 模块：输入大小 10, 输出大小 10\nlocal rnn = nn.LSTM(10, 10)\n\n-- 设置最大反向传播步数 (BPTT)，默认为 99999\nrnn.rho = 5 \n```\n\n### 2. 序列前向传播与反向传播\n\n使用 `rnn` 处理序列时，需要在循环中调用 `forward`，并在**逆序**循环中调用 `backward` 以实现随时间反向传播（BPTT）。\n\n```lua\nlocal nStep = 5 -- 序列长度\nlocal inputs = {}\nlocal gradOutputs = {}\n\n-- 模拟生成输入数据和梯度输出 (占位符)\nfor i=1,nStep do\n   inputs[i] = torch.Tensor(1, 10):uniform()      -- Batch size 1, Input dim 10\n   gradOutputs[i] = torch.Tensor(1, 10):uniform() -- 对应的梯度\nend\n\nlocal outputs = {}\nlocal gradInputs = {}\n\n-- 前向传播：按时间步顺序\nfor i=1,nStep do\n   outputs[i] = rnn:forward(inputs[i])\nend\n\n-- 反向传播：必须按时间步逆序\nfor i=nStep,1,-1 do\n   gradInputs[i] = rnn:backward(inputs[i], gradOutputs[i])\nend\n\n-- 重要：处理完一个序列后，必须调用 forget() 清除内部状态缓存\nrnn:forget()\n```\n\n### 3. 使用 Sequencer 简化操作\n\n如果你希望一次性传入整个序列张量（Tensor）而不是逐个时间步处理，可以使用 `Sequencer` 或专用的 `SeqLSTM`：\n\n```lua\nrequire 'rnn'\n\n-- 方法 A: 使用 Sequencer 包装普通 RNN 模块\nlocal model = nn.Sequencer(nn.LSTM(10, 10))\n\n-- 方法 B: 使用高性能的 SeqLSTM (输入输出均为 Tensor)\nlocal fastModel = nn.SeqLSTM(10, 10)\n\n-- 此时可以直接传入维度为 [序列长度 x 批次大小 x 输入维度] 的张量\nlocal inputSequence = torch.Tensor(5, 32, 10) -- 5 个时间步，Batch 32\nlocal outputSequence = fastModel:forward(inputSequence)\n```\n\n### 4. 处理变长序列 (Padding)\n\n对于批次中长度不一的序列，通常使用零填充（Zero Padding）。`rnn` 提供了 `MaskZero` 装饰器来处理这种情况，确保填充部分的零值不会干扰隐藏状态的计算：\n\n```lua\nlocal lstm = nn.LSTM(10, 10)\n-- 装饰模块，nInputDim 指定非 batch 维度的数量 (此处为 1)\nlstm:maskZero(1) \n\n-- 现在可以将填充了 0 的短序列放入同一批次训练，\n-- 当输入行为全 0 时，该时间步的输出将被掩码，且隐藏状态会被重置。\n```","某自然语言处理团队正在基于 Torch7 框架开发一个实时金融新闻情感分析系统，需要处理带有时间依赖性的文本序列数据。\n\n### 没有 rnn 时\n- 开发者必须手动编写复杂的循环逻辑来模拟时间步传播，代码冗长且极易出错，难以维护。\n- 想要尝试更先进的 LSTM 或 GRU 单元以提升准确率时，需从零推导数学公式并实现反向传播，研发周期长达数周。\n- 处理变长句子填充（Padding）时，缺乏原生支持，导致大量无效计算浪费 GPU 资源，推理延迟居高不下。\n- 构建双向网络捕捉上下文信息时，需自行拼接正向和反向传递结果，调试困难且容易引发维度不匹配错误。\n\n### 使用 rnn 后\n- 直接调用 `Sequencer` 和 `LSTM` 模块即可自动处理序列时间步，将核心模型代码缩减至几十行，逻辑清晰直观。\n- 通过切换 `FastLSTM` 或 `SeqGRU` 等预置模块，几分钟内即可完成模型架构升级，显著提升了情感分类的准确度。\n- 利用 `MaskZero` 和 `LookupTableMaskZero` 原生支持零值掩码，自动忽略填充部分的计算，推理速度提升约 40%。\n- 借助 `BiSequencer` 一键构建双向循环网络，轻松捕捉前后文语境，无需关心底层数据流向的细节实现。\n\nrnn 通过提供模块化、高性能的循环神经网络组件，让开发者从繁琐的底层实现中解放出来，专注于算法优化与业务落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FElement-Research_rnn_a33c86ba.png","Element-Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FElement-Research_2b82e713.jpg","",null,"https:\u002F\u002Fgithub.com\u002FElement-Research",[78,82],{"name":79,"color":80,"percentage":81},"Lua","#000080",99.9,{"name":83,"color":84,"percentage":85},"CMake","#DA3434",0.1,944,309,"2026-03-27T06:18:29","BSD-3-Clause",5,"Linux, macOS","可选。若使用 CUDA 加速，需 NVIDIA GPU 及对应的 cutorch\u002Fcunn\u002Fcunnx 支持（具体型号和显存未说明，示例中提到在 NVIDIA Titan X 上运行）。","未说明",{"notes":95,"python":96,"dependencies":97},"1. 该仓库已弃用，官方建议迁移至 https:\u002F\u002Fgithub.com\u002Ftorch\u002Frnn。\n2. 这是一个基于 Lua 和 Torch7 框架的库，不使用 Python。\n3. 安装需使用 luarocks 包管理器。\n4. 部分高性能示例（如噪声对比估计）明确要求安装使用 Lua 而非 LuaJIT 版本的 Torch。\n5. 若遇到错误，建议更新上述依赖库。","不适用 (该项目基于 Lua\u002FTorch7，非 Python)",[98,99,100,101,102,103,104],"torch","nn","dpnn","torchx","cutorch (如需 CUDA)","cunn (如需 CUDA)","cunnx (如需 CUDA)",[35,14],"2026-03-27T02:49:30.150509","2026-04-18T22:33:45.390920",[109,114,119,123,128,133],{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},40895,"训练 LSTM 网络时，输出和权重变成 NaN（非数字）怎么办？","NaN 通常由梯度爆炸、准则（criterion）问题、参数初始化不良或序列过长引起。特别是当使用 log 或 exp 函数时，数值容易被放大为 inf 或 nan。\n解决方案包括：\n1. 使用梯度裁剪（Gradient Clipping）来限制梯度范围。\n2. 尝试更好地初始化偏置（bias）参数。\n3. 检查输入数据处理：如果词汇量小，建议使用 `nn.OneHot` 代替 `nn.LookupTable`。\n4. 确认是否错误地使用了 MaskZero，它仅适用于用零索引分隔独立序列的情况。","https:\u002F\u002Fgithub.com\u002FElement-Research\u002Frnn\u002Fissues\u002F187",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},40896,"如何正确使用 MaskZero 处理变长序列进行分类任务？","使用 `MaskZero` 时有一个常见的误区：`MaskZeroCriterion` 是基于输入（outputs）中的零值进行掩码，而不是基于目标值（targets）中的零值。\n如果预测值全为零即使目标值不为零，误差也会计算为零，这会导致模型倾向于输出全零。\n建议：\n1. 确保理解掩码机制是基于输出的零值。\n2. 如果需要基于目标值的零进行掩码，可能需要自定义或使用特定的模块（如社区提供的 gist 代码）。\n3. 对于变长序列，确保填充部分在输入中被正确标记为零，以便 MaskZero 生效。","https:\u002F\u002Fgithub.com\u002FElement-Research\u002Frnn\u002Fissues\u002F75",{"id":120,"question_zh":121,"answer_zh":122,"source_url":118},40897,"使用 MaskZero 包装循环模块时收到警告，应该如何修正？","如果出现警告\"Warning : you are most likely using MaskZero the wrong way\"，说明你可能错误地将 `MaskZero` 直接包裹在 `AbstractRecurrent` 模块（如 LSTM）本身，而不是其内部的循环模块。\n正确做法是使用 `AbstractRecurrent:maskZero()` 方法，或者手动将 `MaskZero` 应用于内部的 `recurrentModule`，而不是外层的序列模块。这样可以确保掩码逻辑正确地作用于递归单元内部。",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},40898,"如何实现“一对多”预测（单个输入生成多个时间步的输出）？","标准的 `Sequencer` 模块通常用于处理序列到序列或一步对一步的预测。如果你需要从一个单一输入生成一系列输出（例如在临床场景中根据一个状态预测未来的多步运动），这属于“一对多”架构。\n解决方案：\n1. 不要简单地在循环中多次调用 forward 并累积输入，这可能导致维度不匹配。\n2. 考虑使用专门的模块如 `Repeater`（如果库支持）来重复输入特征。\n3. 或者构建自定义网络结构，将单个输入通过线性层映射后，作为初始状态输入到 RNN 中，并在后续时间步仅依赖隐藏状态生成输出，而不需要新的外部输入。","https:\u002F\u002Fgithub.com\u002FElement-Research\u002Frnn\u002Fissues\u002F179",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},40899,"LSTM 在处理多变量时间序列玩具数据集时无法收敛，误差很高，可能的原因是什么？","如果 LSTM 在简单的多变量时间序列（如交替的加法和乘法序列）上无法学习，可能原因包括：\n1. 数据归一化问题：时间序列数据的尺度差异可能导致梯度不稳定，需确保输入数据经过适当的标准化或归一化。\n2. 学习率设置不当：过高或过低的学习率都会阻碍收敛，建议尝试调整学习率。\n3. 序列长度与记忆能力：如果序列模式超出 LSTM 的记忆范围（rho 参数），可能难以捕捉长依赖。\n4. 模型架构：检查是否正确克隆了前向传播的输出（某些论坛建议克隆输出以断开计算图），以及网络层数是否足够。\n5. 优化器选择：尝试不同的优化算法（如 Adam 而非 SGD）。","https:\u002F\u002Fgithub.com\u002FElement-Research\u002Frnn\u002Fissues\u002F113",{"id":134,"question_zh":135,"answer_zh":136,"source_url":113},40900,"在使用 CTC 准则进行字符级 OCR 任务时，如何选择合适的输入编码方式？","在字符级 OCR 任务中，如果词汇量较小（例如只有几十个字符），不建议使用 `nn.LookupTable`，而应使用 `nn.OneHot` 编码。\n`nn.LookupTable` 更适合词汇量巨大的自然语言处理任务。对于图像列作为输入、输出为 Unicode 字符的任务，直接使用 `BiSequencer LSTM` 配合 `OneHot` 编码通常能解决 NaN 问题并提高训练稳定性。此外，确保输入图像列在每一步的时间步长处理正确，无需额外的查找表嵌入。",[]]