EU GMP附录22《人工智能》-2025（中英文对照）_检测资讯

EU GMP附录22《人工智能》-2025（中英文对照）

嘉峪检测网 2025-07-09 21:41

导读：7月7日，欧盟委员会和PIC/S均发布了新的GMP修订：包括正文第4章《文件记录》、附录11《计算机化系统》和一个新的附录——附录22《人工智能》。

7月7日，欧盟委员会和PIC/S均发布了新的GMP修订：包括正文第4章《文件记录》、附录11《计算机化系统》和一个新的附录——附录22《人工智能》：

附录22为首个人工智能相关GMP指南，文件反复强调了不应在关键GMP应用中使用动态模型，生成式人工智能和大型语言模型（LLM）和具有概率性输出的模型。以下为附录22《人工智能》中英文对照：
Annex 22: Artificial Intelligence附录 22：人工智能
Reasons for changes: Not applicable (new annex).变更原因：新增附录

Document map文件目录
1. Scope范围

2. Principles原则

3. Intended Use预期用途

4. Acceptance Criteria接受标准5. Test Data测试数据

6. Test Data Independency测试数据独立性

7. Test Execution测试执行

8. Explainability可解释性

9. Confidence置信度

10. Operation运行（操作） Glossary术语表

1.Scope范围
This annex applies to all types of computerised systems used in the manufacturing of medicinal products and active substances, where Artificial Intelligence models are used in critical applications with direct impact on patient safety, product quality or data integrity, e.g. to predict or classify data. The document provides additional guidance to Annex11 for computerised systems in which AI models are embedded.本附录适用于药品和活性物质生产中使用人工智能模型用于对患者安全、产品质量或数据完整性有直接影响的关键应用（例如用于数据预测或分类）的各类计算机化系统。对于嵌入了人工智能模型的计算机化系统，本文件为《附录11》提供了补充指导。
The document applies to machine learning (AI/ML) models which have obtained their functionality through training with data, rather than being explicitly programmed. Models may consist of several individual models, each automating specific process steps in GMP.本文件适用于通过数据训练（而非通过明确编程）而非通过明确编程获得功能的机器学习（人工智能/机器学习）模型。这些模型可由多个独立模型构成，每个模型负责自动化处理GMP中的特定流程步骤。
The document applies to static models, i.e. models that do not adapt their performance during use by incorporating new data. The use of dynamic models which continuously and automatically learn and adapt performance during use, is not covered by this document, and should not be used in critical GMP applications.本文件适用于静态模型，即那些在使用过程中不会通过纳入新数据来调整自身性能的模型。而动态模型在使用过程中会持续自动学习并调整性能，其使用不在本文件的涵盖范围内，且不应在关键GMP应用中使用。
The document applies to models with a deterministic output which, when given identical inputs, provide identical outputs. Models with a probabilistic output which, when given identical inputs, might not provide identical outputs are not covered by this document and should not be used in critical GMP applications.本文件适用于具有确定性输出的模型，即当输入相同时，这类模型能给出相同的输出。而具有概率性输出的模型在输入相同时可能不会产生相同的输出，此类模型不在本文件的涵盖范围内，且不应在关键GMP应用中使用。
Following the above, the document does not apply to Generative AI and Large Language Models (LLM), and such models should not be used in critical GMP applications. If used in non-critical GMP applications, which do not have direct impact on patient safety, product quality or data integrity, personnel with adequate qualification and training should always be responsible for ensuring that the outputs from such models are suitable for the intended use, i.e. a human-in-the-loop (HITL) and the principles described in this document may be considered where applicable.基于上述内容，本文件不适用于生成式人工智能和大型语言模型（LLM），且此类模型不应在关键GMP应用中使用。若在非关键的GMP应用（即对患者安全、产品质量或数据完整性无直接影响的应用）中使用这类模型，应由具备适当资质和培训的人员负责确保模型输出符合预期用途，即需采用“人机协同（HITL）”模式，且在适用情况下可参考本文件所述原则。

2. Principles2. 原则
2.1.Personnel. 2.1. 人员
In order to adequately understand the intended use and the associated risks of the application of an AI model in a GMP environment, there should be close cooperation between all relevant parties during algorithm selection, and model training, validation, testing and operation. This includes but may not be limited to process subject matter experts (SMEs), QA, data scientists, IT, and consultants. All personnel should have adequate qualifications, defined responsibilities and appropriate level of access.为充分理解人工智能模型在GMP环境中应用的预期用途及相关风险，在算法选择、模型训练、验证、测试和运行期间，所有相关方应开展密切合作。相关方包括但不限于工艺主题专家（SMEs）、QA、数据科学家、IT、和顾问。所有人员均应具备适当资质、明确的职责和相应的访问权限。
2.2.Documentation. 2.2. 文件记录
Documentation for activities described in this section should be available and reviewed by the regulated user irrespective of whether a model is trained, validated and tested in-house or whether it is provided by a supplier or service provider.无论模型是内部训练、验证和测试的，还是由供应商或服务提供商提供，应有本节所述活动的文件记录，并经受监管用户审核。
2.3.Quality Risk Management质量风险管理
Activities described in this document should be implemented based on the risk to patient safety, product quality and data integrity.本文件所述活动应基于患者安全、产品质量和数据完整性的风险实施。

3. Intended Use3. 预期用途
3.1.Intended use. The intended use of a model and the specific tasks it is designed to assist or automate should be described in detail based on an in-depth knowledge of the process the model is integrated in. This should include a comprehensive characterisation of the data the model is intended to use as input and all common and rare variations; i.e. the input sample space. Any limitations and possible erroneous and biased inputs should be identified. A process subject matter expert (SME) should be responsible for the adequacy of the description, and it should be documented and approved before the start of acceptance testing.3.1. 预期用途：应基于对模型所融入的流程的深入了解，详细描述模型的预期用途以及其设计用于辅助或自动化的特定任务。这应包括对模型拟使用的输入数据、所有常见及罕见的数据变异（即输入样本空间）的全面特征分析。此外，还应识别出模型的任何局限性以及可能存在的错误输入和有偏输入。流程领域专家（SME）需对上述描述的充分性负责，且该描述需在验收测试开始前形成文件并获得批准。
3.2.Subgroups. Where applicable, the input sample space should be divided into subgroups based on relevant characteristics. Subgroups may be defined by characteristics like the decision output (e.g. ‘accept’ or ‘reject’), process specific baseline characteristics (e.g. geographical site or equipment), specific characteristics in material or product, and characteristics specific to the task being automated (e.g. types and severity of defects).3.2. 子组：在适用情况下，应根据相关特征将输入样本空间划分为子组。子组可通过以下特征来定义：决策输出（如“接受”或“拒绝”）、特定工艺基准特征（如生产地点或设备）、物料或产品的特定属性，以及自动化任务所特有的特征（如缺陷类型和严重程度）。
3.3.Human-in-the-loop. Where a model is used to give an input to a decision made by a human operator (human-in-the-loop), and where the effort to test such model has been diminished, the description ofthe intended use should include theresponsibility of the operator. In this case, the training and consistent performance of the operator should be monitored like any other manual process.3.3. 人机协同：当模型用于为人工操作员的决策提供输入（即人机协同），且对该模型的测试力度有所降低时，其预期用途描述应包含操作员的职责。在此情况下，操作员的培训情况及持续表现应像其他任何人工流程一样受到监控。

4. Acceptance Criteria4. 接受标准
4.1.Test metrics. Suitable, case dependent test metrics, should be defined to measure the performance of the model according to the intended use. As an example, suitable test metrics for a model used to classify products (e.g. ‘accept’ or ‘reject’ ) may include, but may not be limited to, a confusion matrix, sensitivity, specificity, accuracy, precision and/or F1 score.4.1 测试指标：应根据具体情况，定义适当的测试指标，以依据预期用途衡量模型的性能。例如，对于用于产品分级（如“接受”或“拒绝”）的模型，合适的测试指标可包括但不限于混淆矩阵、灵敏度、特异度、准确率、精确率和/或F1分数。
4.2.Acceptance criteria. Acceptance criteria for the defined test metrics should be established by which the performance of the model should be considered acceptable for the intended use. The acceptance criteria may differ for specific subgroups within the intended use. A process subject matter expert (SME) should be responsible for the definition of the acceptance criteria, which should be documented and approved before the start of acceptance testing.4.2. 接受标准：应针对所定义的测试指标制定接受标准，以此判定模型性能是否满足预期用途要求。预期用途中的特定子组可能适用不同的接受标准。流程主题专家（SME）应负责定义接受标准，该标准应形成文件并在验收测试开始前获得批准。
4.3.No decrease. The acceptance criteria of a model, should be at least as high as the performance of the process it replaces. This implies, that the performance should be known for the process which is to be replaced by a model (see Annex 11 2.7).4.3. 不降低要求：模型的接受标准应至少不低于其所要替代的流程的性能水平。这意味着，对于拟由模型替代的流程，其性能水平应是已知的（参见《附录 11》第 2.7 条）。

5. Test Data5. 测试数据
5.1.Selection. Test data should be representative of and expand the full sample space of the intended use. It should be stratified, include all subgroups, and reflect the limitations, complexity and all common and rare variations within the intended use of the model. The criteria and rationale for selection of test data should be documented.5.1. 选择：测试数据应能代表并覆盖预期用途的完整样本空间。测试数据应经过分层处理，涵盖所有子组，并能反映模型预期用途范围内的局限性、复杂性以及所有常见和罕见的变异情况。应记录测试数据选择的标准和理由。
5.2.Sufficient in size. The test dataset, and any of its subgroups, should be sufficient in size to calculate the test metrics with adequate statistical confidence.5.2. 规模充足：测试数据集及其包含的所有子组，在规模上应足以使测试指标的计算具备充分的统计置信度。
5.3.Labelling. The labelling of test data should be verified following a process that ensures a very high degree of correctness. This may include independent verification by multiple experts, validated equipment or laboratory tests.5.3. 标记：测试数据的标记应通过能确保极高正确性的流程进行验证。这可包括由多名专家进行独立验证、使用经过验证的设备验证或通过实验室检测验证。
5.4.Pre-processing. Any pre-processing of the test data, e.g. transformation, normalisation, or standardisation, should be pre-specified and a rationale should be provided, that it represents intended use conditions.5.4. 预处理：对测试数据的任何预处理（如转换、归一化或标准化）都应预先规定，并提供相应理由，以证明其符合预期使用条件。
5.5.Exclusion. Any cleaning or exclusion of test data should be documented and fully justified.5.5. 排除：对测试数据的任何清洗或排除操作均应形成文件记录，并提供充分的理由。
5.6.Data generation. Generation of test data or labels, e.g. by means of generative AI, is not recommended and any use hereof should be fully justified.5.6. 数据生成：不建议通过生成式人工智能等方式生成测试数据或标记，若确需使用此类生成的数据或标记，必须提供充分的理由。

6. Test Data Independency6. 测试数据独立性
6.1.Independence. Effective measures consisting of technical and/or procedural controls should be implemented to ensure the independency of test data, i.e. that data which will be used to test a model, is not used during development, training or validation of the model. This may be by capturing test data only after completion of training and validation, or by splitting test data from a complete pool of data before training has started.6.1. 独立性：应实施由技术和/或程序控制组成的有效措施，确保测试数据的独立性，即确保用于测试模型的数据未在模型的开发、训练或验证阶段使用。这可以通过仅在训练和验证完成后收集测试数据，或在训练开始前从完整数据池中分离出测试数据来实现。
6.2.Data split. If test data is split from a complete pool of data before training of the model, it is essential that employees involved in the development and training of the model have never had access to the test data. The test data should be protected by access control and audit trail functionality logging accesses and changes to these. There should be no copies of test data outside this repository.6.2. 数据拆分：若在模型训练前从完整数据池中拆分出测试数据，至关重要的一点是：参与模型开发和训练的人员绝不能接触到测试数据。测试数据应通过访问控制和审计追踪功能加以保护，以记录对测试数据的访问和修改情况。此外，该数据存储库之外不应存在测试数据的副本。
6.3.Identification. It should be recorded which data has been used for testing, when and how many times.6.3. 标识：应记录哪些数据用于测试、测试的时间以及测试的次数。
6.4.Physical objects. When test data originates from physical objects, it should be ensured, that the objects used for the final test of the model have not previously been used to train or validate the model, unless features are independent.6.4. 实物对象：当测试数据源自实物对象时，应确保用于模型最终测试的实物对象此前未被用于模型的训练或验证，除非其特征具有独立性。
6.5.Staff independency. Effective procedural and/or technical controls should be implemented to prevent staff members who have had access to test data from being involved in training and validation of the same model. In organisations where it is impossible to maintain this independency, a staff member who might have had access to test data for a model, should only have access to training and validation of the same model when working together (in pair) with a colleague who has not had this access (4-eyes principle).6.5. 人员独立性：应实施有效的程序和/或技术控制措施，防止接触过测试数据的人员参与同一模型的训练和验证工作。在无法保持这种独立性的组织中，可能接触过某模型测试数据的人员，只有在与未接触过该测试数据的同事共同工作（双人协作）时，方可参与同一模型的训练和验证（即遵循“四眼原则”）。

7. Test Execution7. 测试执行
7.1.Fit for intended use. The test should ensure that a model is fit for intended use and is ‘generalising well’, i.e. that the model has a satisfactory performance with new data from the intended use. This includes detecting possible over- or underfitting of the model to the training data.7.1. 符合预期用途：测试应确保模型符合预期用途且“泛化能力良好”，即模型在处理来自预期用途的新数据时表现令人满意。这包括检测模型对训练数据可能存在的过拟合或欠拟合问题。
7.2.Testplan. Before the test is initiated, a test plan should be prepared and approved. It should contain a summary of the intended use, the pre-defined metrics and acceptance criteria, a reference to the test data, a test script including a description of all steps necessary to conduct the test, and a description of how to calculate the test metrics. A process subject matter expert (SME) should be involved in developing the plan.7.2. 测试计划：在启动测试前，应编制并批准测试计划。该计划应包含以下内容：预期用途概述、预先定义的指标和接受标准、测试数据的引用、包含实施测试所需全部步骤说明的测试脚本，以及测试指标的计算方法说明。流程主题专家（SME）应参与测试计划的制定。
7.3.Deviation. Any deviation from the test plan, failure to meet acceptance criteria, or omission to use all test data should be documented, investigated, and fully justified.7.3. 偏差：任何与测试计划不符的情况、未达到接受标准的情况，或未使用全部测试数据的情况，都应记录、调查并充分论证。
7.4.Test documentation. All test documentation should be retained along with the description of the intended use, the characterisation of test data, the actual test data, and whererelevant, physical test objects. In addition, documentation for access control to test data and related audit trail records, should be retained similarly to other GMP documentation.7.4. 测试文件记录：所有测试文件记录均应留存，同时留存的还应包括预期用途说明、测试数据特征描述、实际测试数据，以及相关情况下的实物测试对象。此外，测试数据的访问控制文件记录及相关审计追踪记录，应与其他GMP文件记录一样妥善留存。

8. Explainability8. 可解释性
8.1.Feature attribution. During testing of models used in critical GMP applications, systems should capture and record the features in the test data that have contributed to a particular classification or decision (e.g. rejection). Where applicable, techniques like feature attribution (e.g. SHAP values or LIME) or visual tools like heat maps should be used to highlight key factors contributing to the outcome.8.1. 特征归因：在对关键GMP应用中使用的模型进行测试时，系统应捕捉并记录测试数据中促成特定分类或决策（如拒收）的特征。在适用的情况下，应使用特征归因技术（如SHAP值或LIME）或热图等可视化工具，突出显示导致结果的关键因素。
8.2.Feature justification. In order to ensure that a model is making decisions based on relevant and appropriate features and based on risk, a review of these features should be part of the process for approval of test results.8.2. 特征合理性论证：为确保模型基于相关且适当的特征以及风险来制定决策，对这些特征的审核应成为测试结果审批流程的一部分。

9. Confidence9. 置信度
9.1.Confidencescore. When testing a model used to predict or classify data,the system should, where applicable, log the confidence score of the model for each prediction or classification outcome.9.1. 置信度分数：在测试用于预测或分类数据的模型时，系统应在适用情况下记录模型对每个预测或分类结果的置信度分数。
9.2.Threshold. Models used to predict or classify data should have an appropriate threshold setting to ensure predictions or classifications are made only when suitable. If the confidence score is very low, it should be considered whether the model should flag the outcome as ‘undecided’, rather than making potentially unreliable predictions or classifications.9.2. 阈值：用于预测或分类数据的模型应设置适当的阈值，以确保仅在合适的情况下进行预测或分类。若置信度分数极低，则应考虑模型是否应将结果标记为“未确定”，而非做出可能不可靠的预测或分类。

10. Operation10. 运行
10.1.Change control. A tested model, the system it is implemented in, and the whole process it is automating or assisting should be put under change control before it is deployed in operation. Any change to the model itself, the system, or the process in which it is used, including any change to physical objects the model is using as input, should be documented and evaluated to determine if the model needs to be retested. Any decision not to conduct such retest should be fully justified.10.1. 变更控制：经测试的模型、其部署的系统以及它所自动化或辅助的整个流程，在投入运行前均应纳入变更控制范围。对模型本身、系统或其应用流程的任何变更（包括模型用作输入的实物对象的任何变更），都应形成文件记录并进行评估，以确定是否需要对模型重新测试。对于决定不进行此类重新测试的情况，必须提供充分的理由。
10.2.Configuration control. A tested model should be put under configuration control before being deployed in operation, and effective measures should be used to detect any unauthorised change.10.2. 配置控制：经过测试的模型在投入运行前应纳入配置控制范围，并应采取有效措施检测任何未经授权的变更。
10.3.System performance monitoring. The performance of a model as defined by its metrics should be regularly monitored to detect any changes in the computerised system (e.g. deterioration or change of a lighting condition).10.3. 系统性能监控：应定期监控模型在其指标所定义的性能表现，以检测计算机化系统中出现的任何变化（例如光照条件的恶化或改变）。

10.4.Input sample space monitoring. It should be regularly monitored whether the input data are still within the model sample space and intended use. Metrics should be defined for monitoring any drift in the input data.

10.4. 输入样本空间监控：应定期监控输入数据是否仍处于模型样本空间及预期用途范围内。应为监控输入数据的任何偏移情况定义相关指标。

10.5.Human review. When a model is used to give an input to a decision made by a human operator (human-in-the-loop), and where the effort to test such model has been diminished, records should be kept from this process. Depending on the criticality of the process and the level of testing of the model, this may imply a consistent review and/or test of every output from the model, according to a procedure.

10.5. 人工审核：当模型用于为人工操作员的决策提供输入（即“人机协同”模式），且对此类模型的测试力度有所减弱时，该过程需留存记录。根据流程的关键程度以及模型的测试水平，这可能意味着需要按照既定程序，对模型的每一项输出进行持续审核和/或测试。
Glossary术语
Artificial Intelligence – ‘AI system’ means a machine-based system that is designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment, and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments;人工智能——“人工智能系统”指一种基于机器的系统，其设计具有不同程度的自主性，在部署后可能表现出适应性；为实现显性或隐性目标，该系统从接收的输入中推断如何生成输出（如预测、内容、建议或决策），而这些输出能够对物理环境或虚拟环境产生影响。
Deep Learning – Approach to creating rich hierarchical representations through the training of neural networks with many hidden layers深度学习——通过训练具有多个隐藏层的神经网络来创建丰富的层级表示的方法
Feature – A pattern in data that can be reduced to a simpler higher-level representation特征——数据中可简化为更简洁的高级表示形式的模式
LIME – Local Interpretable Model-Agnostic Explanations; a technique that approximates any black box machine learning model with a local, interpretable model to explain each individual prediction.LIME——局部可解释的模型无关解释（Local Interpretable Model-Agnostic Explanations）；这是一种通过局部可解释模型来近似任何黑箱机器学习模型的技术，用于解释每个单独的预测结果。

Machine Learning – Machine learning refers to the computational process of optimising the parameters of a model from data, which is a mathematical construct generating an output based on input data. Machine learning approaches include, for instance, supervised, unsupervised and reinforcement learning, using a variety of methods including deep learning with neural networks.

机器学习——机器学习是指通过数据优化模型参数的计算过程，模型是一种基于输入数据生成输出的数学结构。机器学习方法包括（例如）监督学习、无监督学习和强化学习，采用包括神经网络深度学习在内的多种技术。
Model – Mathematical algorithms with parameters (weights) arranged in an architecture that allows learning of patterns (features) from training data模型——在特定架构中排列的带有参数（权重）的数学算法，能够从训练数据中学习模式（特征）
Overfitting – Learning details from training data that cannot be generalised to new data过拟合——从训练数据中学习到无法泛化到新数据的细节信息
SHAP – Shapley Additive Explanations; an explainable AI (XAI) framework that can provide model- agnostic local explainability for tabular, image, and text datasetsSHAP——沙普利可加解释（Shapley Additive Explanations）；这是一种可解释人工智能（XAI）框架，能够为表格数据、图像数据和文本数据集提供与模型无关的局部可解释性。

Static – Frozen model: A model where all parameters have been finally set, not allowing further adaption to new data.

静态（模型）——冻结模型：指所有参数均已最终设定、不允许进一步根据新数据进行调整的模型。

Test dataset – The "hold-out" data that is used to estimate performance of the final ML model.

测试数据集——用于评估最终机器学习模型性能的“预留”数据。
Training dataset – The data used to train the ML model.训练数据集——用于训练机器学习模型的数据。

Validation dataset (in AI) – The dataset used during model development, to inform on how to optimally train the model from training data. size smaller than the training set

验证数据集（在人工智能领域）——模型开发过程中使用的数据集，用于指导如何从训练数据中以最优方式训练模型。其规模小于训练集。

来源：Internet

关键词： EUGMP 人工智能

EU GMP附录22《人工智能》-2025（中英文对照）

相关资讯