Flink应用开发-Table API-流式概念

作者: ApacheFlink

流式概念

Flink 的 [Table API] 是流批统一的 API。 这意味着 Table API \& SQL 在无论有限的批式输入还是无限的流式输入下,都具有相同的语义。 因为传统的关系代数以及 SQL 最开始都是为了批式处理而设计的, 关系型查询在流式场景下不如在批式场景下容易懂。

下面这些页面包含了概念、实际的限制,以及流式数据处理中的一些特定的配置。

State Management

Table programs that run in streaming mode leverage all capabilities of Flink as a stateful stream processor. In particular, a table program can be configured with a [state backend] for handling different requirements regarding state size and fault tolerance. It is possible to take a savepoint of a running Table API \& SQL pipeline and to restore the application's state at a later point in time. State Usage Due to the declarative nature of Table API \& SQL programs, it is not always obvious where and how much state is used within a pipeline. The planner decides whether state is necessary to compute a correct result. A pipeline is optimized to claim as little state as possible given the current set of optimizer rules.

Conceptually, source tables are never kept entirely in state. An implementer deals with logical tables (i.e. [dynamic tables]. Their state requirements depend on the used operations.

Queries such as SELECT ... FROM ... WHERE which only consist of field projections or filters are usually stateless pipelines. However, operations such as joins, aggregations, or deduplications require keeping intermediate results in a fault-tolerant storage for which Flink's state abstractions are used.

Please refer to the individual operator documentation for more details about how much state is required and how to limit a potentially ever-growing state size.

For example, a regular SQL join of two tables requires the operator to keep both input tables in state entirely. For correct SQL semantics, the runtime needs to assume that a matching could occur at any point in time from both sides. Flink provides [optimized window and interval joins]. Another example is the following query that computes the number of clicks per session.

   SELECT sessionId, COUNT(*) FROM clicks GROUP BY sessionId;

The sessionId attribute is used as a grouping key and the continuous query maintains a count for each sessionId it observes. The sessionId attribute is evolving over time and sessionId values are only active until the session ends, i.e., for a limited period of time. However, the continuous query cannot know about this property of sessionId and expects that every sessionId value can occur at any point of time. It maintains a count for each observed sessionId value. Consequently, the total state size of the query is continuously growing as more and more sessionId values are observed. Idle State Retention Time The Idle State Retention Time parameter [table.exec.state.ttl] defines for how long the state of a key is retained without being updated before it is removed. For the previous example query, the count of asessionId would be removed as soon as it has not been updated for the configured period of time. By removing the state of a key, the continuous query completely forgets that it has seen this key before. If a record with a key, whose state has been removed before, is processed, the record will be treated as if it was the first record with the respective key. For the example above this means that the count of a sessionId would start again at 0. Stateful Upgrades and Evolution Table programs that are executed in streaming mode are intended as standing queries which means they are defined once and are continuously evaluated as static end-to-end pipelines. In case of stateful pipelines, any change to both the query or Flink's planner might lead to a completely different execution plan. This makes stateful upgrades and the evolution of table programs challenging at the moment. The community is working on improving those shortcomings. For example, by adding a filter predicate, the optimizer might decide to reorder joins or change the schema of an intermediate operator. This prevents restoring from a savepoint due to either changed topology or different column layout within the state of an operator. The query implementer must ensure that the optimized plans before and after the change are compatible. Use the EXPLAIN command in SQL or table.explain() in Table API to [get insights]. Since new optimizer rules are continuously added, and operators become more efficient and specialized, also the upgrade to a newer Flink version could lead to incompatible plans.

Currently, the framework cannot guarantee that state can be mapped from a savepoint to a new table operator topology.

In other words: Savepoints are only supported if both the query and the Flink version remain constant.

Since the community rejects contributions that modify the optimized plan and the operator topology in a patch version (e.g. from 1.13.1 to 1.13.2), it should be safe to upgrade a Table API \& SQL pipeline to a newer bug fix release. However, major-minor upgrades from (e.g. from 1.12 to 1.13) are not supported. For both shortcomings (i.e. modified query and modified Flink version), we recommend to investigate whether the state of an updated table program can be “warmed up” (i.e. initialized) with historical data again before switching to real-time data. The Flink community is working on a [hybrid source] to make this switching as convenient as possible.

接下来?

  • [动态表]: 描述了动态表的概念。
  • [时间属性]: 解释了时间属性以及它是如何在 Table API \& SQL 中使用的。
  • [流上的 Join]: 支持的几种流上的 Join。
  • [时态(temporal)表]: 描述了时态表的概念。
  • [查询配置]: Table API \& SQL 特定的配置。

文章列表

更多推荐

更多
  • IOS开发者的AWS和DevOps指南-十、iOS 应用开发的持续交付渠道 Jenkins 管道公司,AWS 代码管道,摘要,Fastlane 测试阶段,AWS 设备场测试阶段,Fastlane 构建阶段,Fastlane 交付阶段,为 AWS 代码管道设置 Jenkins 环境,在 AWS 控制台上设置代码管
    Apache CN

  • IOS开发者的AWS和DevOps指南-九、将 AWS 设备群用于测试 AWS 设备群简介,为应用测试生成 ipa 包,设置设备场项目并安排测试运行,AWS 设备场 Jenkins 插件,使用 Jenkins 自动化 AWS 设备群测试,摘要,使用 AWS 控制台安排测试运行,使用 AWS CLI 计划测试
    Apache CN

  • IOS开发者的AWS和DevOps指南-八、使用 Fastlane 自动构建、测试和发布 使用 Fastlane 匹配和亚马逊 S3 设置代码签名,设置 Jenkins 环境,用 Fastlane 自动化测试和构建,自动发布到 App Store Connect,摘要,正在初始化 Fastlane 匹配,在亚马逊 S3 存储
    Apache CN

  • IOS开发者的AWS和DevOps指南-六、使用 AWS CodeCommit 的源代码管理 Git 基础,创建 AWS 代码提交存储库,在 AWS 代码提交中添加源代码,AWS 代码提交中分支,AWS 代码提交中的拉请求,摘要,Git 安装,初始化 Git 存储库,记录对 Git 存储库的更改,克隆和使用远程 Git 存储库,
    Apache CN

  • IOS开发者的AWS和DevOps指南-七、将 AWS CodeCommit 与 Jenkins 集成 Jenkins 代码提交插件,设置集成组件,配置插件,使用 AWS 代码提交源创建 Jenkins 作业,摘要,通过 AWS 控制台设置组件,通过 Terraform 设置组件,测试 AWS 代码提交插件, 当应用源代码存储在 AWS
    Apache CN

  • IOS开发者的AWS和DevOps指南-四、AWS 上的 macOS 服务器 Amazon EC2 Mac 服务器,部署 Amazon EC2 Mac 服务器,连接到 Amazon EC2 Mac 服务器,使用 Amazon CloudWatch 监控 EC2 Mac 服务器,清理 Amazon EC2 Mac
    Apache CN

  • IOS开发者的AWS和DevOps指南-五、在 macOS 实例上设置开发工具 增加 macOS 实例宗卷大小,设置 Xcode,陷害 Jenkins,建立 Fastlane,设置 GitLab,摘要,Xcode 命令行工具,供应 Jenkins 控制器,EC2 Mac 实例作为 Jenkins 构建代理,创建 G
    Apache CN

  • IOS开发者的AWS和DevOps指南-三、亚马逊网络服务(AWS)上的 DevOps 三、亚马逊网络服务AWS上的 DevOpsAWS 上的持续集成,AWS 上的连续交付,基础设施作为代码,监控和记录,摘要,AWS 代码提交,AWS 代码构建,AWS 程式码人工因素,AWS 代码部署,AWS 代码管道,AWS 云阵,AW
    Apache CN

  • IOS开发者的AWS和DevOps指南-二、从 Xcode 到 App Store Connect 标识符,应用商店连接,从 Xcode 上传构件,测试和发布,摘要,截图和应用详细信息,TestFlight 软件,应用提交, 在前一章中,我们看到了如何使用 Xcode 在物理 iPhone 设备和模拟器上构建和运行应用。我们将进一步探
    Apache CN

  • IOS开发者的AWS和DevOps指南-一、iOS 应用开发基础 开发要求,迅速发生的,摘要,苹果个人计算机,苹果开发者账户,Xcode 简介,创建应用,构建应用,Xcode 命令行工具, 为了开发 iOS 应用,苹果提供了几种工具和资源。iOS 应用可以用原生编程语言开发,如 Swift 或 Obj
    Apache CN

  • 近期文章

    更多
    文章目录

      推荐作者

      更多