在之前的文章《解析SparkStreaming和Kafka集成的两种方式》中已详细介绍SparkStreaming和Kafka集成主要有Receiver based Approach和Direct Approach。同时对比了二者的优劣势,以及针对不同的Spark、Kafka集成版本处理方式的支持:

本文主要介绍,SparkStreaming和Kafka使用Direct Approach方式处理任务时,如何自己管理offset?

SparkStreaming通过Direct Approach接收数据的入口:

KafkaUtils.createDirectStream。在调用该方法时,会先创建

KafkaCluster:val kc = new KafkaCluster(kafkaParams)

KafkaCluster负责和Kafka,该类会获取Kafka的分区信息、创建DirectKafkaInputDStream,每个DirectKafkaInputDStream对应一个topic,每个DirectKafkaInputDStream也会持有一个KafkaCluster实例。

到了计算周期后,会调用DirectKafkaInputDStream的compute方法,执行以下操作:

  1. 获取对应Kafka Partition的untilOffset,以确定需要获取数据的区间
  2. 构建KafkaRDD实例。每个计算周期里,DirectKafkaInputDStream和KafkaRDD是一一对应的
  3. 将相关的offset信息报给InputInfoTracker
  4. 返回该RDD

关于KafkaRDD和Kafka的分区对应关系,可以参考这篇文章《重要 | Spark分区并行度决定机制》

SparkStreaming和Kafka通过Direct方式集成,自己管理offsets代码实践:

1. 业务逻辑处理

/**
* @Author: 微信公众号-大数据学习与分享
*/
object SparkStreamingKafkaDirect { def main(args: Array[String]) {
if (args.length < 3) {
System.err.println(
s"""
|Usage: SparkStreamingKafkaDirect <brokers> <topics> <groupid>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
| <groupid> is a consume group
|
""".stripMargin)
System.exit(1)
} val Array(brokers, topics, groupId) = args val sparkConf = new SparkConf().setAppName("DirectKafka")
sparkConf.setMaster("local[*]")
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "10")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val ssc = new StreamingContext(sparkConf, Seconds(6)) val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokers,
"group.id" -> groupId,
"auto.offset.reset" -> "smallest"
) val km = new KafkaManager(kafkaParams) val streams = km.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet) streams.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
// 先处理消息
do something... // 再更新offsets
km.updateZKOffsets(rdd)
}
}) ssc.start()
ssc.awaitTermination()
}
}

2. offset管理核心逻辑

2.1 利用zookeeper

注意:自定义的KafkaManager必须在包org.apache.spark.streaming.kafka下

package org.apache.spark.streaming.kafka
/**
* @Author: 微信公众号-大数据学习与分享
* Spark-Streaming和Kafka直连方式:自己管理offsets
*/
class KafkaManager(val kafkaParams: Map[String, String]) extends Serializable {
private val kc = new KafkaCluster(kafkaParams) def createDirectStream[
K: ClassTag,
V: ClassTag,
KD <: Decoder[K] : ClassTag,
VD <: Decoder[V] : ClassTag](ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Set[String]): InputDStream[(K, V)] = {
val groupId = kafkaParams.get("group.id").get //从zookeeper上读取offset前先根据实际情况更新offset
setOrUpdateOffsets(topics, groupId) //从zookeeper上读取offset开始消费message
val messages = {
//获取分区 //Either处理异常的类,通常Left表示异常,Right表示正常
val partitionsE: Either[Err, Set[TopicAndPartition]] = kc.getPartitions(topics)
if (partitionsE.isLeft) throw new SparkException(s"get kafka partition failed:${partitionsE.left.get}") val partitions = partitionsE.right.get val consumerOffsetsE = kc.getConsumerOffsets(groupId, partitions)
if (consumerOffsetsE.isLeft) throw new SparkException(s"get kafka consumer offsets failed:${consumerOffsetsE.left.get}")
val consumerOffsets = consumerOffsetsE.right.get KafkaUtils.createDirectStream[K, V, KD, VD, (K, V)](ssc, kafkaParams, consumerOffsets, (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message))
}
messages
} /** 创建数据流之前,根据实际情况更新消费offsets */
def setOrUpdateOffsets(topics: Set[String], groupId: String): Unit = {
topics.foreach { topic =>
var hasConsumed = true
//获取每一个topic分区
val partitionsE = kc.getPartitions(Set(topic))
if (partitionsE.isLeft) throw new SparkException(s"get kafka partition failed:${partitionsE.left.get}") //正常获取分区结果
val partitions = partitionsE.right.get
//获取消费偏移量
val consumerOffsetsE = kc.getConsumerOffsets(groupId, partitions)
if (consumerOffsetsE.isLeft) hasConsumed = false if (hasConsumed) {
val earliestLeaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if (earliestLeaderOffsetsE.isLeft) throw new SparkException(s"get earliest leader offsets failed: ${earliestLeaderOffsetsE.left.get}") val earliestLeaderOffsets: Map[TopicAndPartition, KafkaCluster.LeaderOffset] = earliestLeaderOffsetsE.right.get
val consumerOffsets: Map[TopicAndPartition, Long] = consumerOffsetsE.right.get var offsets: mutable.HashMap[TopicAndPartition, Long] = mutable.HashMap[TopicAndPartition, Long]()
consumerOffsets.foreach { case (tp, n) =>
val earliestLeaderOffset = earliestLeaderOffsets(tp).offset
//offsets += (tp -> n)
if (n < earliestLeaderOffset) {
println("consumer group:" + groupId + ",topic:" + tp.topic + ",partition:" + tp.partition + "offsets已过时,更新为:" + earliestLeaderOffset)
offsets += (tp -> earliestLeaderOffset)
}
println(n, earliestLeaderOffset, kc.getLatestLeaderOffsets(partitions).right)
}
println("map...." + offsets)
if (offsets.nonEmpty) kc.setConsumerOffsets(groupId, offsets.toMap) // val cs = consumerOffsetsE.right.get
// val lastest = kc.getLatestLeaderOffsets(partitions).right.get
// val earliest = kc.getEarliestLeaderOffsets(partitions).right.get
// var newCS: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long]()
// cs.foreach { f =>
// val max = lastest.get(f._1).get.offset
// val min = earliest.get(f._1).get.offset
// newCS += (f._1 -> f._2)
// //如果zookeeper中记录的offset在kafka中不存在(已过期)就指定其现有kafka的最小offset位置开始消费
// if (f._2 < min) {
// newCS += (f._1 -> min)
// }
// println(max + "-----" + f._2 + "--------" + min)
// }
// if (newCS.nonEmpty) kc.setConsumerOffsets(groupId, newCS)
} else {
println("没有消费过....")
val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase) val leaderOffsets: Map[TopicAndPartition, LeaderOffset] = if (reset == Some("smallest")) {
val leaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if (leaderOffsetsE.isLeft) throw new SparkException(s"get earliest leader offsets failed: ${leaderOffsetsE.left.get}")
leaderOffsetsE.right.get
} else {
//largest
val leaderOffsetsE = kc.getLatestLeaderOffsets(partitions)
if (leaderOffsetsE.isLeft) throw new SparkException(s"get latest leader offsets failed: ${leaderOffsetsE.left.get}")
leaderOffsetsE.right.get
}
val offsets = leaderOffsets.map { case (tp, lo) => (tp, lo.offset) }
kc.setConsumerOffsets(groupId, offsets) /*
val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
val result = for {
topicPartitions <- kc.getPartitions(topics).right
leaderOffsets <- (if (reset == Some("smallest")) {
kc.getEarliestLeaderOffsets(topicPartitions)
} else {
kc.getLatestLeaderOffsets(topicPartitions)
}).right
} yield {
leaderOffsets.map { case (tp, lo) =>
(tp, lo.offset)
}
}
*/ }
}
} /** 更新zookeeper上的消费offsets */
def updateZKOffsets(rdd: RDD[(String, String)]): Unit = {
val groupId = kafkaParams("group.id")
val offsetList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetList.foreach { offset =>
val topicAndPartition = TopicAndPartition(offset.topic, offset.partition)
val o = kc.setConsumerOffsets(groupId, Map((topicAndPartition, offset.untilOffset)))
if (o.isLeft) println(s"Error updating the offset to Kafka cluster: ${o.left.get}")
}
}
}

2.2 不利用zookeeper

/**
* @author 大数据学习与分享
* Spark Streaming和Kafka082通过mysql维护offset
*/
object SaveOffset2Mysql { def getLastOffsets(database: String, sql: String, jdbcOptions:Map[String,String]): HashMap[TopicAndPartition, Long] = {
val getConnection: () => Connection = JdbcUtils.createConnectionFactory(new JDBCOptions(jdbcOptions))
val conn = getConnection() val pst = conn.prepareStatement(sql)
val res = pst.executeQuery()
var map: HashMap[TopicAndPartition, Long] = HashMap()
while (res.next()) {
val o = res.getString(1)
val jSONArray = JSONArray.fromObject(o)
jSONArray.toArray.foreach { offset =>
val json = JSONObject.fromObject(offset)
val topicAndPartition = TopicAndPartition(json.getString("topic"), json.getInt("partition"))
map += topicAndPartition -> json.getLong("untilOffset")
}
}
pst.close()
conn.close()
map
} def offsetRanges2Json(offsetRanges: Array[OffsetRange]): JSONArray = {
val jSONArray = new JSONArray
offsetRanges.foreach { offsetRange =>
val jSONObject = new JSONObject()
jSONObject.accumulate("topic", offsetRange.topic)
jSONObject.accumulate("partition", offsetRange.partition)
jSONObject.accumulate("fromOffset", offsetRange.fromOffset)
jSONObject.accumulate("untilOffset", offsetRange.untilOffset)
jSONArray.add(jSONObject)
}
jSONArray
} def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5)) val kafkaParams = Map("metadata.broker.list" -> SystemProperties.BROKERS,
"zookeeper.connect" -> SystemProperties.ZK_SERVERS,
"zookeeper.connection.timeout.ms" -> "10000")
val topics = Set("pv") val tpMap = getLastOffsets("test", "select offset from res where id = (select max(id) from res)")
var messages: InputDStream[(String, String)] = null
if (tpMap.nonEmpty) {
messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
ssc, kafkaParams, tpMap, (mmd: MessageAndMetadata[String, String]) => (mmd.key(), mmd.message()))
} else {
kafkaParams + ("auto.offset.reset" -> "largest")
messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
}
// var oRanges = Array[OffsetRange]()
// messages.transform { rdd =>
// oRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// rdd
// }.foreachRDD { rdd =>
// val offset = offsetRanges2Json(oRanges).toString
// } messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd.map(_._2).flatMap(_.split(" ")).map((_, 1L)).reduceByKey(_ + _).repartition(1)
.foreachPartition { par =>
if (par.nonEmpty) {
val conn = MysqlUtil.getConnection("test")
conn.setAutoCommit(false)
val pst = conn.prepareStatement("INSERT INTO res (word,count,offset,time) VALUES (?,?,?,?)")
par.foreach { case (word, count) =>
pst.setString(1, word)
pst.setLong(2, count)
pst.setString(3, offset)
pst.setTimestamp(4, new Timestamp(System.currentTimeMillis()))
pst.addBatch()
}
pst.executeBatch()
conn.commit()
pst.close()
conn.close()
}
}
}
ssc.start()
ssc.awaitTermination()
}
}
// Spark Streaming和Kafka010整合维护offset
val kafkaParams = Map[String, Object]("bootstrap.servers" -> SystemProperties.BROKERS,
"key.deserializer" -> classOf[StringDeserializer],
"key.deserializer" -> classOf[StringDeserializer],
"group.id" -> "g1",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)) val messages = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(topicSet, kafkaParams, getLastOffsets(kafkaParams, topicSet))) messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
iter.foreach { each =>
s"Do Something with $each"
}
} messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
} def getLastOffsets(kafkaParams: Map[String, Object], topicSet: Set[String]): Map[TopicPartition, Long] = {
val props = new Properties()
props.putAll(kafkaParams.asJava) val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(topicSet.asJavaCollection) paranoidPoll(consumer) val consumerAssign = consumer.assignment().asScala.map(tp => tp -> consumer.position(tp)).toMap consumer.close()
consumerAssign
} /** 思考: 消息已消费但提交offsets失败时的offsets矫正? */
def paranoidPoll(consumer: KafkaConsumer[String, String]): Unit = {
val msg = consumer.poll(Duration.ZERO)
if (!msg.isEmpty) {
// position should be minimum offset per topic partition
// val x: ((Map[TopicPartition, Long], ConsumerRecord[String, String]) => Map[TopicPartition, Long]) => Map[TopicPartition, Long] = msg.asScala.foldLeft(Map[TopicPartition, Long]())
msg.asScala.foldLeft(Map[TopicPartition, Long]()) { (acc, m) =>
val tp = new TopicPartition(m.topic(), m.partition())
val off = acc.get(tp).map(o => Math.min(o, m.offset())).getOrElse(m.offset()) acc + (tp -> off)
}.foreach { case (tp, off) =>
consumer.seek(tp, off)
}
}
}

上述给出一个demo思路。实际生产中,还要结合具体的业务场景,根据不同情况做特殊处理。

推荐文章:

Spark流式状态管理

Kafka作为消息系统的系统补充

Spark中广播变量详解以及如何动态更新广播变量

Java并发队列与容器

Hadoop调优

Redis中一致性哈希问题

Kafka高性能分析

对Spark硬件配置的建议

Spark集群和任务执行

通过spark.default.parallelism谈Spark谈并行度


关注微信公众号:大数据学习与分享,获取更对技术干货

SparkStreaming和Kafka基于Direct Approach如何管理offset实现exactly once的更多相关文章

  1. SparkStreaming获取kafka数据的两种方式:Receiver与Direct

    简介: Spark-Streaming获取kafka数据的两种方式-Receiver与Direct的方式,可以简单理解成: Receiver方式是通过zookeeper来连接kafka队列, Dire ...

  2. 基于Java+SparkStreaming整合kafka编程

    一.下载依赖jar包 具体可以参考:SparkStreaming整合kafka编程 二.创建Java工程 太简单,略. 三.实际例子 spark的安装包里面有好多例子,具体路径:spark-2.1.1 ...

  3. Spark-Streaming获取kafka数据的两种方式:Receiver与Direct的方式

    简单理解为:Receiver方式是通过zookeeper来连接kafka队列,Direct方式是直接连接到kafka的节点上获取数据 Receiver 使用Kafka的高层次Consumer API来 ...

  4. sparkStreaming读取kafka的两种方式

    概述 Spark Streaming 支持多种实时输入源数据的读取,其中包括Kafka.flume.socket流等等.除了Kafka以外的实时输入源,由于我们的业务场景没有涉及,在此将不会讨论.本篇 ...

  5. 解析SparkStreaming和Kafka集成的两种方式

    spark streaming是基于微批处理的流式计算引擎,通常是利用spark core或者spark core与spark sql一起来处理数据.在企业实时处理架构中,通常将spark strea ...

  6. 工具篇-Spark-Streaming获取kafka数据的两种方式(转载)

    转载自:https://blog.csdn.net/weixin_41615494/article/details/7952173 一.基于Receiver的方式 原理 Receiver从Kafka中 ...

  7. 《Apache kafka实战》读书笔记-管理Kafka集群安全之ACL篇

    <Apache kafka实战>读书笔记-管理Kafka集群安全之ACL篇 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 想必大家能看到这篇博客的小伙伴,估计你对kaf ...

  8. sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到zookeeper)-- 2

    参考上篇博文:https://www.cnblogs.com/niutao/p/10547718.html 同样的逻辑,不同的封装 package offsetInZookeeper /** * Cr ...

  9. sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到zookeeper)

    版本声明: kafka:1.0.1 spark:2.1.0 注意:在使用过程中可能会出现servlet版本不兼容的问题,因此在导入maven的pom文件的时候,需要做适当的排除操作 <?xml ...

  10. Spark+Kafka的Direct方式将偏移量发送到Zookeeper实现(转)

    原文链接:Spark+Kafka的Direct方式将偏移量发送到Zookeeper实现 Apache Spark 1.3.0引入了Direct API,利用Kafka的低层次API从Kafka集群中读 ...

随机推荐

  1. jquery datatable

    <html><head></head> <script type="text/javascript"> $(document).re ...

  2. JDE Section设置默认不执行

    此属性设置后,该Section仅能通过手动调用,默认不执行.

  3. Paip.Php Java 异步编程。推模型与拉模型。响应式(Reactive)”编程FutureData总结... 1

    Paip.Php  Java 异步编程.推模型与拉模型.响应式(Reactive)"编程FutureData总结... 1.1.1       异步调用的实现以及角色(:调用者 提货单) F ...

  4. 不再让内容把td撑开

    <style type="text/css"> table {width:600px;table-layout:fixed;} td {white-space:nowr ...

  5. bzoj 2134 单选错位(期望)

    [题目链接] http://www.lydsy.com/JudgeOnline/problem.php?id=2134 [题意] ai与ai+1相等得1分,求期望. [思路] 每个题的期望都是独立的. ...

  6. [转]LINQ语句之Select/Distinct和Count/Sum/Min/Max/Avg

    在讲述了LINQ,顺便说了一下Where操作,这篇开始我们继续说LINQ语句,目的让大家从语句的角度了解LINQ,LINQ包括LINQ to Objects.LINQ to DataSets.LINQ ...

  7. web基础-web工作原理,http协议,浏览器缓存

    1,web工作原理 2,http协议 3,浏览器缓存 4,cookie和session -------------------------------------------------------- ...

  8. DKNY_百度百科

    DKNY_百度百科 DKNY

  9. Jquery如何获取iframe里面body的html呢?

    如果是自己网页的话,可以这样,$("iframe").contents().find("body").html();意思是,获取iframe里面页面body的内 ...

  10. python全栈开发day48-jqurey自定义动画,jQuery属性操作,jQuery的文档操作,jQuery中的ajax

    一.昨日内容回顾 1.jQuery初识 1).使用jQuery而非JS的六大理由 2).jQuery对象和js对象转换 3).jQuery的两大特点 4).jQuery的入口函数三大写法 5).jQu ...