分类
devops

flink和spark的类装载器

flink

classloader.resolve-order: child-first # parent-first
#classloader.parent-first-patterns.default:
classloader.parent-first-patterns.additional: com.codahale.metrics
java.lang.LinkageError: loader constraint violation: when resolving method 'void org.apache.flink.dropwizard.metrics.DropwizardHistogramWrapper.<init>(com.codahale.metrics.Histogram)' the class loader org.apache.flink.util.ChildFirstClassLoader @1f4f7e8c of the current class, org/apache/iceberg/flink/sink/IcebergStreamWriterMetrics, and the class loader 'app' for the method's defining class, org/apache/flink/dropwizard/metrics/DropwizardHistogramWrapper, have different Class objects for the type com/codahale/metrics/Histogram used in the signature (org.apache.iceberg.flink.sink.IcebergStreamWriterMetrics is in unnamed module of loader org.apache.flink.util.ChildFirstClassLoader @1f4f7e8c, parent loader 'app'; org.apache.flink.dropwizard.metrics.DropwizardHistogramWrapper is in unnamed module of loader 'app')
    at org.apache.iceberg.flink.sink.IcebergStreamWriterMetrics.<init>(IcebergStreamWriterMetrics.java:54)
    at org.apache.iceberg.flink.sink.IcebergStreamWriter.open(IcebergStreamWriter.java:56)
    at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:107)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:858)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$5(StreamTask.java:812)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:812)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:771)
    at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:970)
    at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:939)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
    at java.base/java.lang.Thread.run(Unknown Source)

spark

--conf "spark.driver.extraClassPath="
--conf "spark.executor.extraClassPath="
--conf "spark.driver.extraJavaOptions="
--conf "spark.executor.extraJavaOptions="
--conf "spark.driver.extraLibraryPath="
--conf "spark.executor.extraLibraryPath="
--conf "spark.driver.userClassPathFirst=false"
--conf "spark.executor.userClassPathFirst=false"

MutableURLClassLoader vs AppClassLoader

  • org.apache.spark.util.MutableURLClassLoader

java.net.URLClassLoader<—org.apache.spark.util.MutableURLClassLoader<—org.apache.spark.util.ChildFirstURLClassLoader

三方包

spark加载依赖包

## 从ivy下载
bin/spark-submit \
--conf "spark.jars.packages=cn.hutool:hutool-all:5.8.39" \
jar-file
## 25/07/29 03:10:56 INFO SparkContext: Added JAR file:///opt/bitnami/spark/.ivy2/jars/cn.hutool_hutool-all-5.8.39.jar at spark://192.168.66.115:24911/jars/cn.hutool_hutool-all-5.8.39.jar with timestamp 1753758656002

使用 spark-submit 时,应用程序 jar 以及 --jars 选项中包含的任何 jar 将自动传输到集群。 在 –jars 之后提供的 URL 必须用逗号分隔。该列表包含在驱动程序和执行程序类路径中。目录扩展不适用于 --jars。Spark 使用以下 URL 方案来允许不同的策略来传播 jar:

  • file: – Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
  • hdfs:, http:, https:, ftp: – these pull down files and JARs from the URI as expected
  • local: – a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for -large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

flink加载依赖包

bin/flink run \
--classpath https://maven.aliyun.com/repository/public/cn/hutool/hutool-all/5.8.39/hutool-all-5.8.39.jar \
jar-file

注: bin/spark-submit 命令中的jar-file 支持http, bin/flink命令中的jar-file不支持http

spark必会知识

  • Spark Thrift Server (STS)
  • Adaptive Query Execution (AQE)
  • Dynamic Resource Allocation,简称DRA)
  • spark shuffle
  • Storage Partition Join (SPJ)
  • 在 Spark SQL 中,catalog.database.table 这种结构被称为 表的三段式标识符(three-part identifier),它用于唯一标识一个表。