分类
devops

metadata

opensource project

name desc link
unitycatalog Unity Catalog: Open, Multimodal Catalog for Data & AI https://github.com/unitycatalog/unitycatalog
Apache Atlas Apache Atlas – Data Governance and Metadata framework for Hadoop https://atlas.apache.org/
datahub The #1 open source metadata platform. https://datahubproject.io/
Metacat Metacat is a unified metadata exploration API service. https://github.com/Netflix/metacat
openmetadata Open and unified metadata platform for data discovery, observability, and governance https://open-metadata.org/
apache polaris Apache Polaris is an open-source, fully-featured catalog for Apache Iceberg™. https://polaris.apache.org/
projectnessie Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics https://projectnessie.org/
GRAVITINO Apache Gravitino 是一个高性能、地理分布和联邦元数据湖。通过使用技术数据目录和元数据湖,您可以管理所有数据源(包括文件存储、关系型数据库和事件流)的访问,并在不同的云服务提供商上安全地使用多个引擎(如 Spark、Trino 或 Flink)处理不同格式的数据,同时执行数据治理。 https://gravitino.apache.org/
lakekeeper A secure, fast, and user-friendly Apache Iceberg REST Catalog built with Rust and available under the Apache License. https://docs.lakekeeper.io/
tabulario/iceberg-rest Sample REST image for experimentation and testing with Iceberg RESTCatalog implementations (注意:这是一个参考实现,仅可用于实验) Apache Iceberg背后公司 Tabular

datahub依赖解析

datahub-frontend / datahub-gms / datahub-ingestion / datahub-actions


I. storage backend

    Cassandra ---> DataHub GMS can use Cassandra as an alternate storage backend.
    mariadb
    mysql (default)
    postgres

II. Neo4j

    DataHub uses Neo4j as graph db in the backend to serve graph queries.

III. Elasticsearch & Kibana

    DataHub uses Elasticsearch as a search engine.

IV. Kafka, Zookeeper and Schema Registry

    DataHub uses Kafka as the pub-sub message queue in the backend. 

DataHub Docker Images:

acryldata/datahub-ingestion
acryldata/datahub-gms
acryldata/datahub-frontend-react
acryldata/datahub-mae-consumer
acryldata/datahub-mce-consumer
acryldata/datahub-upgrade
acryldata/datahub-kafka-setup
acryldata/datahub-elasticsearch-setup
acryldata/datahub-mysql-setup
acryldata/datahub-postgres-setup
acryldata/datahub-actions
## Do not use acryldata/acryl-datahub-actions as that is deprecated and no longer used.

## Dependencies:
Kafka, Zookeeper, and Schema Registry
Elasticsearch
MySQL
(Optional) Neo4j

请勿对任何映像使用 latest 或 debug 标记,因为这些标记不受支持,并且仅由于遗留原因而存在。请使用 head 或特定于 v0.8.40 等版本的标签。对于生产环境,我们建议使用特定于版本的标签,而不是 head。

docker pull acryldata/datahub-frontend-react:v1.0.0
docker pull acryldata/datahub-gms:v1.0.0

# https://github.com/datahub-project/datahub/blob/master/docker/docker-compose.yml

spark.sql.catalogImplementation

spark.sql.catalogImplementationSpark SQL 中的一个配置项,用于指定 catalog 的实现方式。目前,Spark SQL 支持两种 catalog 实现方式:

  • hive: 这是 Spark SQL 默认的 catalog 实现方式,使用 Hive 的 metastore 来存储和管理表的元数据。
  • in-memory: 这种实现方式将表的元数据存储在内存中,适合于小规模的数据集和开发环境。

以下是一些第三方 catalog 的实现方式:

  • org.apache.iceberg.spark.SparkCatalog
  • org.apache.spark.sql.hudi.catalog.HoodieCatalog
  • org.apache.spark.sql.delta.catalog.DeltaCatalog
  • org.apache.spark.sql.hive.HiveExternalCatalog
  • org.apache.iceberg.hive.HiveCatalog
  • org.apache.spark.sql.hive.HiveSessionCatalog
  • org.apache.spark.sql.catalyst.catalog.SessionCatalog
  • org.apache.spark.sql.catalyst.catalog.InMemoryCatalog

spark.sql.catalog.spark_catalog 还有什么其他子属性,这些子属性分别是什么

spark.sql.catalog.spark_catalog的子属性主要用于配置Spark的目录服务,特别是当使用Iceberg或其他类型的目录时。以下是一些常见的子属性:

子属性列表
type:指定目录的类型,如hive、hadoop、rest、glue、jdbc或nessie。


spark.sql.catalog.spark_catalog.type=hive
uri:用于指定目录的URI,例如Hive Metastore的Thrift URL或REST API的URL。


spark.sql.catalog.spark_catalog.uri=thrift://localhost:9083
warehouse:指定数据仓库的路径,通常用于Hadoop类型的目录。


spark.sql.catalog.spark_catalog.warehouse=hdfs://nn:8020/warehouse/path
default-namespace:设置目录的默认命名空间。


spark.sql.catalog.spark_catalog.default-namespace=default
catalog-impl:用于指定自定义的目录实现类。


spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.rest.RESTCatalog
io-impl:指定自定义的文件IO实现类。


spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.io.FileIO
metrics-reporter-impl:指定自定义的指标报告实现类。


spark.sql.catalog.spark_catalog.metrics-reporter-impl=org.apache.iceberg.metrics.ConsoleMetricsReporter
cache-enabled:是否启用目录缓存。


spark.sql.catalog.spark_catalog.cache-enabled=true
cache.expiration-interval-ms:缓存过期时间(毫秒)。


spark.sql.catalog.spark_catalog.cache.expiration-interval-ms=30000
table-default.propertyKey:设置默认的表属性值。


spark.sql.catalog.spark_catalog.table-default.write-format-version=2
table-override.propertyKey:设置强制覆盖的表属性值。


spark.sql.catalog.spark_catalog.table-override.write-format-version=2
view-default.propertyKey:设置默认的视图属性值。


spark.sql.catalog.spark_catalog.view-default.write-format-version=2
use-nullable-query-schema:是否在创建表时保留字段的可空性。


spark.sql.catalog.spark_catalog.use-nullable-query-schema=true
这些子属性根据具体的目录类型和需求进行配置,以实现灵活的元数据管理和数据访问。

spark.sql.catalog.{catalog_name}.type是和 spark.sql.catalogImplementation 对标的?

https://iceberg.apache.org/javadoc/1.4.1/org/apache/iceberg/spark/SparkCatalog.html

A Spark TableCatalog implementation that wraps an Iceberg Catalog.
This supports the following catalog configuration options:

type - catalog type, "hive" or "hadoop" or "rest". To specify a non-hive or hadoop catalog, use the catalog-impl option.
uri - the Hive Metastore URI for Hive catalog or REST URI for REST catalog
warehouse - the warehouse path (Hadoop catalog only)
catalog-impl - a custom Catalog implementation to use
io-impl - a custom FileIO implementation to use
metrics-reporter-impl - a custom MetricsReporter implementation to use
default-namespace - a namespace to use as the default
cache-enabled - whether to enable catalog cache
cache.case-sensitive - whether the catalog cache should compare table identifiers in a case sensitive way
cache.expiration-interval-ms - interval in millis before expiring tables from catalog cache. Refer to CatalogProperties.CACHE_EXPIRATION_INTERVAL_MS for further details and significant values.
table-default.$tablePropertyKey - table property $tablePropertyKey default at catalog level
table-override.$tablePropertyKey - table property $tablePropertyKey enforced at catalog level

事实上spark.sql.catalog.{catalog_name}.type属性已经不大够用了,被扩展出来了很多其他字段.

2025/04/14 初步探索datahub,感觉这个部署还是十分困难的,依赖比较多。