安装

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple dbt-core

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade dbt-mysql

dbt --version

Core:
  - installed: 1.1.5
  - latest:    1.5.0 - Update available!
  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation
Plugins:
  - mariadb: 1.1.0 - Could not determine latest version
  - mysql:   1.1.0 - Up to date!
  - mysql5:  1.1.0 - Could not determine latest version

创建项目与连接数据库

在命令行中进入自己常用的存放代码的文件夹之后，按照下面步骤操作，就可以创建一个新的项目：

dbt init dbt_hello_world

06:42:48  Running with dbt=1.1.5
06:42:48  Creating dbt configuration folder at C:\Users\wrss01\.dbt
Which database would you like to use?
[1] mariadb
[2] mysql
[3] mysql5
(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 2
06:43:04  Profile dbt_hello_world written to C:\Users\wrss01\.dbt\profiles.yml using target's sample configuration. Once updated, you'll be able to start developing with dbt.
06:43:04
Your new dbt project "dbt_hello_world" was created!
For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:
  https://docs.getdbt.com/docs/configure-your-profile
One more thing:
Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
  https://community.getdbt.com/
Happy modeling!

创建成功后的目录结构如下：

│  .gitignore
│  dbt_project.yml
│  README.md
│
├─analyses
│      .gitkeep
│
├─macros
│      .gitkeep
│
├─models
│  └─example
│          my_first_dbt_model.sql
│          my_second_dbt_model.sql
│          schema.yml
│
├─seeds
│      .gitkeep
│
├─snapshots
│      .gitkeep
│
└─tests
        .gitkeep

找到 ~/.dbt/profiles.yml 文件，内容改为：

dbt_hello_world:
  target: dev
  outputs:
    dev:
      type: mysql
      server: 172.30.81.86
      port: 3307  # optional
      database: cte_test # optional, should be same as schema
      schema: cte_test
      username: root
      password: mysqlpwd123
      driver: MySQL ODBC 8.0 ANSI Driver
    prod:
      type: mysql
      server: 172.30.81.86
      port: 3307  # optional
      database: cte_test # optional, should be same as schema
      schema: cte_test
      username: root
      password: mysqlpwd123
      driver: MySQL ODBC 8.0 ANSI Driver

测试链接：

dbt debug

07:20:20  Running with dbt=1.1.5
dbt version: 1.1.5
python version: 3.9.16
python path: C:\ProgramData\Anaconda3\envs\3.9\python.exe
os info: Windows-10-10.0.19045-SP0
Using profiles.yml file at C:\Users\wrss01\.dbt\profiles.yml
Using dbt_project.yml file at C:\Users\wrss01\Desktop\dbt_test\dbt_hello_world\dbt_project.yml
Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]
Required dependencies:
 - git [OK found]
Connection:
  server: 172.30.81.86
  port: 3307
  database: None
  schema: cte_test
  user: root
  Connection test: [OK connection ok]
All checks passed!

第一次运行

通过 init 已经自带了两个模型，我们可以直接通过 dbt run 尝试运行一下，结果如下：

07:24:15  Running with dbt=1.1.5
07:24:15  Partial parse save file not found. Starting full parse.
07:24:16  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
07:24:16
07:24:16  Concurrency: 1 threads (target='dev')
07:24:16
07:24:16  1 of 2 START table model cte_test.my_first_dbt_model ........................... [RUN]
07:24:18  1 of 2 OK created table model cte_test.my_first_dbt_model ...................... [SUCCESS 2 in 1.69s]
07:24:18  2 of 2 START view model cte_test.my_second_dbt_model ........................... [RUN]
07:24:19  2 of 2 OK created view model cte_test.my_second_dbt_model ...................... [SUCCESS 0 in 0.98s]
07:24:19
07:24:19  Finished running 1 table model, 1 view model in 3.19s.
07:24:19
07:24:19  Completed successfully
07:24:19
07:24:19  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

通过命令行的输出可以看到生成了 1 个 table model 和 1 个 view model，同时我们也可以在 MySQL Workbench 中看到 1 张表和 1 个视图。

载入数据集

我们先从这里 https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset 下载 UCI 的共享单车数据集。我们将 hour.csv 和 day.csv 放到 seeds 文件夹下，然后创建 bike_share.yml 文件，内容如下：

version: 2
seeds:
  - name: hour
    config:
      column_types: 
        dteday: date
  - name: day
    config:
      column_types:
        dteday: date

这里我们指定一下 dteday 为 date 类型（没有指定的列，dbt 会自己推断类型，具体可见参考链接里 column_types 的链接），接着就可以执行 dbt seed 命令了，结果如下：

08:05:37  Running with dbt=1.1.5
08:05:37  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
08:05:37
08:05:38  Concurrency: 1 threads (target='dev')
08:05:38
08:05:38  1 of 2 START seed file cte_test.day ............................................ [RUN]
08:05:44  1 of 2 OK loaded seed file cte_test.day ........................................ [INSERT 731 in 6.32s]
08:05:44  2 of 2 START seed file cte_test.hour ........................................... [RUN]
08:08:11  2 of 2 OK loaded seed file cte_test.hour ....................................... [INSERT 17379 in 146.61s]
08:08:13
08:08:13  Finished running 2 seeds in 155.65s.
08:08:13
08:08:13  Completed successfully
08:08:13
08:08:13  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

再执行一次 dbt seed，就会发现虽然执行时间没变，但是数据并没有重复两次，因为导入前会把老的数据清空。但如果是表结构有变化，那么就需要先删除表才可以。

创建模型

我们创建 models/shared_bike 文件夹，之后我们的模型会统一放在这个文件夹中。要做模型当然要先了解数据，我们有两张表：hour 和 day，两个的差别在于 hour 表多了一个 hr 字段，表示具体的小时，其他的属性说明如下：

instant: 记录编号
dteday : 日期
season : 季节 (1: 冬天, 2: 春天, 3: 夏天, 4:秋天)
yr : 年份 (0: 2011, 1:2012)
mnth : 月份 ( 1 to 12)
hr : 小时 (0 to 23)
holiday : 是否是节假日
weekday : 星期几
workingday : 是否是工作日
weathersit : 天气类型

1: 晴天少云 Clear, Few clouds, Partly cloudy, Partly cloudy
2: 有雾 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: 小雨小雪雷阵雨 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: 大雨大雪冰雹 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp: 标准化的摄氏度 Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: 标准化的体感摄氏度 Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: 标准化的湿度 Normalized humidity. The values are divided to 100 (max)
windspeed: 标准化的风速 Normalized wind speed. The values are divided to 67 (max)
casual: 非注册用户数量 count of casual users
registered: 注册用户数量 count of registered users
cnt: 租借单车的数量（全部用户） count of total rental bikes including both casual and registered

看了一眼这些数据，我们可以先利用 group 的能力得到如下一些数据模型：

group by 季节，得到每个季节的趋势
group by 月份，得到每个月份的趋势
group by 天气，得到每种天气的趋势
group by 星期几，得到星期的趋势

/* 月度趋势表 */
{{ config(materialized='table') }}
with mnth_trend as (
    select mnth , min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
    from {{ ref('day') }}
    group by mnth 
)
select * from mnth_trend

/* 季节趋势表 */
{{ config(materialized='table') }}
with season_trend as (
    select season , min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
    from {{ ref('day') }}
    group by season 
)
select * from season_trend

/* 天气趋势表 */
{{ config(materialized='table') }}
with weather_trend as (
    select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
    from {{ ref('day') }}
    group by weathersit
)
select * from weather_trend

/* 周趋势表 */
{{ config(materialized='table') }}
with weekday_trend as (
    select weekday , min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
    from {{ ref('day') }}
    group by weekday 
)
select * from weekday_trend

编写month_trend.sql、season_trend.sql、weather_trend.sql、weekday_trend.sql完成之后，执行 dbt run，可以得到如下结果：

08:50:29  Running with dbt=1.1.5
08:50:29  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
08:50:29
08:50:29  Concurrency: 1 threads (target='dev')
08:50:29
08:50:29  1 of 6 START table model cte_test.month_trend .................................. [RUN]
08:50:31  1 of 6 OK created table model cte_test.month_trend ............................. [SUCCESS 12 in 1.94s]
08:50:31  2 of 6 START table model cte_test.my_first_dbt_model ........................... [RUN]
08:50:33  2 of 6 OK created table model cte_test.my_first_dbt_model ...................... [SUCCESS 2 in 1.80s]
08:50:33  3 of 6 START table model cte_test.season_trend ................................. [RUN]
08:50:34  3 of 6 OK created table model cte_test.season_trend ............................ [SUCCESS 4 in 1.07s]
08:50:34  4 of 6 START table model cte_test.weather_trend ................................ [RUN]
08:50:35  4 of 6 OK created table model cte_test.weather_trend ........................... [SUCCESS 3 in 1.02s]
08:50:35  5 of 6 START table model cte_test.weekday_trend ................................ [RUN]
08:50:38  5 of 6 OK created table model cte_test.weekday_trend ........................... [SUCCESS 7 in 2.66s]
08:50:38  6 of 6 START view model cte_test.my_second_dbt_model ........................... [RUN]
08:50:40  6 of 6 OK created view model cte_test.my_second_dbt_model ...................... [SUCCESS 0 in 1.57s]
08:50:40
08:50:40  Finished running 5 table models, 1 view model in 10.35s.
08:50:40
08:50:40  Completed successfully
08:50:40
08:50:40  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

查询数据库，数据如下：

生成文档

dbt 的一大好处就是可以自动根据 sql 语句生成文档，命令如下：

#生成文档
dbt docs generate
#文档本地访问
dbt docs serve

执行的输出：

09:03:03  Running with dbt=1.1.5
09:03:03  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
09:03:03
09:03:03  Concurrency: 1 threads (target='dev')
09:03:03
09:03:03  Done.
09:03:03  Building catalog
09:03:03  Catalog written to C:\Users\wrss01\Desktop\dbt_test\dbt_hello_world\target\catalog.json
09:04:02  Running with dbt=1.1.5
09:04:02  Serving docs at 0.0.0.0:8080
09:04:02  To access from your browser, navigate to:  http://localhost:8080
09:04:02
09:04:02
09:04:02  Press Ctrl+C to exit.

文档的展现：

到目前为止，相信大家已经领略了dbt的魅力。它无需手动创建表格，无需手动维护文档和血缘关系。就像编写前后端代码一样，可以以模块化的方式进行数据分析。如果您还在犹豫，现在是时候跳上这辆dbt的列车了！它将为您带来更高效、可靠和可维护的数据处理和分析体验。别再等待，赶快加入吧！

网站首页 > 技术文章正文

数据集成的未来之路:dbt(data build tool)起步

安装

创建项目与连接数据库

第一次运行

载入数据集

创建模型

生成文档

猜你喜欢

网站首页 > 技术文章 正文

数据集成的未来之路:dbt(data build tool)起步

安装

创建项目与连接数据库

第一次运行

载入数据集

创建模型

生成文档

猜你喜欢

网站首页 > 技术文章正文