安装
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple dbt-core
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade dbt-mysql
dbt --version
Core:
- installed: 1.1.5
- latest: 1.5.0 - Update available!
Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- mariadb: 1.1.0 - Could not determine latest version
- mysql: 1.1.0 - Up to date!
- mysql5: 1.1.0 - Could not determine latest version
创建项目与连接数据库
在命令行中进入自己常用的存放代码的文件夹之后,按照下面步骤操作,就可以创建一个新的项目:
dbt init dbt_hello_world
06:42:48 Running with dbt=1.1.5
06:42:48 Creating dbt configuration folder at C:\Users\wrss01\.dbt
Which database would you like to use?
[1] mariadb
[2] mysql
[3] mysql5
(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 2
06:43:04 Profile dbt_hello_world written to C:\Users\wrss01\.dbt\profiles.yml using target's sample configuration. Once updated, you'll be able to start developing with dbt.
06:43:04
Your new dbt project "dbt_hello_world" was created!
For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:
https://docs.getdbt.com/docs/configure-your-profile
One more thing:
Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
https://community.getdbt.com/
Happy modeling!
创建成功后的目录结构如下:
│ .gitignore
│ dbt_project.yml
│ README.md
│
├─analyses
│ .gitkeep
│
├─macros
│ .gitkeep
│
├─models
│ └─example
│ my_first_dbt_model.sql
│ my_second_dbt_model.sql
│ schema.yml
│
├─seeds
│ .gitkeep
│
├─snapshots
│ .gitkeep
│
└─tests
.gitkeep
找到 ~/.dbt/profiles.yml 文件,内容改为:
dbt_hello_world:
target: dev
outputs:
dev:
type: mysql
server: 172.30.81.86
port: 3307 # optional
database: cte_test # optional, should be same as schema
schema: cte_test
username: root
password: mysqlpwd123
driver: MySQL ODBC 8.0 ANSI Driver
prod:
type: mysql
server: 172.30.81.86
port: 3307 # optional
database: cte_test # optional, should be same as schema
schema: cte_test
username: root
password: mysqlpwd123
driver: MySQL ODBC 8.0 ANSI Driver
测试链接:
dbt debug
07:20:20 Running with dbt=1.1.5
dbt version: 1.1.5
python version: 3.9.16
python path: C:\ProgramData\Anaconda3\envs\3.9\python.exe
os info: Windows-10-10.0.19045-SP0
Using profiles.yml file at C:\Users\wrss01\.dbt\profiles.yml
Using dbt_project.yml file at C:\Users\wrss01\Desktop\dbt_test\dbt_hello_world\dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
server: 172.30.81.86
port: 3307
database: None
schema: cte_test
user: root
Connection test: [OK connection ok]
All checks passed!
第一次运行
通过 init 已经自带了两个模型,我们可以直接通过 dbt run 尝试运行一下,结果如下:
07:24:15 Running with dbt=1.1.5
07:24:15 Partial parse save file not found. Starting full parse.
07:24:16 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
07:24:16
07:24:16 Concurrency: 1 threads (target='dev')
07:24:16
07:24:16 1 of 2 START table model cte_test.my_first_dbt_model ........................... [RUN]
07:24:18 1 of 2 OK created table model cte_test.my_first_dbt_model ...................... [SUCCESS 2 in 1.69s]
07:24:18 2 of 2 START view model cte_test.my_second_dbt_model ........................... [RUN]
07:24:19 2 of 2 OK created view model cte_test.my_second_dbt_model ...................... [SUCCESS 0 in 0.98s]
07:24:19
07:24:19 Finished running 1 table model, 1 view model in 3.19s.
07:24:19
07:24:19 Completed successfully
07:24:19
07:24:19 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
通过命令行的输出可以看到生成了 1 个 table model 和 1 个 view model,同时我们也可以在 MySQL Workbench 中看到 1 张表和 1 个视图。
载入数据集
我们先从 这里 https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset 下载 UCI 的共享单车数据集。我们将 hour.csv 和 day.csv 放到 seeds 文件夹下,然后创建 bike_share.yml 文件,内容如下:
version: 2
seeds:
- name: hour
config:
column_types:
dteday: date
- name: day
config:
column_types:
dteday: date
这里我们指定一下 dteday 为 date 类型(没有指定的列,dbt 会自己推断类型,具体可见参考链接里 column_types 的链接),接着就可以执行 dbt seed 命令了,结果如下:
08:05:37 Running with dbt=1.1.5
08:05:37 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
08:05:37
08:05:38 Concurrency: 1 threads (target='dev')
08:05:38
08:05:38 1 of 2 START seed file cte_test.day ............................................ [RUN]
08:05:44 1 of 2 OK loaded seed file cte_test.day ........................................ [INSERT 731 in 6.32s]
08:05:44 2 of 2 START seed file cte_test.hour ........................................... [RUN]
08:08:11 2 of 2 OK loaded seed file cte_test.hour ....................................... [INSERT 17379 in 146.61s]
08:08:13
08:08:13 Finished running 2 seeds in 155.65s.
08:08:13
08:08:13 Completed successfully
08:08:13
08:08:13 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
再执行一次 dbt seed,就会发现虽然执行时间没变,但是数据并没有重复两次,因为导入前会把老的数据清空。但如果是表结构有变化,那么就需要先删除表才可以。
创建模型
我们创建 models/shared_bike 文件夹,之后我们的模型会统一放在这个文件夹中。要做模型当然要先了解数据,我们有两张表:hour 和 day,两个的差别在于 hour 表多了一个 hr 字段,表示具体的小时,其他的属性说明如下:
- instant: 记录编号
- dteday : 日期
- season : 季节 (1: 冬天, 2: 春天, 3: 夏天, 4:秋天)
- yr : 年份 (0: 2011, 1:2012)
- mnth : 月份 ( 1 to 12)
- hr : 小时 (0 to 23)
- holiday : 是否是节假日
- weekday : 星期几
- workingday : 是否是工作日
- weathersit : 天气类型
- 1: 晴天少云 Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: 有雾 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: 小雨小雪雷阵雨 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: 大雨大雪冰雹 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp: 标准化的摄氏度 Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: 标准化的体感摄氏度 Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: 标准化的湿度 Normalized humidity. The values are divided to 100 (max)
- windspeed: 标准化的风速 Normalized wind speed. The values are divided to 67 (max)
- casual: 非注册用户数量 count of casual users
- registered: 注册用户数量 count of registered users
- cnt: 租借单车的数量(全部用户) count of total rental bikes including both casual and registered
看了一眼这些数据,我们可以先利用 group 的能力得到如下一些数据模型:
- group by 季节,得到每个季节的趋势
- group by 月份,得到每个月份的趋势
- group by 天气,得到每种天气的趋势
- group by 星期几,得到星期的趋势
/* 月度趋势表 */
{{ config(materialized='table') }}
with mnth_trend as (
select mnth , min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
from {{ ref('day') }}
group by mnth
)
select * from mnth_trend
/* 季节趋势表 */
{{ config(materialized='table') }}
with season_trend as (
select season , min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
from {{ ref('day') }}
group by season
)
select * from season_trend
/* 天气趋势表 */
{{ config(materialized='table') }}
with weather_trend as (
select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
from {{ ref('day') }}
group by weathersit
)
select * from weather_trend
/* 周趋势表 */
{{ config(materialized='table') }}
with weekday_trend as (
select weekday , min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
from {{ ref('day') }}
group by weekday
)
select * from weekday_trend
编写month_trend.sql、season_trend.sql、weather_trend.sql、weekday_trend.sql完成之后,执行 dbt run,可以得到如下结果:
08:50:29 Running with dbt=1.1.5
08:50:29 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
08:50:29
08:50:29 Concurrency: 1 threads (target='dev')
08:50:29
08:50:29 1 of 6 START table model cte_test.month_trend .................................. [RUN]
08:50:31 1 of 6 OK created table model cte_test.month_trend ............................. [SUCCESS 12 in 1.94s]
08:50:31 2 of 6 START table model cte_test.my_first_dbt_model ........................... [RUN]
08:50:33 2 of 6 OK created table model cte_test.my_first_dbt_model ...................... [SUCCESS 2 in 1.80s]
08:50:33 3 of 6 START table model cte_test.season_trend ................................. [RUN]
08:50:34 3 of 6 OK created table model cte_test.season_trend ............................ [SUCCESS 4 in 1.07s]
08:50:34 4 of 6 START table model cte_test.weather_trend ................................ [RUN]
08:50:35 4 of 6 OK created table model cte_test.weather_trend ........................... [SUCCESS 3 in 1.02s]
08:50:35 5 of 6 START table model cte_test.weekday_trend ................................ [RUN]
08:50:38 5 of 6 OK created table model cte_test.weekday_trend ........................... [SUCCESS 7 in 2.66s]
08:50:38 6 of 6 START view model cte_test.my_second_dbt_model ........................... [RUN]
08:50:40 6 of 6 OK created view model cte_test.my_second_dbt_model ...................... [SUCCESS 0 in 1.57s]
08:50:40
08:50:40 Finished running 5 table models, 1 view model in 10.35s.
08:50:40
08:50:40 Completed successfully
08:50:40
08:50:40 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
查询数据库,数据如下:
生成文档
dbt 的一大好处就是可以自动根据 sql 语句生成文档,命令如下:
#生成文档
dbt docs generate
#文档本地访问
dbt docs serve
执行的输出:
09:03:03 Running with dbt=1.1.5
09:03:03 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
09:03:03
09:03:03 Concurrency: 1 threads (target='dev')
09:03:03
09:03:03 Done.
09:03:03 Building catalog
09:03:03 Catalog written to C:\Users\wrss01\Desktop\dbt_test\dbt_hello_world\target\catalog.json
09:04:02 Running with dbt=1.1.5
09:04:02 Serving docs at 0.0.0.0:8080
09:04:02 To access from your browser, navigate to: http://localhost:8080
09:04:02
09:04:02
09:04:02 Press Ctrl+C to exit.
文档的展现:
到目前为止,相信大家已经领略了dbt的魅力。它无需手动创建表格,无需手动维护文档和血缘关系。就像编写前后端代码一样,可以以模块化的方式进行数据分析。如果您还在犹豫,现在是时候跳上这辆dbt的列车了!它将为您带来更高效、可靠和可维护的数据处理和分析体验。别再等待,赶快加入吧!