Dewey Decimal Classification
Fun with DDC
Intro🔗
Most of us have probably set a foot in a library with the intent to find a book. You'd probably have noticed that the books are 'sorted' by topic. Maybe you even saw a number like '745.2' that covers a category of books. That is the Dewey Decimal Classification
For the fontter project I needed a structured way to use sa classification.
This post is a quick summary of the steps that I took.
First step: download the file🔗
I found this list that I could use as a starting point. It is probably outdated, but for my needs it was just sufficient.
We will use the aiohttp
client to download the page and store it locally.
file: 1_prepare.sh
#!/bin/bash pip install bs4
file: download_file.py
#!/usr/bin/env python3 import os import asyncio import aiohttp IN_FILE = './tmp/ddc.html' URL = 'http://bpeck.com/references/ddc/ddc.htm' async def download_file(url: str) -> str: "aiohttp client" async with aiohttp.ClientSession() as session: async with session.get(url) as resp: if resp.status == 200: content = await resp.text() return content async def read_or_create_file(url: str, in_file: str) -> str: "make a local cache, to work from (during development)" if os.path.exists(in_file): with open(in_file, 'r') as file_h: content = file_h.read() else: content = await download_file(url) with open(in_file, 'w') as file_h: file_h.write(content) return content async def main(loop: any) -> None: content = await read_or_create_file(URL, IN_FILE) if __name__ == '__main__': LOOP = asyncio.get_event_loop() LOOP.run_until_complete(main(LOOP))
source ./1_prepare.sh ./download.py
Second step: parse the html🔗
We will use beautifull soup
to extract the data in the html tables and save it in a json file.
file: parse_file.py
#!/usr/bin/env python3 import json import asyncio from typing import Dict from bs4 import BeautifulSoup from download_file import read_or_create_file, IN_FILE, URL OUT_FILE = './tmp/ddc.json' def split_ddc(summary: str) -> str: parts = summary.split(' ') return parts[0], ' '.join(parts[1:]) def parse(content) -> Dict[str, str]: ddict = {} soup = BeautifulSoup(content, 'html.parser') for table in soup('table', cellpadding="2", width="99%", border="1"): rows = table.findAll('tr') summary = rows[0].p.contents[0].contents[0].string dec, subject = split_ddc(summary) ddict[dec] = subject for cell in rows[1].findAll('td'): dec = "" subject = "" size = len(cell.contents) if size == 1: dec = cell.contents[0].contents[0].string subject = cell.contents[0].contents[2].string elif size == 3: dec = cell.contents[0].contents[0].string subject = cell.contents[2].contents[0].string else: pass ddict[dec] = subject for cell in rows[2].findAll('td'): for item in cell.findAll('font'): summary = item.contents[0].string dec, subject = split_ddc(summary) ddict[dec] = subject for p_item in cell.findAll('p'): for item in p_item.findAll('font'): summary = item.contents[0].string dec, subject = split_ddc(summary) ddict[dec] = subject return ddict async def main(loop: any) -> None: content = await read_or_create_file(URL, IN_FILE) ddc = parse(content) with open(OUT_FILE, 'w') as file_h: file_h.write(json.dumps(ddc)) if __name__ == '__main__': LOOP = asyncio.get_event_loop() LOOP.run_until_complete(main(LOOP))
pip install bs4 export PYTHONPATH=`pwd`/.. touch __init__.py ./parse_file.py
Now we have the file saved as ./tmp/ddc.json
Third step: prepare the database table.🔗
file: create_table.sql
CREATE TABLE IF NOT EXISTS ddc( ddc_cid VARCHAR( 8 ) PRIMARY KEY, ddc_name VARCHAR( 256 ) NOT NULL );
Fourth step: insert the data into the table🔗
We will use asyncpg
to extract the json rows and insert them in the database table.
file: import_data.py
#!/usr/bin/env python3 import re import json import asyncio import asyncpg from parse_file import OUT_FILE CONF = { 'database': 'a-database', 'user': 'a-user', 'password': 'a-password', } async def create_record(conn: asyncpg.Connection, cid: str, name: str) -> None: sql = """INSERT INTO ddc ( ddc_cid, ddc_name ) VALUES ( $1, $2 ) RETURNING * """ transaction = conn.transaction() await transaction.start() try: row = await conn.fetchrow(sql, cid, name) except: await transaction.rollback() raise finally: await transaction.commit() async def gen_ddc(conn: asyncpg.Connection) -> None: with open(OUT_FILE, 'r') as file_h: text = file_h.read() data = json.loads(text) pattern = re.compile(r"\d{3}(\.\d+)*") for cid, name in data.items(): if pattern.match(cid): await create_record(conn, cid, name) async def main(loop: any) -> None: db_pool = await asyncpg.create_pool(**CONF, loop=loop) conn = await db_pool.acquire() await gen_ddc(conn) await conn.close() await db_pool.release(conn) await db_pool.close() if __name__ == '__main__': LOOP = asyncio.get_event_loop() LOOP.run_until_complete(main(LOOP))
pip install asyncpg ./import_data.py
Nice, now we can query it,
SELECT * FROM ddc WHERE ddc_cid IN ('700', '740', '745', '745.2') ORDER BY ddc_cid;
ddc_cid | ddc_name
---------+---------------------------
700 | The arts
740 | Drawing & decorative arts
745 | Decorative arts
745.2 | Industrial art and design
(4 rows)
Fifth step: create a view🔗
Just a query is not enough. I want the whole display of the hiarchy in one row, even if that means one badass query.
file: create_view.sql
CREATE OR REPLACE VIEW ddc_v AS WITH first AS ( SELECT LEFT(ddc_cid, 1) AS a, ddc_cid, ddc_name FROM ddc WHERE ddc_cid IN ('0', '100', '200', '300', '400', '500', '600', '700', '800', '900') ORDER BY ddc_cid ), second AS ( SELECT LEFT(ddc_cid, 1) AS a, LEFT(ddc_cid, 2) AS b, ddc_cid, ddc_name FROM ddc WHERE ddc_cid ~ '0$' AND ddc_cid !~ '\.' ORDER BY ddc_cid ), third AS ( SELECT LEFT(ddc_cid, 1) AS a, LEFT(ddc_cid, 2) AS b, LEFT(ddc_cid, 3) AS c, ddc_cid, ddc_name from ddc WHERE ddc_cid !~ '\.' ORDER BY ddc_cid ), fourth AS ( SELECT LEFT(ddc_cid, 1) AS a, LEFT(ddc_cid, 2) AS b, LEFT(ddc_cid, 3) AS c, array_to_string(regexp_matches(ddc_cid, '\d*\.(\d+)'), '') AS d, ddc_cid, ddc_name from ddc WHERE ddc_cid ~ '.' ORDER BY ddc_cid ) SELECT t4.ddc_cid, COALESCE( t4.ddc_name, t3.ddc_name, t2.ddc_name, t1.ddc_name) AS name, t1.ddc_name AS t1_name, t2.ddc_name AS t2_name, t3.ddc_name AS t3_name, t4.ddc_name AS t4_name, CAST( CAST( t1.a AS INT) * 100 AS TEXT ) AS a, CAST( CAST( t2.b AS INT) * 10 AS TEXT ) AS b, t3.c AS c, t4.d AS d FROM First t1 INNER JOIN second t2 ON t2.a = t1.a INNER JOIN third t3 ON t3.b = t2.b LEFT OUTER JOIN fourth t4 ON t4.c = t3.c ORDER BY t1.a, t2.b, t3.c, t4.d ASC NULLS FIRST
Some interesting things to note here are:
- Common Table Expressions (CTE) are used so structure the code
- PostgreSQL regular expression are used in the WHERE clauses:
- second:
WHERE ddc_cid ~ '0$' AND ddc_cid !~ '\.'
- third:
WHERE ddc_cid !~ '\.'
- fourth:
WHERE ddc_cid ~ '.'
- second:
- column 'fourth.d' uses these two interesting functions
array_to_string(regexp_matches(ddc_cid, '\d*\.(\d+)'), '')
SELECT * FROM ddc_v WHERE ddc_cid = '745.2';
ddc_cid | name | t1_name | t2_name | t3_name | t4_name | a | b | c | d
---------+---------------------------+----------+---------------------------+-----------------+---------------------------+-----+-----+-----+---
745.2 | Industrial art and design | The arts | Drawing & decorative arts | Decorative arts | Industrial art and design | 700 | 740 | 745 | 2
(1 row)
What can be improved?🔗
- [ ] The parsing routine is a bit ugly.
- [ ] The first level of hiarchy has a
IN
condition, this could be a set ofVALUES
or aSEQUENCE
. - [ ] The aliases of the column; 'a', 'b', 'c', 'd' are not so well chosen.