This is still a work in progress – you have been warned! (Please note I’m not necessarily recommending you code the SQL in this way either, since there are far more optimal ways of doing it in a real program, but it helps to do it this way to explain what I’m doing.)
As you may have noticed from this post, the Association of Train Operating Companies have released their fares data as a series of text files in a ZIP file under a Creative Commons licence. Not one to pass up a challenge like this, I quickly downloaded the file and then went about seeing what I could make of it. Just over a week later, I had created a working, if not entirely finished and polished, Python script to import those data files into a PostgreSQL database. (I will open source this tool and release it to the world once it’s finished, but it’s really only usable by me currently. More news on that when I’ve finished it)
So, now I have a database that is 4GiB in size and 58 tables in size, so what can I do with it? Well, the obvious answer would be to find a fare from A to B. Easy, you’d think. Wrong!
First of all, let me include a small glossary of railway fares terms for you:
- NLC – National Location Code. A four-character code, usually numeric but not always, which identifies either a particular station or other place that can issue tickets.
- RJIS – Rail Journey Information System. The big railway industry database, run by Fujitsu, that the ATOC-provided files are sourced from. Basically the same set of files are sent to the various ticket issuing systems in use, but obviously they are updated more than once every three months.
- TOC – Train Operating Company
- Flow – A journey between a given origin and destination is known as a “flow”. The fare for each flow is assigned to a particular TOC or other similar body permitted to set fares (such as a Passenger Transport Executive such as West Yorkshire Metro, GMPTE in Manchester, etc.) or other body such as Transport for London. It is important to note that not every journey between two points on the UK rail network is individually priced.
- CRS code – A CRS code is a three-character (usually three letters) code which identifies a station. These usually follow the station name, for example LDS is Leeds and BRI is Bristol Temple Meads.
- Station Cluster – A station cluster is a group of stations which belong in a group for the purposes of pricing up a flow.
So let’s attempt to find a fare.
Let’s pick two stations at random, a good distance apart, say Taunton to Leeds. The obvious choice is to look up those two stations in the database and see what it tells us. According to the flow_f table, I need to specify an origin and a destination. These are stored in the database as NLCs, so I need to convert Taunton and Leeds into NLCs. To do that I need to query the location_l table…
SELECT description,nlc_code FROM location_l WHERE description='LEEDS'
OR description='TAUNTON';
That didn’t work! Why not? Well, the answer is a bit subtle. Every record in the data files is fixed width, which means you (at the very least) have to strip spaces from it, and at worst get rid of superfluous dots at the end of it. Madness. Let’s try again…
SELECT description,nlc_code FROM location_l WHERE description LIKE 'LEEDS%'
OR description LIKE 'TAUNTON%';
That’s better. But wait a minute, I’ve got multiple records? What on earth … ?
"TAUNTON. ";"3471"
"TAUNTON STN FRCT";"7434"
"LEEDS. ";"8487"
"LEEDS + BUS ";"H975"
"TAUNTON+BUS ";"J945"
"LEEDS BUS ";"K202"
"LEEDS BRD AIRBUS";"K650"
"LEEDS FEST BUS ";"K684"
OK, this isn’t ideal, but it does at least provide the information we wanted – TAUNTON. (with the dot) is NLC 3471, and LEEDS. (also with the dot) is NLC 8487. So now we can query the flow_f table with the right information. So, let’s have a go …
SELECT * FROM flow_f, flow_t WHERE flow_f.flow_id=flow_t.flow_id AND
origin_code='3471' AND destination_code='8487';
Nothing! Why ever not? The answer to this a little less than simple. As I mentioned above, a flow is not the same as a journey. In other words, not all journeys have an individually priced flow. In the old days of the paper fares manuals, this was generally dealt with in the book by having a ‘related station’ – in other words, if you couldn’t find the fare explicitly listed from A to B in the listing for station A, you looked it up from a related station (which would be nearby). All that is gone out of the window and replaced with something massively more complex – the idea of the Station Cluster. Hold on tight … !
What we need to do next is to find out if NLC 3471 (Taunton) and NLC 8487 are in any station clusters. If they are, we then need to repeat the above query replacing the station NLCs with the cluster NLCs. So let’s have a grapple around the station_clusters table to see if we can find what we want:
SELECT cluster_nlc,cluster_id FROM station_clusters WHERE cluster_nlc='3471'
OR cluster_nlc='8487' ORDER BY cluster_nlc;
So let’s see what that query gave us:
"3471";"Q824"
"3471";"Q942"
"3471";"S327"
"3471";"Q903"
"3471";"R070"
"3471";"R409"
"3471";"Q391"
"3471";"Q909"
"3471";"T104"
"3471";"Q814"
"3471";"Q715"
"3471";"T204"
"3471";"R708"
"3471";"T327"
"3471";"T406"
"3471";"T511"
"8487";"Q864"
"8487";"Q865"
"8487";"Q903"
"8487";"Q931"
"8487";"Q937"
"8487";"Q938"
"8487";"Q964"
"8487";"R629"
"8487";"R639"
"8487";"R640"
"8487";"R705"
"8487";"S031"
"8487";"S205"
"8487";"T007"
"8487";"T018"
"8487";"T020"
"8487";"T027"
"8487";"T029"
"8487";"GLA1"
"8487";"T143"
"8487";"T229"
"8487";"T129"
"8487";"Q475"
"8487";"Q614"
"8487";"Q661"
"8487";"Q696"
"8487";"Q716"
"8487";"Q814"
"8487";"Q816"
"8487";"Q819"
"8487";"Q822"
"8487";"Q824"
"8487";"Q844"
"8487";"Q845"
"8487";"Q846"
"8487";"Q847"
"8487";"Q850"
Erk! What is all this lot? Well, the first column is the NLC of the station we’re looking for, and the second column is a list of all the station clusters that that NLC is in. And you’ll notice there’s loads of them…
Thankfully there’s a clever bit of SQL that can do this task for me. Try this …
SELECT <some fields> FROM flow_f, flow_t WHERE flow_f.flow_id=flow_t.flow_id AND
flow_f.origin_code IN (SELECT cluster_id FROM station_clusters
WHERE cluster_nlc='3471') AND flow_f.destination_code IN
(SELECT cluster_id FROM station_clusters WHERE cluster_nlc='8487');
To be continued…