In order to be able to build the PF2e Encounter Builder I needed to collect data for the database. I wrote a web scraper in rust that used the scraper crate to collect the data from the list of monsters on pf2srd.com. This involved three separate but related tasks: Parsing through the table in order to get the URLs for the individual monsters, scraping the data for each individual monster and then working on the vector of traits created to correctly store the traits in the database.
This project required a bit of extra attention to testing and validation to ensure that there were as few data inconsistencies entered into the database as possible. I was working with unsanitized data directly from the web that was additionally entered in a very inconsistent matter. It would appear that CSS classes were entered manually and inconsistently such that it was more difficult to correctly target the correct elements in a consistent way. I approached validation in two steps.
At runtime the data passes through a validation function that ensure that that the data conforms to the data types of the database. I initially used if statements to achieve this:
pub fn validate(&self) -> bool { if self.name.len() > 100 || self.name.is_empty() { return false; } if self.level > 25 { return false; } if self.alignment.len() > 10 || self.name.is_empty() { return false; } if self.monster_type.len() > 100 || self.monster_type.is_empty() { return false; } if self.size.len() > 20 || self.size.is_empty() { return false; } for t in &self.traits { if t.len() > 50 || t.is_empty() { return false; } } true }
The function checks each member of the struct and ensures that it conforms to the data types. As this was my first significant Rust project, I wasn't aware that this is not idiomatic Rust so when I lated returned to the project and rewrote the function to use match:
for t in &self.traits { match t { t if t.len() > 50 => return false, t if t.is_empty() => return false, _ => () } }
In order to ensure that the validation functions were working correctly, I created tests designed to make each of the matches return false. For example:
//Test long size over 20 let size = String::from("1234567890123456789012312"); let traits = vec![ String::from("Fast"), String::from("Slow"), String::from("speedy"), ]; let monster = Monster { url: String::from("www.foo.com"), name: String::from("ghost"), level: 19, alignment: long_string.clone(), monster_type: String::from("Undead"), size, traits, is_caster: false, is_ranged: false, is_aquatic: false, }; assert![!monster.validate()];
After scraping the traits from the web, I first ensured that the trait strings were properly trimmed:
let mut clean_traits: Vec<String> = Vec::new(); for t in traits { let anchor = Html::parse_fragment(&t) .select(&anchor_selector) .map(|x| x.inner_html()) .next(); if let Some(s) = anchor { clean_traits.push(s.trim().to_string()); } else { clean_traits.push(t.trim().to_string()); } }
In some cases the traits contained anchor tags, while in others they didn't. I removed the anchor tags by retrieving the inner HTML in the cases where the string contained an anchor tag. As this function returned an Option<String>
, I used an if let to check if the anchor was Some or None and then trimmed the remaining string.
Next I checked to see if I had successfully retrieved data for each trait in the struct. In some cases the size trait was contained in a <span class="size">
with a class of size
to correctly color it, and in others it was in a <span>
with no class. To handle these cases I first checked if the scraper had returned Some(String)
and in that case, extracted the string from the Option<String>
in the else
of the if let
statement I looped over the traits and used and used a match to see if any of the traits matched a valid size.
let size: String = if let Some(s) = size_base { s } else { let mut size_string = String::from("NO SIZE"); let mut new_traits: Vec<String> = Vec::new(); for t in traits { match t.as_str() { "Tiny" => size_string = t, "Small" => size_string = t, "Medium" => size_string = t, "Large" => size_string = t, "Huge" => size_string = t, "Gargantuan" => size_string = t, _ => new_traits.push(t), }; } traits = new_traits; size_string };
Messing around with computers and coding since I was 8. Now getting paid to do what I love.