Design a site like this with WordPress.com
Get started

Release 0.4 Part 1 and 2!

I was a little busy last week and I ended up not blogging about which project’s issues I was planning to tackle for our 0.4 release, oops. So, this week I’m going to roll the first and second blogs into one. I decided to work on telescope as I already had the environment setup on my system and had some familiarity with the backend. I looked through the issues with the backend label and found an issue regarding the removal of whitespace only blocks found in some posts. I decided I would take this issue as it was a step up from the blank post fix I made in release 0.3 and I figured that it was another issue I can fix with the power of regex.

I really like regex for some reason, working with it is sort of like solving a puzzle when trying to find the right combination of characters to find the matches you want in a piece of text. In this case our issue lies with empty <p> tags and repetitive <br> tags. Before I started creating the regex to solve the issue I decided to look through a number of posts to view the cases I might deal with.

When looking for empty <p> tags I considered any type of whitespace (spaces, tabs and newlines) with no text and tags that immediately close with </p> to be empty. I noticed some posts had <br> tags inside <p> tags with nothing else as well so I added that to the list of cases which should cover the different types of empty we can expect from RSS feeds. The other case was the usage of <br> tags back to back which is more straightforward than finding empty <p> tags.

Now, equipped with the knowledge of the potential cases I can run into I created the regex to find empty <p> tags and back to back <br> tags. The first regex I created was for finding repetitive <br> tags which looks like this: (<br>(\s?)+?){2,}. I wrote a little snippet of text which can be used to test the regex and plugged into regex101. The regex ended up highlighting all repetitive <br> tags like so:

(<br>(\s?)+?){2,}

Basically the regex looks for any <br> text and then 1 or more instances of whitespace only if it exists with the {2,} telling it to look for 2 or more instances.

With that issue solved I moved onto creating the regex for finding empty <p> tags. I decided to create the <br> regex first as I could re-use it for the regex we’re about to create as <br> tags in a <p> tag and nothing else are also empty as I mentioned previously. This regex is a bit more nasty: <p>(\s+)?((<br>(\s?)+?){1,})?(\s+)?<\/p>. Again, I wrote a little snippet for testing and these were the results:

<p>(\s+)?((<br>(\s?)+?){1,})?(\s+)?<\/p>

This regex looks for any text that starts with <p> and ends with </p> and looking for any instances of whitespace of any length and/or <br> tags using the same regex from above except looking for 1 or more instances.

All I need to do now is add this to post-processing part of the backend by replacing repetitive <br> tags with a single one using the first regex, and completely erasing empty <p> tags with the second regex. Of course the text snippets I created will be used to create tests for the two functions I plan on implementing for the removal of these whitespace elements.

Thanks for reading!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: