Feature metrics in software development

Tags

, , , ,

Introduction

One of the popular software metrics are the DORA metrics. Here is what Google has to say about it:

https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance

These offer valuable insights into DevOps performance, focusing on four key indicators:

  • Deployment Frequency—How often an organization successfully releases to production
  • Lead Time for Changes—The amount of time it takes a commit to get into production
  • Change Failure Rate—The percentage of deployments causing a failure in production
  • Time to Restore Service—How long it takes an organization to recover from a failure in production

Achieving excellence in these metrics can earn organizations prestigious titles like ‘Elite.’ However, while continuous deployment offers benefits such as rapid feedback cycles and quick deployment of fixes, it also presents challenges:

  • (+) Quick feedback cycle: You can deploy new fixes/functionality in a matter of minutes
  • (-) No explicit QA cycle. No room for explorative testing before releasing to production
  • (-) Continuous deployment is not free, it’s complicated and needs baby sitting
  • (+/-) Strong emphasis on good tests to guarantee deployments are successful

While these metrics advocate for the adoption of continuous deployment, it’s essential to recognize that customer priorities may differ. Pursuing the ‘Elite’ status might overshadow more pressing concerns. In many cases, issues stem from a sluggish or inadequate product feature pipeline, highlighting the importance of balancing deployment efficiency with product development.

Feature metrics

Changing acceptance criteria

While embracing change is a core principle of agile methodologies, it’s essential to strike a balance. Constantly shifting acceptance criteria even before customer engagement can lead to inefficiency and wasted developer resources. Agile should not be a shield for indecisiveness. It’s crucial to thoroughly discuss and refine features before committing to development. Conducting preliminary discussions, creating simple mockups, and engaging with customers can prevent costly iterations and rework.

Remember, the majority of users prefer stability and consistency in their applications. Overzealous changes can disrupt user experience and lead to frustration. It’s important to recognize that most users are not early adopters; they value familiarity and reliability in their software.

Minimize stories with unclear acceptance criteria

Stories should not enter the backlog unless they are well-defined. Remember, zero clearly defined stories in the backlog will likely result in zero stories released by developers. It’s crucial to do your homework upfront.

Many developers tend to be introverted, and while asking questions may seem bothersome, it’s essential that developers clarify any uncertainties rather than leaving them for the users to discover. Developers excel at identifying potential issues in features. Embrace feedback and encourage questions. Measure success by tracking the number of stories that reach developers but aren’t yet ready for development due to unclear requirements.

I’ve worked with quite some companies that have very unstable acceptance criteria. Both the developers and product managers were frustrated. A runway of zero user stories ready to be worked on is recipe for a “bore down”.

Avoid shielding your developers from understanding the domain

In many companies, this aspect is often overlooked. Developers must grasp the context and intricacies of the domain they’re working in. Without this understanding, they risk becoming mere executors of tasks devoid of deeper comprehension.

From personal experience as an engineer, I’ve found immense value in acquiring domain knowledge. However, it’s not advisable for developers to immerse themselves entirely alongside customers to become experts. Striking a balance is key.

Some companies do not share any of the domain and make it a boring sequence of add button here, add field there job. Don’t stay there.

Uncharted acceptance criteria

Often nobody knows what the old/legacy system was exactly supposed to do. Better not to touch that code. This is the same terror of not having any tests. Am I breaking something? Nobody knows. Is it still used?

If it’s unclear whether the code is still in use, a pragmatic approach is necessary. If it’s obsolete, it’s best to remove it entirely. However, if it’s still in use, documenting its functionality and establishing acceptance tests become imperative. This process parallels the concept of technical debt but from the perspective of the product owner.

A useful metric to gauge this scenario is to map out the number of features lacking clear user criteria. This provides insight into the extent of technical debt from the product owner’s standpoint.

Features (in %) that are actually used by >50% of the users

Unused features are not just dormant—they come with a price tag. Every refactor that encompasses redundant or unused features adds to your expenses. Moreover, each additional feature contributes to the cognitive load for both users and developers alike.

Consider the cost implications: if only a minuscule percentage, say 1%, of your customer base actually utilizes a particular feature, it’s worth questioning its necessity. Assess whether the benefit it provides justifies the resources expended on its maintenance and inclusion in your product.

Addendum


Here are several technical metrics that I find particularly valuable but are often overlooked elsewhere:

Technical metrics

  • Amount of refactors as a % of total development time. This is an equilibrium.
    • Refactoring is essential for maintaining code health, but like any good practice, it can be overdone. Striking the right balance is key: excessively high percentages, say 99%, indicate that feature delivery is severely impeded. Conversely, if refactoring comprises 0% of development time, it suggests that technical debt is accumulating unchecked. Achieving a sustainable balance ensures steady progress while safeguarding the quality and maintainability of the codebase.
  • Regression, where an acceptance criterion unexpectedly fails after making it to production, should be considered an anomaly.
    • In an ideal scenario, such occurrences are prevented through a robust testing framework. Either automated acceptance tests are absent, leaving room for oversight, or the acceptance criterion lacks explicit definition, leading to ambiguity.
    • Identifying and rectifying such regressions is imperative to uphold product integrity and customer satisfaction. It underscores the importance of stringent testing practices and clearly defined acceptance criteria throughout the development lifecycle.
  • Time spent babysitting existing infrastructure/tests
    • If you have a lot of flaky tests this wastes time and leniency towards failing tests may lead to regression
  • Cost of the infrastructure
    • For instance, introducing complex systems like Kafka when a simple REST endpoint would suffice represents a substantial investment. This decision not only entails the initial setup and maintenance expenses but also potential overhead costs associated with managing the added complexity.
    • Careful consideration of the trade-offs between complexity and functionality is essential to optimize infrastructure costs and ensure resources are allocated efficiently. Choosing the most appropriate solution for the specific requirements can help mitigate unnecessary expenditures in the long run.

Debugging sessions: Core Issue Proximity

This metric gauges how much code a programmer needs to explore to find the troublesome segment, starting from the initial error trace.

To illustrate this concept, let’s look at a simple example.

Imagine an Invoice with an empty Customer field, resulting in a null value.

When this invoice is processed, a null pointer issue arises in the process function.

Two scenarios emerge:

  1. Optional Invoice Customer Field: It’s acceptable for the Invoice’s “customer” attribute to be empty. For example, an invoice in progress might not yet have a specified Customer.The error trace precisely points to the problem. Since the process function should handle empty customer entries, the solution lies within that function.Here, the distance to the core problem is short. A brief distance provides quick feedback for anomalies.
  2. Mandatory Non-empty Invoice Customer Field: Now consider a situation where the customer field unexpectedly remains empty. The solution isn’t within the process function. By the time the invoice reaches this function, it’s already compromised. The issue might originate from incorrect invoice storage due to a faulty repository load function. Thus, scrutinizing this code is necessary. The problem might extend further – investigating the database could reveal incorrect storage.Importantly, a different microservice manages Invoice object storage. Analyzing this microservice is crucial; any attempt to store a corrupted invoice should fail early.In this context, the distance to the core issue is significant. Navigating through multiple classes, possibly crossing application boundaries, is required to identify the root cause.

Lesson Learned

Reducing the core issue’s proximity is vital. This can be achieved by minimizing the amount of code exploration. Excessive movement across files or applications burdens developers. Quick failure or state validation can save substantial time and effort.

Handling factorials using prime factor vectors

Introduction

Working with big factorials can be a challenge. For instance if we want to calculate the number of combinations one uses the following formula:

nCr = n!/(r!(n-r)!)

A naïve approach causes n * r * (n – r) multiplications and one division.

The problem is that it’s easy to cause overflow of traditional built-in number types. 

16! overflows a 32 bit number. 

21! overflows a 64 bit number.

There exist libraries that allow programmers to circumvent these limitations like BigInteger in Java. But once these numbers become big they become inefficient to calculate.

Another solution is the one of Mark Dominus https://blog.plover.com/math/choose.html who adopted an interesting approach from https://en.wikipedia.org/wiki/L%C4%ABl%C4%81vat%C4%AB.

Other than the two solution above I did not immediately find another approach that lends the flexibility I was searching for.

A solution based on prime factor vectors

This post however describes another approach.  

It is based on the fundamental theorem of arithmetic: Every integer greater than 1 is either a prime number or can be written as a product of its prime factors.

Once we have the arrays of prime factors, calculating the result is quite simple. Given variables in brackets indicate an array or vector of prime factors, we can rewrite the above formula as follows:

[nCr] = [n!] - ([r!] + [(n-r)!])

Where + and - are simple vector additions/subtraction.

The challenge is how to efficiently convert a n! into a vector of prime factors. 

A simple approach calculates the prime factors of n! by calculating factors of each component of the result. 

[n!] = [n] + [n – 1] + … +[2]

This is costly, given it needs to factor every multiple of n!.

The following approach however is more efficient: 

calculate_for_factorial(n):
   Calculate the primes with a sieve of eratosthenes
   N = new int[primes <= n.length]
   For the ith prime in primes <= n: 
      Calculate count for one prime

To find the factor for one prime p we count the amount of pi that fits in n.

This calculates the vector in ~ log(n) * n / ln(n) steps since it needs logp(n) per prime. And there are n/ln(n) primes smaller than n.

The upside is that it avoids overflows under the following limits:

For 32 bit numbers the maximum is 2,147,483,647!.

For 64 bit numbers this is 9,223,372,036,854,775,807!.

Here is a java implementation

public static List<Integer> calculate_for_factorial(int n)
    {
        List<Integer> allPrimes = Primes.getAllPrimes(n);
        List<Integer> N = new ArrayList<>(allPrimes.size());
        for (int i = 0; i < allPrimes.size(); i++) {
            var prime = allPrimes.get(i);
            int c = 0;
            int r = prime;
            while(r <= n) {
                c+= n/r;
                r*=prime;
            }
            N.add(c);
        }
        return N;
    }

This approach can also be used for other combination problems like permutations with multisets or multinomial coefficient problems.

Code guidelines won’t fix your problem. Simple process design will.

I was going through the outstanding book “Code craft The practice of writing excellent code” by Pete Goodliffe. It has a lot of tips I wish I had when I started programming. Part of the book is about code guidelines which got me thinking.

Is ignoring code guidelines really what makes software hard to read or understand?

Definitions of complexity have been setup and thoroughly discussed. More on this here: https://en.wikipedia.org/wiki/Cyclomatic_complexity

I would like to propose another metric. To illustrate my point I need to introduce a new (as far as I know) kind of diagram.

Process diagrams

First let’s define what a process is. A process is anything that changes the state of an entity/object. An example would be motion blurring an image.

process diagram for gaussian blur

The big arrow indicates a process: “Gaussian blur of an image” which will change some state: “the Bitmap pixel values”. State is displayed as Venn diagrams.

The gaussian blur parameters have an obvious impact on the blur process but are not changed by the blurring process.

Now we have a simple way of discussing processes and state.

An example where code guidelines never helped

Now let’s consider a webapplication in which we use a message queue to process videos. Here is a process diagram on how that works:

A simple video processing process

The process is simple.

  • A user uploads a video and adds some parameters on how to process the video.
  • The web application puts the uploaded file in a directory and puts a message on a message queue.
  • Later on another process picks up the message. This message is then removed from the message queue and the video is processed.
  • The finalized video gets dropped in a “processed video files” directory and the uploaded video is deleted.

This is simple enough. But let’s say Jimmy is looking for a raise. He wants to prove himself and how better to do this than to wow everyone with something complex? (Based on a real story)

The idea is that we want to give priority to paying customers. Also we want the user to be able to cancel video processing.

Jimmy, victim of my story on bad design, will be adjusting the messages in the queue directly. A process ‘priority requeuing’ bumps free users back to start of the queue every time a paying customer comes along.

While we’re adjusting the queue anyway, Jimmy will also go over the queue and remove messages if they are cancelled. This way the queue is only filled with actual work! Very clever!

If you’ve ever heard about race conditions you know trouble is ahead.

A better and simpler design, however, would be to have multiple queues. One for free users, one for paying customers. Please don’t adjust the queue directly. It’s a queue for a reason.

Many other designs are possible. But some are simpler or better than others.

Conclusion

Now, what was the point of all of that?

If you are doing a code review. Are you really looking at the process? Or are you just looking at brackets and class names? I know I have been. Maybe it’s time to start visualizing the processes and have long debates about that instead?

My advice? Keep your processes clean and simple and you’ll spend less time babysitting chaos.

K-anonymization converting data by using k-means clustering

Introduction

Sometimes a company/researcher or the government needs to publicly share data. In an age where the GDPR dictates strong security, people will resort to anonymizing the data since anonymous data falls outside of the scope of GDPR.

Yet making data anonymous isn’t that easy. Simply de-identifying (for example by following HIPAA guidelines) isn’t enough.

And so k-anonymity was born. The original problem is that quasi identifiers like age, zip code or others can be used in a linkage attack.

Sometimes by combining quasi identifiers, which on their own right aren’t identifiers, can still identify people.

The solution is to adjust the data in such a manner that the quasi identifiers of each record cannot be distinguished from k – 1 other records.

For more details on this:

Converting data

K-anonymity is reached by converting the data in 3 ways:

In this blog I’m explaining a new simple way of scrambling data.

n/k clustering

Assume you have n records you want to convert and you want to reach k-anonymity you’ll need records to be in equivalence classes with at least k elements in them.

One way of doing this is by converting the data into mean values of floor(n/k) clusters. The idea is quite simple. So here is a proof of concept:

import numpy

from sklearn.cluster import KMeans

k = 2 # k anonymization on
nrElements = 10

vals = numpy.floor((numpy.random.rand(nrElements, 1) * 70)+18).astype(int)
kmeans = KMeans(nrElements//k).fit(vals)
clusters = kmeans.cluster_centers_
clusters.sort(axis=0)
converted = list(map(
                  lambda y: int(
min(clusters, key=lambda x: abs(x-y))[0]), vals)
                    ))
  • First we generate some random age values as an example.
  • KMeans clustering calculates the centroids
  • Finally we convert the original values into the closest centroid value.
Example values:
[[35], [47], [18], [82], [37], [74], [38], [66], [23], [67]]
Clusters:
[[20.5], [36.666666666666664], [47.0], [66.5], [78.0]]
Converted example:
[36, 47, 20, 78, 36, 78, 36, 66, 20, 66]

The remarkable thing about the result is that it may not be k-anonymized. For instance the value 47 is unique. This may well be a sign that this record is too different to be included. The suggestion would then be to suppress the value or skip the entire record all together.

Conclusion

While n/k clustering doesn’t guarantee a k-anonymized solution it gives a good idea of which values may be too risky to be used.

Finding an optimal generalization may seem simple but is equivalent to the subset problem which is proven to be NP-hard.

What is your passion as a programmer? Passion and conflict in programming

Recently I asked myself: what is it that draws me to computers and programming? I’ve had a passion – obsession – for computers since I was 8. Almost 30 years later and I’m still at it.

When I first started I would be tinkering on a 386 PC, learning more as I went on. While I couldn’t read English, that couldn’t stop me from figuring out how things worked for hours on end. I carefully tweaked things together with a friend and it was the most fun I’ve had in my life.

This is compatible with the ‘hackers’ movement. It turns out I’m far from alone having a passion to tinker around with computers/hardware. Only back then they called us geeks. 🙂

But are there other passions? After a brief search I figured there are plenty.

  • tinkering lust: Love tinkering with things.
  • uniqueness lust: Love to be different from others.
  • discovery lust: Love to discover new things.
  • wanderlust
  • beauty lust: Enjoy the mirror or enjoy anything aesthetically pleasing
  • higher number lust: Enjoy a steadily increasing number improving something or oneself.
  • hoarding lust: love collecting things.
  • secrecy lust: Love the thrill of doing things nobody should know about.
  • knowledge lust: love to figure things out, proving intelligence.
  • finish lust: Wants to finish tasks. Enjoying that good feeling of finishing something.
  • simplicity lust: Enjoying simplicity. Hate over-complicating things.
  • competition lust: Enjoy competition, being better, loving the struggle, loving the fight.
  • talk lust: Love talking to people.

Passions and their profiles

In my career as a software engineer I’ve met colleagues with different passions for programming. Some had no passion for IT and only do it to make a living. Those usually don’t stick around longer than a few years (unless the pay is really good) or sadly get depressed.

Mostly I’ve seen people tend to programming because of tinkering, finish, simplicity or competition lust.

These passions, though they complement each other, can cause conflicts. Here is a list of some conflicts I’ve seen on the work space. And with these passions in mind most conflicts I’ve seen suddenly make sense.

Tinkering – early finish conflict

The tinkerer likes to takes his time. But he/she will work on it tirelessly. It’s insane. Longer nights? No problem. Anyone wants to call about this in the middle of the night? No problem. Finishing things on time? Might be an issue. The solution may not be the most elegant one but it will work.

The early finisher needs to finish the task as fast as possible. Every slow down is reason to complain. It’s not uncommon for the early finisher to become annoyed.

The conflict between the earlier finisher and the tinkerer is self evident. The tinkerer just wants to spend time on it, the finisher needs it yesterday.

Simplicity – tinkering – finisher conflict

After doing my masters in Informatics I didn’t fully comprehend why there was so much math involved. Some said it was to shape me as a person.

To do proper math you need to be accurate and I found my need/wanting for simplicity grow. A certain preference for a certain “elegance”. Dijkstra had a very strong opinion on this:

“programmers should not be puzzle-minded …. We would be much better served by clean, systematic minds, with a sense of elegance.”
— Dijkstra

This strive to simplicity is commonly a problem for the tinkerer. It works, what’s the problem? To the person striving to simplicity the code is just a mess. They’ll do long rants, sometimes questioning the intelligence of the tinkerer. Clearly this will cause conflict.

But don’t think striving for elegance is simple. Reaching elegance needs time too. Sometimes more than the tinkerer would put into it.

This causes great conflict with the finisher. This elegance person is wasting time! The customer will never see the code. So what’s the point? Finish it already.

Competition conflict

There are competitive people out there too. To them everyone is the competition. The plus side is that they’ll learn from the best and may push the competence of all programmers. They’ll learn quick. Have they learnt everything they can? Then they’ll leave, in search for better competition.

Not everyone however is looking for competition. This can cause plenty of frustration. The tinkerer isn’t looking for someone to compete with, he/she just wants to tinker with their toy. The elegance worker wants to debate elegance, not silently compete around it.

These people usually do competitive programming on the side, read a lot and typically show sportsmanship. They can however not stand people that are not passionate about their job. This will cause conflict with the “nine till five” workforce.

Conclusion

I found that understanding passions helps prevent conflict but may also help out to create excellent combinations. What is your passion? Am I missing out on important passions? I’ll setup a survey gather intel on what the most prevalent passion is and post the results.

A new technique for testing: distribution testing

I’m not going to tell you to unit test. If you don’t know what unit testing is or why’d you do it, well there are plenty of books out there.

The idea is simple. You write some tests for some specific inputs and check if the output is what you’d expect. You can also put in some special cases, usually after the code broke.

A unit testing example and its problem

Let’s take a seemingly trivial example. I want to test the toLowerCase function. What would you test?

Most of the time people would test a single input point like this:

assertEquals(“a”, toLowerCase(“A”));

and perhaps someone got creative and tried this:

assertEquals(“a”, toLowerCase(“a”));

The problem is that you stab at the problem one input at a time. If there is a sea of possible input what should you check? Write a test for each input/output pair?

I’d like to present a different way of testing that I believe is more effective in finding bugs.

Distribution testing

Instead of testing one point we test a whole range of input values. In our example we’ll test each possible character. If you’re like me you’ll probably be having the following character class in mind: [a-zA-Z]. But it’s so much broader than this.

As input let us take a flat distribution of each element over the range of [0..65535]. But let’s keep it simple for now [0..255]. As a histogram this renders as a boring flat line of 1 element at each point. I’ve not put in a screenshot as I assume you’ve seen a flat line before. The output however is slightly more exciting.

The histogram output of toLowerCase

The X axis is the character ordinal value. The Y axis is the amount of times we’ve seen this character.

We find that there are 3 cases in this histogram. The case in which the character isn’t converted. This gives us a value of 1. There are evidently also the characters that are converted into another character like A => a. That’s what’s happening in the first depression. It is countered by a small lift of all characters from [a-z].

At 200 we can see a small range of accented characters like “À” which converts into à. But there is one spike out there that is interesting. What’s going on at 215? Maybe it’s something worth examining? The character we find there is × and the depression that follows is ÷. Funny how they are each others opposite. It makes sense.

But what else can this technique deliver except for finding some weird outliers? We can also check that the sum of all characters should be 256 (which it does).

Another self-evident case: sql injection

Had everyone used this technique or used any modern database access technique there never were something as a SQL injection. Why not?

If they were testing all characters you’d quickly see their program crashing on ‘ and backslash characters. This should give you a quick hint that you’re doing something wrong. Sadly most people only test with evident examples. Hopefully this technique can help find such problems in the future.

Conclusion

I believe that looking into the output distribution of your code can show you surprising things and surface bugs you’ve never thought of. Sure it takes some time from your end as a programmer and becomes more complex as you have multiple inputs. Visualizing also helps you think deeply about your code and the cornercases or how distributions come together.

Is this merely intellectual exercise or is it truly a useful tool? Let me know.

The dangers of the solitary developer

For about 15 years straight I’ve been developing software in a team. Most teams did code reviews. A team of software developers that would look at your code and ask annoying questions like: 

  • Why did you do this?
  • Why not do this instead?
  • Can’t you rename this class?
  • I have no idea what your code is doing.

A good code review makes you feel uneasy. It doesn’t mean you have to agree with every comment that comes your way. A good code review should be about the how you do things. The design choices you’ve made. This way more than one person knows how your code works. So if you’re on holiday someone can cover for you. 

It’s not uncommon that people get aggressive over their territory. It may create conflict. In some peoples eyes it may even ‘slow down’ development. But it starts healthy dialogue about design.

Not having someone review your choices however is far worse. I’ve seen it far too often and it’s not limited to programming. I’ve seen isolated managers, software architectsmathematicians, … well any role really, that fall in this trap. 

It happens most commonly when they are the only developer in a team or even in the company. And it’s a slippery slope. The daily grind wears down best practices. It becomes far more practical it seems not to have unit tests, decent architecture,… You secretly know that this is wrong. Before you know you’re dropping model view controller principles. Hell, why not put all code in one class?

Suddenly you are continuously firefighting defects, you’re no longer creating new exciting features. How did this happen? Usually it’s bias and nearsightedness and code that is far from perfect. 

How can you prevent this?

Well the obvious solution is to do code reviews. Have someone review your work. Have healthy, be it uneasy, discussions about your decisions.  

Don’t have someone to go over your code? Make a list of best practices. Stick to them.

Take some time to make your code perfect. Once you have nothing to change, nothing to improve? Then it’s done. Not because it gives the impression it works. 

It will help you sleep at night. Did you do the right thing? Well at least you did everything you could. 

Using stylometry and a neural network I can finally prove I didn’t write Genesis or did I?

It was a slow evening. I was watching ‘Elementary’ which is a modern take on ‘Sherlock Holmes’. Holmes smirked as he convincingly had caught the criminal by applying the ever so mysterious Stylometry. By cleverly combining the three sentences the criminal had written he could easily reveal that it was the suspect on the chatroom. Isn’t Sherlock a genius?

After some careful examination I found out that Stylometry actually is a thing and has been since Principes de stylométrie (1890). So I wanted to give it a stab. I wrote a program to identify if I wrote a piece or not, much like Sherlock. Here is my feeble attempt.

Classic method: distance between function words

First I tried to use a somewhat simple classic method as described here: 

 I sought for a writer invariant. Wikipedia states that: “An example of a writer invariant is frequency of function words used by the writer.”

As example I used my thesis and a long report I have written for work and compared it to the Bible, more specifically: Genesis.  Out of this I extracted the relevant ‘function words’ and their frequencies relative to the total amount of words I had used.

Combine these and you have a few nice vectors. I figured by calculating the euclidean distance between these vectors I could see how different one text is compared to another. And voilà the first proof that I didn’t write Genesis was in.

  • Bible vs. mails: 814.72
  • Bible vs. thesis: 906.4926
  • Mails vs. thesis: 288.67

But what is far away? Is 814 sufficiently far away enough to be convincing? Also I hear you think, well that isn’t a modern approach at all! Worry not here comes the neural network solution. 

The Neural network solution

The idea is simple. Throw my mails and thesis at a neural network with label 1(I wrote it). Throw some other text at it and label it 0(I didn’t write it).

A neural network however does not take text, however. So we need to create a shared dictionary first. Then replace the coded documents which will be ready to feed to our neural network. 

To do this exercise I used tensorflow together with python 3.6. And now for my neural network architecture:

I borrowed this neural network from the standard tensorflow examples which I can now proudly act as if my own. But how does it perform?

Training and validation accuracy

Conclusion

With the impressive graph I’ve shown you above you must deem me the next Sherlock, only, even more arrogant. The graph shows a 100% accuracy on all of the data I fed it, compared to a test and validation set of 10% each. The dataset however was pathetically small.

The bad news is if you are a regular person as me and not a known writer you will not have enough training data. The above graph is an excellent example of overfitting. When it comes to stylometry on regular people using a neural network won’t make a strong case as it suffers from a big amount of overfitting.

Lessons learned: Learchy an experimental webcrawler and an alternative to robots.txt

Tags

What’s this about?

Everyone knows google. Some even know other search engines.

Way back it used PageRank as a means as identifying the more popular pages. It would then use the popularity(how many people link to that page) of that page. This tactic makes me sad. It encourages people to read the same popular junk over and over again. A simple example is Batman. No not the superhero, the location in Turkey(https://en.wikipedia.org/wiki/Batman,_Turkey).or how about a servant to an army officer?

It looks as if facts and non fiction is becoming secondary to popular culture, which is pretty wrong and something for another discussion. 

But is it useful because it is popular? SEO link farms and other SEO optimizations may be the living proof it is broken.

To get a better idea of the workings I wrote my own crawler . Not because I think I can compete with Google but as a thought experiment on what can be done differently.

Architecture

By making the system flexible I cowardly avoid taking any design decisions whatsoever. Here are the components I came up with.

webcrawler interfaces

This simple system of interfaces allows for a very simple webcrawler. The webcrawler decides the strategy on how to go through the URLs. It gets a summary (content and other links) by using a PageSummaryProcessor. 

PageSummaryProcessor

JSoup was used to download the contents of the pages.

Not all words are equally interesting. So I decided to skip the most common words like ‘the’, ‘and’ etc.  These are filtered out by a WordFilter.

Webcrawler

This summary comes back to the webcrawler. The Webcrawler will then store this information in an index. The IndexCreator stores the content.

Not all pages should be crawled however. A robots.txt file makes sure some pages, such as a login page or a “print this page” button,  are not indexed. I found out that some crawlers disrespect proper crawl-delays and are denied access as broadly as possible. Bad crawlers do bad things with the bandwidth. Crawlers however are a necessary evil. Without them no one will find your content.

Webcrawling strategies

Simple crawling

The simplest strategy is to just have one thread and crawl without any limits. Some hosts like reddit.com quickly stop overly aggressive crawlers by rate-limiting.

Respectful crawling

It’s better to crawl with respect to the crawl delay. A crawl delay of 1 second was used. This however makes crawling ever so slow…This limits the amount of pages to 86400 pages a day which is far from desirable. That would make a very  ineffective crawler.

According to https://www.siteground.com/blog/crawl-delay/ Google uses no crawl delay.

Multithreaded crawling

If we have to respect crawl-delays we could build a complex system that keeps track of the last crawl action on each respective host. I, however, found it much easier to just start a new thread for each host and sleep for 1 second between each page request for that host. For each host a shared collection of pages that need visiting was created. Threads can add pages to other hosts. The amount of threads however quickly exceeds the resources that are available.

A thread limiter was introduced and maximizes up to 20 threads to be used.

The drop point to start crawling was en.wikipedia.org.

It turns out crawling wikipedia would result in 30% of errors. It still needs to be examined if this is truly missing pages or something wrong with the tool. Hosts with a .com domain were not crawled as most commercial domains are not all that interesting when it comes to content.

An alternative to robots.txt

I figured that rather than having crawlers going over all pages it would be much more efficient to just generate one big compressed file with all the content for the host. This file would only contain flat text content and on which page to find it, no ads, no formatting, no scripts, … While this would require some work from the site maintainers side it would reduce crawler induced traffic and guarantee that bots don’t mess with the internals of their website. According to a 2013  study over 61% of the web traffic is generated by crawling. For more info follow the above link.

 

The Learchy crawler code is available here:

https://github.com/denshade/Learchy