Monday, July 22, 2019

Trying not to fertilize a content farm

I got an email this week from a person purporting to be the webmaster of a gardening website, asking if I would link to his page about growing buckwheat at home, from my page about buckwheat pancakes. The email seemed a little strange: it looked like a form letter with inserted fields. And there was a link at the bottom to unsubscribe. That's not something that I'd expect from a hobbyist looking to increase his connections.

But, rather than just ignore the email, I decided to do some investigation. If you'd like to play along at home, you'll find the site at “the daily gardener” — for obvious reasons, I don't want to insert a link or anything like a link, but I'm sure you can figure it out. Beware: while it has innocuous content today, it may not six months from now; I used a guest account on my home Linux box, and deleted the browser's config after visiting.

As I said, it has innocuous content. In fact, for a likely content farm, it has pretty good content: easily digestible chunks of information and lots of pictures. In other words, consistent with most of the gardening sites that I've seen.

There were, however, a few red flags. The first was that certain sentence phrasings were not those of a native English speaker. That, however, could just be the result of editing (I know that I've left some bizarre sentences online after deleting and re-arranging words). The second red flag was that the word “daily” in the title really did mean daily: there was a new long-form article published every day. That too could be explained: the profile picture showed someone who appeared to be post-retirement; he could craft content full-time.

A third red flag was that there were very few outbound links. This is another place where my experience is not necessarily relevant: while I put a lot of outbound links on my site to provide the reader with more information, not everybody does. But most of the gardening sites that I looked at do. And if he's requesting me to link to his site, you'd think he'd be doing the same.

None of these red flags were convincing, so my next step was to take arbitrary sentences from his pages and put them into Google. In the past I've done this to find people who are pirating content from my site, and have issued several DMCA takedown requests as a result: people who put together pastiches tend not to change their source material. Surprisingly, it took a half-dozen attempts before I found almost-exact text from another site (which made me wonder whether most of the site's text was generated by a bot).

By now I was pretty much convinced this site was a content farm, so I explored its history. First stop was whois, which told me that the domain had been registered in 2005. Hmm, but all of the content looks new.

Next stop was the Wayback Machine, to look at how the site changed over the years. And what I discovered was that it was originally an online store, and remained so until the domain expired at the end of 2018. And then in April of 2019, the site was revived with a new look and new content.

Last up was a Google image search, using the picture from the “About Us” page. And, in addition to multiple links to “his” site, there was a link to Shutterstock. OK, that pretty much seals it.

After doing this research, I did some searches to find out if there was anything else that I should have looked for. And was surprised that there were very few “how to spot a content farm” articles available (one of the better ones is from Austin Community College, apparently to keep students from using such sites as citations). Most of the articles were about how content farms don't work, and that Google actively penalizes them. But if you search for “growing buckwheat” you will find the daily gardener site in the first page of results; perhaps it's too new to have triggered alarms.

And there is one thing that I find really strange about this site: there weren't any ads. That is — or has been — the main reason for content farms: get people on the site and hope that they click an ad. But not here. Nor did the network tab on my browser show any activity after page load, so it wasn't (when I looked) some sort of click scam.

So what's the goal? Is this a “long con” approach to black-hat SEO? Are they building traffic before turning the site into something nasty?

I don't know, but I won't be linking to it.

Friday, July 19, 2019

Broken Backwards Compatibility: The Bane of the Library Developer

Backwards compatibility has long been a hallmark of Java: there are many classes and methods still in the standard library even though they've been deprecated since the 1990s. Imagine my surprise, then, when I saw this exception:

java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
    at net.sf.kdgcommons.util.Counters.keySet(Counters.java:179)
    ...

To understand why, you need some back-story. I maintain a couple of open-source libraries that I claim are compatible with Java 1.6. To ensure that I don't accidentally use methods that didn't exist in 1.6, I have a 1.6 JVM installed on my development machine, and tell Eclipse to use it to compile the project. Ditto for the library that was running this test.

However, I don't use the real 1.6 JDK for producing the released versions of these libraries. I use Maven for releases, and the newer versions of Maven won't run on anything older than JDK 1.8 (and there are enough backwards-incompatibilites between older versions of Maven and its plugins that trying to use one is an exercise in frustration). This means that, even though I'm setting my target property to 1.6, it's compiling against the 1.8 standard library.

And in 1.8, the Java maintainers changed the return type of ConcurrentHashMap.keySet(): previously, it returned a Set, as defined by the contract of java.util.Map. In 1.8, it returns a ConcurrentHashMap.KeySetView. This concrete class implements Set, so it doesn't break the contract at the source level.

However things are different at bytecode level: the actual return type is retained in the method's descriptor, and the invokevirtual operation attempts to find a defined method that matches that descriptor. It will accept a method definition that returns a subtype of the specified return type, but not one that returns a supertype. So, when presented with a version of that standard library in which keySet() returns a Set, the JVM throws NoSuchMethodError.

There is a work-around: cast the map reference to java.util.Map. This causes the compiler to use the method definition from the interface, which fortunately has not changed since it was released in 1.2.

ConcurrentHashMap map = // something

Set keys = ((Map)map).keySet();

There is a bug report for this behavior, which was closed with “Won't Fix”. Unfortunately, that's the correct resolution: reverting the method signature would not only break any code that relies on it, but also any deployed applications compiled with the old signature but running on a JVM with the reverted signature.

I can only hope that, in reviewing the bug report, the JVM maintainers realized that changing the public API of a class — any class — is a mistake. Unfortunately, some of the changes happening with newer releases of Java indicate this is to be a more frequent occurrence. Which is probably why 1.8 remains (in my experience) the primary deployment platform for Java apps, even though we're now at Java 12.

Monday, July 15, 2019

Stupid Database Tricks: Accessing the filesystem from Postgres

A while back, one of my younger co-workers asked me why Amazon RDS doesn't support a real Postgres superuser. I replied by showing him this:

create table passwords (entry text);

copy passwords from '/etc/passwd';

He ran off to try running the same code against our production database using Splunk (I didn't ask; I didn't want to know, and still don't). But this example, while dramatic, doesn't really answer the question; /etc/passwd is intended to be read by anyone.

What I should have done was show him this example (warning: don't try this at home unless you enjoy restoring from backups!):

copy passwords to '/usr/local/pgsql/data/postgresql.conf';

You may be wondering — like my colleague was — why Postgres would have such a huge security hole. And the answer is that it's not: only superusers can copy to or from files. If an ordinary user tries either of these statements, this is the response:

ERROR:  must be superuser to COPY to or from a file
HINT:  Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

The real security hole was that we were using the superuser as our standard login, because it made managing the database easier. And while the ability to overwrite files is one reason not to do that, there's a bigger one: we were one mis-qualified table name away from corrupting the data in the database, because the superuser also bypasses any checks on schema or database permissions.