Objects and persistence in databases

  • History

    1. 2020-04-27

      Time of publication.

  1. There are situations where Object-oriented programming (OOP) isn't the best fit, and I keep coming across more of them. Way back when I was working on our graduation project, I had my first taste of Object-relational mapping (ORM), and I thought it was pretty nifty. However, I've come to realize that ORM is just trying to fix a problem caused by OOP, or how OOP is usually implemented.

    As an example, a post like this one, is associated with one topic and any number of keywords. Now let's consider how to store this information in a database. If the topics and keywords are to be shared between different posts, and possibly other types of data, it makes sense to pick a relational database for the task. And, following good practice in that regard, we'll implement the structure, in a MySQL database, as follows.

    CREATE TABLE topic (
    
      id    int           AUTO_INCREMENT,
      name  varchar (50)  NOT NULL,
      
      PRIMARY KEY   (id),
      UNIQUE        (name)
      
    );
    
    CREATE TABLE keyword (
    
      id    int           AUTO_INCREMENT,
      name  varchar (50)  NOT NULL,
      
      PRIMARY KEY   (id),
      UNIQUE        (name)
      
    );
    
    CREATE TABLE post (
    
      id        int             AUTO_INCREMENT,
      name      varchar (255)   NOT NULL,
      text      blob,
      topic_id  int             NOT NULL,
      
      PRIMARY KEY (id),
      FOREIGN KEY (topic_id) REFERENCES topic (id),
      UNIQUE      (name)
      
    );
    
    CREATE TABLE post_keyword (
    
      post_id     int   NOT NULL,
      keyword_id  int   NOT NULL,
      sort_key    int   NOT NULL DEFAULT 0,
      
      FOREIGN KEY (post_id)     REFERENCES post (id),
      FOREIGN KEY (keyword_id)  REFERENCES keyword (id)
      
    );

    Below is an example, in C++, of how this would typically be modelled using OOP. The members are declared public for simplicity's sake.

    struct Post {
    
      int               id;
      string            name;
      string            text;
      Topic             topic;
      vector<Keyword>   keywords;
      
    };

    Comparing this to the database model, we see that there's a discrepancy, and it has to do with the relationships with other data types. We can see a difference in cardinality, but we're blind to the fact that those types could be shared. To populate an object of this type, some "magic" has to happen that converts foreign key values and junction tables. This process might involve recursion if the nested object(s) are nested further . Sometimes this recursion might lead to loops, which is a big no-no, and certainly there are performance costs to consider. Also, there is additional complexity when objects are modified, or deleted (depending on if the relationship is a composition or an aggregate). Lastly, not all languages support reflection, increasing the inconvenience of defining things twice, or use of additional tools.

    A normal class definition can't really hold all this additional information. To bridge the gap, object-relational mapping (ORM) is often used. There are several such solutions out there. The one I used in the project back then was Java Persistence API (JPA). Java supports reflection, which JPA takes advantage of, making it a bit more convenient than other solutions. JPA uses a declarative approach using annotations or XML descriptor files, so that the information is more "detached" from the class definition itself. An example using annotations is given below. Note that you can get away with much less if the default behavior is sufficient, but it might also be necessary with more in some cases.

    @Entity
    @Table(name="Post")
    public class Post {
    
      @Id
      @GeneratedValue(strategy = GenerationType.IDENTITY)
      public    int         id;
      
      @Column(name="name", length=255, nullable=false, unique=true)
      public    String      name;
      
      @Column(name="text", nullable=true, unique=false)
      public    String      text;
      
      @ManyToOne(fetch=FetchType.EAGER, cascade={CascadeType.PERSIST, CascadeType.MERGE, CascadeType.REFRESH})
      @JoinColumn(name="topic_id")
      public    Topic       topic;
      
      @ManyToMany(fetch=FetchType.EAGER, cascade={CascadeType.PERSIST, CascadeType.MERGE, CascadeType.REFRESH})
      @JoinTable(
        name                = "post_keyword", 
        joinColumns         = @JoinColumn(name = "post_id"), 
        inverseJoinColumns  = @JoinColumn(name = "keyword_id"))
      public    Keyword[]   keywords;
      
    }

    To apply this, is a rather laborious task, and it's hard to make the right decisions in many cases. A typical trait for declarative approaches, is that things might appear quite elegant for simple cases, but when it gets more complicated, you end up with a mess. This is certainly true also for JPA. It's been a long while since I last worked with JPA, and just coming up with this example now reminded me of some of the mentioned problems. They are evident here by the "fetch" and "cascade" properties in particular. Another related issue, which is not shown here, is bi-directional relationships. In the case of JPA you basically introduce another language/vocabulary into your project, and you still have to write code to do deletions correctly and efficiently. Other ORM libraries might of course use different approaches, but the point here is that it might not be as simple as one might think.

    To model the persistent data, we could just skip this nesting of objects entirely (at least for aggregate relationships), and stick to Plain Old Data (POD) types. That means we basically make a 1:1 representation of the database structure. If we want to retrieve related objects, and perform deletions, we write the code to do so. It doesn't have to be a bad thing to be more data-oriented. Below is the revised model. Note that in the spirit of data-driven design, we could go further by replacing the per-instance members with lists.

    struct Post {
    
      int       id;
      string    name;
      string    text;
      int       topic_id;
    
    };
    
    struct Post_Keyword {
    
      int post_id;
      int keyword_id;
      int sort_key;
    
    };

    For my own website, I wrote a function many years ago to automate some of the database retrieval. The built-in PHP libraries for MYSQL databases support populating query results into objects without requiring the user to define classes, but I wanted to include some related data directly into those objects, without resorting to an ORM library. The function goes only one level deep, only looks for many-to-one and many-to-many relations, and relies on some naming conventions. For my website, it's sufficient. The code is given below.

    /**
     * Returns the current row of a result set as an object. Objects from relations are also included through their table name.
     * 
     * Relation requirements:
     * Many-to-one:     field "othertype_id"
     * Many-to-many:    table "thistype_othertype" with fields "thistype_id", "othertype_id", "sort_key"
     * parent-child:    field "parent_id"
     * 
     * @param     result    a mysqli query result object
     *
     * @return              An object representation
     */
    function fetch($result) {
      
      global $databaseConnection;
      
      $type = $result->fetch_field_direct(0)->table;
      
      $obj = mysqli_fetch_object($result);
      
      //Add any Many-to-one relationships to the object
      //Look through fields for any foreign key fields, and fetch the relevant object/record using that key
      for($i=0; $i<($result->field_count); $i++) {
        $field = $result->fetch_field_direct($i)->name;
        if(preg_match('/[a-zA-Z]+_id/', $field) && !preg_match('/parent_id/', $field)) { //Column contains foreign keys
          $table = str_replace("_id", "", $field);
          $objRef = mysqli_fetch_object(mysqli_query($databaseConnection, "SELECT * FROM ".$table." WHERE id = ".$obj->$field));
          if($objRef) {
            $obj->$table = $objRef;
          }
        }
      }
      
      //Add any Many-to-many relationships to the object (assumes a SORT_KEY field on the relation table)
      //Look through tables for any matching junction tables, and fetch relevant objects/records
      $resultRelTables = mysqli_query($databaseConnection, "show tables where Tables_in_mydb like '" . $type . "\_%'");
      while($resultRelTable = $resultRelTables->fetch_array()) {
        $table = str_replace($type."_","", $resultRelTable[0]);
        $objRefs = array();
        $resultObjRefs = $databaseConnection->query("SELECT ref.* FROM ".$table." ref INNER JOIN ".$resultRelTable[0]." rel ON rel.".$table."_id = ref.id WHERE rel.".$type."_id = ".$obj->id." ORDER BY rel.sort_key ASC");
        while($objRef = $resultObjRefs->fetch_object()) {
          array_push($objRefs, $objRef);
        };
        $obj->$table = $objRefs;
      }
      
      return $obj;
      
    }

    It might seem a bit gritty, but it's not overly complicated to deal with, I think. I'm just scratching the surface of the topic here, but I do believe that ORM is difficult, and that it's just a bi-product of the discrepancy between typical OOP practices and persistent data. Most applications have to deal with persistence one way or another, and in those cases I think a more data-oriented approach is preferable. Also, I don't like adding another language/vocabulary into the mix if it can be avoided.