Re: [IMP-dev] clustering - code consolidation

On Thu, Jun 14, 2012 at 1:58 PM, Barak Raveh <" target="_blank">> wrote:

good points. Here is the proposed interface.

class Clusterer{

// use explicit vector embedding to initialize the data
Clusterer(Embedding data, params);

// use implicit conversion of data from e.g. FloatsList to data
Clusterer(EmbeddingAdaptor data, params);

// run the clustering on the stored data
virtual PartitioningClusteringResults do_clustering() = 0;
};

IMP tends to use do_X method in IMP for protected virtual methods in non-virtual interface style interfaces (that is, when one has a non-virtual method that is public that calls a private virtual method that does the work, the public non-virtual method can check the arguments and results). And tend to use create_X when creating an Object that needs to be ref counted on return.

That said, I'm not sure what having such an architecture buys you over just having a function like we do now. It is hard to do anything meaningful with a long term connection between the returned Clustering and the state of the Clusterer. That is, what happens when you call a method on your clusterer to eg, change the number of clusters? The Clustering is now running around somewhere so you can't modify it safely, so at best you have to copy the results out of it. And since the Embedding is passed the constructor, you can't easily cluster another set of points.

I suspect either inheriting from the Clustering and doing the clustering in your constructor or taking the embedding as an argument to the create_clustering method are better for a class. Or just going with the existing structure (you can always use argument classes to pass large number of arguments in a nice manner).

class XXXXIncrementalClusterer : Clusterer{
// use explicit vector embedding to initialize the data

XXXXIncrementalClusterer(Embedding data, params);

// use implicit conversion of data from e.g. FloatsList to data
XXXXIncremenralClusterer(EmbeddingAdaptor data, params);

//
PartitioningClusteringResults add_data(XXXX data);
};

It's a little awkward to have such an add_data method as the Clusterer doesn't know anything about the underlying thing that is being clustered (only the Embedding knows). So data would have to be first added to the Embedding and then perhaps a create_cluster_with_added_items() method called (they are items in embedding). And the Incremental clusterer effectively has to keep a copy of any results it returns internally anyway. So I think either of the alternative structures above may make more sense.

On Thu, Jun 14, 2012 at 1:34 PM, Daniel Russel <" target="_blank">> wrote:
Sorry, I didn't see this before. I think my previous comments still stand. As a couple additional ones:
- having "execute" style methods on objects isn't a very nice practice. It destroys type safety, since for class A you really have two different types of A, the pre-execute A and the post-execute A and some contexts require one and some the other (and the compiler can't check). And doesn't really give you anything that you can't get from producing a new object as the result. Ideally, classes should have what I have seen called the "no protocols" property: you can call any function of the class in any order.

- I kind of prefer Clusterer to Clustering when referring to the algorithm as the latter very much means the result of running a clustering algorithm to me as opposed to something that does clustering. But that may be my eccentricity (but if you google "clusterer" the results seem consistent with that usage). In any case, we need to make sure that distinct terms are used for distinct things.

On Thu, Jun 14, 2012 at 12:38 PM, Barak Raveh <" target="_blank">> wrote:

Now the full version...
Daniel and I discussed a little bit consolidation of clustering things in statistics and kmeans modules. Please tell me if it is agreed that things will work with the following interface:
* The Embedding family of classes (used to embed data in vector form) will remain as is
* There will be a "Clustering" class from which all clustering algorithms will derive, with a constructor that takes either Embedding class, or EmbeddingAdaptor for implicit conversions from e.g., FloatsList

* The Clustering classes will also have a void ::execute() method and a ::get_clustering results() method, that will return the clustering results (using the exisiting PartitioningClustering class, perhaps we can change its name to PartitioningClusteringResults).

So bottom line, if you will want to cluster some data, you will do something like
FloatsList data; // or create an Embedding object
KMeansClustering kmeans(data, params);

kmeans.execute();
PartitioningClustering clustering_results = kmeans.get_clustering_results()

Makes sense? For backward compatibility, I will add a DEPRECATED warning to existing clustering methods, so they will be removed within a few months completely.

Barak

_______________________________________________
IMP-dev mailing list
" target="_blank">
https://salilab.org/mailman/listinfo/imp-dev

--
Barak